Saturday, March 24, 2007

Estimating Query Hardness

David Carmel, Elad Yom-Tov, Adam Darlow, Dan Pelleg: What Makes A Query Difficult?, SIGIR 2006: 390-397.

Vishwa Vinay, Ingemar J. Cox, Natasa Milic-Frayling, Ken Wood: On Ranking the Effectiveness of Searches, SIGIR 2006: 398-404.

Yun Zhou, Bruce Croft: Query Performance Prediction in Web Search Environments, SIGIR 2007.

Friday, March 23, 2007

Wikipedia XML Corpus

There is a large XML corpus based on Wikipedia avalilable for research.

Sociocentric vs. Egocentric

While sociocentric network/graph anlysis is oriented toward the global 'society', egocentric network/graph anlysis is centered around an individual node therefore provides a personalized view of its local neighborhood environment.

Saturday, March 17, 2007

Edge Separator vs. Vertex Separator

An edge separator is a subset of the graph's edges whose removal from the graph makes the graph disconnected. Defferent weakness measures of the edge separator lead to different ways of graph partitioning or clustering, such as normalized cuts, conductance, edge betweenness, modularity, and relative neighborhoods. However, regardless of the weakness measure used, edge separators sometimes fail to capture the cohesion of graphs, especially in the presence of overlapping clusters. While the existence of a weak edge separator in a graph is sufficient to make the graph noncohesive, it is not a necessary condition.

A vertex separator is a subset of the graph's nodes whose removal leaves the graph disconnected. Note that sparse vertex separators subsume weak edge separators: if the graph has a weak edge separator, then it must also have a sparse vertex separator, but the converse is not true.

--- Ziv Bar-Yossef, Ido Guy, Ronny Lempel, Yoelle S. Maarek and Vladimir Soroka, Cluster Ranking with an Application to Mining Mailbox Networks, Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), pp. 63--74, Hong Kong, China, 2006.

Thursday, March 15, 2007

Motzkin-Straus

The Motzkin-Straus theorem (formulation) establishes a remarkable connection between the maximal/maximum cliques of an unweighted undirected graph G and the local/global solutions of a quadratic programming problem. It has motivated several clique-finding techniques. There are generalizations of this result to edge-weighted or vertex-weighted graphs.

Wednesday, March 07, 2007

10 Challenging Problems in Data Mining Research

On ICDM'05, Qiang Yang and Xindong Wu presented 10 Challenging Problems in Data Mining Research.
1. Developing a Unifying Theory of Data Mining
2. Scaling Up for High Dimensional Data and High Speed Data Streams
3. Mining Sequence Data and Time Series Data
4. Mining Complex Knowledge from Complex Data
5. Data Mining in a Network Setting
6. Distributed Data Mining and Mining Multi-agent Data
7. Data Mining for Biological and Environmental Problems
8. Data-Mining-Process Related Problems
9. Security, Privacy and Data Integrity
10. Dealing with Non-static, Unbalanced and Cost-sensitive Data

I am particularly interested in Problem 2, 4 and 5 which share one common theme: Learning/Mining with Very Large Graphs.

Sunday, March 04, 2007

Free eBook on Matrix Computation

The Matrix Cookbook is a mathematical desktop reference on matrices that is freely available online.