A pretty striking finding in the WWW'08 paper from Leskovec etc. is that in nearly every network dataset they examined, there are tight but almost trivial communities at very small scales (up to around 100 nodes), while at larger scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like".
Wednesday, August 12, 2009
Thursday, July 30, 2009
Spectral Graph Partitioning
There are a number of methods in the family of Spectral Graph Partitioning, including the traditional min-cut and various balanced cut criteria (such as ratio-cut, average-cut, normalized-cut and minmax-cut). Each method uses a different objective function and consequently a different definition of partition (cluster) indicator vector. The following two tutorials on Spectral Clustering both contain a good summary of these methods.
[1] Spectral Clustering, ICML 2004 Tutorial by Chris Ding
[2] A Tutorial on Spectral Clustering by Ulrike von Luxburg
Sunday, June 28, 2009
SIFT and SURF
Scale-Invariant Feature Transform (or SIFT) is one of the most popular algorithms in computer vision to detect and describe local features in images. The algorithm was published by David Lowe in 1999, and it is now a patent of the University of British Columbia.
The SIFT approach, for image feature generation, takes an image and transforms it into a "large collection of local feature vectors". Each of these feature vectors is invariant to any scaling, rotation or translation of the image. This approach shares many features with neuron responses in primate vision. To aid the extraction of these features the SIFT algorithm applies a 4 stage filtering approach: (1) Scale-Space Extrema Detection (2) Keypoint Localistaion (3) Orientation Assignment (4) Keypoint Descriptor.
Speeded Up Robust Features (SURF) is said to have similar performance to SIFT, while at the same time being faster.
Monday, May 25, 2009
Centrality measures in network analysis
There are four measures of centrality that are widely used in network analysis: degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. Google's PageRank is a variant of the eigenvector centrality measure.
Dirichlet prior for smoothing
Using Dirichlet distribution as the prior for smoothing in statistical language modeling leads to additive smoothing (a.k.a. Lidstone smoothing) that includes Laplace smoothing (i.e., add one) and Jeffreys-Perks smoothing (i.e., add half) (a.k.a. Expected Likelihood Estimation) as special cases. This family of smoothing methods can be regarded as a document dependent extension of linear interpolated smoothing.
It has been shown that Laplace smoothing, though most popular (in textbooks), is often inferior to Lidstone smoothing (using a value less than one) in modeling natural language data, e.g., for text classification tasks (see Athena: Mining-based interactive management of text databases).