A very crude, but often good enough, method to achieve parallel processing (e.g., on multi-core computers) is to partition the large input data file into small chunks, run the program to process each of them in parallel, and then merge the output results file back. Fortunately, this process can be done easily with the wise iterative usage of two Unix utilities: split and cat.
Monday, October 04, 2010
Friday, September 10, 2010
nDCG
The choice of the gain and discount function for the popular IR performance measure normalised Discounted Cumulative Gain (nDCG) has been discussed and empirically justified in a CIKM-2009 paper through analysis of variance (ANOVA).
Wednesday, August 11, 2010
LNRE
Here is a good tutorial with Matlab examples about Statistical Estimation for Large Numbers of Rare Events (LNRE).
Friday, June 18, 2010
VLFeat - a computer vision toolbox
The VLFeat open source computer vision library that implements popular
- feature extraction algorithms (such as SIFT, MSER, and quick shift),
- clustering algorithms (such as integer k-means, hierarchical k-means, and agglomerative information bottleneck), and
- matching algorithms (such as randomized kd-trees).
Tuesday, June 01, 2010
Bloom filters and Locality Sensitive Hashing
Locality Sensitive Hashing (LSH) of l-bits is achieved by carrying out l independent random cuts of the Euclidean space: if two data points are in the same side of all these cuts, they are very likely to be nearest neighbours. In this sense, I think Bloom filters (that also relies on a number of independent hashing functions) can be conceptually considered as the extreme case of LSH: each of its cuts tries to separate one data point from the rest.