Research on Search: 2010

Monday, October 04, 2010

A Poor Man's Parallel Processing

A very crude, but often good enough, method to achieve parallel processing (e.g., on multi-core computers) is to partition the large input data file into small chunks, run the program to process each of them in parallel, and then merge the output results file back. Fortunately, this process can be done easily with the wise iterative usage of two Unix utilities: split and cat.

Friday, September 10, 2010

nDCG

The choice of the gain and discount function for the popular IR performance measure normalised Discounted Cumulative Gain (nDCG) has been discussed and empirically justified in a CIKM-2009 paper through analysis of variance (ANOVA).

Wednesday, August 11, 2010

LNRE

Here is a good tutorial with Matlab examples about Statistical Estimation for Large Numbers of Rare Events (LNRE).

Friday, June 18, 2010

VLFeat - a computer vision toolbox

The VLFeat open source computer vision library that implements popular

feature extraction algorithms (such as SIFT, MSER, and quick shift),
clustering algorithms (such as integer k-means, hierarchical k-means, and agglomerative information bottleneck), and
matching algorithms (such as randomized kd-trees).

It is written in C for efficiency and compatibility, with interfaces in MATLAB for ease of use, and detailed documentation throughout.

Tuesday, June 01, 2010

Bloom filters and Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) of l-bits is achieved by carrying out l independent random cuts of the Euclidean space: if two data points are in the same side of all these cuts, they are very likely to be nearest neighbours. In this sense, I think Bloom filters (that also relies on a number of independent hashing functions) can be conceptually considered as the extreme case of LSH: each of its cuts tries to separate one data point from the rest.

Monday, May 31, 2010

An application of Bloom filters

It is said that Google's BigTable uses Bloom filters to reduce the disk lookups for non-existent rows or columns.

Tuesday, May 04, 2010

A suffix tree implementation with Unicode support

It seems that there is currently no suffix tree implementation with Unicode support publicly available online. So I adapted Thomas Mailund's suffix tree implementation in C with a Python binding and put it here. The changes that I made to the code were mainly to make it support Unicode text and be compatible with new version Python. It also includes an example program all_comsubstr.py that illustrates the extraction of common substrings from two Chinese strings (encoded in UTF-8).

Monday, May 03, 2010

Longest Common Substring

Given two strings, S of length m and T of length n, their longest common substrings can be found in O(m+n) time using a generalised suffix tree, or in O(mn) time through dynamic programming (e.g., the Python code here).

Sunday, March 28, 2010

Bayesian inference for the Gaussian

Given the prior probability
$p(\mu) = \mathcal{N}(\x_0,\sigma_0^2)$
and the likelihood
$p(x_1|\mu) = \mathcal{N}(\mu,\sigma_1^2)$,
the expectation of the posterior probability
$p(\mu|x_1)$
has a very simple and elegant form:
$(\alpha \x_0 + \beta x_1) / (\alpha + \beta)$
where
$\alpha = 1/(\sigma_0^2)$ and $\beta = 1/(\sigma_1^2)$
are the precisions.

Please refer to Bishop's PRML book section 2.3.6.

Wednesday, February 03, 2010

Comparing Data Analysis Packages

A succinct comparison of data analysis packages including R, Matlab, SciPy, Excel, SAS, SPSS and Stata, can be found here. I recently tried Stata, but found its language syntax ugly and awkward.

Research on Search

Monday, October 04, 2010

A Poor Man's Parallel Processing

Friday, September 10, 2010

nDCG

Wednesday, August 11, 2010

LNRE

Friday, June 18, 2010

VLFeat - a computer vision toolbox

Tuesday, June 01, 2010

Bloom filters and Locality Sensitive Hashing

Monday, May 31, 2010

An application of Bloom filters

Tuesday, May 04, 2010

A suffix tree implementation with Unicode support

Monday, May 03, 2010

Longest Common Substring

Sunday, March 28, 2010

Bayesian inference for the Gaussian

Wednesday, February 03, 2010

Comparing Data Analysis Packages

About Me

Archive

Links

Monday, October 04, 2010

Friday, September 10, 2010

Wednesday, August 11, 2010

Friday, June 18, 2010

Tuesday, June 01, 2010

Monday, May 31, 2010

Tuesday, May 04, 2010

Monday, May 03, 2010

Sunday, March 28, 2010

Wednesday, February 03, 2010

About Me

Archive

Links

Subscribe To