Wednesday, September 07, 2011

The Hungarian algorithm in clustering evaluation

The Hungarian algorithm (aka Kuhn–Munkres algorithm or Munkres assignment algorithm) can solve the assignment problem in polynomial time O(n^3). It can be used to find the optimal mapping from discovered clusters to the ground-truth categories which serves as the basis for some performance measures of clustering.

Sunday, August 28, 2011

Fastest membership test in Python

What is the most efficient method to check whether an item is in a given group or not? In Python, it seems that set (or frozenset) would be slightly faster than dict and much much faster than list.

Friday, August 26, 2011

Submodular functions

Intuitively, a submodular function over the subsets demonstrates "diminishing returns", which is related to the concept of marginal utility in economics. Its usefulness for machine learning is well explained and illustrated by the Beyond Convexity tutorial. There is a Matlab toolbox for submodular function optimization available that is developed by Andreas Krause.

L1 regularisation Is efficient for selecting relevant features

Andrew Ng has proven in his ICML-2004 paper that sample complexity grows linearly in the number of irrelevant features when using L2 regularisation (in logistic regression, support vector machine, and back-propagation neural network), but only logarithmically when using L1 regularisation (in logistic regression).

Sunday, July 03, 2011

New linear-time algorithm for suffix array construction

Juha Kärkkäinen, Peter Sanders , and Stefan Burkhardt: Linear Work Suffix Array Construction, Journal of the ACM (JACM), Volume 53 Issue 6, November 2006.
As the authors have said, this algorithm narrows the gap between suffix tree and suffix array, which are widely used and largely interchangeable index structures on strings and sequences. Usually theoreticians prefer the former due to linear-time construction algorithms and more explicit structure while practitioners prefer the latter due to its simplicity and space efficiency. Now there is one more reason for practitioners to stick with suffix array.

Friday, July 01, 2011

Research Impact for REF

The British government's emphasis on the practical impact of research in REF reminds me of Feynman's following words.

Physics [research] is like sex: sure, it may give some practical results, but that's not why we do it.

Friday, June 17, 2011


Pablo Castells, Jun Wang, Ruben Lara, and Dell Zhang are organising an ACM RecSys-2011 workshop on Novelty and Diversity in Recommender Systems (DiveRS). A special issue of ACM TIST in the scope of the workshop will be announced after the conference. Authors of accepted papers will be invited to submit an extended version.

A couple of metrics

It is often desirable to measure the dissimilarity or distance between items using a proper metric.