Research on Search: 2008

Monday, December 01, 2008

Python(x,y)

Python(x,y) is a free development environment for scientific and engineering computation based on Python, Qt and Eclipse. It is probably the most complete scientific-oriented Python distribution available for researchers.

Sunday, November 30, 2008

ShowMeDo

ShowMeDo contains a large collection of short video tutorials for Python programming and a lot more.

OpenCalais

Thompson/Reuters provides a free web-service OpenCalais that can be used for information extraction from unstructured text/html/xml documents. It can recognize not only named entities, but also facts and events. There is a Python interface available to this Web service called python-calais.

Friday, November 21, 2008

Efficient SVM for Ranking

Ranking Vector Machine improves the efficiency of the standard Ranking SVM (as implemented in SVM-light) by (1) using L_1 norm instead of L_2 norm for regularisation; and (2) using a subset of the original instance vectors rather than the pairwise difference vectors as support vectors. It is said to be much faster than Ranking SVM if non-linear kernels are employed, though with a lower accuracy (especially on small datasets).

Friday, October 31, 2008

Six Sigma

The well-known management concept six sigma actually comes from the empirical rule that if one has six standard deviations (sigma) between the mean of a process and the nearest specification limit in a short-term study, there will be practically no items that fail to meet the specifications (i.e., 3.4 defective parts per million opportunities, corresponding to 4.5 sigma) in the long-term, assuming that the process is normally distributed.

Tuesday, October 21, 2008

Laplace Approximation

Laplace approximation is a simple and popular method that approximates a probability density function p(z) by fitting a Gaussian distribution to the local quadratic expansion of p(z) around its mode where p'(z) vanishes. One important usage of this technique is to compute the integral of p(z). A good explanation of this method from David MacKay can be found here.

Tuesday, August 05, 2008

Loss Functions and Prior Distributions

Andrew Gelman has an article that explains the relationship between loss functions and prior distributions. In a word, using L2 or L1 norm as regularisers for regresion is essentially equivalent to choosing Gaussian or Laplace priors for the parameters, i.e., L2 or L1 corresponds to the MAP of Gaussian or Laplace.

Monday, July 28, 2008

Using Python for Machine Learning

Python plus a C/C++ library seems a good combination for agile development in machine learning. Please refer to John Langford's post on Programming Languages for Machine Learning Implementations. There is a list of open source machin learning tools that support Python scripting. Orange is probably the most comprehensive one among them, e.g., it also contains data mining algorithms such as association rules.

Monday, July 14, 2008

Python defaultdict

I recently learned the trick of using the defaultdict class for frequency counting and smoothing from Peter Norvig's influential technical article How to Write a Spelling Corrector.

As its name suggests, defaultdict is like a regular Python dict except that a default value (factory in fact) can be specified in advance. For example, the following piece of code uses the stadard dict to build a term-frequency hash table.


tf = {}
for t in words:
    tf[t] = tf.get(t, 0) + 1

It can be simpler and faster by making use of the defaultdict from the collections module.


import collections
tf = collections.defaultdict{int}
for t in words:
    tf[t] += 1

PS: I am glad to see that my colleague Rogger Mitton's work on spell-checking is cited by the above article.

Saturday, July 12, 2008

The Strength of Weak Ties, in Online Advertising

The Strength of Weak Ties was studied by sociologists 35 years ago and re-discovered by physicists about 10 years ago in the research of small-world networks. This theory actually provides a good explanation of the phenomenon noticed by Anand Rajaraman recently: Affinity and Herding Determine the Effectiveness of Social Media Advertising.

Friday, June 27, 2008

To be a leader

Paper number does not really make a lot of sense. To be a leader, you should have a clear focus and have real impact. For example, the only reason why one receives an academic award is that he contributes a lot to a specific research direction but not that he has published hundreds of papers.

The above words are said to be from Bruce Croft (a leading computer scientist in the field of information retrieval), though I could not find his original text.

Thursday, June 19, 2008

SAGE

I really believe that Python, but not Matlab, is the future for scientific and engineering computing. SAGE, a free open-source alternative to Matlab etc., has started to attract attention from the research community. Since it uses Python instead of an obscure language designed for a particular mathematics program, one can write programs that combine serious mathematics with anything else, such as natual language processing.

Sunday, June 01, 2008

Right Data vs. Better Models

Yehuda Koren, a member of the famous BellKor team in the Netflix Prize competition, recently wrote an article about the three levels of addressing the Netflix Prize. The insight that deciding what data to model is more important than choosing what model to use seems to be complement to the issue of more data vs. better algorithms. This is also supported by the successful story of Amazon's recommendation systems.

Thursday, May 01, 2008

Generating all permutations: a non-recursive algorithm

A natural idea for generating all permutations of n elements is to write a recursive algorithm with O(n!) time complexity. The algorithm used in C++ STL function next_permutation is a more efficient solution --- it generates all unique permutations non-recursively and moreover in lexicographic order. The algorithm is given by the great computer scientist Donald Knuth in his book The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations. Here is his Draft of Section 7.2.1.2: Generating All Permutations.

Monday, April 28, 2008

ParsCit

ParsCit is an open-source reference string parsing package developed by Min-Yen Kan et al. It is based on the Conditional Random Fields (CRF) toolkit CRF++. It is being used by the well-known computer science digital library CiteSeer^x.

Tuesday, April 08, 2008

More Data vs. Better Algorithms

The recent blog posts from Anand Rajaraman that more data usually beats better algorithms (part 1, part 2 and part 3) reminds me of a talk by David Hand two years ago --- Classifier Technology and the Illusion of Progress. There has also been discussons on a this issue in Hal Daume III's blog post about Heuristics.

Laplacian Kernel, Resistance Distance and Commute Time

The Laplacian kernel for a graph is interestingly connected to the resistance distance (the total resistance between two nodes) and the commute time (the average length of a random walk between two nodes) over the graph.

Sunday, March 30, 2008

Human Insight vs. Number Crunching

It is interesting to read the story This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize together with the book Super Crunchers. On one hand, human intuition is usually less reliable and less accurate than number crunching. On the other hand, a little human insight goes a long way, which may sometimes point out the right direction for number crunching.

Sunday, March 09, 2008

Shogun

Shogun is a machine learning toolbox focusing on large scale kernel methods especially SVM. It supports

state-of-the-art SVM implementations,

various kernels including string kernels,

a number of linear methods and

interfaceing with Matlab and Python etc.

Sunday, March 02, 2008

SVDLIBC

SVDLIBC is a C library for computing Singular Value Decomposition (SVD). It is actually an improved version of the SVDPACKC library. Since the algorithm that it implemented, las2, has relatively low precision for low order singular values, it is particularly suitable for truncated SVD (as in LSI).

Thursday, February 28, 2008

Apache Mahout

The Lucene PMC has launched a project, Mahout, to build scalable Apache-licensed machine learning libraries based on the MapReduce mechanism. It will start with imlementing those ten machine learning algorithms described in Map-Reduce for Machine Learning on Multicore with Hadoop.