Python(x,y) is a free development environment for scientific and engineering computation based on Python, Qt and Eclipse. It is probably the most complete scientific-oriented Python distribution available for researchers.

## Monday, December 01, 2008

## Sunday, November 30, 2008

### OpenCalais

Thompson/Reuters provides a free web-service OpenCalais that can be used for information extraction from unstructured text/html/xml documents. It can recognize not only named entities, but also facts and events. There is a Python interface available to this Web service called python-calais.

## Friday, November 21, 2008

### Efficient SVM for Ranking

Ranking Vector Machine improves the efficiency of the standard Ranking SVM (as implemented in SVM-light) by (1) using L_1 norm instead of L_2 norm for regularisation; and (2) using a subset of the original instance vectors rather than the pairwise difference vectors as support vectors. It is said to be much faster than Ranking SVM if *non-linear* kernels are employed, though with a lower accuracy (especially on small datasets).

## Friday, October 31, 2008

### Six Sigma

The well-known management concept six sigma actually comes from the empirical rule that if one has six standard deviations (sigma) between the mean of a process and the nearest specification limit in a short-term study, there will be *practically* no items that fail to meet the specifications (i.e., 3.4 defective parts per million opportunities, corresponding to 4.5 sigma) in the long-term, assuming that the process is normally distributed.

## Tuesday, October 21, 2008

### Laplace Approximation

Laplace approximation is a simple and popular method that approximates a probability density function *p(z)* by fitting a Gaussian distribution to the local quadratic expansion of *p(z)* around its mode where *p'(z)* vanishes. One important usage of this technique is to compute the integral of *p(z)*. A good explanation of this method from David MacKay can be found here.

## Tuesday, August 05, 2008

### Loss Functions and Prior Distributions

Andrew Gelman has an article that explains the relationship between loss functions and prior distributions. In a word, using L2 or L1 norm as regularisers for regresion is essentially equivalent to choosing Gaussian or Laplace priors for the parameters, i.e., L2 or L1 corresponds to the MAP of Gaussian or Laplace.

## Monday, July 28, 2008

### Using Python for Machine Learning

Python plus a C/C++ library seems a good combination for agile development in machine learning. Please refer to John Langford's post on Programming Languages for Machine Learning Implementations. There is a list of open source machin learning tools that support Python scripting. Orange is probably the most comprehensive one among them, e.g., it also contains data mining algorithms such as association rules.

## Monday, July 14, 2008

### Python defaultdict

I recently learned the trick of using the defaultdict class for frequency counting and smoothing from Peter Norvig's influential technical article How to Write a Spelling Corrector.

As its name suggests, defaultdict is like a regular Python dict except that a default value (factory in fact) can be specified in advance. For example, the following piece of code uses the stadard dict to build a term-frequency hash table.

tf = {}

for t in words:

tf[t] = tf.get(t, 0) + 1

It can be simpler and faster by making use of the defaultdict from the collections module.

import collections

tf = collections.defaultdict{int}

for t in words:

tf[t] += 1

PS: I am glad to see that my colleague Rogger Mitton's work on spell-checking is cited by the above article.

## Saturday, July 12, 2008

### The Strength of Weak Ties, in Online Advertising

The Strength of Weak Ties was studied by sociologists 35 years ago and re-discovered by physicists about 10 years ago in the research of small-world networks. This theory actually provides a good explanation of the phenomenon noticed by Anand Rajaraman recently: Affinity and Herding Determine the Effectiveness of Social Media Advertising.

## Friday, June 27, 2008

### To be a leader

Paper number does not really make a lot of sense. To be a leader, you should have a clear focus and have real impact. For example, the only reason why one receives an academic award is that he contributes a lot to a specific research direction but not that he has published hundreds of papers.

The above words are said to be from Bruce Croft (a leading computer scientist in the field of information retrieval), though I could not find his original text.

## Thursday, June 19, 2008

### SAGE

I really believe that Python, but not Matlab, is the future for scientific and engineering computing. SAGE, a free open-source alternative to Matlab etc., has started to attract attention from the research community. Since it uses Python instead of an obscure language designed for a particular mathematics program, one can write programs that combine serious mathematics with anything else, such as *natual language processing*.

## Sunday, June 01, 2008

### Right Data vs. Better Models

Yehuda Koren, a member of the famous BellKor team in the Netflix Prize competition, recently wrote an article about the three levels of addressing the Netflix Prize. The insight that *deciding what data to model is more important than choosing what model to use* seems to be complement to the issue of more data vs. better algorithms. This is also supported by the successful story of Amazon's recommendation systems.

## Thursday, May 01, 2008

### Generating all permutations: a non-recursive algorithm

A natural idea for generating all permutations of n elements is to write a recursive algorithm with O(n!) time complexity. The algorithm used in C++ STL function *next_permutation* is a more efficient solution --- it generates all *unique* permutations *non-recursively* and moreover *in lexicographic order*. The algorithm is given by the great computer scientist Donald Knuth in his book The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations. Here is his Draft of Section 7.2.1.2: Generating All Permutations.

## Monday, April 28, 2008

### ParsCit

ParsCit is an open-source reference string parsing package developed by Min-Yen Kan et al. It is based on the Conditional Random Fields (CRF) toolkit CRF++. It is being used by the well-known computer science digital library CiteSeer^x.

## Tuesday, April 08, 2008

### More Data vs. Better Algorithms

The recent blog posts from Anand Rajaraman that *more data usually beats better algorithms* (part 1, part 2 and part 3) reminds me of a talk by David Hand two years ago --- Classifier Technology and the Illusion of Progress. There has also been discussons on a this issue in Hal Daume III's blog post about Heuristics.

### Laplacian Kernel, Resistance Distance and Commute Time

The Laplacian kernel for a graph is interestingly connected to the resistance distance (the total resistance between two nodes) and the commute time (the average length of a random walk between two nodes) over the graph.

## Sunday, March 30, 2008

### Human Insight vs. Number Crunching

It is interesting to read the story This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize together with the book Super Crunchers. On one hand, human intuition is usually less reliable and less accurate than number crunching. On the other hand, a little human insight goes a long way, which may sometimes point out the right direction for number crunching.

## Sunday, March 09, 2008

## Sunday, March 02, 2008

### SVDLIBC

SVDLIBC is a C library for computing Singular Value Decomposition (SVD). It is actually an improved version of the SVDPACKC library. Since the algorithm that it implemented, *las2*, has relatively low precision for low order singular values, it is particularly suitable for truncated SVD (as in LSI).

## Thursday, February 28, 2008

### Apache Mahout

The Lucene PMC has launched a project, Mahout, to build scalable Apache-licensed machine learning libraries based on the MapReduce mechanism. It will start with imlementing those ten machine learning algorithms described in Map-Reduce for Machine Learning on Multicore with Hadoop.