Monday, July 28, 2008

Using Python for Machine Learning

Python plus a C/C++ library seems a good combination for agile development in machine learning. Please refer to John Langford's post on Programming Languages for Machine Learning Implementations. There is a list of open source machin learning tools that support Python scripting. Orange is probably the most comprehensive one among them, e.g., it also contains data mining algorithms such as association rules.

Monday, July 14, 2008

Python defaultdict

I recently learned the trick of using the defaultdict class for frequency counting and smoothing from Peter Norvig's influential technical article How to Write a Spelling Corrector.

As its name suggests, defaultdict is like a regular Python dict except that a default value (factory in fact) can be specified in advance. For example, the following piece of code uses the stadard dict to build a term-frequency hash table.


tf = {}
for t in words:
tf[t] = tf.get(t, 0) + 1

It can be simpler and faster by making use of the defaultdict from the collections module.

import collections
tf = collections.defaultdict{int}
for t in words:
tf[t] += 1


PS: I am glad to see that my colleague Rogger Mitton's work on spell-checking is cited by the above article.

Saturday, July 12, 2008

The Strength of Weak Ties, in Online Advertising

The Strength of Weak Ties was studied by sociologists 35 years ago and re-discovered by physicists about 10 years ago in the research of small-world networks. This theory actually provides a good explanation of the phenomenon noticed by Anand Rajaraman recently: Affinity and Herding Determine the Effectiveness of Social Media Advertising.