Python(x,y) is a free development environment for scientific and engineering computation based on Python, Qt and Eclipse. It is probably the most complete scientific-oriented Python distribution available for researchers.
Monday, December 01, 2008
Sunday, November 30, 2008
OpenCalais
Thompson/Reuters provides a free web-service OpenCalais that can be used for information extraction from unstructured text/html/xml documents. It can recognize not only named entities, but also facts and events. There is a Python interface available to this Web service called python-calais.
Friday, November 21, 2008
Efficient SVM for Ranking
Ranking Vector Machine improves the efficiency of the standard Ranking SVM (as implemented in SVM-light) by (1) using L_1 norm instead of L_2 norm for regularisation; and (2) using a subset of the original instance vectors rather than the pairwise difference vectors as support vectors. It is said to be much faster than Ranking SVM if non-linear kernels are employed, though with a lower accuracy (especially on small datasets).
Friday, October 31, 2008
Six Sigma
The well-known management concept six sigma actually comes from the empirical rule that if one has six standard deviations (sigma) between the mean of a process and the nearest specification limit in a short-term study, there will be practically no items that fail to meet the specifications (i.e., 3.4 defective parts per million opportunities, corresponding to 4.5 sigma) in the long-term, assuming that the process is normally distributed.
Tuesday, October 21, 2008
Laplace Approximation
Laplace approximation is a simple and popular method that approximates a probability density function p(z) by fitting a Gaussian distribution to the local quadratic expansion of p(z) around its mode where p'(z) vanishes. One important usage of this technique is to compute the integral of p(z). A good explanation of this method from David MacKay can be found here.
Tuesday, August 05, 2008
Loss Functions and Prior Distributions
Andrew Gelman has an article that explains the relationship between loss functions and prior distributions. In a word, using L2 or L1 norm as regularisers for regresion is essentially equivalent to choosing Gaussian or Laplace priors for the parameters, i.e., L2 or L1 corresponds to the MAP of Gaussian or Laplace.
Monday, July 28, 2008
Using Python for Machine Learning
Python plus a C/C++ library seems a good combination for agile development in machine learning. Please refer to John Langford's post on Programming Languages for Machine Learning Implementations. There is a list of open source machin learning tools that support Python scripting. Orange is probably the most comprehensive one among them, e.g., it also contains data mining algorithms such as association rules.
Monday, July 14, 2008
Python defaultdict
I recently learned the trick of using the defaultdict class for frequency counting and smoothing from Peter Norvig's influential technical article How to Write a Spelling Corrector.
As its name suggests, defaultdict is like a regular Python dict except that a default value (factory in fact) can be specified in advance. For example, the following piece of code uses the stadard dict to build a term-frequency hash table.
tf = {}
for t in words:
tf[t] = tf.get(t, 0) + 1
It can be simpler and faster by making use of the defaultdict from the collections module.
import collections
tf = collections.defaultdict{int}
for t in words:
tf[t] += 1
PS: I am glad to see that my colleague Rogger Mitton's work on spell-checking is cited by the above article.
Saturday, July 12, 2008
The Strength of Weak Ties, in Online Advertising
The Strength of Weak Ties was studied by sociologists 35 years ago and re-discovered by physicists about 10 years ago in the research of small-world networks. This theory actually provides a good explanation of the phenomenon noticed by Anand Rajaraman recently: Affinity and Herding Determine the Effectiveness of Social Media Advertising.
Friday, June 27, 2008
To be a leader
Paper number does not really make a lot of sense. To be a leader, you should have a clear focus and have real impact. For example, the only reason why one receives an academic award is that he contributes a lot to a specific research direction but not that he has published hundreds of papers.
The above words are said to be from Bruce Croft (a leading computer scientist in the field of information retrieval), though I could not find his original text.
Thursday, June 19, 2008
SAGE
I really believe that Python, but not Matlab, is the future for scientific and engineering computing. SAGE, a free open-source alternative to Matlab etc., has started to attract attention from the research community. Since it uses Python instead of an obscure language designed for a particular mathematics program, one can write programs that combine serious mathematics with anything else, such as natual language processing.
Sunday, June 01, 2008
Right Data vs. Better Models
Yehuda Koren, a member of the famous BellKor team in the Netflix Prize competition, recently wrote an article about the three levels of addressing the Netflix Prize. The insight that deciding what data to model is more important than choosing what model to use seems to be complement to the issue of more data vs. better algorithms. This is also supported by the successful story of Amazon's recommendation systems.
Thursday, May 01, 2008
Generating all permutations: a non-recursive algorithm
A natural idea for generating all permutations of n elements is to write a recursive algorithm with O(n!) time complexity. The algorithm used in C++ STL function next_permutation is a more efficient solution --- it generates all unique permutations non-recursively and moreover in lexicographic order. The algorithm is given by the great computer scientist Donald Knuth in his book The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations. Here is his Draft of Section 7.2.1.2: Generating All Permutations.
Monday, April 28, 2008
ParsCit
ParsCit is an open-source reference string parsing package developed by Min-Yen Kan et al. It is based on the Conditional Random Fields (CRF) toolkit CRF++. It is being used by the well-known computer science digital library CiteSeer^x.
Tuesday, April 08, 2008
More Data vs. Better Algorithms
The recent blog posts from Anand Rajaraman that more data usually beats better algorithms (part 1, part 2 and part 3) reminds me of a talk by David Hand two years ago --- Classifier Technology and the Illusion of Progress. There has also been discussons on a this issue in Hal Daume III's blog post about Heuristics.
Laplacian Kernel, Resistance Distance and Commute Time
The Laplacian kernel for a graph is interestingly connected to the resistance distance (the total resistance between two nodes) and the commute time (the average length of a random walk between two nodes) over the graph.
Sunday, March 30, 2008
Human Insight vs. Number Crunching
It is interesting to read the story This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize together with the book Super Crunchers. On one hand, human intuition is usually less reliable and less accurate than number crunching. On the other hand, a little human insight goes a long way, which may sometimes point out the right direction for number crunching.
Sunday, March 09, 2008
Sunday, March 02, 2008
SVDLIBC
SVDLIBC is a C library for computing Singular Value Decomposition (SVD). It is actually an improved version of the SVDPACKC library. Since the algorithm that it implemented, las2, has relatively low precision for low order singular values, it is particularly suitable for truncated SVD (as in LSI).
Thursday, February 28, 2008
Apache Mahout
The Lucene PMC has launched a project, Mahout, to build scalable Apache-licensed machine learning libraries based on the MapReduce mechanism. It will start with imlementing those ten machine learning algorithms described in Map-Reduce for Machine Learning on Multicore with Hadoop.