Wednesday, November 11, 2009

The myth about the Internet

Walter Willinger et al. recently published a paper in which the scale-free network model of the preferential attachment type for Internet is said to be a myth, as it is based on fundamentally flawed traceout data. Furthermore, they criticize the currently popular data-fitting approach to network science and argue that it should be replaced by the reverse-engineering approach.

Wednesday, August 12, 2009

Large networks are not modular

A pretty striking finding in the WWW'08 paper from Leskovec etc. is that in nearly every network dataset they examined, there are tight but almost trivial communities at very small scales (up to around 100 nodes), while at larger scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like".

Thursday, July 30, 2009

Spectral Graph Partitioning

There are a number of methods in the family of Spectral Graph Partitioning, including the traditional min-cut and various balanced cut criteria (such as ratio-cut, average-cut, normalized-cut and minmax-cut). Each method uses a different objective function and consequently a different definition of partition (cluster) indicator vector. The following two tutorials on Spectral Clustering both contain a good summary of these methods.
[1] Spectral Clustering, ICML 2004 Tutorial by Chris Ding
[2] A Tutorial on Spectral Clustering by Ulrike von Luxburg

Sunday, June 28, 2009


Scale-Invariant Feature Transform (or SIFT) is one of the most popular algorithms in computer vision to detect and describe local features in images. The algorithm was published by David Lowe in 1999, and it is now a patent of the University of British Columbia.

The SIFT approach, for image feature generation, takes an image and transforms it into a "large collection of local feature vectors". Each of these feature vectors is invariant to any scaling, rotation or translation of the image. This approach shares many features with neuron responses in primate vision. To aid the extraction of these features the SIFT algorithm applies a 4 stage filtering approach: (1) Scale-Space Extrema Detection (2) Keypoint Localistaion (3) Orientation Assignment (4) Keypoint Descriptor.

Speeded Up Robust Features (SURF) is said to have similar performance to SIFT, while at the same time being faster.

Monday, May 25, 2009

Centrality measures in network analysis

There are four measures of centrality that are widely used in network analysis: degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. Google's PageRank is a variant of the eigenvector centrality measure.

Dirichlet prior for smoothing

Using Dirichlet distribution as the prior for smoothing in statistical language modeling leads to additive smoothing (a.k.a. Lidstone smoothing) that includes Laplace smoothing (i.e., add one) and Jeffreys-Perks smoothing (i.e., add half) (a.k.a. Expected Likelihood Estimation) as special cases. This family of smoothing methods can be regarded as a document dependent extension of linear interpolated smoothing.

It has been shown that Laplace smoothing, though most popular (in textbooks), is often inferior to Lidstone smoothing (using a value less than one) in modeling natural language data, e.g., for text classification tasks (see Athena: Mining-based interactive management of text databases).

Saturday, May 23, 2009


PyMat exposes the MATLAB engine interface allowing Python programs to start, close, and communicate with a MATLAB engine session. In addition, the package allows transferring matrices to and from an MATLAB workspace. These matrices can be specified as NumPy arrays, allowing a blend between the mathematical capabilities of NumPy and those of MATLAB.

Sunday, May 17, 2009

Sunday, May 10, 2009


PyLucene is a Python extension that allows the usage of Lucene's text indexing and searching capabilities from Python.

JPype vs Jython

It is often desirable to access Java libraries (such as Weka and Mallet) in Python code. There are currently two possible approaches: JPype and Jython. The former looks more attractive, as it is achieved not through re-implementing Python (as the latter does), but rather through interfacing at the native level in both Virtual Machines.

Curve Fitting in Matlab

EzyFit is a free toolbox for Matlab that enables quick curve fitting of 1D data using arbitrary (nonlinear) fitting functions. It can work in both interactive and command-line modes.

Converting HTML to text using Python

html2text is a Python script that does a good job in extracting text from HTML files. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text: it produces markdown that would then have to be turned into plain text.

Stanford Named Entity Recognizer

Stanford Named Entity Recognizer is an open source named entity recognition tool implemented in Java. The software is based on the linear chain Conditional Random Field (CRF) sequence model.

Wednesday, May 06, 2009

Anytime Algorithms

Anytime algorithms are algorithms that trade execution time for quality of results. In particular, an anytime algorithm always has a best-so-far answer available, and the quality of the answer improves with execution time. The user may examine this answer at any time and choose to terminate, temporarily suspend, or continue the algorithm’s execution until completion. Using an anytime algorithm can mitigate the need for high performance hardware.

[1] Shlomo Zilberstein: Using Anytime Algorithms in Intelligent Systems. AI Magazine 17(3): 73-83 (1996)
[2] Xiaopeng Xi, Ken Ueno, Eamonn J. Keogh, Dah-Jye Lee: Converting non-parametric distance-based classification to anytime algorithms. Pattern Anal. Appl. 11(3-4): 321-336 (2008)

Wednesday, April 29, 2009

Agile Research Methodology

There is a great similarity between scientific research and software development. I hereby propose Agile Research Methodology that consists of the following 12 best practices, as an analogy to eXtreme Programming.

  1. Planning Game = Planning Game

  2. Test Driven Development = Evaluation/Review Driven

  3. Pair Programming = Joint Work

  4. On-Site Customer = Real-World Problems

  5. Continuous Integration = Continuous Reflection

  6. Refactoring = Rewriting

  7. Small Releases = Frequent Publications

  8. Coding Standards = Benchmark Data

  9. Collective Ownership = Co-authorship

  10. Simple Design = Simple Idea

  11. System Metaphor = Approach Metaphor

  12. 40-Hour Week = 40-Hour Week

Saturday, April 18, 2009

Wednesday, April 08, 2009

Compressed Sensing uses L1 norm

It is interesting that Compressed Sensing also uses the trick of L1 norm optimisation to achieve more efficient computation (than L0 norm) and sparser solution (than L2 norm).

Using LaTeX in any Web page

With jsTeXrender, a small JavaScript program from, using \LaTeX to show mathematical formula in Web pages (inclusing Blogger posts) becomes a piece of cake. Please refer to their documentation for more information.


It works!

Saturday, March 28, 2009

Semantic Web vs. Semantic Interpretation

The article on The Unreasonable Effectiveness of Data also contrasts Semantic Web and Semantic Interpretation. Research on the Semantic Web aims to enable machines to comprehend semantic documents and data so as to achieve software service interoperability. It is fundamentally different with the problem of Semantic Interpretation, i.e., understanding human speech and writing. Roughly speaking, Semantic Web creates precise data, while Semantic Interpretation deals with inherently imprecise data --- natural language. It is important to make a clear distinction between the above two different meanings of 'semantic' which are both widely used in computer science.

Follow the Data

Recently some Google researchers published an article to advocate The Unreasonable Effectiveness of Data. They argue that AI researchers should embrace the complexity and follow the data rather than attempting to create elegant theories.

  • "Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data."

  • "Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail."

For example, due to the availability of Web-scale text data, natural language applications can achieve better performance by simply relying on word occurrence and co-occurrence statistics instead of complex latent factor analysis. The former approach is also more scalable because it only requires online learning that can be easily parallelized.

This clearly echoes the previous posts More Data vs. Better Algorithms and Right Data vs. Better Models.

Wednesday, March 25, 2009

Yahoo's Key Terms Extraction

It seems that the Key Terms feature of Yahoo! Search BOSS is based on the technique used to be called Prisma. It has actually been described in the following SIGIR-2003 paper:
Using terminological feedback for web search refinement: a log-based study.

Saturday, March 21, 2009

The Bayesian approach to ML

To be Bayesian or not to be Bayesian, that is the question. Zoubin Ghahramani recently gave a thought-provoking and humorous talk on this topic.

Wednesday, February 25, 2009

Market Data vs. Real Data

The story about the formula that killed Wall Street illustrates the extreme importance of obtaining real-world first-hand data. It is the availability of CDS market data that makes the correlation between default risks quantitatively measurable using the Gaussian copula model in spite of the scarcity of real historical default data, and thus enables the invention and proliferation of CDOs. However, it is also because that the model is based on the CDS market data rather than the real historical default data, applying it without prudence can lead to catastrophic outcomes as witnessed by the recent global financial crisis. The key to his success is also his undoing. When there are no sufficient data, the so-called wisdom of the crowd is not reliable at all.

Monday, February 23, 2009

Ullman set

The Ullman set is a clever data structure for a set of n integers densely distributed over a range, say from 0 to N-1. It uses O(N) space, and supports construction, destruction, adding, deleting, membership-test as well as cardinality operations in O(1) time. Moreover iterating over its members only takes O(n) time.

Thursday, February 19, 2009

Approximating Binomial Distributions

The Poisson distribution can be derived as a limiting case to the binomial distribution as the number of trials goes to infinity and the expected number of successes remains fixed. Therefore it can be used as an approximation of the binomial distribution if n is sufficiently large and p is sufficiently small. There is a rule of thumb stating that the Poisson distribution is a good approximation of the binomial distribution if n is at least 20 and p is smaller than or equal to 0.05. According to this rule the approximation is excellent if n ≥ 100 and np ≤ 10.
--- Wikipedia

Monday, February 02, 2009

Take the computations to the data

The legendary computer scientist Jim Gray is said to have a conviction that when you have lots of data, you take the computations to the data rather than the data to the computations. Does this imply that (statistical) machine learning that relies on a large amount of data should be built into the database systems? Maybe a database system that directly manages uncertainty, such as Trio, is indeed a future trend.

Sunday, February 01, 2009

Fewer Searches, Higher Price

A recent paper shows empirical evidences that "search engine advertising is most valuable when firms have just a few hard-to-reach customers": lawyers will bid the price of a keyword query higher when there are fewer searches for it because they need a way to find these niche customers. In other words, the strength of search engine advertising is indeed in exploiting the long tail of advertsing which is impossible or difficult for traditional broadcast media.

Sunday, January 25, 2009

Terminology for mathematical statements

The following snippet from Wikipedia describes the correct usage of terminology for mathematical statements (which I often get confused about).

Theorems are often indicated by several other terms: the actual label "theorem" is reserved for the most important results, whereas results which are less important, or distinguished in other ways, are named by different terminology.

  • A proposition is a statement not associated with any particular theorem. This term sometimes connotes a statement with a simple proof, or a basic consequence of a definition that needs to be stated, but is obvious enough to require no proof. The word proposition is sometimes used for the statement part of a theorem.
  • A lemma is a "pre-theorem", a statement that forms part of the proof of a larger theorem. The distinction between theorems and lemmas is rather arbitrary, since one mathematician's major result is another's minor claim. Gauss's lemma and Zorn's lemma, for example, are interesting enough that some authors present the nominal lemma without going on to use it in the proof of a theorem.
  • A corollary is a proposition that follows with little or no proof from one other theorem or definition. That is, proposition B is a corollary of a proposition A if B can readily be deduced from A.
  • A claim is a necessary or independently interesting result that may be part of the proof of another statement. Despite the name, claims must be proved.

There are other terms, less commonly used, which are conventionally attached to proven statements, so that certain theorems are referred to by historical or customary names. For examples:

A few well-known theorems have even more idiosyncratic names. The division algorithm is a theorem expressing the outcome of division in the natural numbers and more general rings. The Banach–Tarski paradox is a theorem in measure theory that is paradoxical in the sense that it contradicts common intuitions about volume in three-dimensional space.

An unproven statement that is believed to be true is called a conjecture (or sometimes a hypothesis, but with a different meaning from the one discussed above). To be considered a conjecture, a statement must usually be proposed publicly, at which point the name of the proponent may be attached to the conjecture, as with Goldbach's conjecture. Other famous conjectures include the Collatz conjecture and the Riemann hypothesis.

Tuesday, January 20, 2009

Omitting vowels for shorthand

Much to my surprise, omitting vowels makes shorthand easier to read, according to the phoenix theory.

Here are some notes about EasyScript speedwriting.

Tuesday, January 13, 2009

The Decline of Text

Prof Marti Hearst recently has an essay on Edge talking about the decline of text. It is certainly the trend that speech/video will become more and more important, but does it mean the discipline of IR is going to be obsolete? I do not think so, because the essence of IR, in my opinion, is the statistical approach to processing and managing large amounts of information expressed in natural language. Therefore no matter the data are in text format or not, the need of IR is still there.

Friday, January 02, 2009

Using CSV files in Python

It is often desirable to store experimental data in Comma Separated Value (CSV) files so that we can easily inspect/process them with a text editor or more advanced data management/analysis software (like Microsoft Excel).

There are a number of ways of reading/writing CSV files in Python.

  • Python's string and file functions are adequate for simple CSV files (e.g., those with only numeric data).

  • Using the standard module csv.

  • Using the numpy functions loadtxt/savetxt.

  • Using the matplotlib.mlab functions such as load/save and csv2rec/rec2csv.