Friday, December 21, 2007


A Windows (32bit) binary version of svmsgd that I compiled using MinGW with zlib linked statically is available here.

Saturday, December 15, 2007

The Fastest SVM

The quest for the fastest SVM learning algorithm is continuing.

Leon Bottou reported his suprising finding: a classic optimization method, Stochastic Gradient Descent, is amazingly fast for training linear SVMs or CRFs. His program svmsgd works much faster than SVMperf and LIBLINEAR on very large datasets such as RCV1-v2.

Edward Chang's team has released their code of PSVM, a parallel implementation of SVM that can achieve almost linear reduction in both memory use and training time.

Thursday, December 13, 2007

Metaweb and KnowItAll

Metaweb Technologies, a start-up company that has been reported by the following news articles, seems to have a similar ambition as the KnowItAll project. They both try to turn the Web into a huge database of structured knowledge.

New York Times: Start-Up Aims for Database to Automate Web Searching
The Economist: Sharing what matters

Search Internationalization and Personalization

Some Web search engines employ different relevance (ranking) algorithms for different countries. Leaving the language issues aside, I wonder if the underlying technology of internationalized search is essentially equivalent to that of personalized search (such as Personalized PageRank) since we can consider the whole nation as a virtual user.

Monday, December 10, 2007

Transfer Learning

Hal Daume III recently wrote an insightful blogpost on the relationship between Transfer Learning and Domain Adaptation. I think another problem closely related to Transfer Learning is Multi-Task Learning.

The Apex Lab in SJTU recently published a number of papers on this topic and released a dataset.

Tuesday, November 27, 2007

Version Control for Papers

I found that TortoiseSVN is very handy for not only code version control but also paper version control. The whole history of all the related (text or binary) files for a research project can be saved and managed effectively.

Wednesday, November 21, 2007


Chih-Jen Lin, the researcher who developed the well-known LIBSVM, has released LIBLNEAR, a program for large-scale linear classification (e.g., on text data) that supports both logistic regression and L2-loss linear SVM using a trust region Newton method.

Tuesday, November 20, 2007

Scraping Web Pages with Python

Beautiful Soup is a Python HTML/XML parser. The following features make it stand out for screen-scraping: (1) high tolerance of web pages with bad markup; (2) simple methods for data extraction from web pages; (3) automatic conversion of web pages to Unicode/UTF-8 encoding.

Sunday, November 18, 2007

Tanimoto coefficient

The Tanimoto coefficient, like the consine similarity in the vetcor space model, is a similarity measure between two vectors. It is also called the extended Jaccard coefficient as it yields the standard Jaccard coefficient in the case of binary vectors.

Wednesday, November 14, 2007

Web Search Results Clustering based on Related Queries

Although automatic organization of Web search results should be helpful, search engines that offer clustered search results, such as Clusty from Vivisimo, have achieved only limited success so far. The underlying reason may be that (1) the created clusters are not really consistent with users' search interests and (2) the extracted labels for clusters are not really readable or informative from the users' perspective.

I have an idea which may sound silly:) Many search engines return a list of related queries in addition to the search results for a given query. Such related queries are real queries extracted from search logs --- they reflect the users real information needs. By using related queries as search result categories, we may be able to get user-oriented clustering of search results. This approach is simple to understand and easy to implement. Furthermore, it can be done on-the-fly just-in-time.

For example, a clustering interface for Windows Live Search could be developed in the following way.

  • Given a query q such as 'jaguar'.

  • Get the top 100 search results for q through Live Search SDK.

  • Get the related queries of q, such as 'jaguar animal' and 'jaguar car', by scraping the 'Related searches' column in the search results page directly, or using the ITermSuggestion provider in the Microsoft adCenter Keyword Services Platform.

  • For each related query r_i, create a corresponding cluster c_i with the label r_i, and assign all search results containing the r_i terms to the cluster c_i.

  • Present the clustered search results in a Clusty-like interface.

Saturday, November 10, 2007


Using microformats within HTML code provides additional formatting and semantic data that can be used by applications. This approach to providing "more intelligent data" on the Web has a lower entry barrier, compared with the Semantic Web.

Friday, November 02, 2007

To be better than Google

The following articles review three possible approaches to outdoing Google, and why we still intuitively feel that Google is going to remain the king of search.

  • Better Quality

  • Better Interface

  • Vertical Search

The Race to Beat Google
Search 2.0 - What's Next

Spectral Regression

The recently proposed Spectral Regression approach to dimensionality reduction is simply Laplacian Eigenmap plus Regularized Linear Regression. Just that simple.

Thursday, July 26, 2007

Some Papers on Graph-based Text Analysis

Gunes Erkan, Dragomir R. Radev: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. J. Artif. Intell. Res. (JAIR) 22: 457-479 (2004)

Oren Kurland, Lillian Lee: PageRank without hyperlinks: structural re-ranking using links induced by language models. SIGIR 2005: 306-313.

Oren Kurland, Lillian Lee: Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models. SIGIR 2006: 83-90.

Popular Science Books on Complex Networks

Duncan J. Watts, Small Worlds: The Dynamics of Networks between Order and Randomness, Princeton University Press, 2003.

Duncan J. Watts, Six Degrees: The Science of a Connected Age, New Ed edition, Vintage, 2004.

Albert-Laszlo Barabasi, Linked: How Everything is Connected to Everything Else, Plume, 2003.

Mark Buchanan, Nexus: Small Worlds and the Groundbreaking Theory of Networks, W. W. Norton & Company, 2003.

Monday, July 16, 2007

The secret of getting ahead

"The secret of getting ahead is getting started. The secret of getting started is breaking your complex overwhelming tasks into small manageable tasks, and then starting on the first one." --- Mark Twain

Saturday, June 02, 2007

Where do I end up

"I rarely end up where I was intending to go, but often I end up somewhere that I needed to be".
---Douglas Adams, The Long Dark Tea-Time of the Soul.

Tuesday, May 08, 2007

Why share ...?

There's no limit to what you can accomplish if you don't care who gets the credit. -- anonymous

Monday, May 07, 2007

Enter argmax (argmin) in LaTeX

Here is the method to define a LaTeX \argmax (\argmin) command which is not bulit-in.


Tuesday, April 24, 2007

An Explanation of SVD

Simon Funk presented a nice explanation of SVD in a recent KDNuggets interview. He has applied SVD to the Netflix Prize challenge, and opend his source code!

Tuesday, April 17, 2007

Quotes about Paper Reviewing

"Paper which ultimately turns out to be highly regarded is initially rejected."

"Acceptance or rejection is rather random at times."

-- John Langford's blog post An ICML reject.

"Paper acceptance is a radom process, at least for relatively senior but not highly senior researchers."

-- Shuicheng Yan's blog post Thinking at the coming of the 60-th paper.

Saturday, March 24, 2007

Estimating Query Hardness

David Carmel, Elad Yom-Tov, Adam Darlow, Dan Pelleg: What Makes A Query Difficult?, SIGIR 2006: 390-397.

Vishwa Vinay, Ingemar J. Cox, Natasa Milic-Frayling, Ken Wood: On Ranking the Effectiveness of Searches, SIGIR 2006: 398-404.

Yun Zhou, Bruce Croft: Query Performance Prediction in Web Search Environments, SIGIR 2007.

Friday, March 23, 2007

Wikipedia XML Corpus

There is a large XML corpus based on Wikipedia avalilable for research.

Sociocentric vs. Egocentric

While sociocentric network/graph anlysis is oriented toward the global 'society', egocentric network/graph anlysis is centered around an individual node therefore provides a personalized view of its local neighborhood environment.

Saturday, March 17, 2007

Edge Separator vs. Vertex Separator

An edge separator is a subset of the graph's edges whose removal from the graph makes the graph disconnected. Defferent weakness measures of the edge separator lead to different ways of graph partitioning or clustering, such as normalized cuts, conductance, edge betweenness, modularity, and relative neighborhoods. However, regardless of the weakness measure used, edge separators sometimes fail to capture the cohesion of graphs, especially in the presence of overlapping clusters. While the existence of a weak edge separator in a graph is sufficient to make the graph noncohesive, it is not a necessary condition.

A vertex separator is a subset of the graph's nodes whose removal leaves the graph disconnected. Note that sparse vertex separators subsume weak edge separators: if the graph has a weak edge separator, then it must also have a sparse vertex separator, but the converse is not true.

--- Ziv Bar-Yossef, Ido Guy, Ronny Lempel, Yoelle S. Maarek and Vladimir Soroka, Cluster Ranking with an Application to Mining Mailbox Networks, Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), pp. 63--74, Hong Kong, China, 2006.

Thursday, March 15, 2007


The Motzkin-Straus theorem (formulation) establishes a remarkable connection between the maximal/maximum cliques of an unweighted undirected graph G and the local/global solutions of a quadratic programming problem. It has motivated several clique-finding techniques. There are generalizations of this result to edge-weighted or vertex-weighted graphs.

Wednesday, March 07, 2007

10 Challenging Problems in Data Mining Research

On ICDM'05, Qiang Yang and Xindong Wu presented 10 Challenging Problems in Data Mining Research.
1. Developing a Unifying Theory of Data Mining
2. Scaling Up for High Dimensional Data and High Speed Data Streams
3. Mining Sequence Data and Time Series Data
4. Mining Complex Knowledge from Complex Data
5. Data Mining in a Network Setting
6. Distributed Data Mining and Mining Multi-agent Data
7. Data Mining for Biological and Environmental Problems
8. Data-Mining-Process Related Problems
9. Security, Privacy and Data Integrity
10. Dealing with Non-static, Unbalanced and Cost-sensitive Data

I am particularly interested in Problem 2, 4 and 5 which share one common theme: Learning/Mining with Very Large Graphs.

Sunday, March 04, 2007

Free eBook on Matrix Computation

The Matrix Cookbook is a mathematical desktop reference on matrices that is freely available online.

Saturday, February 17, 2007

Normalized Latent Semantic Indexing

Laplacian Eigenmap on the term-document bipartite graph with adjency matrix W = [0, A; A^T, 0] is acutally equivalent to Latent Semantic Indexing (Singular Value Decomposition) on the normalized term-document matrix A_n = D_1^{-1/2} A D_2^{-1/2}. This could be easily derived from the fact that the generalized eigenvector problem (D-W) y = \lambda D y can be tranformed into a standard eigenvector problem D^{-1/2} (D-W) D^{-1/2} x = \lambda x, i.e., D^{-1/2} W D^{-1/2} x = (1-\lambda) x where x = D^{1/2} y. See the paper Normalized Cuts and Image Segmentation p12. Note that we do not use D^{-1} W here because it is not garanteed to be symmetric. Refer to the book Kernel Methods for Pattern Analysis section 6.4. It is not clear yet whether such kind of normalization will lead to better empirical performance.

To my knowledge, the proof of this connection has been presented in the following two papers, but in an unnecessarily complicated way.
Co-clustering documents and words using Bipartite Spectral Graph Partitioning
Bipartite Graph Partitioning and Data Clustering

Manifold Learning

Manifold Learning or Nonlinear Dimensionality Reduction has been a hot topic for several years, particularly since the publication of ISOMAP and LLE on Science (December 2000).

A collection of Manifold Learning papers.
A collection of Manifold Learning codes (in MatLab).

John Langford has an insightful blog post: Why Manifold-Based Dimension Reduction Techniques?

Datasets in LIBSVM format

The LIBSVM Data page contains a number of preprocessed and formated datasets for various machine learning problems. It is very helpful when one wants to try some new machine learning algorithms (not necessarily SVM) without the hassle of preprocessing and formatting.

Thursday, February 15, 2007

Laplace Prior and Sparsity

Sparse parameter estimation can be achieved in the Bayesian framework by using the Laplace (double exponential) prior distribution. It is known that the MAP estimates using the Laplace prior are the same as those produced by applying the lasso algorithm that minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients.

The Bayesian Logistic Regression (BBR) and Bayesian Mutinomial Regression (BMR) software packages have implemented this idea and showed good performance in text classification.

Challenges in Statistical Machine Learning

John Lafferty and Larry Wasserman have recently published an article talking about the Challenges in Statistical Machine Learning. The following problems are considered current challenges that cut across both statistics and machine learning communities.

  • Sparse Learning in High Dimensions

  • Semi-Supervised Learning

  • Relation between Computation and Risk

  • Structured Prediction

Tuesday, January 30, 2007

e^x and e^ix

When x -> infinity, e^x -> infinity, but in contrast e^ix is always bounded to the unit circle according to Euler's formula. This clever trick is said to be an important step that finally led to the discovery of Fourier series.

Saturday, January 27, 2007

Rank Correlation Coefficients

To measure the correlation between two rankings rather than two random variables, we can consider using the following non-parametric correlation coefficients.
Spearman's ρ
Kendall's τ

Hellinger metric

Hellinger metric is the square-root of Jensen–Shannon divergence, a symmetric version of Kullback-Leibler divergence. It looks a good choice when measuring the dissimilarity between two distributions (e.g., two multinomial distributions corresponding to a pair of documents), because it is known to be a legitimate distance metric.