Tuesday, November 27, 2007

Version Control for Papers

I found that TortoiseSVN is very handy for not only code version control but also paper version control. The whole history of all the related (text or binary) files for a research project can be saved and managed effectively.

Wednesday, November 21, 2007


Chih-Jen Lin, the researcher who developed the well-known LIBSVM, has released LIBLNEAR, a program for large-scale linear classification (e.g., on text data) that supports both logistic regression and L2-loss linear SVM using a trust region Newton method.

Tuesday, November 20, 2007

Scraping Web Pages with Python

Beautiful Soup is a Python HTML/XML parser. The following features make it stand out for screen-scraping: (1) high tolerance of web pages with bad markup; (2) simple methods for data extraction from web pages; (3) automatic conversion of web pages to Unicode/UTF-8 encoding.

Sunday, November 18, 2007

Tanimoto coefficient

The Tanimoto coefficient, like the consine similarity in the vetcor space model, is a similarity measure between two vectors. It is also called the extended Jaccard coefficient as it yields the standard Jaccard coefficient in the case of binary vectors.

Wednesday, November 14, 2007

Web Search Results Clustering based on Related Queries

Although automatic organization of Web search results should be helpful, search engines that offer clustered search results, such as Clusty from Vivisimo, have achieved only limited success so far. The underlying reason may be that (1) the created clusters are not really consistent with users' search interests and (2) the extracted labels for clusters are not really readable or informative from the users' perspective.

I have an idea which may sound silly:) Many search engines return a list of related queries in addition to the search results for a given query. Such related queries are real queries extracted from search logs --- they reflect the users real information needs. By using related queries as search result categories, we may be able to get user-oriented clustering of search results. This approach is simple to understand and easy to implement. Furthermore, it can be done on-the-fly just-in-time.

For example, a clustering interface for Windows Live Search could be developed in the following way.

  • Given a query q such as 'jaguar'.

  • Get the top 100 search results for q through Live Search SDK.

  • Get the related queries of q, such as 'jaguar animal' and 'jaguar car', by scraping the 'Related searches' column in the search results page directly, or using the ITermSuggestion provider in the Microsoft adCenter Keyword Services Platform.

  • For each related query r_i, create a corresponding cluster c_i with the label r_i, and assign all search results containing the r_i terms to the cluster c_i.

  • Present the clustered search results in a Clusty-like interface.

Saturday, November 10, 2007


Using microformats within HTML code provides additional formatting and semantic data that can be used by applications. This approach to providing "more intelligent data" on the Web has a lower entry barrier, compared with the Semantic Web.

Friday, November 02, 2007

To be better than Google

The following articles review three possible approaches to outdoing Google, and why we still intuitively feel that Google is going to remain the king of search.

  • Better Quality

  • Better Interface

  • Vertical Search

The Race to Beat Google
Search 2.0 - What's Next

Spectral Regression

The recently proposed Spectral Regression approach to dimensionality reduction is simply Laplacian Eigenmap plus Regularized Linear Regression. Just that simple.