There are four measures of centrality that are widely used in network analysis: degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. Google's PageRank is a variant of the eigenvector centrality measure.
Monday, May 25, 2009
Dirichlet prior for smoothing
Using Dirichlet distribution as the prior for smoothing in statistical language modeling leads to additive smoothing (a.k.a. Lidstone smoothing) that includes Laplace smoothing (i.e., add one) and Jeffreys-Perks smoothing (i.e., add half) (a.k.a. Expected Likelihood Estimation) as special cases. This family of smoothing methods can be regarded as a document dependent extension of linear interpolated smoothing.
It has been shown that Laplace smoothing, though most popular (in textbooks), is often inferior to Lidstone smoothing (using a value less than one) in modeling natural language data, e.g., for text classification tasks (see Athena: Mining-based interactive management of text databases).
Saturday, May 23, 2009
PyMat
PyMat exposes the MATLAB engine interface allowing Python programs to start, close, and communicate with a MATLAB engine session. In addition, the package allows transferring matrices to and from an MATLAB workspace. These matrices can be specified as NumPy arrays, allowing a blend between the mathematical capabilities of NumPy and those of MATLAB.
Sunday, May 17, 2009
Global Ranking
Global Ranking looks a promising direction in the research area of Learning to Rank for Information Retrieval.
[1] Global Ranking Using Continuous Conditional Random Fields
[2] Global Ranking by Exploiting User Clicks
Sunday, May 10, 2009
JPype vs Jython
It is often desirable to access Java libraries (such as Weka and Mallet) in Python code. There are currently two possible approaches: JPype and Jython. The former looks more attractive, as it is achieved not through re-implementing Python (as the latter does), but rather through interfacing at the native level in both Virtual Machines.
Curve Fitting in Matlab
EzyFit is a free toolbox for Matlab that enables quick curve fitting of 1D data using arbitrary (nonlinear) fitting functions. It can work in both interactive and command-line modes.
Converting HTML to text using Python
html2text is a Python script that does a good job in extracting text from HTML files. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text: it produces markdown that would then have to be turned into plain text.
Stanford Named Entity Recognizer
Stanford Named Entity Recognizer is an open source named entity recognition tool implemented in Java. The software is based on the linear chain Conditional Random Field (CRF) sequence model.
Wednesday, May 06, 2009
Anytime Algorithms
Anytime algorithms are algorithms that trade execution time for quality of results. In particular, an anytime algorithm always has a best-so-far answer available, and the quality of the answer improves with execution time. The user may examine this answer at any time and choose to terminate, temporarily suspend, or continue the algorithm’s execution until completion. Using an anytime algorithm can mitigate the need for high performance hardware.
[1] Shlomo Zilberstein: Using Anytime Algorithms in Intelligent Systems. AI Magazine 17(3): 73-83 (1996)
[2] Xiaopeng Xi, Ken Ueno, Eamonn J. Keogh, Dah-Jye Lee: Converting non-parametric distance-based classification to anytime algorithms. Pattern Anal. Appl. 11(3-4): 321-336 (2008)