Monday, May 25, 2009

Centrality measures in network analysis

There are four measures of centrality that are widely used in network analysis: degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. Google's PageRank is a variant of the eigenvector centrality measure.

Dirichlet prior for smoothing

Using Dirichlet distribution as the prior for smoothing in statistical language modeling leads to additive smoothing (a.k.a. Lidstone smoothing) that includes Laplace smoothing (i.e., add one) and Jeffreys-Perks smoothing (i.e., add half) (a.k.a. Expected Likelihood Estimation) as special cases. This family of smoothing methods can be regarded as a document dependent extension of linear interpolated smoothing.

It has been shown that Laplace smoothing, though most popular (in textbooks), is often inferior to Lidstone smoothing (using a value less than one) in modeling natural language data, e.g., for text classification tasks (see Athena: Mining-based interactive management of text databases).

Saturday, May 23, 2009

PyMat

PyMat exposes the MATLAB engine interface allowing Python programs to start, close, and communicate with a MATLAB engine session. In addition, the package allows transferring matrices to and from an MATLAB workspace. These matrices can be specified as NumPy arrays, allowing a blend between the mathematical capabilities of NumPy and those of MATLAB.

Sunday, May 17, 2009

Sunday, May 10, 2009

PyLucene

PyLucene is a Python extension that allows the usage of Lucene's text indexing and searching capabilities from Python.

JPype vs Jython

It is often desirable to access Java libraries (such as Weka and Mallet) in Python code. There are currently two possible approaches: JPype and Jython. The former looks more attractive, as it is achieved not through re-implementing Python (as the latter does), but rather through interfacing at the native level in both Virtual Machines.

Curve Fitting in Matlab

EzyFit is a free toolbox for Matlab that enables quick curve fitting of 1D data using arbitrary (nonlinear) fitting functions. It can work in both interactive and command-line modes.

Converting HTML to text using Python

html2text is a Python script that does a good job in extracting text from HTML files. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text: it produces markdown that would then have to be turned into plain text.

Stanford Named Entity Recognizer

Stanford Named Entity Recognizer is an open source named entity recognition tool implemented in Java. The software is based on the linear chain Conditional Random Field (CRF) sequence model.

Wednesday, May 06, 2009

Anytime Algorithms

Anytime algorithms are algorithms that trade execution time for quality of results. In particular, an anytime algorithm always has a best-so-far answer available, and the quality of the answer improves with execution time. The user may examine this answer at any time and choose to terminate, temporarily suspend, or continue the algorithm’s execution until completion. Using an anytime algorithm can mitigate the need for high performance hardware.

[1] Shlomo Zilberstein: Using Anytime Algorithms in Intelligent Systems. AI Magazine 17(3): 73-83 (1996)
[2] Xiaopeng Xi, Ken Ueno, Eamonn J. Keogh, Dah-Jye Lee: Converting non-parametric distance-based classification to anytime algorithms. Pattern Anal. Appl. 11(3-4): 321-336 (2008)