Monday, May 31, 2010

An application of Bloom filters

It is said that Google's BigTable uses Bloom filters to reduce the disk lookups for non-existent rows or columns.

Tuesday, May 04, 2010

A suffix tree implementation with Unicode support

It seems that there is currently no suffix tree implementation with Unicode support publicly available online. So I adapted Thomas Mailund's suffix tree implementation in C with a Python binding and put it here. The changes that I made to the code were mainly to make it support Unicode text and be compatible with new version Python. It also includes an example program all_comsubstr.py that illustrates the extraction of common substrings from two Chinese strings (encoded in UTF-8).

Monday, May 03, 2010

Longest Common Substring

Given two strings, S of length m and T of length n, their longest common substrings can be found in O(m+n) time using a generalised suffix tree, or in O(mn) time through dynamic programming (e.g., the Python code here).