Tuesday, May 04, 2010

A suffix tree implementation with Unicode support

It seems that there is currently no suffix tree implementation with Unicode support publicly available online. So I adapted Thomas Mailund's suffix tree implementation in C with a Python binding and put it here. The changes that I made to the code were mainly to make it support Unicode text and be compatible with new version Python. It also includes an example program all_comsubstr.py that illustrates the extraction of common substrings from two Chinese strings (encoded in UTF-8).

6 comments:

Anonymous said...

Thank you!

Matteo Romanello said...

Thanks very much for your contribution, it's just what I was looking for!
I was wondering if you have any example about how to implement an efficient dictionary search using an underlying suffix tree in python.

Dell Zhang said...

The usage is same as Thomas Mailund's original module.

Anonymous said...

Although this is some years old by now: Under which license is this code released?

Dell Zhang said...

As it is based on Thomas Mailund's suffix tree code, it should bear the same constraints. I do not know which license he has adopted ---probably a non-restrictive one as most 3rd-party modules for Python.

alec said...

Hi Dell Zhang,

Could you make an example of using the SuffixTree to find the longest repeating substring (Repeating 2x, or more generally, n-times)?