Friday, April 21, 2006


MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text. It was written primarily by Dr William W. Cohen. It comes with a collection of publically-available extraction problems in Minorthird format (about 2Mb).

Minorthird differs from existing NLP and learning toolkits in a number of ways:

  • Unlike many NLP packages (eg GATE, Alembic) it combines tools for annotating and visualizing text with state-of-the art learning methods.

  • Unlike many other learning packages, it contains methods to visualize both training data and the performance of classifiers, which facilitates debugging.

  • Unlike other learning packages less tightly integrated with text manipulation tools, it is possible to track and visualize the transformation of text data into machine learning data.

  • Unlike many packages (including WEKA), it is open-source, and available for both commercial and research purposes.

  • Unlike any open-source learning systems I know of, it is architected to support active learning and on-line learning, which should facilitate integration of learning methods into agents.