I recently learned the trick of using the defaultdict class for frequency counting and smoothing from Peter Norvig's influential technical article How to Write a Spelling Corrector.
As its name suggests, defaultdict is like a regular Python dict except that a default value (factory in fact) can be specified in advance. For example, the following piece of code uses the stadard dict to build a term-frequency hash table.
tf = {}
for t in words:
tf[t] = tf.get(t, 0) + 1
It can be simpler and faster by making use of the defaultdict from the collections module.
import collections
tf = collections.defaultdict{int}
for t in words:
tf[t] += 1
PS: I am glad to see that my colleague Rogger Mitton's work on spell-checking is cited by the above article.
No comments:
Post a Comment