Monday, May 25, 2009

Dirichlet prior for smoothing

Using Dirichlet distribution as the prior for smoothing in statistical language modeling leads to additive smoothing (a.k.a. Lidstone smoothing) that includes Laplace smoothing (i.e., add one) and Jeffreys-Perks smoothing (i.e., add half) (a.k.a. Expected Likelihood Estimation) as special cases. This family of smoothing methods can be regarded as a document dependent extension of linear interpolated smoothing.

It has been shown that Laplace smoothing, though most popular (in textbooks), is often inferior to Lidstone smoothing (using a value less than one) in modeling natural language data, e.g., for text classification tasks (see Athena: Mining-based interactive management of text databases).

1 comment:

Bob Carpenter said...

Is Laplace smoothing really popular anywhere but in textbooks?

I find the real key to the Dirichlet/multinomial is the conjugacy. That means if the prior parameter is α, the multinomial model params are θ, and the data vector of word counts is x, then both the prior p(θ|α) and the posteior p(θ|α, x) are Dirichlet (with params α and x+α respectively).

If you're doing LM-based search, you can then do prediction with the resulting Dirichlet-multinomial or estimate another Dirichlet posterior from the query and then compare query and doc with KL-divergence over the posterior Dirichlets rather than over the maximum a posteriori (MAP) point estimates.