Saturday, March 28, 2009

Follow the Data

Recently some Google researchers published an article to advocate The Unreasonable Effectiveness of Data. They argue that AI researchers should embrace the complexity and follow the data rather than attempting to create elegant theories.


  • "Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data."

  • "Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail."


For example, due to the availability of Web-scale text data, natural language applications can achieve better performance by simply relying on word occurrence and co-occurrence statistics instead of complex latent factor analysis. The former approach is also more scalable because it only requires online learning that can be easily parallelized.

This clearly echoes the previous posts More Data vs. Better Algorithms and Right Data vs. Better Models.

No comments: