The story about the formula that killed Wall Street illustrates the extreme importance of obtaining real-world first-hand data. It is the availability of CDS market data that makes the correlation between default risks quantitatively measurable using the Gaussian copula model in spite of the scarcity of real historical default data, and thus enables the invention and proliferation of CDOs. However, it is also because that the model is based on the CDS market data rather than the real historical default data, applying it without prudence can lead to catastrophic outcomes as witnessed by the recent global financial crisis. The key to his success is also his undoing. When there are no sufficient data, the so-called wisdom of the crowd is not reliable at all.

## Wednesday, February 25, 2009

## Monday, February 23, 2009

### Ullman set

The Ullman set is a clever data structure for a set of n integers densely distributed over a range, say from 0 to N-1. It uses O(N) space, and supports construction, destruction, adding, deleting, membership-test as well as cardinality operations in O(1) time. Moreover iterating over its members only takes O(n) time.

## Thursday, February 19, 2009

### Approximating Binomial Distributions

The Poisson distribution can be derived as a limiting case to the binomial distribution as the number of trials goes to infinity and the expected number of successes remains fixed. Therefore it can be used as an approximation of the binomial distribution if n is sufficiently large and p is sufficiently small. There is a rule of thumb stating that the Poisson distribution is a good approximation of the binomial distribution if n is at least 20 and p is smaller than or equal to 0.05. According to this rule the approximation is excellent if n ≥ 100 and np ≤ 10.

--- Wikipedia

## Monday, February 02, 2009

### Take the computations to the data

The legendary computer scientist Jim Gray is said to have a conviction that when you have lots of data, you take the computations to the data rather than the data to the computations. Does this imply that (statistical) machine learning that relies on a large amount of data should be built into the database systems? Maybe a database system that directly manages uncertainty, such as Trio, is indeed a future trend.

## Sunday, February 01, 2009

### Fewer Searches, Higher Price

A recent paper shows empirical evidences that "search engine advertising is most valuable when firms have just a few hard-to-reach customers": lawyers will bid the price of a keyword query higher when there are fewer searches for it because they need a way to find these niche customers. In other words, the strength of search engine advertising is indeed in exploiting the long tail of advertsing which is impossible or difficult for traditional broadcast media.