UCSD SCIENTISTS EXPLAIN AND IMPROVE UPON 'ENIGMATIC' PROBABILITY FORMULA
Findings Could Have Implications for Speech Recognition, Machine Learning, Information Retrieval

In the article, Orlitsky and his colleagues unlock some of the secrets of the "GoodTuring estimator," a formula for estimating the probability of elements based on observed data. The formula is named after famed mathematicians I.J. Good and Alan Turing who, during WWII, were among a group of cryptanalysts charged with breaking the Enigma cipher  the code used to encrypt German military communications. Working at Bletchley Park outside of London, their work has been credited by some with shortening the war by several years. (It also led to the development of the first modern computer, and was documented in a number of books and movies.)
The cryptanalysts were greatly aided by their possession of the Kengruppenbuch, the German cipher book that contained all possible secret keys to Enigma, and had been previously captured by British Intelligence. They documented the keys used by various Uboat commanders in previously decrypted messages and used this information to estimate the distributions of pages from which commanders picked their secret keys.
The prevailing technique at the time estimated the likelihood of each page by simply using its empirical frequency, the fraction of the time it had been picked in the past. But Good and Turing developed an unintuitive formula that bore little resemblance to conventional estimators. Surprisingly, this GoodTuring estimator outperformed the more intuitive approaches. Following the war, Good published the formula, mentioning that Turing had an "intuitive demonstration" for its power, but not describing what that demonstration entailed.
Since then, GoodTuring has been incorporated into a variety of applications such as information retrieval, spellchecking, and speech recognition software, where it is used to learn automatically the underlying structure of the language. But despite its usefulness, "its performance has remained something of an enigma itself," said Orlitsky, a professor in the Electrical and Computer Engineering department. While some partial explanations were given as to why GoodTuring may work well, no objective evaluation or results have been established for its optimality. Additionally, scientists observed that while it worked well under many circumstances, at times, its performance was lacking.
Now, Orlitsky, Santhanam, and Zhang believe they have unraveled some of the mystery surrounding GoodTuring, and constructed a new estimator that, unlike the historic formula, is reliable under all conditions. Motivated by informationtheoretic and machinelearning considerations, they propose a natural measure for the performance of an estimator. Called attenuation, it evaluates the highest possible ratio between the probability assigned to each symbol in a sequence by any distribution, and the corresponding probability assigned by the estimator.
The UCSD researchers show that intuitive estimators, such as empirical frequency, can attenuate the probability of a symbol by an arbitrary amount. They also prove that GoodTuring performs well in general. While it can attenuate the probability of symbols by a factor of 1.39, it never attenuates by a factor of more than 2. Motivated by these observations, they derived an estimator whose attenuation is 1. This means that as the length of any sequence increases, the probability assigned to each symbol by the new estimator is as high as that assigned to it by any distribution.
"While there is a considerable amount of work to be done in simplifying and further improving the new estimator," concluded Orlitsky, "we hope that this new framework will eventually improve language modeling and hence lead to better speech recognition and data mining software."
"Always GoodTuring: Asymptotically Optimal Probability Estimation," Science Magazine. http://www.sciencemag.org/
Media Contact:
Doug Ramsey (858) 8225825 dramsey@ucsd.edu