September 2010

SLMPIME is a Statistical Language Model based Pinyin Input Method Editor (基於統計語言模型的拼音輸入法). It is a tool that converts pinyin sequence into Chinese sentences which applies Hidden Markov Models and Viterbi algorithm.

More than 90% Chinese people use a pinyin IME to input Chinese characters, but the quality usually can not meets most people's needs. It is because the pronunciations of different Chinese characters and even phrases may be the same. For example pinyin "wo shi" may be "我是 (I am)" or "臥室 (bedroom)". "zhong guo ren" may be "中國人 (Chinese people)" or "種果人 (man who grow fruits)". When a user inputs "wo shi zhong guo ren", the IME could not determine what the user means. Obviously, all the Chinese speakers know it means "我是中國人". In this case, Statistical Language Model (SLM) comes in handy. We compute the probabilities between all the phrase pairs in the SLM, which is called 2-gram. Similarly we can compute all the triples (3-gram) and so on (N-gram). With the N-gram information, we can calculate which combination is the most possible, and output it.

Simultaneously there is another problem on pinyin hyphenation. Users are used not to input separator between pinyin syllables, so there might be mistakes when IME doing hyphenation. For example, "fangan" can be segmented into "fang an" or "fan gan". The former one corresponds to "方案" while the later one corresponds to "反感". SLMPIME did not cut the pinyin sequence into syllables arbitrarily. It searches all the syllable paths and makes a "syllable graph".There are more details of implementation.

In fact SLMPIME is only a model. It has a distance from becoming a product-level IME. In other words, I did this project for learning purpose (or just for fun).

Download all the source code and data (slm_based_pinyin_ime.7z)