Introduction
This blog introduces the bigram model for word segmentation.
So the bigram model considers previous character which makes more sense in terms of context. The equation should be:
Witten-Bell smoothing
Thus, to solve this problem, we need some smoothing and interpolation is introduced.
This smoothing trick combines both bigram and unigram. There are two coefficient,
So far the probabilities are calculable. The workflow would be like (赫尔辛基大学在芬兰[The university of Helsinki is in finland/Helsingin yliopisto on suomessa]):
Find best segmentation way
Yes, next step requires ability to search for the optimal path in this graph. Probabilities will serve as weights of edges.
In practice, adding regular expression would be a good idea to recognize time and numbers, boosting the performance on OOV(out of vocabulary). Implementation is in the repository. The blog is just complementary to the code.
Here we can also see some drawbacks of this approach:
- It relies on large training data.
- N-gram usually performs better as N grows but also more computationally expensive.
- It still can’t deal with unseen words which never appear in training data. (Regular expression only helps some but can’t manage person’s name)
HMM model seems to be more effective in tackling OOV.
Reference: Speech and Language Processing