稍有闲暇,找来一篇很早就保存下来的好文--刘未鹏的《数学之美番外篇:平凡而神奇的贝叶斯方法》拿来研读。读到一半,发现其中有一篇Peter Norvig原创、徐宥翻译的《怎样写一个拼写检查器》很有意思。
原理:概率论之朴素贝叶斯
原文链接:http://norvig.com/spell-correct.html
中文链接:http://blog.csdn.net/nirendao/article/details/50640139
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
import re, collections def words(text): return re.findall('[a-z]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): n = len(word) return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=lambda w: NWORDS[w]) >>> correct('speling') 'spelling' >>> correct('korrecter') 'corrector' |