Python统计txt文件中的词频

设有一个文本文件word.txt，里面存放的是用空格或者换行分开的英文单词，统计其中每个词出现的频率，将统计结果保存在某个文件中。

这里利用列表推导式和集合来统计词频。

代码：

filename = 'C:\\Users\\dell\\desktop\\big.txt'

with open(filename) as f:
    s = f.readlines()

words = []
for line in s:
    words.extend(line.strip().split(' '))
# 中英文混合对齐 ，参考http://bbs.fishc.com/thread-67465-1-1.html ，二楼
# 汉字与字母 格式化占位 format对齐出错 对不齐 汉字对齐数字 汉字对齐字母 中文对齐英文
# alignment函数用于英汉混合对齐、汉字英文对齐、汉英对齐、中英对齐
def alignment(str1, space=8, align='left'):
    length = len(str1.encode('gb2312'))
    space = space - length if space >= length else 0
    if align in ['left', 'l', 'L', 'Left', 'LEFT']:
        str1 = str1 + ' ' * space
    elif align in ['right', 'r', 'R', 'Right', 'RIGHT']:
        str1 = ' ' * space + str1
    elif align in ['center', 'c', 'C', 'Center', 'CENTER', 'centre']:
        str1 = ' ' * (space // 2) + str1 + ' ' * (space - space // 2)
    return str1


def geshi(a, b, c):
    return alignment(str(a)) + alignment(str(b), 18) + alignment(str(c)) + '\n'

w_s = geshi('序号', '词', '频率')

wordcount = sorted([(w, words.count(w)) for w in set(words)], key=lambda t: (-t[1], t[0]))


for (w, c) in wordcount:
    w_s += geshi(wordcount.index((w, c)) + 1, w, c)

writefile = '\\ar.txt'
with open(writefile, 'w') as wf:
    wf.write(w_s)

filename = 'C:\\Users\\dell\\desktop\\big.txt'

with open(filename) as f:

s = f.readlines()

words = []

for line in s:

words.extend(line.strip().split(' '))

# 中英文混合对齐，参考http://bbs.fishc.com/thread-67465-1-1.html ，二楼

# 汉字与字母格式化占位 format对齐出错对不齐汉字对齐数字汉字对齐字母中文对齐英文

# alignment函数用于英汉混合对齐、汉字英文对齐、汉英对齐、中英对齐

def alignment(str1, space=8, align='left'):

length = len(str1.encode('gb2312'))

space = space - length if space >= length else 0

if align in ['left', 'l', 'L', 'Left', 'LEFT']:

str1 = str1 + ' ' * space

elif align in ['right', 'r', 'R', 'Right', 'RIGHT']:

str1 = ' ' * space + str1

elif align in ['center', 'c', 'C', 'Center', 'CENTER', 'centre']:

str1 = ' ' * (space // 2) + str1 + ' ' * (space - space // 2)

return str1

def geshi(a, b, c):

return alignment(str(a)) + alignment(str(b), 18) + alignment(str(c)) + '\n'

w_s = geshi('序号', '词', '频率')

wordcount = sorted([(w, words.count(w)) for w in set(words)], key=lambda t: (-t[1], t[0]))

for (w, c) in wordcount:

w_s += geshi(wordcount.index((w, c)) + 1, w, c)

writefile = '\\ar.txt'

with open(writefile, 'w') as wf:

wf.write(w_s)

Python统计txt文件中的词频

大模型AlpacaFarm分析

NLG文本评估任务或许并不需要真值或参考文本

大模型中的RepE表征工程

大模型也是一种优化器（LLM as Optimizer）

全栈开发与快速部署Demo

学术idea自动发现与生成

自回归语言模型（language model）Python实现

粉丝期待的三体电影宇宙（近四十部电影与电视剧集）

基于历史对比学习的时序知识图谱推理

泰拉瑞亚Terriaria快速部署Linux服务器

留下评论取消回复

相关文章

留下评论取消回复