0%

Python练习册：0004

发表于 2018-05-28 更新于 2018-05-31 分类于 Python练习册
本文字数： 1.2k 阅读时长 ≈ 2 分钟

========================

题目

    任一个英文的纯文本文件，统计其中的单词出现的个数。

分析

这里可以先从文件中读取内容，然后考虑到大小写问题（Text和text作为一个单词）,缩写问题(I’m 作为一个一个单词),连字符（末尾没写完新开一行te-xt）,标点符号（.,?:"）等问题后，把单词分割好，最后进行个数统计。
这里主要使用正则表达式

代码

用 Dict 统计词频

import re
with open("text.txt","r") as f:
    #读取文本内容并全部转换为小写字母
    text = f.read().lower()

#将标点符号替换成空格
text = re.sub(r'[,.!?:"]',' ',text)

#去掉连字符
text = re.sub(r'-','',text)

#将计次放入以单词为key的value中
counts = {}
for word in text.split():
    if word not in counts:
        counts[word] = 1
    counts[word] += 1

#按词频从大到小排序
result = sorted(counts.items(),key=lambda item:item[1],reverse=True)

#输出统计结果
for i in result:
    print(i)

用collections库中的Counter计数

import re
from collections import Counter
with open("test.txt","r") as f:
    text = f.read().lower()
text = re.sub(r'[,.!?:"]',' ',text)
text = re.sub(r'-','',text)

#统计词频
counts = Counter(text.split())
#按顺序输出结果
print(counts.most_common())

参考

欢迎关注我的其它发布渠道