群聊天记录转词云—— Python学习【001】 – 飞翔的菜鸟

群聊天记录转词云—— Python学习【001】

2023-10-15 10:10

|

随笔

|

KUN

本文最后更新于 653 天前，其中的信息可能已经有所发展或是发生改变。

0.昨天在翻硬盘的时候，发现了之前下载的别人大佬的聊天记录转成词云的源码。由于之前也没怎么用过Python，下载下来的源码就搁置了，但是一时兴起，就也想学习一下Python，就从这里开始吧！！

1.首先Python需要引入包，有点类似于R语言

import re
import jieba
from collections import Counter
import wordcloud
import matplotlib.pyplot as plt

2.接下来应该读入文件,这里我选择的是E盘下的一个txt文件，是由：《微信，留痕》这个安卓软件导出的HTML文件重命名得到的

with open('E:/all.txt', 'r', encoding='utf-8') as f:
    text = f.read()

3.使用正则表达式提取所有符合条件的内容

result_list = []
pattern = re.compile(r'发送者：BoJack.*?<p class="content">(.*?)<\/p>', flags=re.DOTALL)
for match in pattern.findall(text):
    result_list.append(match)

#这里，我选择的是群消息中的文件，《微信，留痕》导出的html文件，这里的html的格式为：
<p>发送者：BoJack<span class="time">时间：2023/07/15 16:41:09</span></p>
<hr class="line">
<p class="content">[旺柴][旺柴][旺柴]</p>
</div>
<div class="item">

上述代码可以将“发送者：BoJack” 的群聊记录提取出来，如果不是群聊，请无视此代码

4.将匹配到的内容拼接起来，进行文本清洗

all_text = ''.join(result_list)
s = re.compile('消息|2023|div|怎么|\[|\]|span|Emm|来自|\d{4}/\d{2}/\d{2}\s\d{2}:\d{2}:\d{2}')
#这里可以加入自定义的不需要的分词
clean_text = re.sub(s, '', all_text)

或者读取某个文件来进行清洗不需要的词

# 加载停用词
stopwords = []
with open('E:/stopwords.txt', 'r', encoding='utf-8') as ss:
    stopwords = ss.read().splitlines()

# 对处理后的文本进行分词，并输出不在停用词列表中的结果
word_L = []
words = jieba.cut(clean_text)
for word in words:
    if word not in stopwords and word != '\n' and len(word) > 1:
        word_L.append(word)

5.统计词频

word_count = Counter()
for word in word_L:
    if len(word) >= 2:  # 仅统计长度大于等于2的词语
        word_count[word] += 1

6.获取词频前100的词汇

top100_words = word_count.most_common(100)

7.输出结果到txt文档

with open('bboo.txt', 'w', encoding='utf-8-sig') as f:
    for word, count in top100_words:
        f.write(f'{word}: {count}\n')

8.生成词云图

wc = wordcloud.WordCloud(
    width=800, height=600, background_color='white',
    font_path='E:/msyh.ttc'  # 使用微软雅黑字体
)
wc.generate_from_frequencies(word_count)

9.绘制词云图

plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

并导出词云图

wc.to_file('333.png')

源码参考：
https://www.52pojie.cn/thread-1154638-1-1.html
https://www.52pojie.cn/thread-1155758-1-1.html
https://www.52pojie.cn/thread-1791190-1-1.html

暂无评论

发送评论编辑评论