小熊科技|PyTorch文本:01.聊天机器人教程( 二 )
2.1 创建格式化数据文件为了方便起见 , 我们将创建一个格式良好的数据文件 , 其中每一行包含一个由tab制表符分隔的查询语句和响应语句对 。
以下函数便于解析原始 movie_lines.txt 数据文件 。
loadLines:将文件的每一行拆分为字段(lineID, characterID, movieID, character, text)组合的字典
loadConversations :根据movie_conversations.txt将loadLines中的每一行数据进行归类
extractSentencePairs: 从对话中提取句子对
# 将文件的每一行拆分为字段字典def loadLines(fileName, fields):lines = {}with open(fileName, 'r', encoding='iso-8859-1') as f:for line in f:values = line.split(" +++$+++ ")# Extract fieldslineObj = {}for i, field in enumerate(fields):lineObj[field] = values[i]lines[lineObj['lineID']] = lineObjreturn lines# 将 `loadLines` 中的行字段分组为基于 *movie_conversations.txt* 的对话def loadConversations(fileName, lines, fields):conversations = []with open(fileName, 'r', encoding='iso-8859-1') as f:for line in f:values = line.split(" +++$+++ ")# Extract fieldsconvObj = {}for i, field in enumerate(fields):convObj[field] = values[i]# Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")lineIds = eval(convObj["utteranceIDs"])# Reassemble linesconvObj["lines"] = []for lineId in lineIds:convObj["lines"].append(lines[lineId])conversations.append(convObj)return conversations# 从对话中提取一对句子def extractSentencePairs(conversations):qa_pairs = []for conversation in conversations:# Iterate over all the lines of the conversationfor i in range(len(conversation["lines"]) - 1):# We ignore the last line (no answer for it)inputLine = conversation["lines"][i]["text"].strip()targetLine = conversation["lines"][i+1]["text"].strip()# Filter wrong samples (if one of the lists is empty)if inputLine and targetLine:qa_pairs.append([inputLine, targetLine])return qa_pairs现在我们将调用这些函数来创建文件 , 我们命名为formatted_movie_lines.txt 。
# 定义新文件的路径datafile = os.path.join(corpus, "formatted_movie_lines.txt")delimiter = '\t'delimiter = str(codecs.decode(delimiter, "unicode_escape"))# 初始化行dict , 对话列表和字段IDlines = {}conversations = []MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]# 加载行和进程对话print("\nProcessing corpus...")lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)print("\nLoading conversations...")conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"),lines, MOVIE_CONVERSATIONS_FIELDS)# 写入新的csv文件print("\nWriting newly formatted file...")with open(datafile, 'w', encoding='utf-8') as outputfile:writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')for pair in extractSentencePairs(conversations):writer.writerow(pair)# 打印一个样本的行print("\nSample lines from file:")printLines(datafile)输出结果:
Processing corpus...Loading conversations...Writing newly formatted file...Sample lines from file:b"Can we make this quick?Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\n"b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.Please.\n"b"Not the hacking and gagging and spitting part.Please.\tOkay... then how 'bout we try out some French cuisine.Saturday?Night?\n"b"You're asking me out.That's so cute. What's your name again?\tForget it.\n"b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\n"b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.My sister.I can't date until she does.\n"b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.My sister.I can't date until she does.\tSeems like she could get a date easy enough...\n"b'Why?\tUnsolved mystery.She used to be really popular when she started high school, then it was just like she got sick of it or something.\n'b"Unsolved mystery.She used to be really popular when she started high school, then it was just like she got sick of it or something.\tThat's a shame.\n"b'Gosh, if only we could find Kat a boyfriend...\tLet me see what I can do.\n'
推荐阅读
- 所持股份|万兴科技:公司控股股东、实际控制人吴太兵质押150万股
- 发布公告|数量过半!博创科技:天通股份累计减持约150万股
- 英雄科技聊数码|蔡崇信有实力买下篮网,那身价3200亿的马云,能买下几支NBA球队
- 科技前沿阵地|涨疯了!海思安防芯片遭哄抬“围剿”
- 月影浓|吴亦凡机械造型走秀 垫肩披风搭银框眼镜科技感足
- 中国历史发展过程|中国历史发展过程.中国的科技史界过去半个多世纪
- 天津|桂发祥:不再持有昆汀科技股份
- 消费|减持!天通股份:减持博创科技约32万股
- 处罚|老周侃股:吉鑫科技大股东应补偿踩雷投资者
- 华中科技大学|杯具!超本科线95分,本科有路不走,却梦幻般碰瓷,撞开专科的门
