小熊科技|PyTorch文本:01.聊天机器人教程( 二 )

2.1 创建格式化数据文件为了方便起见 , 我们将创建一个格式良好的数据文件 , 其中每一行包含一个由tab制表符分隔的查询语句和响应语句对 。
以下函数便于解析原始 movie_lines.txt 数据文件 。
loadLines:将文件的每一行拆分为字段(lineID, characterID, movieID, character, text)组合的字典
loadConversations :根据movie_conversations.txt将loadLines中的每一行数据进行归类
extractSentencePairs: 从对话中提取句子对
# 将文件的每一行拆分为字段字典def loadLines(fileName, fields):lines = {}with open(fileName, 'r', encoding='iso-8859-1') as f:for line in f:values = line.split(" +++$+++ ")# Extract fieldslineObj = {}for i, field in enumerate(fields):lineObj[field] = values[i]lines[lineObj['lineID']] = lineObjreturn lines# 将 `loadLines` 中的行字段分组为基于 *movie_conversations.txt* 的对话def loadConversations(fileName, lines, fields):conversations = []with open(fileName, 'r', encoding='iso-8859-1') as f:for line in f:values = line.split(" +++$+++ ")# Extract fieldsconvObj = {}for i, field in enumerate(fields):convObj[field] = values[i]# Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")lineIds = eval(convObj["utteranceIDs"])# Reassemble linesconvObj["lines"] = []for lineId in lineIds:convObj["lines"].append(lines[lineId])conversations.append(convObj)return conversations# 从对话中提取一对句子def extractSentencePairs(conversations):qa_pairs = []for conversation in conversations:# Iterate over all the lines of the conversationfor i in range(len(conversation["lines"]) - 1):# We ignore the last line (no answer for it)inputLine = conversation["lines"][i]["text"].strip()targetLine = conversation["lines"][i+1]["text"].strip()# Filter wrong samples (if one of the lists is empty)if inputLine and targetLine:qa_pairs.append([inputLine, targetLine])return qa_pairs现在我们将调用这些函数来创建文件 , 我们命名为formatted_movie_lines.txt 。
# 定义新文件的路径datafile = os.path.join(corpus, "formatted_movie_lines.txt")delimiter = '\t'delimiter = str(codecs.decode(delimiter, "unicode_escape"))# 初始化行dict , 对话列表和字段IDlines = {}conversations = []MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]# 加载行和进程对话print("\nProcessing corpus...")lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)print("\nLoading conversations...")conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"),lines, MOVIE_CONVERSATIONS_FIELDS)# 写入新的csv文件print("\nWriting newly formatted file...")with open(datafile, 'w', encoding='utf-8') as outputfile:writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')for pair in extractSentencePairs(conversations):writer.writerow(pair)# 打印一个样本的行print("\nSample lines from file:")printLines(datafile)输出结果:
Processing corpus...Loading conversations...Writing newly formatted file...Sample lines from file:b"Can we make this quick?Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\n"b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.Please.\n"b"Not the hacking and gagging and spitting part.Please.\tOkay... then how 'bout we try out some French cuisine.Saturday?Night?\n"b"You're asking me out.That's so cute. What's your name again?\tForget it.\n"b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\n"b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.My sister.I can't date until she does.\n"b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.My sister.I can't date until she does.\tSeems like she could get a date easy enough...\n"b'Why?\tUnsolved mystery.She used to be really popular when she started high school, then it was just like she got sick of it or something.\n'b"Unsolved mystery.She used to be really popular when she started high school, then it was just like she got sick of it or something.\tThat's a shame.\n"b'Gosh, if only we could find Kat a boyfriend...\tLet me see what I can do.\n'


推荐阅读