小熊科技|PyTorch文本：01.聊天机器人教程( 二 ) 在本教程中

2.1 创建格式化数据文件为了方便起见，我们将创建一个格式良好的数据文件，其中每一行包含一个由tab制表符分隔的查询语句和响应语句对。
以下函数便于解析原始 movie_lines.txt 数据文件。
loadLines:将文件的每一行拆分为字段(lineID, characterID, movieID, character, text)组合的字典
loadConversations :根据movie_conversations.txt将loadLines中的每一行数据进行归类
extractSentencePairs: 从对话中提取句子对
# 将文件的每一行拆分为字段字典def loadLines(fileName, fields):lines = {}with open(fileName, 'r', encoding='iso-8859-1') as f:for line in f:values = line.split(" +++$+++ ")# Extract fieldslineObj = {}for i, field in enumerate(fields):lineObj[field] = values[i]lines[lineObj['lineID']] = lineObjreturn lines# 将 `loadLines` 中的行字段分组为基于 *movie_conversations.txt* 的对话def loadConversations(fileName, lines, fields):conversations = []with open(fileName, 'r', encoding='iso-8859-1') as f:for line in f:values = line.split(" +++$+++ ")# Extract fieldsconvObj = {}for i, field in enumerate(fields):convObj[field] = values[i]# Convert string to list (convObj["utteranceIDs"] == "['L598485', 'L598486', ...]")lineIds = eval(convObj["utteranceIDs"])# Reassemble linesconvObj["lines"] = []for lineId in lineIds:convObj["lines"].append(lines[lineId])conversations.append(convObj)return conversations# 从对话中提取一对句子def extractSentencePairs(conversations):qa_pairs = []for conversation in conversations:# Iterate over all the lines of the conversationfor i in range(len(conversation["lines"]) - 1):# We ignore the last line (no answer for it)inputLine = conversation["lines"][i]["text"].strip()targetLine = conversation["lines"][i+1]["text"].strip()# Filter wrong samples (if one of the lists is empty)if inputLine and targetLine:qa_pairs.append([inputLine, targetLine])return qa_pairs现在我们将调用这些函数来创建文件，我们命名为formatted_movie_lines.txt 。
# 定义新文件的路径datafile = os.path.join(corpus, "formatted_movie_lines.txt")delimiter = '\t'delimiter = str(codecs.decode(delimiter, "unicode_escape"))# 初始化行dict ，对话列表和字段IDlines = {}conversations = []MOVIE_LINES_FIELDS = ["lineID", "characterID", "movieID", "character", "text"]MOVIE_CONVERSATIONS_FIELDS = ["character1ID", "character2ID", "movieID", "utteranceIDs"]# 加载行和进程对话print("\nProcessing corpus...")lines = loadLines(os.path.join(corpus, "movie_lines.txt"), MOVIE_LINES_FIELDS)print("\nLoading conversations...")conversations = loadConversations(os.path.join(corpus, "movie_conversations.txt"),lines, MOVIE_CONVERSATIONS_FIELDS)# 写入新的csv文件print("\nWriting newly formatted file...")with open(datafile, 'w', encoding='utf-8') as outputfile:writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')for pair in extractSentencePairs(conversations):writer.writerow(pair)# 打印一个样本的行print("\nSample lines from file:")printLines(datafile)输出结果：
Processing corpus...Loading conversations...Writing newly formatted file...Sample lines from file:b"Can we make this quick?Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\n"b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.Please.\n"b"Not the hacking and gagging and spitting part.Please.\tOkay... then how 'bout we try out some French cuisine.Saturday?Night?\n"b"You're asking me out.That's so cute. What's your name again?\tForget it.\n"b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\n"b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.My sister.I can't date until she does.\n"b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.My sister.I can't date until she does.\tSeems like she could get a date easy enough...\n"b'Why?\tUnsolved mystery.She used to be really popular when she started high school, then it was just like she got sick of it or something.\n'b"Unsolved mystery.She used to be really popular when she started high school, then it was just like she got sick of it or something.\tThat's a shame.\n"b'Gosh, if only we could find Kat a boyfriend...\tLet me see what I can do.\n'


上一页
1
2
3
4
5
6
下一页
		  	






























推荐阅读

           
                  
              
                  红茶储藏温度,勐库的普洱茶能存放多久 
                
                   
                
              
            

                  
              
                  购车小助理奇瑞全新一代瑞虎8正式曝光真身，主打1.5T+48V、1.6TGDI发动机 
                
                   
                
              
            

                  
              
                  四季沐歌荣获中国自主品牌500强，总裁李骏荣获清洁采暖行业创新人物 
                
                   
                
              
            

                  
              
                   『去年』一图读懂去年广州在残疾人教育、就业、脱贫下了什么功夫？ 
                
                   
                
              
            

                  
              
                  贾永婕|大S再出手！转发贾永婕「千万床垫没卖汪小菲」真相 
                
                   
                
              
            

                  
              
                  微信|黑龙江省大兴安岭地区迎来秋后首场降雪 
                
                   
                
              
            

                  
              
                  硬拳大卫君|军方依然充满信心，印度武器又伤自己人了！新型火炮炸膛专家受伤 
                
                   
                
              
            

                  
              
                  唐平哥|营养健康工作论坛隆重召开 
                
                   
                
              
            

                  
              
                  古手羽|王者局却把把MVP的辅助，放大就团灭，单排王者随便上，唯一钻石局无人用 
                
                   
                
              
            

                  
              
                  LOHAS乐活杂志|性冷淡风已out，“盐系”小姐姐更撩人！，这个夏天 
                
                   
                
              
            

                  
              
                  [时尚迪科]结婚狂宋轶穿条纹睡衣接地气，赵今麦扎小辫可爱，《涩女郎》路透 
                
                   
                
              
            

                  
              
                  新机发布|三星新机定档8月5日，安卓机皇即将诞生，价格成最大败笔 
                
                   
                
              
            

                  
              
                  蔡英文|蔡英文视察雷达站，身后出现一张美国脸 
                
                   
                
              
            

                  
              
                  信贷▲6月这些新规实施！涉及你的医、食、住、行 
                
                   
                
              
            

                  
              
                  饺子好吃调馅是关键，萝卜香菇多加1物，无肉无蛋，也鲜美多汁 
                
                   
                
              
            

                  
              
                  陈坤|陈坤长相帅气，身边美女如云，可他为何一直单身？为了谁？ 
                
                   
                
              
            

                  
              
                  茶叶涩是因为什么原因 
                
                   
                
              
            

                  
              
                  「全球奢侈婚礼」实用！婚礼当天超全流程时间表 
                
                   
                
              
            

                  
              
                  |刘强东隐藏了15年的前妻，竟然是我们熟悉的她，难怪要娶奶茶妹妹 
                
                   
                
              
            

                  
              
                  辽宋夏金|包拯究竟是一位怎样的人？不苟言笑，根本不受同僚待见 
                
                   
                
              
            

          

所持股份|万兴科技：公司控股股东、实际控制人吴太兵质押150万股 

发布公告|数量过半！博创科技：天通股份累计减持约150万股 

英雄科技聊数码|蔡崇信有实力买下篮网，那身价3200亿的马云，能买下几支NBA球队 

科技前沿阵地|涨疯了！海思安防芯片遭哄抬“围剿” 

月影浓|吴亦凡机械造型走秀 垫肩披风搭银框眼镜科技感足 

中国历史发展过程|中国历史发展过程.中国的科技史界过去半个多世纪 

天津|桂发祥：不再持有昆汀科技股份 

消费|减持！天通股份：减持博创科技约32万股 

处罚|老周侃股：吉鑫科技大股东应补偿踩雷投资者 

华中科技大学|杯具！超本科线95分，本科有路不走，却梦幻般碰瓷，撞开专科的门