引领先锋|基于深度学习的文本数据特征提取方法之Word2Vec(一)( 二 ) 作者：Dipanjan(DJ)Sarkar编译：ronghu

特征工程策略
让我们来看看处理文本数据并从中提取有意义的特征的一些高级策略，这些策略可用于下游的机器学习系统。我们将从加载一些基本的依赖项和设置开始。
import pandas as pd import numpy as np import re import nltk import matplotlib.pyplot as plt pd.options.display.max_colwidth = 200 %matplotlib inline现在，我们将获取一些文档语料库，在这些文档语料库上执行所有的分析。对于其中一个语料库，我们将重用上一篇文章中的语料库。为了便于理解，我们将代码重新写一下。
corpus = ['The sky is blue and beautiful.', 'Love this blue and beautiful sky!', 'The quick brown fox jumps over the lazy dog.', "A king's breakfast has sausages, ham, bacon, eggs, toast and beans", 'I love green eggs, ham, sausages and bacon!', 'The brown fox is quick and the blue dog is lazy!', 'The sky is very blue and the sky is very beautiful today', 'The dog is lazy but the brown fox is quick!'] labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']corpus = np.array(corpus) corpus_df = pd.DataFrame({'Document': corpus, 'Category': labels}) corpus_df = corpus_df[['Document', 'Category']]
我们的玩具语料库由几个类别的文档组成。我们将在本文中使用的另一个语料库是The King James Version of the Bible ，可以从Project Gutenberg通过nltk中的corpus模块免费获得。我们将在下一节中加载它。在讨论特征工程之前，我们需要对本文进行预处理和规范化。
文本预处理
可以有多种方法来清理和预处理文本数据。在上一篇文章中已经讲过了。由于本文的重点是特征工程，就像前面的文章一样，我们将重用简单的文本预处理程序，它的重点是删除特殊字符、额外的空格、数字、停止词和把语料库的大写转换为小写。
wpt = nltk.WordPunctTokenizer() stop_words = nltk.corpus.stopwords.words('english')def normalize_document(doc): # lower case and remove special characters\whitespaces doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A) doc = doc.lower() doc = doc.strip() # tokenize document tokens = wpt.tokenize(doc) # filter stopwords out of document filtered_tokens = [token for token in tokens if token not in stop_words] # re-create document from filtered tokens doc = ' '.join(filtered_tokens) return docnormalize_corpus = np.vectorize(normalize_document)我们准备好了基本的预处理pipeline ，让我们首先将其应用于我们的玩具语料库。
【引领先锋|基于深度学习的文本数据特征提取方法之Word2Vec(一)】 norm_corpus = normalize_corpus(corpus) norm_corpus Output ------ array(['sky blue beautiful', 'love blue beautiful sky', 'quick brown fox jumps lazy dog', 'kings breakfast sausages ham bacon eggs toast beans', 'love green eggs ham sausages bacon', 'brown fox quick blue dog lazy', 'sky blue sky beautiful today', 'dog lazy brown fox quick'], dtype='现在让我们使用nltk加载基于The King James Version of the Bible的其他语料库，并对文本进行预处理。
from nltk.corpus import gutenberg from string import punctuationbible = gutenberg.sents('bible-kjv.txt') remove_terms = punctuation + '0123456789'norm_bible = [[word.lower() for word in sent if word not in remove_terms] for sent in bible] norm_bible = [' '.join(tok_sent) for tok_sent in norm_bible] norm_bible = filter(None, normalize_corpus(norm_bible)) norm_bible = [tok_sent for tok_sent in norm_bible if len(tok_sent.split()) > 2]print('Total lines:', len(bible)) print('\nSample line:', bible[10]) print('\nProcessed line:', norm_bible[10])

引领先锋|基于深度学习的文本数据特征提取方法之Word2Vec(一)( 二 )

推荐阅读

人民日报客户端广东频道|格兰仕筹划部分要约收购惠而浦（中国）

苹果自研高端基带曝光：支持5G毫米波

月经血突然变黑？月经发黑是什么原因

剧院等演出场所限流提至50%-剧院等演出场所恢复第三版

像素之源|戴安娜王妃的“鸟笼面纱帽”造型神秘高贵，朦胧美很高级

国防部,军事|国防部通报，有重要信号

赵丽颖|收视女王转型三部曲：《知否》人生哲学神剧《楚乔传》结局最虐

一个爱炫耀的男人值得考虑交往吗

趣事知多D|太羡慕番禺人！打卡南村人气第一西餐，云顶餐吧很适合约会聚餐

父母在彩礼嫁妆上总是出尔反尔让我难做，该咋办

采采搞笑段子| 爆笑囧图，超市遇见一位令我心疼的男人…

农村即将消失的老物件，如今成值钱老古董，第3个最贵能卖60万

华为荣耀|双11大屏手机推荐：荣耀这款手机性价比高，各项性能都很出色！

首都机场▲22岁女孩刚结婚一个多月跳湖自杀，母亲称女儿婚前有很多追求者

「试管婴儿」当初那个爱上28岁小鲜肉的老太太，不顾一切做试管婴儿，如今怎么样了？

【什么是旧粗布】什么是旧粗布

怎样合理布置餐客区

贾母和宝玉是什么关系贾母想让宝玉娶谁

散布武汉汛情虚假信息，2名网民被警方依法处理

时尚丽人风行|色彩搭配总是在踩雷？值得一看的三点心机想不美都难，回头率爆表