自然语言处理—-情感分析实例 | Python技术交流与分享

文章目录

第9章自然语言处理---情感分析实例
9.1 词袋模型(BOW)示例
9.2情感分析实例
9.2.1 加载数据
9.2.2数据预处理
9.2.3训练模型
9.2.4评估模型

第9章自然语言处理---情感分析实例

在自然语言处理中，首先需要把文本或单词等转换为数值格式，为后续机器学习或深度学习使用，把文本或单词转换为数值，有几种模型，如词袋模型（bag of words或简称为BOW）、word2vec等。

9.1 词袋模型(BOW)示例

BOW模型是信息检索领域常用的文档表示方法。在信息检索中，BOW模型假定对于一个文档，忽略它的单词顺序和语法、句法等要素，将其仅仅看作是若干个词汇的集合，文档中每个单词的出现都是独立的，不依赖于其它单词是否出现。也就是说，文档中任意一个位置出现的任何单词，都不受该文档语意影响而独立选择的。例如有如下三个文档：
1、The sun is shining
2、The weather is sweet
3、The sun is shining and the the weather is sweet
基于这三个文本文档（为简便起见这里以一个句子代表一个文档），构造一个词典或词汇库。如果构建词典？首先，看出现哪些单词，然后，给每个单词编号。在这三个文档中，共出现7个单词（不区分大小写），分别是：the，is ，sun，shining，and，weather，sweet。
然后，我们把这7个单词给予编号，从0开始，从而得到一个单词:序号的字典：
{'and':0,'is':1,'shining':2,'sun':3,'sweet':4,'the':5,'weather':6}
现在根据这个字典，把以上三个文档转换为特征向量(在对应序列号中是否有对应单词及出现的频率)：
第一句可转换为：
[0 1 1 1 0 1 0]
第二句可转换为：
[0 1 0 0 1 1 1]
第三句可转换为：
[1 2 1 1 1 2 1]
0表示字典中对应单词在文档中未出现，1表示对应单词在文档出现一次，2表示出现2次。出现在特征向量中值也称为原始词频（raw term frequency）:tf(t,d),单词t在文档d出现的次数）
这个一个简单转换，如果有几个文档，而且有些单词在每个文档中出现的频度都较高，这种频繁出现的单词往往不含有用或特别的信息，在向量中如何降低这些单词的权重？这里我们可以采用逆文档频率（inverse document frequency，idf）技术来处理。
原始词频结合逆文档频率，称为词频-逆文档词频（term frequency - inverse document frequency,简称为tf-idf）。
tf-idf如何计算呢？我们通过以下公式就明白了：
tf-idf(t,d)=tf(t,d)*idf(t,d)
其中idf(t,d)=log□(n_d/(1+df(d,t)))
n_d 表示总文档数（这里总文档数为3），df（d，t）为文档d中的单词t涉及的文档数量。
取对数是为了保证文档中出现频率较低的单词被赋予较大的权重，分母中的加1是为了防止df（d,t）为零的情况。有些模型中也会在分子加上1，分子变为1+n_d，tf-ifd（t,d）= tf(t,d)*(idf(t,d)+1),Scikit-learn采用这中计算方法。
如我们看单词'the'在第一个句子或第一个文档（d1来表示）中的tf-idf（t,d）的值
tf-idf('the',d1)=tf('the',d1)*idf('the',d1)
=1*log3/(1+3)=1*log0.75=-0.125

这些计算都有现成的公式，以下我们以Scikit-learn中公式或库来计算。

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count=CountVectorizer()
docs=np.array(['The sun is shining',
'The weather is sweet',
'The sun is shining and the the weather is sweet'])
bag=count.fit_transform(docs)

print(count.vocabulary_) #vocabulary_表示字典

运行结果：
{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}

print(bag.toarray())
打印结果为：
[[0 1 1 1 0 1 0]
[0 1 0 0 1 1 1]
[1 2 1 1 1 2 1]]

以下求文档的tf-idf

from sklearn.feature_extraction.text import TfidfTransformer
tfidf=TfidfTransformer()
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())
#打印结果为：
[[ 0. 0.43 0.56 0.56 0. 0.43 0. ]
[ 0. 0.43 0. 0. 0.56 0.43 0.56]
[ 0.4 0.48 0.31 0.31 0.31 0.48 0.31]]

说明：sklearn计算tf-idf时，还进行了归一化处理，其中TfidfTransformer缺省使用L2范数。
我们按照sklearn的计算方式,即tf-idf（t,d）=tf(t,d)*(log(1+n_d)/(1+df(d,t))+1)，不难验证以上结果，以第一语句为例。
第一个语句的v=tf-idf(t,d1)=[0,1,1.28,1.28,0,1,0]

tf-idf(t,d1)norm=||v||/(〖||v||〗_2)=v/sqrt(∑▒v_i^2 )=v/2.29
=[0,0.43,0.56,0.56,0,0.43,0]
这个与上面的计算结果一致。

9.2情感分析实例

情感分析，有时也称为观点挖掘，是自然语言处理（NLP）领域一个非常重要的一个分支，它主要分析评论、文章、报道等的情感倾向，掌握或了解人们这些情感倾向非常重要。这些倾向对我们处理后续很多事情都有指定或借鉴作用。
这里我们以人们对一个互联网电影的评论为数据集。该数据集包含50,000个关于电影的评论，正面评论高于6星，负面评论低于5星。
以下我们采用词袋模型（BOW），用Python语言处理包(NLTK)对数据进行处理，由于数据量比较大，我们使用随机梯度下载方法来优化，利用逻辑蒂斯回归分类器进行分类。具体步骤如下：

9.2.1 加载数据

下载数据：
http://ai.stanford.edu/~amaas/data/sentiment

tar -zxf aclImdb_v1.tar.gz

文件结构：
在aclImdb目录下有test和train等目录，在train和test目录下，各有二级子目录neg和pos目录。其中neg目录存放大量评级负面或消极txt文件，pos存放大量评级为正面或积极的评论txt文件

hadoop@master:~/data/nlp_im/aclImdb$ ll
total 1732
-rw-r--r-- 1 hadoop hadoop 903029 Jun 12 2011 imdbEr.txt
-rw-r--r-- 1 hadoop hadoop 845980 Apr 13 2011 imdb.vocab
-rw-r--r-- 1 hadoop hadoop 4037 Jun 26 2011 README
drwxr-xr-x 4 hadoop hadoop 4096 Aug 29 15:16 test/
drwxr-xr-x 5 hadoop hadoop 4096 Aug 29 15:16 train/
hadoop@master:~/data/nlp_im/aclImdb$ cd train/
hadoop@master:~/data/nlp_im/aclImdb/train$ ll
total 66580
-rw-r--r-- 1 hadoop hadoop 21021197 Apr 13 2011 labeledBow.feat
drwxr-xr-x 2 hadoop hadoop 352256 Aug 29 15:18 neg/
drwxr-xr-x 2 hadoop hadoop 352256 Aug 29 15:16 pos/
drwxr-xr-x 2 hadoop hadoop 1409024 Aug 29 15:16 unsup/
-rw-r--r-- 1 hadoop hadoop 41348699 Apr 13 2011 unsupBow.feat
-rw-r--r-- 1 hadoop hadoop 612500 Apr 12 2011 urls_neg.txt
-rw-r--r-- 1 hadoop hadoop 612500 Apr 12 2011 urls_pos.txt
-rw-r--r-- 1 hadoop hadoop 2450000 Apr 12 2011 urls_unsup.txt

把这些文件附加到df中，同时显示加载进度。

import pyprind
import pandas as pd
import os
pbar=pyprind.ProgBar(50000)
labels={'pos':1,'neg':0}
df=pd.DataFrame()
for s in ('test','train'):
for l in ('pos','neg'):
path='./aclImdb/%s/%s'% (s,l)
for file in os.listdir(path):
with open(os.path.join(path,file),'r') as infile:
txt=infile.read()
df=df.append([[txt,labels[l]]],ignore_index=True)
pbar.update()
df.columns=['review','snetiment']

运行了大概2分多钟：
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:42

重排标签顺序，并把数据集存储到cvs文件中

import numpy as np
np.random.seed(0)
df=df.reindex(np.random.permutation(df.index))
df.to_csv('./movie_data.csv',index=False)

查看或检查存储数据

df=pd.read_csv('./movie_data.csv')
df.head(4)

查询结果如下：

这里有个拼写错误，snetiment,应该是sentiment，如果要更改过来，只要修改df的列名即可：

df.columns=['review','sentiment']

9.2.2数据预处理

1)、首先使用自然语言处理工具NLTK，下载停用词，然后过来文件。

import nltk
nltk.download('stopwords')

2)、对文件进行预处理，过来停用词、删除多余符号等。

from nltk.corpus import stopwords
import re
stop=stopwords.words('english')
def tokenizer(text):
text=re.sub('<[^>]*>','',text)
emoticons=re.findall('(?::|;|=)(?:-)?(?:

$</span>|<span class="es0">$ |D|P)',text.lower())
text=re.sub('[\W]+',' ',text.lower())+' '.join(emoticons).replace('-','')
tokenized=[w for w in text.split() if w not in stop]
return tokenized

3)、定义一个生成器函数，从csv文件中读取文档

def stream_docs(path):
with open(path,'r') as csv:
next(csv)# skip header
for line in csv:
text,label=line[:-3],int(line[-2])
yield text,label

4)、定义一个每次获取的小批量数据的函数

def get_minibatch(doc_stream,size):
docs,y=[],[]
try:
for _ in range(size):
text,label=next(doc_stream)
docs.append(text)
y.append(label)
except StopIteration:
return None,None
return docs,y

5)、利用sklearn中的HashingVectorizer进行语句的特征化、向量化等。

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect=HashingVectorizer(decode_error='ignore',n_features=2**21,preprocessor=None,tokenizer=tokenizer)
clf=SGDClassifier(loss='log',random_state=1,n_iter=1)
doc_stream=stream_docs(path='./movie_data.csv')

9.2.3训练模型

训练模型

import pyprind
pbar=pyprind.ProgBar(45)
classes=np.array([0,1])
for _ in range(45):
x_train,y_train=get_minibatch(doc_stream,size=1000)
if not x_train:
break
x_train=vect.transform(x_train)
clf.partial_fit(x_train,y_train,classes=classes)
pbar.update()

9.2.4评估模型

x_test,y_test=get_minibatch(doc_stream,size=5000)
x_test=vect.transform(x_test)
print('accuracy: %.3f' % clf.score(x_test,y_test))

测试结果为：
accuracy: 0.879
效果还不错，准确率达到近88%

Python技术交流与分享

分享技术平台

第9章自然语言处理---情感分析实例

9.1 词袋模型(BOW)示例

9.2情感分析实例

9.2.1 加载数据

9.2.2数据预处理

9.2.3训练模型

9.2.4评估模型

《自然语言处理----情感分析实例》有1个想法

第9章 自然语言处理---情感分析实例

9.1 词袋模型(BOW)示例

9.2情感分析实例

9.2.1 加载数据

9.2.2数据预处理

9.2.3训练模型

9.2.4评估模型

《自然语言处理----情感分析实例》有1个想法

第9章自然语言处理---情感分析实例