本章主要以SMS Spam Collection数据集 为例介绍骚扰短信的识别技术。这一个小节以Word2Vec_1d与Word2Vec_2d对两种特征提取方式来详细讲解。
与word2vec不同之处在于标准化处理,这一处理流程如下所示:
1.MinMaxScaler标准化
如果想把各个维度的数据都转换到0和1之间,可以使用MinMaxScaler函数,比如转换前的数据为:
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X)
print(X_train_minmax)
输出结果如下所示
[[0.5 0. 1. ]
[1. 0.5 0.33333333]
[0. 1. 0. ]]
相对于word2vec的scale处理,word2vec的标准化处理流程如下
min_max_scaler = preprocessing.MinMaxScaler()
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = min_max_scaler.fit_transform(x_train)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = min_max_scaler.transform(x_test)
2.完整源码
def get_features_by_word2vec_cnn_1d():
global max_features
global word2ver_bin
x_train, x_test, y_train, y_test=load_all_files()
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(word2ver_bin):
print("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
min_max_scaler = preprocessing.MinMaxScaler()
x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
x_train = min_max_scaler.fit_transform(x_train)
x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
x_test = min_max_scaler.transform(x_test)
return x_train, x_test, y_train, y_test
根据第五点,我们知道Word2Vec有个特性,一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取,比如
model['good boy']= (model['good]+ model['boy])/2
这里要特别注意,由于CNN使用2d卷积可以直接识别,故而这部分无需基于基于此原则处理,以训练姐为例,代码如下所示
for sms in x_train:
sms=sms[:max_document_length]
x_train_vec = np.zeros((max_document_length, max_features))
for i,w in enumerate(sms):
try:
vec=model[w].reshape((1, max_features))
x_train_vec[i-1]=vec.copy()
except KeyError:
continue
x_train_vecs.append(x_train_vec)
这里要特别注意需要捕捉异常,因为之前配置时限制了最大特征数量max_features,综上所有源码如下所示
def get_features_by_word2vec_cnn_2d():
global max_features
global max_document_length
global word2ver_bin
x_train, x_test, y_train, y_test=load_all_files()
x_train_vecs=[]
x_test_vecs=[]
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(word2ver_bin):
print ("Find cache file %s" % word2ver_bin)
model=gensim.models.Word2Vec.load(word2ver_bin)
else:
model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(word2ver_bin)
x_all=np.zeros((1,max_features))
for sms in x_train:
sms=sms[:max_document_length]
#print sms
x_train_vec = np.zeros((max_document_length, max_features))
for i,w in enumerate(sms):
vec=model[w].reshape((1, max_features))
x_train_vec[i-1]=vec.copy()
#x_all=np.concatenate((x_all,vec))
x_train_vecs.append(x_train_vec)
#print x_train_vec.shape
for sms in x_test:
sms=sms[:max_document_length]
x_test_vec = np.zeros((max_document_length, max_features))
for i,w in enumerate(sms):
vec=model[w].reshape((1, max_features))
x_test_vec[i-1]=vec.copy()
x_test_vecs.append(x_test_vec)
min_max_scaler = preprocessing.MinMaxScaler()
print ("fix min_max_scaler")
x_train_2d=np.concatenate([z for z in x_train_vecs])
min_max_scaler.fit(x_train_2d)
x_train=np.concatenate([min_max_scaler.transform(i) for i in x_train_vecs])
x_test=np.concatenate([min_max_scaler.transform(i) for i in x_test_vecs])
x_train=x_train.reshape([-1, max_document_length, max_features, 1])
x_test = x_test.reshape([-1, max_document_length, max_features, 1])
return x_train, x_test, y_train, y_test
基于Word2Vec的方法,Quoc Le和Tomas Mikolov又给出了Doc2Vec的训练方法。如下图所示,其原理与Word2Vec相同,分为分布式存储(Distributed Memory,DM)和分布 式词袋(Distributed Bag of Words,DBOW)
具体源码如下所示
def get_features_by_doc2vec():
global max_features
x_train, x_test, y_train, y_test=load_all_files()
print('y:', len(y_train), len(y_test))
print('x:', len(x_train), len(x_test))
x_train=cleanText(x_train)
x_test=cleanText(x_test)
x_train = labelizeReviews(x_train, 'TRAIN')
x_test = labelizeReviews(x_test, 'TEST')
x=x_train+x_test
cores=multiprocessing.cpu_count()
if os.path.exists(doc2ver_bin):
print ("Find cache file %s" % doc2ver_bin)
model=Doc2Vec.load(doc2ver_bin)
else:
model=Doc2Vec(dm=0, size=max_features, negative=5, hs=0, min_count=2, workers=1,iter=60)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(doc2ver_bin)
x_test=getVecs(model,x_test,max_features)
x_train=getVecs(model,x_train,max_features)
return x_train, x_test, y_train, y_test
以train[0]为例,初始化代码如下
x_train, x_test, y_train, y_test=load_all_files()
这时返回结果为
That's my honeymoon outfit. :)
接下来是cleanText处理
x_train=cleanText(x_train)
x_test=cleanText(x_test)
处理结果如下所示
["that's", 'my', 'honeymoon', 'outfit', '.', ':', ')']
与Word2Vec不同的地方是,Doc2Vec处理的每个英文段落,需要使用一个唯一的标识来标记,并且使用一种特殊定义的数据格式保存需要处理的英文段落,这种数据格式定义如 下:
SentimentDocument = namedtuple('SentimentDocument', 'words tags')
其中SentimentDocument可以理解为这种格式的名称,也可以理解为这种对象的名称,words会保存英文段落,并且是以单词和符合列表的形式保存,tags就是我们说的保存的唯一标识。最简单的一种实现就是依次给每个英文段落编号,测试数据集的标记为“TRAIN_数字”,测试数据集的标记为“TEST_数字”:
def labelizeReviews(reviews, label_type):
labelized = []
for i, v in enumerate(reviews):
label = '%s_%s' % (label_type, i)
labelized.append(SentimentDocument(v, [label]))
return labelized
本文调用逻辑如下
x_train = labelizeReviews(x_train, 'TRAIN')
x_test = labelizeReviews(x_test, 'TEST')
这时,x_train[0]的处理结果如下
SentimentDocument(words=["that's", 'my', 'honeymoon', 'outfit', '.', ':', ')'], tags=['TRAIN_0'])
训练模型或者load模型的代码如下
if os.path.exists(doc2ver_bin):
print ("Find cache file %s" % doc2ver_bin)
model=Doc2Vec.load(doc2ver_bin)
else:
model=Doc2Vec(dm=0, size=max_features, negative=5, hs=0, min_count=2, workers=1,iter=60)
model.build_vocab(x)
model.train(x, total_examples=model.corpus_count, epochs=model.iter)
model.save(doc2ver_bin)
训练好模型后,通过如下方法即可向量化
def getVecs(model, corpus, size):
vecs = [np.array(model.docvecs[z.tags[0]]).reshape((1, size)) for z in corpus]
return np.array(np.concatenate(vecs),dtype='float')
在本小节中,x_train[0]的处理结果如下
[ 1.07623920e-01 2.03667670e-01 4.39605879e-04 2.15817243e-02
-1.33999720e-01 -1.04999244e-01 6.59373477e-02 1.42834812e-01
1.93634644e-01 3.39346938e-02 -3.85309849e-03 -1.64910153e-01
-1.58900153e-02 -2.52948906e-02 -3.99647981e-01 -9.12046656e-02
5.02838679e-02 -2.62314398e-02 -2.06790075e-01 1.31346032e-01
-4.74643335e-02 -8.93631503e-02 3.17618906e-01 1.38969775e-02
-2.91506946e-01 -1.80232242e-01 1.07947908e-01 1.12745747e-01
-8.35281014e-02 -1.05828121e-01 -1.48421243e-01 -6.38468117e-02
4.14518379e-02 3.76846604e-02 1.51488677e-01 -8.19508582e-02
-2.27401778e-01 4.36623842e-02 1.62245184e-01 1.52473062e-01
4.37728465e-02 -1.23079844e-01 -3.73415574e-02 -1.03757225e-01
-1.39077678e-01 1.03434525e-01 -1.43260106e-01 -5.22182137e-02
-7.31687844e-02 2.00760439e-01 2.19642222e-02 8.81896541e-02
2.05241382e-01 -1.33090258e-01 1.55719062e-02 2.02201935e-03
1.01647124e-01 -3.05845201e-01 -8.83292854e-02 -1.23624884e-01
7.08331615e-02 4.64668535e-02 -9.01960209e-02 9.96094272e-02
-1.34212002e-01 2.00427771e-01 1.07755877e-01 2.25574061e-01
-1.82992116e-01 1.53573334e-01 -9.32435393e-02 1.56051908e-02
1.82265509e-02 7.87798688e-02 1.61014810e-01 9.62717682e-02
-1.00199223e-01 5.18805161e-02 -1.82170309e-02 -1.01618238e-01
-2.62078028e-02 -8.01115185e-02 7.61420429e-02 1.60436168e-01
2.32044682e-01 1.22496150e-01 3.62544470e-02 -6.78069219e-02
-2.98173614e-02 -8.31498671e-03 -1.16020748e-02 3.31646539e-02
2.66764522e-01 -1.76852301e-01 -2.00362876e-01 7.83127621e-02
4.40042838e-02 1.37215763e-01 -3.20810378e-02 4.33634184e-02
7.93892331e-03 8.94398764e-02 7.15106502e-02 7.87770897e-02
-2.57502228e-01 -1.56798676e-01 1.16182871e-01 4.86833565e-02
4.79952479e-03 9.11491066e-02 -1.40987933e-02 2.22539883e-02
-1.59795076e-01 9.47566405e-02 1.88340500e-01 2.16186821e-01
1.01996697e-01 -5.08516617e-02 -6.35962486e-02 -1.20826885e-01
-2.66686548e-02 1.30622894e-01 -5.55477142e-02 1.86781764e-01
1.94867045e-01 -5.55491969e-02 -1.73223794e-01 -2.23940029e-03
-8.42484012e-02 -9.78929922e-03 1.58954084e-01 2.10609302e-01
-9.30389911e-02 -1.19832724e-01 -2.57326901e-01 -9.47180465e-02
-6.85976595e-02 5.72628602e-02 -7.99224451e-02 1.05779201e-01
3.46482173e-02 -6.65628025e-03 -4.14694101e-03 -1.77285731e-01
1.71367064e-01 -1.04618266e-01 1.80818886e-02 1.00839488e-01
4.69732359e-02 5.89858182e-02 -1.17412247e-01 2.59877682e-01
2.43245989e-01 -1.44933388e-01 2.10463241e-01 2.31560692e-01
-2.60037601e-01 4.47640829e-02 -2.82789797e-01 7.55265653e-02
-2.29167998e-01 2.02734083e-01 -4.64072858e-04 1.37224300e-02
1.67293772e-01 2.41031274e-01 -2.25378182e-02 3.51362005e-02
-1.32802263e-01 1.59649570e-02 4.22540531e-02 1.46264017e-01
7.94619396e-02 -1.92659259e-01 5.14810346e-02 3.52863334e-02
1.29052609e-01 -1.65790990e-01 6.53655455e-02 8.40774626e-02
6.88608587e-02 -1.29266530e-01 -4.67737317e-01 1.19465608e-02
-4.01823148e-02 -6.38013333e-02 2.58218110e-01 -3.32820341e-02
-3.28772753e-01 1.57434553e-01 -3.27924967e-01 -4.27057222e-02
-1.31331399e-01 -1.63260803e-01 5.75413043e-03 -1.01014659e-01
1.76209942e-01 4.11938801e-02 -2.52132833e-01 1.40920551e-02
-1.06663443e-01 -2.49055997e-01 8.67770463e-02 1.78776234e-01
1.24725297e-01 1.30073011e-01 4.45161164e-02 1.60344705e-01
1.23212010e-01 -1.01057090e-01 -9.87102762e-02 -2.37999424e-01
-2.54926801e-01 1.51191801e-01 -1.31282523e-01 -2.35551950e-02
-4.09922339e-02 2.10600153e-01 -5.79826534e-02 3.74017060e-02
-2.94750094e-01 -7.88278356e-02 2.41033360e-01 -1.52092993e-01
2.95348503e-02 3.29553857e-02 -1.94013342e-02 -3.87551337e-01
-2.59788465e-02 -5.78223402e-03 -2.31364612e-02 -3.76425721e-02
-1.05162010e-01 -7.57280588e-02 -2.94224806e-02 9.29094255e-02
5.02580917e-03 1.27323985e-01 1.81948487e-02 2.53445506e-02
5.22816628e-02 1.21363856e-01 6.15054667e-02 1.54508455e-02
-8.20488727e-04 7.56623894e-02 -1.50576439e-02 1.67132884e-01
-2.82725275e-01 1.27364337e-01 1.80122808e-01 -1.31743163e-01
-9.61113051e-02 -6.37162179e-02 7.30723217e-02 1.46646490e-02
-1.33017078e-01 -1.19762242e-01 2.35711131e-02 -2.81341195e-01
-5.68757765e-03 1.85813874e-01 1.10502146e-01 -7.27362931e-02
-1.60796911e-01 5.11522032e-02 -9.55855697e-02 -7.15414509e-02
2.83251163e-02 1.16654523e-01 3.66797764e-03 1.54608116e-01
-2.03318566e-01 6.67307079e-02 8.06226656e-02 -1.19694658e-02
8.96890089e-02 2.01400295e-01 8.02449882e-02 -2.21296884e-02
5.21850772e-02 -1.38125028e-02 8.87114927e-02 1.21549807e-01
-5.28845526e-02 3.75475399e-02 6.20372519e-02 1.29727647e-01
-6.90457448e-02 2.34339647e-02 -4.55942191e-02 4.64116298e-02
-1.33431286e-01 2.55507827e-01 2.16026157e-01 -9.25980732e-02
-4.85622920e-02 -1.45295113e-01 -2.58427318e-02 1.93505753e-02
1.17950164e-01 -1.23775247e-02 1.33392587e-01 -8.28817710e-02
1.36878446e-01 -6.80091828e-02 -1.98093444e-01 5.15850522e-02
1.92994636e-03 2.26874277e-01 1.26609832e-01 -5.96865974e-02
-1.37154952e-01 1.35652214e-01 1.13142699e-01 3.95694794e-03
-2.06833467e-01 -7.06818774e-02 -2.37924159e-02 7.30280858e-03
-2.31933817e-01 -2.13069454e-01 -1.42960489e-01 -7.01301452e-03
8.29812512e-03 6.81945086e-02 1.51407436e-01 7.65770003e-02
1.34944141e-01 1.27678066e-01 -2.07772329e-01 1.72014341e-01
-5.36921099e-02 -7.11955428e-02 4.27525938e-02 -7.01407567e-02
-7.80161470e-03 2.37379566e-01 -9.58834738e-02 -7.06278309e-02
-2.44213790e-02 -6.29713684e-02 4.97949077e-03 -1.93031520e-01
1.07220590e-01 -6.00046758e-03 -1.33072376e-01 -1.13887295e-01
1.19876094e-01 8.42822641e-02 -1.89513087e-01 9.65327695e-02
5.26705459e-02 1.42758876e-01 1.56315282e-01 2.01160442e-02
4.45949882e-02 -2.87032127e-02 8.93688649e-02 2.41289198e-01
1.31013185e-01 4.65613231e-03 -4.40470129e-02 2.39219904e-01
-1.30711459e-02 4.95963730e-02 2.94293970e-01 -5.10329269e-02
-1.80931166e-01 1.07536905e-01 9.01441202e-02 -1.17586985e-01
1.99178141e-02 1.22322226e-02 -1.71743870e-01 -1.57537639e-01
1.04444884e-01 7.31004849e-02 9.70724411e-03 1.06952451e-01
1.65776461e-01 1.47664443e-01 8.90543163e-02 7.31813684e-02
1.05123490e-01 1.22088723e-01 -1.21460930e-02 -1.45071194e-01
-8.42208490e-02 -1.08709313e-01 2.45642308e-02 -6.45151436e-02
4.05842774e-02 -2.05672416e-03 6.51900023e-02 1.91479787e-01
-9.36061218e-02 1.43005680e-02 1.36256188e-01 -5.99846505e-02]
note:本章节未完,这是因为第8节骚扰短信识别的笔记较多,分为一个系列,下一篇笔记的题目为《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(5)
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(5)_mooyuan的博客-CSDN博客