《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(4)

本文阅读 4 分钟

本章主要以SMS Spam Collection数据集 为例介绍骚扰短信的识别技术。这一个小节以Word2Vec_1d与Word2Vec_2d对两种特征提取方式来详细讲解。

与word2vec不同之处在于标准化处理,这一处理流程如下所示: 

1.MinMaxScaler标准化

        如果想把各个维度的数据都转换到0和1之间,可以使用MinMaxScaler函数,比如转换前的数据为:

from sklearn import preprocessing
import numpy as np

X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X)
print(X_train_minmax)

输出结果如下所示

[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]

相对于word2vec的scale处理,word2vec的标准化处理流程如下

min_max_scaler = preprocessing.MinMaxScaler()
    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
    x_train = min_max_scaler.fit_transform(x_train)
    x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
    x_test = min_max_scaler.transform(x_test)

2.完整源码

def  get_features_by_word2vec_cnn_1d():
    global  max_features
    global word2ver_bin
    x_train, x_test, y_train, y_test=load_all_files()

    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

    x=x_train+x_test
    cores=multiprocessing.cpu_count()

    if os.path.exists(word2ver_bin):
        print("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)

        model.build_vocab(x)

        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)

    min_max_scaler = preprocessing.MinMaxScaler()

    x_train= np.concatenate([buildWordVector(model,z, max_features) for z in x_train])
    x_train = min_max_scaler.fit_transform(x_train)
    x_test= np.concatenate([buildWordVector(model,z, max_features) for z in x_test])
    x_test = min_max_scaler.transform(x_test)

    return x_train, x_test, y_train, y_test

根据第五点,我们知道Word2Vec有个特性,一句话或者几个单词组成的短语含义可以通过把全部单词的Word2Vec值相加取平均值来获取,比如

model['good boy']= (model['good]+ model['boy])/2

这里要特别注意,由于CNN使用2d卷积可以直接识别,故而这部分无需基于基于此原则处理,以训练姐为例,代码如下所示

for sms in x_train:
        sms=sms[:max_document_length]
        x_train_vec = np.zeros((max_document_length, max_features))
        for i,w in enumerate(sms):
            try:
                vec=model[w].reshape((1, max_features))
                x_train_vec[i-1]=vec.copy()
            except KeyError:
                continue
        x_train_vecs.append(x_train_vec)

这里要特别注意需要捕捉异常,因为之前配置时限制了最大特征数量max_features,综上所有源码如下所示

def  get_features_by_word2vec_cnn_2d():
    global max_features
    global max_document_length
    global word2ver_bin

    x_train, x_test, y_train, y_test=load_all_files()

    x_train_vecs=[]
    x_test_vecs=[]

    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

    x=x_train+x_test
    cores=multiprocessing.cpu_count()

    if os.path.exists(word2ver_bin):
        print ("Find cache file %s" % word2ver_bin)
        model=gensim.models.Word2Vec.load(word2ver_bin)
    else:
        model=gensim.models.Word2Vec(size=max_features, window=10, min_count=1, iter=60, workers=1)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(word2ver_bin)


    x_all=np.zeros((1,max_features))
    for sms in x_train:
        sms=sms[:max_document_length]
        #print sms
        x_train_vec = np.zeros((max_document_length, max_features))
        for i,w in enumerate(sms):
            vec=model[w].reshape((1, max_features))
            x_train_vec[i-1]=vec.copy()
            #x_all=np.concatenate((x_all,vec))
        x_train_vecs.append(x_train_vec)
        #print x_train_vec.shape
    for sms in x_test:
        sms=sms[:max_document_length]
        x_test_vec = np.zeros((max_document_length, max_features))
        for i,w in enumerate(sms):
            vec=model[w].reshape((1, max_features))
            x_test_vec[i-1]=vec.copy()
        x_test_vecs.append(x_test_vec)

    min_max_scaler = preprocessing.MinMaxScaler()
    print ("fix min_max_scaler")
    x_train_2d=np.concatenate([z for z in x_train_vecs])
    min_max_scaler.fit(x_train_2d)

    x_train=np.concatenate([min_max_scaler.transform(i) for i in x_train_vecs])
    x_test=np.concatenate([min_max_scaler.transform(i) for i in x_test_vecs])

    x_train=x_train.reshape([-1, max_document_length, max_features, 1])
    x_test = x_test.reshape([-1, max_document_length, max_features, 1])

    return x_train, x_test, y_train, y_test

        基于Word2Vec的方法,Quoc Le和Tomas Mikolov又给出了Doc2Vec的训练方法。如下图所示,其原理与Word2Vec相同,分为分布式存储(Distributed Memory,DM)和分布 式词袋(Distributed Bag of Words,DBOW)    

img

        具体源码如下所示    

def  get_features_by_doc2vec():
    global  max_features
    x_train, x_test, y_train, y_test=load_all_files()
    print('y:', len(y_train), len(y_test))
    print('x:', len(x_train), len(x_test))
    x_train=cleanText(x_train)
    x_test=cleanText(x_test)

    x_train = labelizeReviews(x_train, 'TRAIN')
    x_test = labelizeReviews(x_test, 'TEST')

    x=x_train+x_test
    cores=multiprocessing.cpu_count()

    if os.path.exists(doc2ver_bin):
        print ("Find cache file %s" % doc2ver_bin)
        model=Doc2Vec.load(doc2ver_bin)
    else:
        model=Doc2Vec(dm=0, size=max_features, negative=5, hs=0, min_count=2, workers=1,iter=60)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(doc2ver_bin)

    x_test=getVecs(model,x_test,max_features)
    x_train=getVecs(model,x_train,max_features)

    return x_train, x_test, y_train, y_test

       以train[0]为例,初始化代码如下

x_train, x_test, y_train, y_test=load_all_files()

这时返回结果为

That's my honeymoon outfit. :)

 接下来是cleanText处理

x_train=cleanText(x_train)
    x_test=cleanText(x_test)

处理结果如下所示 

["that's", 'my', 'honeymoon', 'outfit', '.', ':', ')']

 与Word2Vec不同的地方是,Doc2Vec处理的每个英文段落,需要使用一个唯一的标识来标记,并且使用一种特殊定义的数据格式保存需要处理的英文段落,这种数据格式定义如 下:

SentimentDocument = namedtuple('SentimentDocument', 'words tags')

其中SentimentDocument可以理解为这种格式的名称,也可以理解为这种对象的名称,words会保存英文段落,并且是以单词和符合列表的形式保存,tags就是我们说的保存的唯一标识。最简单的一种实现就是依次给每个英文段落编号,测试数据集的标记为“TRAIN_数字”,测试数据集的标记为“TEST_数字”:

def labelizeReviews(reviews, label_type):
    labelized = []
    for i, v in enumerate(reviews):
        label = '%s_%s' % (label_type, i)
        labelized.append(SentimentDocument(v, [label]))
    return labelized

本文调用逻辑如下

x_train = labelizeReviews(x_train, 'TRAIN')
    x_test = labelizeReviews(x_test, 'TEST')

这时,x_train[0]的处理结果如下

SentimentDocument(words=["that's", 'my', 'honeymoon', 'outfit', '.', ':', ')'], tags=['TRAIN_0'])

 训练模型或者load模型的代码如下

if os.path.exists(doc2ver_bin):
        print ("Find cache file %s" % doc2ver_bin)
        model=Doc2Vec.load(doc2ver_bin)
    else:
        model=Doc2Vec(dm=0, size=max_features, negative=5, hs=0, min_count=2, workers=1,iter=60)
        model.build_vocab(x)
        model.train(x, total_examples=model.corpus_count, epochs=model.iter)
        model.save(doc2ver_bin)

训练好模型后,通过如下方法即可向量化

def getVecs(model, corpus, size):
    vecs = [np.array(model.docvecs[z.tags[0]]).reshape((1, size)) for z in corpus]
    return np.array(np.concatenate(vecs),dtype='float')

在本小节中,x_train[0]的处理结果如下

[ 1.07623920e-01  2.03667670e-01  4.39605879e-04  2.15817243e-02
 -1.33999720e-01 -1.04999244e-01  6.59373477e-02  1.42834812e-01
  1.93634644e-01  3.39346938e-02 -3.85309849e-03 -1.64910153e-01
 -1.58900153e-02 -2.52948906e-02 -3.99647981e-01 -9.12046656e-02
  5.02838679e-02 -2.62314398e-02 -2.06790075e-01  1.31346032e-01
 -4.74643335e-02 -8.93631503e-02  3.17618906e-01  1.38969775e-02
 -2.91506946e-01 -1.80232242e-01  1.07947908e-01  1.12745747e-01
 -8.35281014e-02 -1.05828121e-01 -1.48421243e-01 -6.38468117e-02
  4.14518379e-02  3.76846604e-02  1.51488677e-01 -8.19508582e-02
 -2.27401778e-01  4.36623842e-02  1.62245184e-01  1.52473062e-01
  4.37728465e-02 -1.23079844e-01 -3.73415574e-02 -1.03757225e-01
 -1.39077678e-01  1.03434525e-01 -1.43260106e-01 -5.22182137e-02
 -7.31687844e-02  2.00760439e-01  2.19642222e-02  8.81896541e-02
  2.05241382e-01 -1.33090258e-01  1.55719062e-02  2.02201935e-03
  1.01647124e-01 -3.05845201e-01 -8.83292854e-02 -1.23624884e-01
  7.08331615e-02  4.64668535e-02 -9.01960209e-02  9.96094272e-02
 -1.34212002e-01  2.00427771e-01  1.07755877e-01  2.25574061e-01
 -1.82992116e-01  1.53573334e-01 -9.32435393e-02  1.56051908e-02
  1.82265509e-02  7.87798688e-02  1.61014810e-01  9.62717682e-02
 -1.00199223e-01  5.18805161e-02 -1.82170309e-02 -1.01618238e-01
 -2.62078028e-02 -8.01115185e-02  7.61420429e-02  1.60436168e-01
  2.32044682e-01  1.22496150e-01  3.62544470e-02 -6.78069219e-02
 -2.98173614e-02 -8.31498671e-03 -1.16020748e-02  3.31646539e-02
  2.66764522e-01 -1.76852301e-01 -2.00362876e-01  7.83127621e-02
  4.40042838e-02  1.37215763e-01 -3.20810378e-02  4.33634184e-02
  7.93892331e-03  8.94398764e-02  7.15106502e-02  7.87770897e-02
 -2.57502228e-01 -1.56798676e-01  1.16182871e-01  4.86833565e-02
  4.79952479e-03  9.11491066e-02 -1.40987933e-02  2.22539883e-02
 -1.59795076e-01  9.47566405e-02  1.88340500e-01  2.16186821e-01
  1.01996697e-01 -5.08516617e-02 -6.35962486e-02 -1.20826885e-01
 -2.66686548e-02  1.30622894e-01 -5.55477142e-02  1.86781764e-01
  1.94867045e-01 -5.55491969e-02 -1.73223794e-01 -2.23940029e-03
 -8.42484012e-02 -9.78929922e-03  1.58954084e-01  2.10609302e-01
 -9.30389911e-02 -1.19832724e-01 -2.57326901e-01 -9.47180465e-02
 -6.85976595e-02  5.72628602e-02 -7.99224451e-02  1.05779201e-01
  3.46482173e-02 -6.65628025e-03 -4.14694101e-03 -1.77285731e-01
  1.71367064e-01 -1.04618266e-01  1.80818886e-02  1.00839488e-01
  4.69732359e-02  5.89858182e-02 -1.17412247e-01  2.59877682e-01
  2.43245989e-01 -1.44933388e-01  2.10463241e-01  2.31560692e-01
 -2.60037601e-01  4.47640829e-02 -2.82789797e-01  7.55265653e-02
 -2.29167998e-01  2.02734083e-01 -4.64072858e-04  1.37224300e-02
  1.67293772e-01  2.41031274e-01 -2.25378182e-02  3.51362005e-02
 -1.32802263e-01  1.59649570e-02  4.22540531e-02  1.46264017e-01
  7.94619396e-02 -1.92659259e-01  5.14810346e-02  3.52863334e-02
  1.29052609e-01 -1.65790990e-01  6.53655455e-02  8.40774626e-02
  6.88608587e-02 -1.29266530e-01 -4.67737317e-01  1.19465608e-02
 -4.01823148e-02 -6.38013333e-02  2.58218110e-01 -3.32820341e-02
 -3.28772753e-01  1.57434553e-01 -3.27924967e-01 -4.27057222e-02
 -1.31331399e-01 -1.63260803e-01  5.75413043e-03 -1.01014659e-01
  1.76209942e-01  4.11938801e-02 -2.52132833e-01  1.40920551e-02
 -1.06663443e-01 -2.49055997e-01  8.67770463e-02  1.78776234e-01
  1.24725297e-01  1.30073011e-01  4.45161164e-02  1.60344705e-01
  1.23212010e-01 -1.01057090e-01 -9.87102762e-02 -2.37999424e-01
 -2.54926801e-01  1.51191801e-01 -1.31282523e-01 -2.35551950e-02
 -4.09922339e-02  2.10600153e-01 -5.79826534e-02  3.74017060e-02
 -2.94750094e-01 -7.88278356e-02  2.41033360e-01 -1.52092993e-01
  2.95348503e-02  3.29553857e-02 -1.94013342e-02 -3.87551337e-01
 -2.59788465e-02 -5.78223402e-03 -2.31364612e-02 -3.76425721e-02
 -1.05162010e-01 -7.57280588e-02 -2.94224806e-02  9.29094255e-02
  5.02580917e-03  1.27323985e-01  1.81948487e-02  2.53445506e-02
  5.22816628e-02  1.21363856e-01  6.15054667e-02  1.54508455e-02
 -8.20488727e-04  7.56623894e-02 -1.50576439e-02  1.67132884e-01
 -2.82725275e-01  1.27364337e-01  1.80122808e-01 -1.31743163e-01
 -9.61113051e-02 -6.37162179e-02  7.30723217e-02  1.46646490e-02
 -1.33017078e-01 -1.19762242e-01  2.35711131e-02 -2.81341195e-01
 -5.68757765e-03  1.85813874e-01  1.10502146e-01 -7.27362931e-02
 -1.60796911e-01  5.11522032e-02 -9.55855697e-02 -7.15414509e-02
  2.83251163e-02  1.16654523e-01  3.66797764e-03  1.54608116e-01
 -2.03318566e-01  6.67307079e-02  8.06226656e-02 -1.19694658e-02
  8.96890089e-02  2.01400295e-01  8.02449882e-02 -2.21296884e-02
  5.21850772e-02 -1.38125028e-02  8.87114927e-02  1.21549807e-01
 -5.28845526e-02  3.75475399e-02  6.20372519e-02  1.29727647e-01
 -6.90457448e-02  2.34339647e-02 -4.55942191e-02  4.64116298e-02
 -1.33431286e-01  2.55507827e-01  2.16026157e-01 -9.25980732e-02
 -4.85622920e-02 -1.45295113e-01 -2.58427318e-02  1.93505753e-02
  1.17950164e-01 -1.23775247e-02  1.33392587e-01 -8.28817710e-02
  1.36878446e-01 -6.80091828e-02 -1.98093444e-01  5.15850522e-02
  1.92994636e-03  2.26874277e-01  1.26609832e-01 -5.96865974e-02
 -1.37154952e-01  1.35652214e-01  1.13142699e-01  3.95694794e-03
 -2.06833467e-01 -7.06818774e-02 -2.37924159e-02  7.30280858e-03
 -2.31933817e-01 -2.13069454e-01 -1.42960489e-01 -7.01301452e-03
  8.29812512e-03  6.81945086e-02  1.51407436e-01  7.65770003e-02
  1.34944141e-01  1.27678066e-01 -2.07772329e-01  1.72014341e-01
 -5.36921099e-02 -7.11955428e-02  4.27525938e-02 -7.01407567e-02
 -7.80161470e-03  2.37379566e-01 -9.58834738e-02 -7.06278309e-02
 -2.44213790e-02 -6.29713684e-02  4.97949077e-03 -1.93031520e-01
  1.07220590e-01 -6.00046758e-03 -1.33072376e-01 -1.13887295e-01
  1.19876094e-01  8.42822641e-02 -1.89513087e-01  9.65327695e-02
  5.26705459e-02  1.42758876e-01  1.56315282e-01  2.01160442e-02
  4.45949882e-02 -2.87032127e-02  8.93688649e-02  2.41289198e-01
  1.31013185e-01  4.65613231e-03 -4.40470129e-02  2.39219904e-01
 -1.30711459e-02  4.95963730e-02  2.94293970e-01 -5.10329269e-02
 -1.80931166e-01  1.07536905e-01  9.01441202e-02 -1.17586985e-01
  1.99178141e-02  1.22322226e-02 -1.71743870e-01 -1.57537639e-01
  1.04444884e-01  7.31004849e-02  9.70724411e-03  1.06952451e-01
  1.65776461e-01  1.47664443e-01  8.90543163e-02  7.31813684e-02
  1.05123490e-01  1.22088723e-01 -1.21460930e-02 -1.45071194e-01
 -8.42208490e-02 -1.08709313e-01  2.45642308e-02 -6.45151436e-02
  4.05842774e-02 -2.05672416e-03  6.51900023e-02  1.91479787e-01
 -9.36061218e-02  1.43005680e-02  1.36256188e-01 -5.99846505e-02]

 note:本章节未完,这是因为第8节骚扰短信识别的笔记较多,分为一个系列,下一篇笔记的题目为《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(5)

《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(5)_mooyuan的博客-CSDN博客

本文为互联网自动采集或经作者授权后发布,本文观点不代表立场,若侵权下架请联系我们删帖处理!文章出自:https://blog.csdn.net/mooyuan/article/details/123409746
-- 展开阅读全文 --
Web安全—逻辑越权漏洞(BAC)
« 上一篇 03-13
Redis底层数据结构--简单动态字符串
下一篇 » 04-10

发表评论

成为第一个评论的人

热门文章

标签TAG

最近回复