本章主要以SMS Spam Collection数据集 为例介绍骚扰短信的识别技术。这一个小节以词袋TF-IDF算法和N-gram算法对骚扰短信特征提取方法来详细讲解。
本笔记接上文《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(1)
《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(1)_mooyuan的博客-CSDN博客
1、原理
词频与逆向文件频率(term frequency–inverse document frequency,TF-IDF)模型是一种文本处理领域的特征提取方法。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。
字词的重要性与它在文件中出现的次数成正比,但同时会与它在语料库中出现的频率成反比。
TF-IDF与词袋模型结合后可以提高检测能力,我们使用TF-IDF将词袋模型生成的数据做进一步处理。
transformer = TfidfTransformer(smooth_idf=False)
x_train=transformer.fit_transform(x_train)
x_train=x_train.toarray()
x_test=transformer.transform(x_test)
x_test=x_test.toarray()
2、源码
def get_features_by_wordbag_tfidf():
global max_features
x_train, x_test, y_train, y_test=load_all_files()
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1,
binary=True)
print (vectorizer)
x_train=vectorizer.fit_transform(x_train)
x_train=x_train.toarray()
vocabulary=vectorizer.vocabulary_
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
vocabulary=vocabulary,
stop_words='english',
max_df=1.0,binary=True,
min_df=1 )
print (vectorizer)
x_test=vectorizer.fit_transform(x_test)
x_test=x_test.toarray()
transformer = TfidfTransformer(smooth_idf=False)
x_train=transformer.fit_transform(x_train)
x_train=x_train.toarray()
x_test=transformer.transform(x_test)
x_test=x_test.toarray()
return x_train, x_test, y_train, y_test
这里以train[0]数据为例,讲解其经过每个步骤的向量变化:
原始数据, train[0]的数据如下:
Sorry,in meeting I'll call later
接下来通过调用 scikit-learn 的 CountVectorizer 类来进行文本的词频统计的调用函数。
vectorizer = CountVectorizer(
decode_error='ignore',
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1,
binary=True)
x_train=vectorizer.fit_transform(x_train)
这时候打印x_train[0]的结果,这里要强调因为max_features限制数量,故而有效的x_train[0]值,如下所示
(0, 308) 1
(0, 211) 1
(0, 194) 1
(0, 180) 1
将其向量化
x_train=x_train.toarray()
结果如下
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
对其进行TF-IDF处理
transformer = TfidfTransformer(smooth_idf=False)
x_train=transformer.fit_transform(x_train)
处理后,x_train[0]值如下所示:
(0, 366) 0.6954286575055049
(0, 238) 0.7185951449321735
将其向量化
x_train=x_train.toarray()
向量化处理如下
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.71859514 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.69542866 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. ]
与TF-IDF相比,大概区别如下
代码如下所示
def get_features_by_ngram():
global max_features
x_train, x_test, y_train, y_test=load_all_files()
vectorizer = CountVectorizer(
decode_error='ignore',
ngram_range=(3, 3),
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1,
token_pattern=r'\b\w+\b',
binary=True)
print (vectorizer)
x_train=vectorizer.fit_transform(x_train)
x_train=x_train.toarray()
vocabulary=vectorizer.vocabulary_
vectorizer = CountVectorizer(
decode_error='ignore',
ngram_range=(3, 3),
strip_accents='ascii',
vocabulary=vocabulary,
stop_words='english',
max_df=1.0,binary=True,
token_pattern=r'\b\w+\b',
min_df=1 )
print (vectorizer)
x_test=vectorizer.fit_transform(x_test)
x_test=x_test.toarray()
transformer = TfidfTransformer(smooth_idf=False)
x_train=transformer.fit_transform(x_train)
x_train=x_train.toarray()
x_test=transformer.transform(x_test)
x_test=x_test.toarray()
return x_train, x_test, y_train, y_test
note:本章节未完,这是因为第8节骚扰短信识别的笔记较多,分为一个系列,下一篇笔记的题目为《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(3)
《Web安全之深度学习实战》笔记:第8章 骚扰短信识别(3)_mooyuan的博客-CSDN博客
后续内容具体可参考《web安全之深度学习实战》系列笔记。