本小节是通过网上搜集的数据使用svm算法识别XSS攻击。
一、支持向量机
支持向量机SVM(support vector machines)是一种二分类模型,它的目的是寻找一个超平面来对样本进行分割,分割的原则是间隔最大化,最终转化为一个凸二次规划问题来求解。
二、数据集
构造黑白样本,黑样本20w,白样本20w
etl('../data/xss-200000.txt',x,1)
etl('../data/good-xss-200000.txt',x,0)
三、向量特征化
对于数据集文件,每项提取出四个特征,分别是:url长度、url中包含的第三方域名的个数、敏感字符的个数、敏感关键字的个数。
如下为源码,注意相对于书里配套代码,增加了encoding='utf-8',否则会报错
def get_len(url):
return len(url)
def get_url_count(url):
if re.search('(http://)|(https://)', url, re.IGNORECASE) :
return 1
else:
return 0
def get_evil_char(url):
return len(re.findall("[<>,\'\"/]", url, re.IGNORECASE))
def get_evil_word(url):
return len(re.findall("(alert)|(script=)(%3c)|(%3e)|(%20)|(onerror)|(onload)|(eval)|(src=)|(prompt)",url,re.IGNORECASE))
def etl(filename,data,isxss):
with open(filename, encoding='utf-8') as f:
for line in f:
f1=get_len(line)
f2=get_url_count(line)
f3=get_evil_char(line)
f4=get_evil_word(line)
data.append([f1,f2,f3,f4])
if isxss:
y.append(1)
else:
y.append(0)
return data
打印数据集X的前四个数据
[23, 0, 2, 0] [106, 0, 7, 5] [16, 0, 2, 0] [24, 0, 2, 0]
四、训练数据
如下所示,训练与测试集的比例为6:4
x_train, x_test, y_train, y_test = model_selection.train_test_split(x,y, test_size=0.4, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
y_pred = clf.predict(x_test)
五、完整源码
import re
import joblib
from sklearn import model_selection
from sklearn import svm
from sklearn import metrics
x = []
y = []
def get_len(url):
return len(url)
def get_url_count(url):
if re.search('(http://)|(https://)', url, re.IGNORECASE) :
return 1
else:
return 0
def get_evil_char(url):
return len(re.findall("[<>,\'\"/]", url, re.IGNORECASE))
def get_evil_word(url):
return len(re.findall("(alert)|(script=)(%3c)|(%3e)|(%20)|(onerror)|(onload)|(eval)|(src=)|(prompt)",url,re.IGNORECASE))
def do_metrics(y_test,y_pred):
print("metrics.accuracy_score:")
print(metrics.accuracy_score(y_test, y_pred))
print("metrics.confusion_matrix:")
print(metrics.confusion_matrix(y_test, y_pred))
print("metrics.precision_score:")
print(metrics.precision_score(y_test, y_pred))
print("metrics.recall_score:")
print(metrics.recall_score(y_test, y_pred))
print("metrics.f1_score:")
print(metrics.f1_score(y_test, y_pred))
def etl(filename,data,isxss):
with open(filename, encoding='utf-8') as f:
for line in f:
f1=get_len(line)
f2=get_url_count(line)
f3=get_evil_char(line)
f4=get_evil_word(line)
data.append([f1,f2,f3,f4])
if isxss:
y.append(1)
else:
y.append(0)
return data
etl('../data/xss-200000.txt',x,1)
etl('../data/good-xss-200000.txt',x,0)
x_train, x_test, y_train, y_test = model_selection.train_test_split(x,y, test_size=0.4, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
y_pred = clf.predict(x_test)
do_metrics(y_test, y_pred)
joblib.dump(clf,"xss-svm-200000-module.m")
六、运行结果
metrics.accuracy_score:
0.9979229856257418
metrics.confusion_matrix:
[[54092 73]
[ 53 6446]]
metrics.precision_score:
0.988801963491333
metrics.recall_score:
0.9918448992152639
metrics.f1_score:
0.9903210938700261
七、模型验证
clf1 = joblib.load("xss-svm-200000-module.m")
y_test1 = []
y_test1 = clf1.predict(x)
print(metrics.accuracy_score(y_test1, y))
运行结果
0.9979493333685002
本文为互联网自动采集或经作者授权后发布,本文观点不代表立场,若侵权下架请联系我们删帖处理!文章出自:https://blog.csdn.net/mooyuan/article/details/122759618