本小节主要以MIST数据集为例介绍恶意程序的分类识别技术,使用特征提取方法为2-Gram和TF-IDF模型,介绍的分类算法包括支持向量机、XGBoost和多层感知机。
常见的恶意程序识别方法主要依据是静态文件特征码和高危动态行为特征等,会随着恶意程序呈指数级增长。传统的基于规则的检测技术已经难以覆盖全部恶意程序,终端安全厂商将大量的人力物力投入到使用沙箱以及机器学习技术上,希望可以有效提高识别恶意程序的能力。
测试数据来自Marco Ramilli的MIST数据集(Malware Instruction Set for Behaviour Analysis),MIST通过分析大量的恶意程序,提取静态的文件特征以及动态的程序行为特征,对应的数据特征获取过程如图14-2所示。
源码如下所示:
def load_files():
malware_class=['APT1','Crypto','Locker','Zeus']
x=[]
y=[]
for i,family in enumerate(malware_class):
dir="../data/malware/MalwareTrainingSets-master/trainingSets/%s/*" % family
print ("Load files from %s index %d" % (dir,i))
v=load_files_from_dir(dir)
x+=v
y+=[i]*len(v)
print ("Loaded files %d" % len(x))
return x,y
对于每个文件,处理如下
def load_files_from_dir(dir):
import glob
files=glob.glob(dir)
result = []
for file in files:
#print ("Load file %s" % file)
with open(file) as f:
lines=f.readlines()
lines_to_line=" ".join(lines)
lines_to_line = re.sub(r"[APT|Crypto|Locker|Zeus]", ' ', lines_to_line,flags=re.I)
result.append(lines_to_line)
return result
(一)Ngram-TFIDF
def get_feature_text():
x,y=load_files()
max_features=1000
vectorizer = CountVectorizer(
decode_error='ignore',
ngram_range=(2, 2),
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1,
token_pattern=r'\b\w+\b',
binary=True)
print (vectorizer)
x=vectorizer.fit_transform(x)
transformer = TfidfTransformer(smooth_idf=False)
x = transformer.fit_transform(x)
# 非常重要 稀疏矩阵转换成矩阵
x = x.toarray()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
return x_train, x_test, y_train, y_test
(二)Ngram-2D
def get_feature_pe_picture():
#加载原始文件
x,y=load_files()
max_features=1024
vectorizer = CountVectorizer(
decode_error='ignore',
ngram_range=(2, 2),
strip_accents='ascii',
max_features=max_features,
stop_words='english',
max_df=1.0,
min_df=1,
dtype=np.int,
token_pattern=r'\b\w+\b',
binary=False)
print (vectorizer)
x=vectorizer.fit_transform(x)
#非常重要 稀疏矩阵转换成矩阵
x=x.toarray()
x_pic = []
for i in range(4762):
#将形状为(1024,1)的向量转化成(32,32)的矩阵
pic=np.reshape(x[i],(32,32,1))
x_pic.append(pic)
#save_image(pic,i)
#随机分配训练和测试集合
x_train, x_test, y_train, y_test = train_test_split(x_pic, y, test_size=0.4)
return x_train, x_test, y_train, y_test
(一)XGBOOST
def do_xgboost(x_train, x_test, y_train, y_test):
xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
y_pred = xgb_model.predict(x_test)
print(classification_report(y_test, y_pred))
(二)SVM
def do_svm(x_train, x_test, y_train, y_test):
from sklearn.svm import SVC
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
(三)MLP
def do_mlp(x_train, x_test, y_train, y_test):
clf = MLPClassifier(solver='lbfgs',
alpha=1e-5,
hidden_layer_sizes = (10, 4),
random_state = 1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print(metrics.confusion_matrix(y_test, y_pred))
(四)CNN_2d
def do_cnn_2d(trainX, testX, trainY, testY):
print("text feature and cnn 2d")
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=4)
testY = to_categorical(testY, nb_classes=4)
# Building convolutional network
network = input_data(shape=[None, 32, 32,1], name='input')
network = conv_2d(network, 16, 3, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = conv_2d(network, 16, 3, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = fully_connected(network, 16, activation='tanh')
network = dropout(network, 0.1)
network = fully_connected(network, 16, activation='tanh')
network = dropout(network, 0.1)
network = fully_connected(network, 4, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.01,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY, n_epoch=10, validation_set=(testX, testY),show_metric=True, run_id="malware")
(五)CNN_1d
def do_cnn_1d(trainX, testX, trainY, testY):
print("text feature and cnn")
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=4)
testY = to_categorical(testY, nb_classes=4)
# Building convolutional network
network = input_data(shape=[None,1000], name='input')
network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.8)
network = fully_connected(network, 4, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')
# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY,
n_epoch=5, shuffle=True, validation_set=(testX, testY),
show_metric=True, batch_size=100,run_id="malware")
1D运行结果,从结果来看cnn效果有点过差
xgboost
precision recall f1-score support
0 0.98 0.94 0.96 113
1 0.97 0.95 0.96 803
2 0.96 0.90 0.93 205
3 0.94 0.98 0.96 784
accuracy 0.95 1905
macro avg 0.96 0.94 0.95 1905
weighted avg 0.96 0.95 0.95 1905
svm
precision recall f1-score support
0 0.96 0.92 0.94 113
1 0.95 0.96 0.95 803
2 0.91 0.87 0.89 205
3 0.94 0.94 0.94 784
accuracy 0.94 1905
macro avg 0.94 0.92 0.93 1905
weighted avg 0.94 0.94 0.94 1905
cnn
| Adam | epoch: 001 | loss: 1.09685 - acc: 0.4432 | val_loss: 1.13283 - val_acc: 0.4089 -- iter: 2857/2857
| Adam | epoch: 002 | loss: 1.09272 - acc: 0.4425 | val_loss: 1.12148 - val_acc: 0.4089 -- iter: 2857/2857
| Adam | epoch: 003 | loss: 1.11942 - acc: 0.4117 | val_loss: 1.11967 - val_acc: 0.4089 -- iter: 2857/2857
| Adam | epoch: 004 | loss: 1.12596 - acc: 0.4221 | val_loss: 1.12072 - val_acc: 0.4089 -- iter: 2857/2857
| Adam | epoch: 005 | loss: 1.11561 - acc: 0.4272 | val_loss: 1.12084 - val_acc: 0.4089 -- iter: 2857/2857
CNN 2D的性能如下,看起来也没好到哪里去
| Adam | epoch: 001 | loss: 1.23541 - acc: 0.4109 | val_loss: 1.11576 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 002 | loss: 1.16763 - acc: 0.4203 | val_loss: 1.11529 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 003 | loss: 1.12465 - acc: 0.4194 | val_loss: 1.11524 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 004 | loss: 1.11964 - acc: 0.4281 | val_loss: 1.11697 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 005 | loss: 1.11276 - acc: 0.4242 | val_loss: 1.11429 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 006 | loss: 1.11595 - acc: 0.4346 | val_loss: 1.11510 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 007 | loss: 1.10915 - acc: 0.4170 | val_loss: 1.10926 - val_acc: 0.4247 -- iter: 2857/2857
| Adam | epoch: 008 | loss: 1.11696 - acc: 0.4282 | val_loss: 1.10626 - val_acc: 0.4268 -- iter: 2857/2857
| Adam | epoch: 009 | loss: 1.14538 - acc: 0.4108 | val_loss: 1.08093 - val_acc: 0.4268 -- iter: 2857/2857
| Adam | epoch: 010 | loss: 1.09208 - acc: 0.4215 | val_loss: 1.08760 - val_acc: 0.4241 -- iter: 2857/2857
不过xgboost性能还是不错的,cnn总体上在图像效果不错,在这个恶意软件识别中,效果过于差了些
本文为互联网自动采集或经作者授权后发布,本文观点不代表立场,若侵权下架请联系我们删帖处理!文章出自:https://blog.csdn.net/mooyuan/article/details/123441457