天志,一个没有故事的男童鞋,在深圳南漂,过着苦逼的日子,想着美好的人生,平淡无奇却又想充满惊喜。。。

φ(≧ω≦*)♪

监督学习之knn和svm

2018.02.03

    偷懒了好几天,这两天总算把信息安全的进度赶了赶,在关联算法失效的时候决定用监督学习的算法解决,起初决定采用knn来分类,在后续学习中,无意发现了svm,在测试中发现svm的准确率比最好的knn搞0.1个百分比,故最终采用了svm。下对两种监督学习进行简介。

一、简单的理论介绍

    首先,对监督学习讲解一下,监督学习和无监督学习的区别就是训练数据的label的有无,监督学习需要带有label,而无监督学习不需要label。在这次实践中,我们做的是网络入侵检测,采用的数据集是kddcup的99数据集(1), 拥有42个维度,其中前41个是描述网络行为的特征值,最后一个是类型,即label。

    knn,即k-NearestNeighbor,k最邻近问题,为方便介绍,假设每条数据有两个特征值x和y,一个label,即点的颜色,先将所有数据放在平面直角坐标系中,如下图1.1的红点和蓝点,红点和蓝点所构成的所有点即为训练集,而绿点则是测试点,k最邻近问题的最邻近就是直观的邻近的意思,即离得近,而k指的是找几个离得最近的,如果k=3,那么所选的点即为实线所包含的三个点,若k=5,则为虚线所包含的五个点,而对于预测点分类的预测则是根据所选k个点中最多个数的类别所确定,同样以下图为例,如果k=3,那么预测点的结果将为红色(2个红色,1个蓝色),如果k=5,那么预测点的结果将为蓝色(3个蓝色,2个红色),由此可见,参数k的选取直接影响了预测结果的准确度。


图1.1 knn示例

    svm,即Support Vector Machine,中文翻译也很暴力,支持向量机,一听就给人一种懵逼的感觉,下面做简单分解解释:机,即机器,指的是这个模型是一个机器,此外它的作用是分类,所以可以理解为一个分类用的机器,support vector之后再介绍。同样为了简单介绍采用二维介绍,样本同样是带有颜色label的有x和y两个属性的训练点集合,如下图1.2,svm要做什么呢,它要找一条线,使得把两个类别的点区分开来,那么对于接下来的测试点,看测试点位于哪一侧,就将其归类于该类,那么问题来了,符合这个要求的线有很多条,比如图中的黑线和灰线就是其中的两条,那么什么才是最优解呢,现在就要介绍support vector了,就是两个类别中的点离这条分割线最近的距离,如何才是最优解呢,就是让两个类别的离分割线最近的点,再回到最近的问题,什么才是最优解呢,那就是支持向量离分割线越远越好,因为距离越远,允许容纳的点越多,使得分类的越平均,更加理想。


图1.2 svm示例

    上面所说的是线性可区分的,然而遇到线性不可分的该怎么办呢,svm又引入了kernel,kernel学的没有十分的透彻,简单的理解了一下,就是把平面变成了曲面,即将原平面函数变成凸二次函数,再加上约束条件,就完成了一个通过增维使非线性样点线性可区分的目的,之后用一个平面去切割凸面,则就找到了一个分割面,把测试点带入凸二次函数中,即可得到划分结果。

    有个可视化svm的kernel的视频(2),个人看了表示恍然大悟的感觉

    对于高维的svm,同理二维,继续升维,找到超平面,一刀切下去即可、不过据说还有些更高级的处理方式,学疏才浅,没看明白,就不瞎掰扯了。

二、实践

    在这次实践中,主要用的是numpy和sklearn两个库,numpy用于存放训练集,label集,还有测试集,以及预测结果,sklearn则是提供了众多机器学习模型,因为懒,不想自己写了,所以直接调用了。

    先说knn吧,先建它一个knn,后面的参数是调整k值的(默认是5),之后把原材料,即训练数据和训练标签fit进去,好了,一个knn就搭建好了。

from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(trainData,trainLabel)

    之后,把测试数据predict一下,就可以得到预测结果了。

predict = knn.predict(testData)

    此外,还可以知道每个类别的可能性。

predict_proba = knn.predict_proba(testData)

    好了,这样一个knn的搭建、导入训练集、预测的过程就结束了。

    之后再来说svm的,更加简单。

from sklearn import 
svm = svm.SVC()
svm.fit(trainData,trainLabel)

    预测一下testData。

predict = clf.predict(testData)

三、彩(shua)蛋(ping)

    如开头所说,我本来是学习knn的,后来偶然间发现了svm,所以我就也测试了一下,将knn的k从1到49(range(50)呵呵哒)都试了下,结果惊喜的发现,svm比最好的knn的预测结果都准一点点,测试结果如下

1:
knn: Total:494021 Correct: 433614 Error: 60407 Accuracy: 0.8777238214569826
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

2:
knn: Total:494021 Correct: 431819 Error: 62202 Accuracy: 0.8740903726764652
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

3:
knn: Total:494021 Correct: 433837 Error: 60184 Accuracy: 0.8781752192720552
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

4:
knn: Total:494021 Correct: 433838 Error: 60183 Accuracy: 0.878177243477504
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

5:
knn: Total:494021 Correct: 434368 Error: 59653 Accuracy: 0.8792500723653448
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

6:
knn: Total:494021 Correct: 436099 Error: 57922 Accuracy: 0.8827539719971418
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

7:
knn: Total:494021 Correct: 436564 Error: 57457 Accuracy: 0.8836952275308134
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

8:
knn: Total:494021 Correct: 436603 Error: 57418 Accuracy: 0.883774171543315
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

9:
knn: Total:494021 Correct: 436921 Error: 57100 Accuracy: 0.8844178688760195
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

10:
knn: Total:494021 Correct: 436910 Error: 57111 Accuracy: 0.8843956026160831
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

11:
knn: Total:494021 Correct: 437011 Error: 57010 Accuracy: 0.8846000473664075
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

12:
knn: Total:494021 Correct: 436955 Error: 57066 Accuracy: 0.8844866918612772
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

13:
knn: Total:494021 Correct: 436978 Error: 57043 Accuracy: 0.8845332485865985
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

14:
knn: Total:494021 Correct: 437036 Error: 56985 Accuracy: 0.8846506525026264
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

15:
knn: Total:494021 Correct: 437080 Error: 56941 Accuracy: 0.8847397175423717
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

16:
knn: Total:494021 Correct: 436660 Error: 57361 Accuracy: 0.883889551253894
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

17:
knn: Total:494021 Correct: 436829 Error: 57192 Accuracy: 0.8842316419747339
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

18:
knn: Total:494021 Correct: 436522 Error: 57499 Accuracy: 0.8836102109019657
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

19:
knn: Total:494021 Correct: 436659 Error: 57362 Accuracy: 0.8838875270484453
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

20:
knn: Total:494021 Correct: 436492 Error: 57529 Accuracy: 0.883549484738503
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

21:
knn: Total:494021 Correct: 436868 Error: 57153 Accuracy: 0.8843105859872353
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

22:
knn: Total:494021 Correct: 436497 Error: 57524 Accuracy: 0.8835596057657468
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

23:
knn: Total:494021 Correct: 436452 Error: 57569 Accuracy: 0.8834685165205528
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

24:
knn: Total:494021 Correct: 436264 Error: 57757 Accuracy: 0.8830879658961865
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

25:
knn: Total:494021 Correct: 436323 Error: 57698 Accuracy: 0.8832073940176632
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

26:
knn: Total:494021 Correct: 436219 Error: 57802 Accuracy: 0.8829968766509926
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

27:
knn: Total:494021 Correct: 436259 Error: 57762 Accuracy: 0.8830778448689428
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

28:
knn: Total:494021 Correct: 436098 Error: 57923 Accuracy: 0.8827519477916931
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

29:
knn: Total:494021 Correct: 436127 Error: 57894 Accuracy: 0.882810649749707
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

30:
knn: Total:494021 Correct: 435959 Error: 58062 Accuracy: 0.8824705832343159
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

31:
knn: Total:494021 Correct: 436223 Error: 57798 Accuracy: 0.8830049734727876
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

32:
knn: Total:494021 Correct: 436117 Error: 57904 Accuracy: 0.8827904076952194
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

33:
knn: Total:494021 Correct: 436320 Error: 57701 Accuracy: 0.8832013214013169
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

34:
knn: Total:494021 Correct: 436410 Error: 57611 Accuracy: 0.8833834998917051
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

35:
knn: Total:494021 Correct: 436500 Error: 57521 Accuracy: 0.8835656783820931
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

36:
knn: Total:494021 Correct: 436557 Error: 57464 Accuracy: 0.8836810580926722
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

37:
knn: Total:494021 Correct: 436564 Error: 57457 Accuracy: 0.8836952275308134
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

38:
knn: Total:494021 Correct: 435528 Error: 58493 Accuracy: 0.881598150685902
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

39:
knn: Total:494021 Correct: 435561 Error: 58460 Accuracy: 0.881664949465711
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

40:
knn: Total:494021 Correct: 154708 Error: 339313 Accuracy: 0.31316077656617836
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

41:
knn: Total:494021 Correct: 154468 Error: 339553 Accuracy: 0.3126749672584769
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

42:
knn: Total:494021 Correct: 154426 Error: 339595 Accuracy: 0.31258995062962913
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

43:
knn: Total:494021 Correct: 150380 Error: 343641 Accuracy: 0.3044000153839614
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

44:
knn: Total:494021 Correct: 132313 Error: 361708 Accuracy: 0.26782869554128264
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

45:
knn: Total:494021 Correct: 124802 Error: 369219 Accuracy: 0.25262488841567465
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

46:
knn: Total:494021 Correct: 124646 Error: 369375 Accuracy: 0.25230911236566866
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

47:
knn: Total:494021 Correct: 116557 Error: 377464 Accuracy: 0.23593531449067956
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

48:
knn: Total:494021 Correct: 102720 Error: 391301 Accuracy: 0.20792638369623964
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

49:
knn: Total:494021 Correct: 102005 Error: 392016 Accuracy: 0.20647907680037894
svm: Total:494021 Correct: 437527 Error: 56494 Accuracy: 0.8856445373779657

    遂knn卒,svm胜,狭路相逢准者胜。

    特此公告:

        初学者学习记录,不对之处请指正,不喜勿喷。

(1) kddcup99数据集地址

(2) kernel视频链接

发表评论