Use of machine learning for anomaly detection in netflow data
Posted: | More posts about machine-learning sysadmin python
Threshold based alerting on netflow data has been proved to be ineffective even for a binary classification for DDoS/not DDoS flows. This brief investigation wants to show that common patterns emerge from analysis on aggregated netflow data that can allow automatic identification of such patterns using machine learning.
Data comes from aggregated flows of core routers of a cloud service provider taken using nfdump with -l 1000 parameter that ignores fluxes with less that 1k packets. Training data has been collected within 2 weeks of observation.
K nearest neighbors algorithm has been choosen for its easy implementation properties and its effectiveness in clustering on numeric features.
Considered features were:
- duration
- number of flows
- total number of packets
- total bytes
- packets per second
- bit per second
- bit per packet
Supervised classification
Typical, non problematic flows scatter plot
This graph represents typical, not abnormal flows manually labeled from top ten output of nfdump -l 1000 -s dstip/pps/bps

Small outgoing UDP flood
Previosly plotted data are overlapped with flows of outgoing harmful floods.

Previosly plotted data is overlapped with flows of incoming DDoS.

Applying kNN
Check against test samples
Check kNN using manulally labeled samples.
Samples are manually labeled from output of:
nfdump -l 1K -M $PATH_TO_NFCAP_DATA -r $NFCAP_TARGET_FILE -n 1000 -s record/pps/bps -s dstip/pps/bps -s srcip/pps/bps
Only dstip/pps/bps is used as input data.
raw sample 1: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-01-11 07:12:14.667 4295562.914 any XXX.XXX.XXX.XXX 14045( 0.2) 1.3 M( 0.6) 1.8 G( 1.1) 0 3398 1363 label: strange http flows 5 nearest neighbors are (denormalized): incoming-flood, small-flood, small-flood, small-flood, tivoli 5 nearest neighbors are (normalized): tivoli, tivoli, tivoli, tivoli, tivoli ================= raw sample 2: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-09 05:18:30.579 387.262 any XXX.XXX.XXX.XXX 13( 0.0) 2.5 M( 2.2) 3.8 G( 4.6) 6499 77.9 M 1497 label: concurrent https sessions 5 nearest neighbors are (denormalized): incoming-flood, tivoli, tivoli, small-flood, tivoli 5 nearest neighbors are (normalized): small-flood, tivoli, small-flood, incoming-flood, tivoli ================= raw sample 3: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-10 01:59:06.179 328.155 any XXX.XXX.XXX.XXX 59( 0.0) 3.1 M( 1.7) 4.7 G( 3.0) 9597 115.1 M 1498 label: https traffic 5 nearest neighbors are (denormalized): tivoli, small-flood, small-flood, incoming-flood, tivoli 5 nearest neighbors are (normalized): tivoli, small-flood, http, tivoli, incoming-flood ================= raw sample 4: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-11 13:21:24.747 0.127 any XXX.XXX.XXX.XXX 1( 0.0) 2896( 0.0) 127800( 0.0) 22803 8.1 M 44 label: http 5 nearest neighbors are (denormalized): http, http, smtp, http, http 5 nearest neighbors are (normalized): http, http, http, http, http ================= raw sample 5: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-03 00:04:46.118 313.433 any XXX.XXX.XXX.XXX 65544( 1.0) 958845( 0.5) 1.1 G( 0.6) 3059 27.9 M 1139 label: strange TCP/80 5 nearest neighbors are (denormalized): tivoli, small-flood, tivoli, tivoli, tivoli 5 nearest neighbors are (normalized): tivoli, http, small-flood, tivoli, incoming-flood ================= raw sample 6: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-03 00:01:26.568 511.240 any XXX.XXX.XXX.XXX 16509( 0.3) 799646( 0.4) 332.0 M( 0.2) 1564 5.2 M 415 label: strange TCP/80 5 nearest neighbors are (denormalized): http, http, tivoli, http, http 5 nearest neighbors are (normalized): tivoli, tivoli, tivoli, small-flood, incoming-flood ================= raw sample 7: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-02 23:03:23.172 0.015 any XXX.XXX.XXX.XXX 1( 0.0) 1546( 0.0) 2.3 M( 0.0) 103066 1.2 G 1498 label: TCP/80 5 nearest neighbors are (denormalized): http, http, small-flood, small-flood, small-flood 5 nearest neighbors are (normalized): http, smtp, http, http, smtp ================= raw sample 8: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-03 02:59:50.060 202.630 any XXX.XXX.XXX.XXX 22( 0.0) 1.8 M( 1.2) 2.7 G( 2.3) 8898 106.7 M 1498 label: high http traffic 5 nearest neighbors are (denormalized): small-flood, small-flood, small-flood, small-flood, small-flood 5 nearest neighbors are (normalized): http, tivoli, small-flood, small-flood, small-flood ================= raw sample 9: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-04 05:09:30.969 15.477 any XXX.XXX.XXX.XXX 10( 0.0) 117404( 0.1) 172.7 M( 0.2) 7585 89.3 M 1471 label: http traffic 5 nearest neighbors are (denormalized): http, http, http, tivoli, http 5 nearest neighbors are (normalized): smtp, http, small-flood, http, small-flood ================= raw sample 10: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-10 17:14:52.652 242.550 any XXX.XXX.XXX.XXX 12( 0.0) 13.5 M( 4.3) 20.1 G( 8.3) 55529 664.4 M 1495 label: nfs 5 nearest neighbors are (denormalized): incoming-flood, tivoli, tivoli, tivoli, small-flood 5 nearest neighbors are (normalized): http, tivoli, small-flood, small-flood, small-flood ================= raw sample 11: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-07 01:55:01.695 81.326 any XXX.XXX.XXX.XXX 3( 0.0) 1.4 M( 0.8) 2.0 G( 1.6) 16790 195.4 M 1454 label: port 2305 5 nearest neighbors are (denormalized): http, small-flood, small-flood, small-flood, tivoli 5 nearest neighbors are (normalized): small-flood, small-flood, small-flood, small-flood, small-flood ================= raw sample 12: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-04 17:51:56.955 0.186 any XXX.XXX.XXX.XXX 1( 0.0) 2867( 0.0) 4.1 M( 0.0) 15413 174.6 M 1416 label: smtp 5 nearest neighbors are (denormalized): http, smtp, smtp, http, http 5 nearest neighbors are (normalized): http, http, http, http, smtp ================= raw sample 13: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-09 05:08:45.970 372.588 any XXX.XXX.XXX.XXX 169836( 2.4) 2.7 M( 2.3) 1.1 G( 1.3) 7217 23.8 M 412 label: smtp spam 5 nearest neighbors are (denormalized): tivoli, small-flood, tivoli, tivoli, tivoli 5 nearest neighbors are (normalized): tivoli, incoming-flood, small-flood, tivoli, small-flood ================= raw sample 14: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-10 17:09:50.900 305.083 any XXX.XXX.XXX.XXX 44036( 0.4) 801745( 0.3) 417.9 M( 0.2) 2627 11.0 M 521 label: this should be a massive smtp sending (target MTA has ETRN) 5 nearest neighbors are (denormalized): http, http, tivoli, tivoli, http 5 nearest neighbors are (normalized): http, tivoli, small-flood, tivoli, incoming-flood ================= raw sample 15: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-10 02:09:57.208 297.278 any XXX.XXX.XXX.XXX 504( 0.0) 1.5 M( 0.9) 2.1 G( 1.5) 4881 55.3 M 1415 label: TCP 7022 5 nearest neighbors are (denormalized): tivoli, small-flood, http, small-flood, small-flood 5 nearest neighbors are (normalized): http, tivoli, small-flood, tivoli, incoming-flood ================= raw sample 16: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-09 17:18:43.460 289.856 any XXX.XXX.XXX.XXX 9( 0.0) 34.8 M(10.5) 2.7 G( 1.1) 120118 75.0 M 78 label: udp 8090 5 nearest neighbors are (denormalized): small-flood, small-flood, small-flood, small-flood, small-flood 5 nearest neighbors are (normalized): small-flood, small-flood, small-flood, small-flood, small-flood ================= raw sample 17: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-03 00:05:29.324 249.729 any XXX.XXX.XXX.XXX 41228( 0.6) 1.7 M( 0.8) 2.3 G( 1.3) 6793 73.8 M 1358 label: small outgoing udp flood 5 nearest neighbors are (denormalized): small-flood, small-flood, small-flood, small-flood, small-flood 5 nearest neighbors are (normalized): http, tivoli, small-flood, tivoli, incoming-flood ================= raw sample 18: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-13 14:30:05.904 91.027 any XXX.XXX.XXX.XXX 16550( 0.2) 13.3 M( 5.0) 18.2 G( 7.6) 145970 1.6 G 1370 label: outgoing udp flood 5 nearest neighbors are (denormalized): small-flood, small-flood, tivoli, small-flood, tivoli 5 nearest neighbors are (normalized): small-flood, small-flood, small-flood, small-flood, small-flood ================= raw sample 19: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-02-27 08:34:09.931 275.588 any XXX.XXX.XXX.XXX 396858( 4.2) 14.9 M( 8.1) 20.2 G(13.8) 54037 587.4 M 1358 label: outgoing udp flood 5 nearest neighbors are (denormalized): incoming-flood, tivoli, tivoli, tivoli, small-flood 5 nearest neighbors are (normalized): http, tivoli, small-flood, tivoli, incoming-flood ================= raw sample 20: Date first seen Duration Proto Dst IP Addr Flows(%) Packets(%) Bytes(%) pps bps bpp 2015-03-13 12:18:01.203 653.575 any XXX.XXX.XXX.XXX 23762( 0.2) 91.3 M(29.0) 125.6 G(41.5) 139753 1.5 G 1374 label: outgoing udp flood 5 nearest neighbors are (denormalized): small-flood, small-flood, small-flood, small-flood, small-flood 5 nearest neighbors are (normalized): small-flood, small-flood, tivoli, tivoli, tivoli =================
Classification report & Confusion matrix
### Classifier parameters KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=5, p=2, weights='uniform') ### Classification Report precision recall f1-score support 1 0.73 0.55 0.63 85 2 0.73 0.83 0.78 155 3 0.79 0.94 0.86 68 4 0.68 0.58 0.62 59 5 1.00 0.69 0.82 13 avg / total 0.74 0.74 0.74 380 ### Confusion Matrix smtp(1): [ 47 37 0 1 0] http(2): [ 17 129 1 8 0] tivoli(3): [ 0 0 64 4 0] small-flood(4): [ 0 10 15 34 0] incoming-flood(5): [ 0 0 1 3 9]
Conclusions
Despite the limited training dataset, kNN has proven to be effective in clustering aggregated netflow data as shown in Classification Report and Confusion Matrix.
Even unknown samples of unlabeled type are reconducted to logically similar labels.
The ipython notebook with source code and updated graphs and data can be found on github.