Use of machine learning for anomaly detection in netflow data


Posted:   |   More posts about machine-learning sysadmin python

Threshold based alerting on netflow data has been proved to be ineffective even for a binary classification for DDoS/not DDoS flows. This brief investigation wants to show that common patterns emerge from analysis on aggregated netflow data that can allow automatic identification of such patterns using machine learning.

Data comes from aggregated flows of core routers of a cloud service provider taken using nfdump with -l 1000 parameter that ignores fluxes with less that 1k packets. Training data has been collected within 2 weeks of observation.

K nearest neighbors algorithm has been choosen for its easy implementation properties and its effectiveness in clustering on numeric features.

Considered features were:

  • duration
  • number of flows
  • total number of packets
  • total bytes
  • packets per second
  • bit per second
  • bit per packet

Supervised classification

Typical, non problematic flows scatter plot

This graph represents typical, not abnormal flows manually labeled from top ten output of nfdump -l 1000 -s dstip/pps/bps

/galleries/netflow/output_8_0.png

Small outgoing UDP flood

Previosly plotted data are overlapped with flows of outgoing harmful floods.

/galleries/netflow/output_11_0.png

Previosly plotted data is overlapped with flows of incoming DDoS.

/galleries/netflow/output_13_0.png

Applying kNN

Check against test samples

Check kNN using manulally labeled samples.

Samples are manually labeled from output of:

nfdump -l 1K -M $PATH_TO_NFCAP_DATA -r $NFCAP_TARGET_FILE -n 1000 -s record/pps/bps -s dstip/pps/bps -s srcip/pps/bps

Only dstip/pps/bps is used as input data.

raw sample 1:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-01-11 07:12:14.667 4295562.914 any        XXX.XXX.XXX.XXX    14045( 0.2)    1.3 M( 0.6)    1.8 G( 1.1)        0     3398  1363
label:  strange http flows
5 nearest neighbors are (denormalized): incoming-flood, small-flood, small-flood, small-flood, tivoli
5 nearest neighbors are (normalized): tivoli, tivoli, tivoli, tivoli, tivoli
=================
raw sample 2:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-09 05:18:30.579   387.262 any       XXX.XXX.XXX.XXX       13( 0.0)    2.5 M( 2.2)    3.8 G( 4.6)     6499   77.9 M  1497
label:  concurrent https sessions
5 nearest neighbors are (denormalized): incoming-flood, tivoli, tivoli, small-flood, tivoli
5 nearest neighbors are (normalized): small-flood, tivoli, small-flood, incoming-flood, tivoli
=================
raw sample 3:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-10 01:59:06.179   328.155 any       XXX.XXX.XXX.XXX       59( 0.0)    3.1 M( 1.7)    4.7 G( 3.0)     9597  115.1 M  1498
label:  https traffic
5 nearest neighbors are (denormalized): tivoli, small-flood, small-flood, incoming-flood, tivoli
5 nearest neighbors are (normalized): tivoli, small-flood, http, tivoli, incoming-flood
=================
raw sample 4:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-11 13:21:24.747     0.127 any     XXX.XXX.XXX.XXX        1( 0.0)     2896( 0.0)   127800( 0.0)    22803    8.1 M    44
label:  http
5 nearest neighbors are (denormalized): http, http, smtp, http, http
5 nearest neighbors are (normalized): http, http, http, http, http
=================
raw sample 5:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-03 00:04:46.118   313.433 any       XXX.XXX.XXX.XXX    65544( 1.0)   958845( 0.5)    1.1 G( 0.6)     3059   27.9 M  1139
label:  strange TCP/80
5 nearest neighbors are (denormalized): tivoli, small-flood, tivoli, tivoli, tivoli
5 nearest neighbors are (normalized): tivoli, http, small-flood, tivoli, incoming-flood
=================
raw sample 6:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-03 00:01:26.568   511.240 any      XXX.XXX.XXX.XXX    16509( 0.3)   799646( 0.4)  332.0 M( 0.2)     1564    5.2 M   415
label:  strange TCP/80
5 nearest neighbors are (denormalized): http, http, tivoli, http, http
5 nearest neighbors are (normalized): tivoli, tivoli, tivoli, small-flood, incoming-flood
=================
raw sample 7:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-02 23:03:23.172     0.015 any     XXX.XXX.XXX.XXX        1( 0.0)     1546( 0.0)    2.3 M( 0.0)   103066    1.2 G  1498
label:  TCP/80
5 nearest neighbors are (denormalized): http, http, small-flood, small-flood, small-flood
5 nearest neighbors are (normalized): http, smtp, http, http, smtp
=================
raw sample 8:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-03 02:59:50.060   202.630 any       XXX.XXX.XXX.XXX       22( 0.0)    1.8 M( 1.2)    2.7 G( 2.3)     8898  106.7 M  1498
label:  high http traffic
5 nearest neighbors are (denormalized): small-flood, small-flood, small-flood, small-flood, small-flood
5 nearest neighbors are (normalized): http, tivoli, small-flood, small-flood, small-flood
=================
raw sample 9:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-04 05:09:30.969    15.477 any         XXX.XXX.XXX.XXX       10( 0.0)   117404( 0.1)  172.7 M( 0.2)     7585   89.3 M  1471
label:  http traffic
5 nearest neighbors are (denormalized): http, http, http, tivoli, http
5 nearest neighbors are (normalized): smtp, http, small-flood, http, small-flood
=================
raw sample 10:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-10 17:14:52.652   242.550 any        XXX.XXX.XXX.XXX       12( 0.0)   13.5 M( 4.3)   20.1 G( 8.3)    55529  664.4 M  1495
label:  nfs
5 nearest neighbors are (denormalized): incoming-flood, tivoli, tivoli, tivoli, small-flood
5 nearest neighbors are (normalized): http, tivoli, small-flood, small-flood, small-flood
=================
raw sample 11:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-07 01:55:01.695    81.326 any       XXX.XXX.XXX.XXX        3( 0.0)    1.4 M( 0.8)    2.0 G( 1.6)    16790  195.4 M  1454
label:  port 2305
5 nearest neighbors are (denormalized): http, small-flood, small-flood, small-flood, tivoli
5 nearest neighbors are (normalized): small-flood, small-flood, small-flood, small-flood, small-flood
=================
raw sample 12:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-04 17:51:56.955     0.186 any        XXX.XXX.XXX.XXX        1( 0.0)     2867( 0.0)    4.1 M( 0.0)    15413  174.6 M  1416
label:  smtp
5 nearest neighbors are (denormalized): http, smtp, smtp, http, http
5 nearest neighbors are (normalized): http, http, http, http, smtp
=================
raw sample 13:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-09 05:08:45.970   372.588 any       XXX.XXX.XXX.XXX   169836( 2.4)    2.7 M( 2.3)    1.1 G( 1.3)     7217   23.8 M   412
label:  smtp spam
5 nearest neighbors are (denormalized): tivoli, small-flood, tivoli, tivoli, tivoli
5 nearest neighbors are (normalized): tivoli, incoming-flood, small-flood, tivoli, small-flood
=================
raw sample 14:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-10 17:09:50.900   305.083 any      XXX.XXX.XXX.XXX    44036( 0.4)   801745( 0.3)  417.9 M( 0.2)     2627   11.0 M   521
label:  this should be a massive smtp sending (target MTA has ETRN)
5 nearest neighbors are (denormalized): http, http, tivoli, tivoli, http
5 nearest neighbors are (normalized): http, tivoli, small-flood, tivoli, incoming-flood
=================
raw sample 15:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-10 02:09:57.208   297.278 any       XXX.XXX.XXX.XXX      504( 0.0)    1.5 M( 0.9)    2.1 G( 1.5)     4881   55.3 M  1415
label:  TCP 7022
5 nearest neighbors are (denormalized): tivoli, small-flood, http, small-flood, small-flood
5 nearest neighbors are (normalized): http, tivoli, small-flood, tivoli, incoming-flood
=================
raw sample 16:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-09 17:18:43.460   289.856 any        XXX.XXX.XXX.XXX        9( 0.0)   34.8 M(10.5)    2.7 G( 1.1)   120118   75.0 M    78
label:  udp 8090
5 nearest neighbors are (denormalized): small-flood, small-flood, small-flood, small-flood, small-flood
5 nearest neighbors are (normalized): small-flood, small-flood, small-flood, small-flood, small-flood
=================
raw sample 17:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-03 00:05:29.324   249.729 any       XXX.XXX.XXX.XXX    41228( 0.6)    1.7 M( 0.8)    2.3 G( 1.3)     6793   73.8 M  1358
label:  small outgoing udp flood
5 nearest neighbors are (denormalized): small-flood, small-flood, small-flood, small-flood, small-flood
5 nearest neighbors are (normalized): http, tivoli, small-flood, tivoli, incoming-flood
=================
raw sample 18:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-13 14:30:05.904    91.027 any       XXX.XXX.XXX.XXX    16550( 0.2)   13.3 M( 5.0)   18.2 G( 7.6)   145970    1.6 G  1370
label:  outgoing udp flood
5 nearest neighbors are (denormalized): small-flood, small-flood, tivoli, small-flood, tivoli
5 nearest neighbors are (normalized): small-flood, small-flood, small-flood, small-flood, small-flood
=================
raw sample 19:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-02-27 08:34:09.931   275.588 any     XXX.XXX.XXX.XXX   396858( 4.2)   14.9 M( 8.1)   20.2 G(13.8)    54037  587.4 M  1358
label:  outgoing udp flood
5 nearest neighbors are (denormalized): incoming-flood, tivoli, tivoli, tivoli, small-flood
5 nearest neighbors are (normalized): http, tivoli, small-flood, tivoli, incoming-flood
=================
raw sample 20:
Date first seen          Duration Proto       Dst IP Addr    Flows(%)     Packets(%)       Bytes(%)         pps      bps   bpp
2015-03-13 12:18:01.203   653.575 any       XXX.XXX.XXX.XXX    23762( 0.2)   91.3 M(29.0)  125.6 G(41.5)   139753    1.5 G  1374
label:  outgoing udp flood
5 nearest neighbors are (denormalized): small-flood, small-flood, small-flood, small-flood, small-flood
5 nearest neighbors are (normalized): small-flood, small-flood, tivoli, tivoli, tivoli
=================

Classification report & Confusion matrix

### Classifier parameters

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform')

### Classification Report

             precision    recall  f1-score   support

          1       0.73      0.55      0.63        85
          2       0.73      0.83      0.78       155
          3       0.79      0.94      0.86        68
          4       0.68      0.58      0.62        59
          5       1.00      0.69      0.82        13

avg / total       0.74      0.74      0.74       380

### Confusion Matrix

                smtp(1): [  47   37    0    1    0]
                http(2): [  17  129    1    8    0]
              tivoli(3): [   0    0   64    4    0]
         small-flood(4): [   0   10   15   34    0]
      incoming-flood(5): [   0    0    1    3    9]

Conclusions

Despite the limited training dataset, kNN has proven to be effective in clustering aggregated netflow data as shown in Classification Report and Confusion Matrix.

Even unknown samples of unlabeled type are reconducted to logically similar labels.

The ipython notebook with source code and updated graphs and data can be found on github.