Friday, May 30, 2008

Network Flow: TopN

There are a lot of questions popping up on and off in argus mailing list regarding how to generate TopN output from argus data, but frequently you may find the questions are too rough to give complete answer.

I'm going to discuss about TopN this time, TopN is the technique that widely used in many industries, what is it for?

TopN is used to retrieve the first N records from the data based on certain object and ordered by its property. Since I'm talking about Network Flow, I would like to make the example using it.

Data: Network Flow Record
Object: Protocol, Network, IP(host), Port, etc
Object Property: Packet Count, Byte Count, etc

Bear in mind that I'm avoiding the use of Flow terminology but layman one so that this example can be understood easily.

If you want to use TopN technique to generate information from the network flow data, first you need to know what you are looking for. Lets go with a simple one -

I want to find out Top 5 IP ordered by Total Packet Count

Total Packet Per IP(host) = (packet send + packet receive) Per IP(host)

Now you run the argus client command to parse the data and generate exactly the result which looks like this -

shell>racluster -M rmon -m saddr -nr testing.arg3 -w - | \
rasort -m pkts -w - | \
ra -L0 -N 5 -s saddr pkts

SrcAddr TotPkts
172.16.1.108 993
193.231.236.41 824
211.185.125.124 178
172.16.1.103 56
211.180.229.190 36

The command above is to generate Top 5 IP ordered by Packet Count. Don't ask me about the command line, it looks complicated for now but that's not my point here, look at the output instead. Host 172.16.1.108 sends or receives 993 pakcets, followed by 193.231.236.41 and so forth.

Now if you want to locate Top 5 IP ordered by Byte Count. You can just run -

shell>racluster -M rmon -m saddr -nr testing.arg3 -w - | \
rasort -m bytes -w - | \
ra -L0 -N 5 -s saddr bytes

SrcAddr TotBytes
172.16.1.108 599949
193.231.236.41 579050
211.185.125.124 18901
172.16.1.103 4964
216.168.224.69 3458

You want to use TopN, you should draft out the TopN output you are looking for, I have seen questions like this -

1. Which is the most active network?
2. Who is the most active sender?
3. Who is the most active receiver(got ddos?)

Or worse,

How can I find out the top talkers?

These kind of questions are too loose, you should at least specify the property, such as most active sender that is ordered by packet count, or most active network that is ordered by byte count and so forth. You have to bear in mind that packet and byte are not going inline, you can have one host sending many small size packets which won't hit TopN byte count at all.

With this kind of idea in mind, you can build the list of TopN which can draw you a good picture of network activeness to solve different issues.

For the next round, I will introduce Traffic Matrix, stay tuned!

Enjoy (;])

No comments: