Sunday, December 30, 2007

Packets -> Flows -> Session

This is my last post before reaching the milestone 500th, so I try my best to write great post. Since I want to keep this post simple and clear, I will try my best to explain it in details. If you are network flow analysis guru, you can skip this post because I consider this as introductory post but may help others understanding more about network flow because it was me who taking amount of time to learn how to utilize network flow data. My approach will be similar to my previous post here but the topic is totally different. The "Not So Upcoming" Argus 3 will be the main weapon to be discussed here. Lets walk through it now.

Network Packets

For this matter, I need to obtain the network packets, I have logged the network traffic using tcpdump during the time I was downloading wireshark. Here's how I do it -

shell>sudo tcpdump -s 0 -nni lnc0 -w http-download.pcap


After I finished downloading wireshark, I terminated tcpdump and get initial view of the pcap file with capinfos.

shell>capinfos http-download.pcap
File name: http-download.pcap
File type: Wireshark/tcpdump/... - libpcap
Number of packets: 19782
File size: 1981512 bytes
Data size: 18047455 bytes
Capture duration: 405.100833 seconds
Start time: Thu Dec 20 23:43:22 2007
End time: Thu Dec 20 23:50:07 2007
Data rate: 44550.53 bytes/s
Data rate: 356404.20 bits/s
Average packet size: 912.32 bytes

For single file download which is approximately 20MB, it contains 19872 packets. It is painful to look at every single packet if it is not important. What if I don't want to know the payload in the packet but the connection summary such as how many packets have been sent by one host to another, how many bytes have been transferred in this connection? How long is the duration of this particular connection? Packet centric analysis doesn't fit well here. Therefore I introduce you network flow analysis. But before that, lets have fun with packets -

------------------------------------------------------------
Scenario:

Host A(Client) - 192.168.0.102
Host B(Server) - 128.121.50.122

Host A downloads the wireshark source from Host B
-------------------------------------------------------------

To get the count of how many packets have been sent by Host A to Host B -

shell>tcpdump -ttttnnr http-download.pcap \
ip src 192.168.0.102 | wc -l

reading from file http-download.pcap, link-type EN10MB (Ethernet)
7801

To get the count of how many packets have been sent by Host B to Host A -

shell>tcpdump -ttttnnr http-download.pcap \
ip src 128.121.50.122 | wc -l

reading from file http-download.pcap, link-type EN10MB (Ethernet)
11981

What if you want to know how many bytes have been sent by Host A to Host B and the reversal? It would be exhaustive if you have to look into those packets and count. Now this is where network flow kicks in.

Network Flows

Network Flow is really different beast. To give you the idea what is flow, I define it as -

Flow is the sequence of packets or a packet that belonged to certain network session(conversation) between two hosts but delimited by the setting of flow generation tool. To cut it short, it provides network traffic summarization by metering or accounting certain attributes in the network session.

To understand them better, lets convert the packet data(pcap) to argus format flow data -

shell>argus -mAJZRU 512 -r http-download.pcap \
-w http-download.arg3


I run argus with the option -mAJZRU 512 so that it will generate as much data as possible for each flow record. I won't explain each option here since you can find them in the man page or argus -h.

Now I can examine/parse http-download.arg3 with argus client tools for further flow processing. To make it easy to read, I use ra here as it is the most basic argus flow data processing tool. I choose to print the necessary field with -s option such as (start time|src address|src port|direction|dst address|dst port|src packets|dst packets) -

shell>ra -L0 -nnr http-download.arg3 \
-s stime saddr sport dir daddr dport spkts dpkts - ip
StartTime SrcAddr Sport Dir DstAddr Dport SrcPkts DstPkts
23:43:22.024899 192.168.0.102.51371 -> 128.121.50.122.80 1165 1800
23:44:22.068631 192.168.0.102.51371 -> 128.121.50.122.80 1186 1807
23:45:22.101391 192.168.0.102.51371 -> 128.121.50.122.80 1246 1919
23:46:22.117747 192.168.0.102.51371 -> 128.121.50.122.80 1125 1751
23:47:22.171437 192.168.0.102.51371 -> 128.121.50.122.80 1160 1759
23:48:22.209375 192.168.0.102.51371 -> 128.121.50.122.80 1080 1664
23:49:22.186030 192.168.0.102.51371 -> 128.121.50.122.80 839 1281


There are totally 7 flow records here for just single network session. Why?

If you read the argus configuration manual page, it mentions -

ARGUS_FLOW_STATUS_INTERVAL
Argus will periodically report on a flow’s activity every ARGUS_FLOW_STATUS_INTERVAL seconds, as long as there is new activity on the flow. This is so that you can get a view into the activity of very long lived flows. The default is 60 seconds, but this number may be too low or too high depending on your uses.

The default value is 60 seconds, but argus does support a minimum value of 1. This is very useful for doing measurements in a controlled experimental environment where the number of flows is <>
Command line equivalent -S

ARGUS_FLOW_STATUS_INTERVAL=60

For better understanding, I print the start time field only to get better interpretation -

shell>ra -nr http-download.arg3 -s stime - ip
23:43:22.024899
23:44:22.068631
23:45:22.101391
23:46:22.117747
23:47:22.171437
23:48:22.209375
23:49:22.186030

With the default setting, you may notice the boundary is 1 minute for each flow record, that's actually what I try to explain above -

Flow is the sequence of packets or a packet that belonged to certain network session(conversation) between two hosts but delimited by the setting of flow generation tool.

If the network session longer than 1 minute(long lived flow), then it will generate another flow(with same attribute/label) which is actually belonged to the same network session though. Of course you can tune this with -S option in argus. Lets try -

shell>argus -S 480 -mAJZRU 512 -r http-download.pcap \
-w http-download-480.arg3


I set 480 seconds here which is 8 minutes as the network session duration falls in that time range. Now we read it again with ra -

shell>ra -L0 -nr http-download-480.arg3 \
-s stime saddr sport dir daddr dport spkts dpkts - ip
StartTime SrcAddr Sport Dir DstAddr Dport SrcPkts DstPkts
23:43:22.024899 192.168.0.102.51371 -> 128.121.50.122.80 7801 11981

However in the real world implementation, this is not the right way to construct the network session from multiple flows, especially if your network structure is complex(provides various of network services) and busy(heavy network traffics) and this is really arbitrary. You can't easily identify that multiple flows belong to the same network session as there will be many other flow records inserted in between, another issue is what if the network session duration is longer than 480 seconds(8 minutes). That's where racluster(another argus client tool) comes into rescue.

Network Session

From the racluster partial man page -

Racluster reads argus data from an argus-data source, and clusters/merges the records based on the flow key criteria specified either on the command line, or in a racluster configuration file, and outputs a valid argus-stream. This tool is primarily used for data mining, data management and report generation.
The default action is to merge status records from the same flow and argus probe, providing in some cases huge data reduction with limited loss of flow information.

Racluster is easy to use but hard to master, however here's the simple usage to construct the network session from multiple network flow records.

shell>racluster -L0 -nr http-download.arg3 \
-s stime saddr sport dir daddr dport spkts dpkts
StartTime SrcAddr Sport Dir DstAddr Dport SrcPkts DstPkts
23:43:22.024899 192.168.0.102.51371 -> 128.121.50.122.80 7801 11981

It is just really that simple, to explain this network session.

Start Time - 23:43:22.024899
Source Address - 192.168.0.102
Source Port - 51371
Destination Address - 128.121.50.122
Destination Port - 80
Source Packets - 7801
Destination Packets - 11981

Start Time is the time when the network session started, others are pretty self-explained except Source Packets and Destination Packets. Source packets count how many packets have been sent by the Source Address, Destination Packets count how many packets have been sent by Destination Address. To generate summarization of this network session, you can run -

shell>racluster -L0 -nr http-download.arg3 \
-s dur pkts bytes

Dur TotPkts TotBytes
405.100830 19782 18047455

This network session duration is approximately 405 seconds, the total packets in this network session is 19782, and the total bytes is 18047255. Yes, this is where network flow analysis can be useful - traffic accounting but I won't really explain it much here since it will be another topic.

Maybe I should make this post title sounds more interesting with "Network Flow Demystified". There are other topic about network flow where I don't mention here such as Cisco Netflow, Unidirectional vs Bidirectional model, other interesting flow metrics that provided by argus and so forth, I wish I can close the gap in coming posts.

Enjoy (;])

5 comments:

e0n said...

Great writeup!! Its nice to see some highlights of other argus client tools other than just "ra". I think the client suite is heavily under-utilized.

Pablo said...

Your post is great! , I am studying about traffic modeling and I found your information very useful for me. I have question, the flow definition in your example is "bi-directional", is it?, is there a way to argus generate only uni-directional flows?
Definition of flows result sometimes confused for me, can you help with that?
Grettings!

C.S.Lee said...

hi pablo,

The definition applies for both unidirectional and bidirectional flow, I will be going to write the post about the different between them with diagram and clean explanation so people can understand it better.

You can actually ask argus to generate unidirectional flow and I will show you in the new post.

Stay tuned and thanks!

Pablo said...

Thx Lee. I be waiting that post.
We are developing a perl script to generate flows reading directly from a pcap file and generate txt files with the flows generated in a slot time of duration T. The input parameters are: pcap file name, slot time(secs, duration T), outname(txt), mode (generate uni/biflows), flow duration (secs).
Every single txt file have a list of the flows information, ready for Matlab processing:
Time star/end of the flow,IPsrc, srcPrt, dstIP, dstPrt, proto, #pkts, volume(bytes)

C.S.Lee said...

hi pablo,

Here's the tip -

If you are using argus, you can use its client - rastream to achieve half of the things you have mentioned except matlab part.

Cheers ;]