New Directions in Traf?c Measurement and Accounting
Computer Science and Engineering Department
Computer Science and Engineering Department
University of California, San Diego
University of California, San Diego
9500 Gilman Drive
9500 Gilman Drive
La Jolla, CA 92093-0114
La Jolla, CA 92093-0114
Accurate network tra?c measurement is required for ac-
counting, bandwidth provisioning and detecting DoS at-
tacks. These applications see the tra?c as a collection of
?ows they need to measure. As link speeds and the number
of ?ows increase, keeping a counter for each ?ow is too ex-
Network tra?c measurement, usage based accounting, scal-
pensive (using SRAM) or slow (using DRAM). The current
ability, on-line algorithms, identifying large ?ows
state-of-the-art methods (Cisco’s sampled NetFlow) which
log periodically sampled packets are slow, inaccurate and
resource-intensive. Previous work showed that at di?erent
granularities a small number of “heavy hitters” accounts for
If we’re keeping per-?ow state, we have a scaling
a large share of tra?c. Our paper introduces a paradigm
problem, and we’ll be tracking millions of ants to
shift for measurement by concentrating only on large ?ows
track a few elephants. — Van Jacobson, End-to-
— those above some threshold such as 0.1% of the link ca-
end Research meeting, June 2000.
We propose two novel and scalable algorithms for iden-
Measuring and monitoring network tra?c is required to
tifying the large ?ows: sample and hold and multistage ?l-
manage today’s complex Internet backbones [9, 4]. Such
ters, which take a constant number of memory references per
measurement information is essential for short-term moni-
packet and use a small amount of memory. If M is the avail-
toring (e.g., detecting hot spots and denial-of-service attacks
able memory, we show analytically that the errors of our new
), longer term tra?c engineering (e.g., rerouting tra?c
algorithms are proportional to 1/M ; by contrast, the error
and upgrading selected links), and accounting (e.g., to
of an algorithm based on classical sampling is proportional
support usage based pricing).
to 1/ M , thus providing much less accuracy for the same
The standard approach advocated by the Real-Time Flow
amount of memory. We also describe further optimizations
Measurement (RTFM)  Working Group of the IETF is to
such as early removal and conservative update that further
instrument routers to add ?ow meters at either all or selected
improve the accuracy of our algorithms, as measured on re-
input links. Today’s routers o?er tools such as NetFlow 
al tra?c traces, by an order of magnitude. Our schemes
that give ?ow level information about tra?c.
allow a new form of accounting called threshold accounting
The main problem with the ?ow measurement approach is
in which only ?ows above a threshold are charged by usage
its lack of scalability. Measurements on MCI traces as early
while the rest are charged a ?xed fee. Threshold accounting
as 1997  showed over 250,000 concurrent ?ows. More
generalizes usage-based and duration based pricing.
recent measurements in  using a variety of traces show
the number of ?ows between end host pairs in a one hour
period to be as high as 1.7 million (Fix-West) and 0.8 million
(MCI). Even with aggregation, the number of ?ows in 1 hour
Categories and Subject Descriptors
in the Fix-West used by  was as large as 0.5 million.
It can be feasible for ?ow measurement devices to keep
C.2.3 [Computer-Communication Networks]: Network
up with the increases in the number of ?ows (with or with-
Operations—tra?c measurement, identifying large ?ows
out aggregation) only if they use the cheapest memories:
DRAMs. Updating per-packet counters in DRAM is already
impossible with today’s line speeds; further, the gap between
DRAM speeds (improving 7-9% per year) and link speeds
Permission to make digital or hard copies of all or part of this work for
(improving 100% per year) is only increasing. Cisco Net-
personal or classroom use is granted without fee provided that copies are
Flow , which keeps its ?ow counters in DRAM, solves
not made or distributed for pro?t or commercial advantage and that copies
this problem by sampling: only sampled packets result in
bear this notice and the full citation on the ?rst page. To copy otherwise, to
updates. But NetFlow sampling has problems of its own (as
republish, to post on servers or to redistribute to lists, requires prior speci?c
we show later) since it a?ects measurement accuracy.
permission and/or a fee.
Despite the large number of ?ows, a common observation
SIGCOMM’02, August 19-23, 2002, Pittsburgh, Pennsylvania, USA.
Copyright 2002 ACM 1-58113-570-X/02/0008 ...$5.00.
found in many measurement studies (e.g., [9, 8]) is that a
small percentage of ?ows accounts for a large percentage of
the tra?c.  shows that 9% of the ?ows between AS pairs
Our algorithms for identifying large ?ows can potentially
account for 90% of the byte tra?c between all AS pairs.
be used to solve many problems. Since di?erent applications
For many applications, knowledge of these large ?ows is
de?ne ?ows by di?erent header ?elds, we need a separate
probably su?cient. [8, 17] suggest achieving scalable di?er-
instance of our algorithms for each of them. Applications
entiated services by providing selective treatment only to a
we envisage include:
small number of large ?ows.  underlines the importance
of knowledge of “heavy hitters” for decisions about network
• Scalable Threshold Accounting: The two poles
upgrades and peering.  proposes a usage sensitive billing
of pricing for network tra?c are usage based (e.g., a
scheme that relies on exact knowledge of the tra?c of large
price per byte for each ?ow) or duration based (e.g.,
?ows but only samples of the tra?c of small ?ows.
a ?xed price based on duration). While usage-based
We conclude that it is infeasible to accurately measure all
pricing [13, 20] has been shown to improve overall u-
?ows on high speed links, but many applications can bene?t
tility, usage based pricing in its most complete form is
from accurately measuring only the few large ?ows. One
not scalable because we cannot track all ?ows at high
can easily keep counters for a few large ?ows using a small
speeds. We suggest, instead, a scheme where we mea-
amount of fast memory (SRAM). However, how does the
sure all aggregates that are above z% of the link; such
device know which ?ows to track? If one keeps state for all
tra?c is subject to usage based pricing, while the re-
?ows to identify the few large ?ows, our purpose is defeated.
maining tra?c is subject to duration based pricing. By
Thus a reasonable goal is to devise an algorithm that iden-
varying z from 0 to 100, we can move from usage based
ti?es large ?ows using memory that is only a small constant
pricing to duration based pricing. More importantly,
larger than is needed to describe the large ?ows in the ?rst
for reasonably small values of z (say 1%) threshold
place. This is the central question addressed by this paper.
accounting may o?er a compromise between that is s-
We present two algorithms that provably identify large ?ows
calable and yet o?ers almost the same utility as usage
using such a small amount of state. Further, our algorithms
based pricing.  o?ers experimental evidence based
use only a few memory references, making them suitable for
on the INDEX experiment that such threshold pricing
use in high speed routers.
could be attractive to both users and ISPs. 1.
• Real-time Tra?c Monitoring: Many ISPs moni-
tor backbones for hot-spots in order to identify large
tra?c aggregates that can be rerouted (using MPLS
A ?ow is generically de?ned by an optional pattern (which
tunnels or routes through optical switches) to reduce
de?nes which packets we will focus on) and an identi?er (val-
congestion. Also, ISPs may consider sudden increases
ues for a set of speci?ed header ?elds). We can also general-
in the tra?c sent to certain destinations (the victims)
ize by allowing the identi?er to be a function of the header
to indicate an ongoing attack.  proposes a mecha-
?eld values (e.g., using pre?xes instead of addresses based
nism that reacts as soon as attacks are detected, but
on a mapping using route tables). Flow de?nitions vary with
does not give a mechanism to detect ongoing attacks.
applications: for example for a tra?c matrix one could use
For both tra?c monitoring and attack detection, it
a wildcard pattern and identi?ers de?ned by distinct source
may su?ce to focus on large ?ows.
and destination network numbers. On the other hand, for
identifying TCP denial of service attacks one could use a
• Scalable Queue Management: At a smaller time
pattern that focuses on TCP packets and use the destina-
scale, scheduling mechanisms seeking to approximate
tion IP address as a ?ow identi?er.
max-min fairness need to detect and penalize ?ows
Large ?ows are de?ned as those that send more than a giv-
sending above their fair rate. Keeping per ?ow state
en threshold (say 0.1% of the link capacity) during a given
only for these ?ows [10, 17] can improve fairness with
measurement interval (1 second, 1 minute or even 1 hour).
We do not address this application
The technical report  gives alternative de?nitions and al-
further, except to note that our techniques may be
gorithms based on de?ning large ?ows via leaky bucket de-
useful for such problems. For example,  uses clas-
sical sampling techniques to estimate the sending rates
An ideal algorithm reports, at the end of the measurement
of large ?ows. Given that our algorithms have better
interval, the ?ow IDs and sizes of all ?ows that exceeded the
accuracy than classical sampling, it may be possible
threshold. A less ideal algorithm can fail in three ways: it
to provide increased fairness for the same amount of
can omit some large ?ows, it can wrongly add some small
memory by applying our algorithms.
?ows to the report, and can give an inaccurate estimate of
the tra?c of some large ?ows. We call the large ?ows that
The rest of the paper is organized as follows.
evade detection false negatives, and the small ?ows that are
scribe related work in Section 2, describe our main ideas in
wrongly included false positives.
Section 3, and provide a theoretical analysis in Section 4.
The minimum amount of memory required by an ideal al-
We theoretically compare our algorithms with NetFlow in
gorithm is the inverse of the threshold; for example, there
Section 5. After showing how to dimension our algorithms in
can be at most 1000 ?ows that use more than 0.1% of the
Section 6, we describe experimental evaluation on traces in
link. We will measure the performance of an algorithm by
Section 7. We end with implementation issues in Section 8
four metrics: ?rst, its memory compared to that of an ideal
and conclusions in Section 9.
algorithm; second, the algorithm’s probability of false neg-
1Besides , a brief reference to a similar idea can be found
atives; third, the algorithm’s probability of false positives;
in . However, neither paper proposes a fast mechanism
and fourth, the expected error in tra?c estimates.
to implement the idea.
of large ?ows, our algorithms can be implemented using on-
The primary tool used for ?ow level measurement by IP
chip or o?-chip SRAM to store ?ow state. We assume that
backbone operators is Cisco NetFlow . NetFlow keeps
at each packet arrival we can a?ord to look up a ?ow ID in
per ?ow state in a large, slow DRAM. Basic NetFlow has two
the SRAM, update the counter(s) in the entry or allocate
problems: i) Processing Overhead: updating the DRAM
a new entry if there is no entry associated with the current
slows down the forwarding rate; ii) Collection Overhead:
the amount of data generated by NetFlow can overwhelm
The biggest problem is to identify the large ?ows. Two
the collection server or its network connection. For example
approaches suggest themselves. First, when a packet arrives
 reports loss rates of up to 90% using basic NetFlow.
with a ?ow ID not in the ?ow memory, we could make place
The processing overhead can be alleviated using sampling:
for the new ?ow by evicting the ?ow with the smallest mea-
per-?ow counters are incremented only for sampled packets.
sured tra?c (i.e., smallest counter). While this works well
We show later that sampling introduces considerable inaccu-
on traces, it is possible to provide counter examples where
racy in the estimate; this is not a problem for measurements
a large ?ow is not measured because it keeps being expelled
over long periods (errors average out) and if applications do
from the ?ow memory before its counter becomes large e-
not need exact data. However, we will show that sampling
nough, even using an LRU replacement policy as in .
does not work well for applications that require true low-
A second approach is to use classical random sampling.
er bounds on customer tra?c (e.g., it may be infeasible to
Random sampling (similar to sampled NetFlow except us-
charge customers based on estimates that are larger than ac-
ing a smaller amount of SRAM) provably identi?es large
tual usage) and for applications that require accurate data
We show, however, in Table 1 that random sam-
at small time scales (e.g., billing systems that charge higher
pling introduces a very high relative error in the measure-
during congested periods).
ment estimate that is proportional to 1/ M , where M is
The data collection overhead can be alleviated by having
the amount of SRAM used by the device. Thus one needs
the router aggregate ?ows (e.g., by source and destination
very high amounts of memory to reduce the inaccuracy to
AS numbers) as directed by a manager. However,  shows
that even the number of aggregated ?ows is very large. For
The two most important contributions of this paper are
example, collecting packet headers for Code Red tra?c on a
two new algorithms for identifying large ?ows: Sample and
class A network  produced 0.5 Gbytes per hour of com-
Hold (Section 3.1) and Multistage Filters (Section 3.2). Their
pressed NetFlow data and aggregation reduced this data
performance is very similar, the main advantage of sam-
only by a factor of 4. Techniques described in  can be
ple and hold being implementation simplicity, and the main
used to reduce the collection overhead at the cost of further
advantage of multistage ?lters being higher accuracy.
errors. However, it can considerably simplify router process-
contrast to random sampling, the relative errors of our two
ing to only keep track of heavy-hitters (as in our paper) if
new algorithms scale with 1/M , where M is the amount of
that is what the application needs.
SRAM. This allows our algorithms to provide much more
Many papers address the problem of mapping the tra?c of
accurate estimates than random sampling using the same
large IP networks.  deals with correlating measurements
amount of memory.
Section 3.3 we present improve-
taken at various points to ?nd spatial tra?c distributions;
ments that further increase the accuracy of these algorithms
the techniques in our paper can be used to complement their
on traces (Section 7). We start by describing the main ideas
methods.  describes a mechanism for identifying packet
behind these schemes.
trajectories in the backbone, that is not focused towards
estimating the tra?c between various networks.
Sample and hold
Bloom ?lters  and stochastic fair blue  use similar
Base Idea: The simplest way to identify large ?ows is
but di?erent techniques to our parallel multistage ?lters to
through sampling but with the following twist. As with or-
compute very di?erent metrics (set membership and drop
dinary sampling, we sample each packet with a probability.
probability). Gibbons and Matias  consider synopsis da-
If a packet is sampled and the ?ow it belongs to has no entry
ta structures that use small amounts of memory to approx-
in the ?ow memory, a new entry is created. However, after
imately summarize large databases. They de?ne counting
an entry is created for a ?ow, unlike in sampled NetFlow,
samples that are similar to our sample and hold algorithm.
we update the entry for every subsequent packet belonging
However, we compute a di?erent metric, need to take into
to the ?ow as shown in Figure 1.
account packet lengths and have to size memory in a di?er-
Thus once a ?ow is sampled a corresponding counter is
ent way. In , Fang et al look at e?cient ways of answering
held in a hash table in ?ow memory till the end of the mea-
iceberg queries, or counting the number of appearances of
While this clearly requires processing
popular items in a database. Their multi-stage algorithm
(looking up the ?ow entry and updating a counter) for ev-
is similar to multistage ?lters that we propose. However,
ery packet (unlike Sampled NetFlow), we will show that the
they use sampling as a front end before the ?lter and use
reduced memory requirements allow the ?ow memory to be
multiple passes. Thus their ?nal algorithms and analyses
in SRAM instead of DRAM. This in turn allows the per-
are very di?erent from ours. For instance, their analysis is
packet processing to scale with line speeds.
limited to Zipf distributions while our analysis holds for all
Let p be the probability with which we sample a byte.
Thus the sampling probability for a packet of size s is ps =
1?(1?p)s. This can be looked up in a precomputed table or
approximated by ps = p ? s. Choosing a high enough value
for p guarantees that ?ows above the threshold are very like-
Because our algorithms use an amount of memory that is
ly to be detected. Increasing p unduly can cause too many
a constant factor larger than the (relatively small) number
false positives (small ?ows ?lling up the ?ow memory). The
Sampled packet (probability=1/3)
Large reports to
Update entry or
create a new one
Sample and hold
Update existing entry
Small reports to
p ~ size
Sampled NetFlow counts only sampled
Figure 1: The leftmost packet with ?ow label F 1
packets, sample and hold counts all after entry cre-
arrives ?rst at the router. After an entry is created
for a ?ow (solid line) the counter is updated for all
its packets (dotted lines)
advantage of this scheme is that it is easy to implement and
yet gives accurate measurements with very high probability.
flow ID F
Preliminary Analysis: The following example illustrates
the method and analysis. Suppose we wish to measure the
tra?c sent by ?ows that take over 1% of the link capaci-
ty in a measurement interval. There are at most 100 such
?ows. Instead of making our ?ow memory have just 100
locations, we will allow oversampling by a factor of 100 and
Figure 3: In a parallel multistage ?lter, a packet
keep 10, 000 locations. We wish to sample each byte with
with a ?ow ID F is hashed using hash function h1 in-
probability p such that the average number of samples is
to a Stage 1 table, h2 into a Stage 2 table, etc. Each
10, 000. Thus if C bytes can be transmitted in the measure-
table entry contains a counter that is incremented
ment interval, p = 10, 000/C.
by the packet size. If all the hashed counters are
For the error analysis, consider a ?ow F that takes 1% of
above the threshold (shown bolded), F is passed to
the tra?c. Thus F sends more than C/100 bytes. Since we
the ?ow memory for individual observation.
are randomly sampling each byte with probability 10, 000/C,
the probability that F will not be in the ?ow memory at
the end of the measurement interval (false negative) is (1 ?
technique avoids packet size biases unlike NetFlow which
10000/C)C/100 which is very close to e?100.
samples every x packets. Third, our technique reduces the
the factor of 100 in the exponent is the oversampling factor.
extra resource overhead (router processing, router memo-
Better still, the probability that ?ow F is in the ?ow mem-
ry, network bandwidth) for sending large reports with many
ory after sending 5% of its tra?c is, similarly, 1 ? e?5 which
records to a management station.
is greater than 99% probability. Thus with 99% probability
the reported tra?c for ?ow F will be at most 5% below the
actual amount sent by F .
Base Idea: The basic multistage ?lter is shown in Figure 3.
The analysis can be generalized to arbitrary threshold val-
The building blocks are hash stages that operate in parallel.
ues; the memory needs scale inversely with the threshold
First, consider how the ?lter operates with only one stage.
percentage and directly with the oversampling factor. No-
A stage is a table of counters which is indexed by a hash
tice also that the analysis assumes that there is always space
function computed on a packet ?ow ID; all counters in the
to place a sample ?ow not already in the memory. Setting
table are initialized to 0 at the start of a measurement in-
p = 10, 000/C ensures only that the average number of ?ows
terval. When a packet comes in, a hash on its ?ow ID is
sampled is no more than 10,000. However, the distribution
computed and the size of the packet is added to the corre-
of the number of samples is binomial with a small standard
sponding counter. Since all packets belonging to the same
deviation (square root of the mean). Thus, adding a few
?ow hash to the same counter, if a ?ow F sends more than
standard deviations to the memory estimate (e.g., a total
threshold T , F ’s counter will exceed the threshold. If we
memory size of 10,300) makes it extremely unlikely that the
add to the ?ow memory all packets that hash to counters of
?ow memory will ever over?ow.
T or more, we are guaranteed to identify all the large ?ows
Compared to Sampled NetFlow our idea has three signif-
(no false negatives).
icant di?erences shown in Figure 2. Most importantly, we
Unfortunately, since the number of counters we can a?ord
sample only to decide whether to add a ?ow to the mem-
is signi?cantly smaller than the number of ?ows, many ?ows
ory; from that point on, we update the ?ow memory with
will map to the same counter. This can cause false positives
every byte the ?ow sends. As shown in section 5 this will
in two ways: ?rst, small ?ows can map to counters that hold
make our results much more accurate. Second, our sampling
large ?ows and get added to ?ow memory; second, several
small ?ows can hash to the same counter and add up to a
Let d be the number of stages (the depth of the serial
number larger than the threshold.
?lter). We set a threshold of T /d for all the stages. Thus for
To reduce this large number of false positives, we use mul-
a ?ow that sends T bytes, by the time the last packet is sent,
tiple stages. Each stage (Figure 3) uses an independent hash
the counters the ?ow hashes to at all d stages reach T /d, so
function. Only the packets that map to counters of T or
the packet will pass to the ?ow memory. As with parallel
more at all stages get added to the ?ow memory. For exam-
?lters, we have no false negatives. As with parallel ?lters,
ple, in Figure 3, if a packet with a ?ow ID F arrives that
small ?ows can pass the ?lter only if they keep hashing to
hashes to counters 3,1, and 7 respectively at the three stages,
counters made large by other ?ows.
F will pass the ?lter (counters that are over the threshold
The analytical evaluation of serial ?lters is more compli-
are shown darkened). On the other hand, a ?ow G that
cated than for parallel ?lters. On one hand the early stages
hashes to counters 7, 5, and 4 will not pass the ?lter be-
shield later stages from much of the tra?c, and this con-
cause the second stage counter is not over the threshold.
tributes to stronger ?ltering. On the other hand the thresh-
E?ectively, the multiple stages attenuate the probability of
old used by stages is smaller (by a factor of d) and this
false positives exponentially in the number of stages. This
contributes to weaker ?ltering. Since, as shown in Section
is shown by the following simple analysis.
7, parallel ?lters perform better than serial ?lters on traces
Preliminary Analysis: Assume a 100 Mbytes/s link2,
of actual tra?c, the main focus in this paper will be on
with 100,000 ?ows and we want to identify the ?ows above
1% of the link during a one second measurement interval.
Assume each stage has 1,000 buckets and a threshold of 1
Improvements to the basic algorithms
Mbyte. Let’s see what the probability is for a ?ow sending
The improvements to our algorithms presented in this sec-
100 Kbytes to pass the ?lter. For this ?ow to pass one stage,
tion further increase the accuracy of the measurements and
the other ?ows need to add up to 1 Mbyte - 100Kbytes = 900
reduce the memory requirements.
Some of the improve-
Kbytes. There are at most 99,900/900=111 such buckets
ments apply to both algorithms, some apply only to one
out of the 1,000 at each stage. Therefore, the probability
of passing one stage is at most 11.1%. With 4 independent
stages, the probability that a certain ?ow no larger than 100
Kbytes passes all 4 stages is the product of the individual
There are a number of basic optimizations that exploit
stage probabilities which is at most 1.52 ? 10?4.
the fact that large ?ows often last for more than one mea-
Based on this analysis, we can dimension the ?ow memo-
ry so that it is large enough to accommodate all ?ows that
Preserving entries: Erasing the ?ow memory after each
pass the ?lter. The expected number of ?ows below 100K-
interval, implies that the bytes of a large ?ow that were sent
bytes passing the ?lter is at most 100, 000 ? 15.2 ? 10?4 < 16.
before the ?ow was allocated an entry are not counted. By
There can be at most 999 ?ows above 100Kbytes, so the
preserving entries of large ?ows across measurement inter-
number of entries we expect to accommodate all ?ows is at
vals and only reinitializing stage counters, all long lived large
most 1,015. Section 4 has a rigorous theorem that proves
?ows are measured nearly exactly. To distinguish between a
a stronger bound (for this example 122 entries) that holds
large ?ow that was identi?ed late and a small ?ow that was
for any distribution of ?ow sizes. Note the potential scala-
identi?ed by error, a conservative solution is to preserve the
bility of the scheme. If the number of ?ows increases to 1
entries of not only the ?ows for which we count at least T
million, we simply add a ?fth hash stage to get the same
bytes in the current interval, but also all the ?ows who were
e?ect. Thus to handle 100,000 ?ows, requires roughly 4000
added in the current interval (since they may be large ?ows
counters and a ?ow memory of approximately 100 memory
that entered late).
locations, while to handle 1 million ?ows requires roughly
Early removal: Sample and hold has a larger rate of
5000 counters and the same size of ?ow memory. This is
false positives than multistage ?lters. If we keep for one
more interval all the ?ows that obtained a new entry, many
The number of memory accesses per packet for a multi-
small ?ows will keep their entries for two intervals. We can
stage ?lter is one read and one write per stage. If the num-
improve the situation by selectively removing some of the
ber of stages is small, this is feasible even at high speeds by
?ow entries created in the current interval. The new rule for
doing parallel memory accesses to each stage in a chip im-
preserving entries is as follows. We de?ne an early removal
plementation.3 While multistage ?lters are more complex
threshold R that is less then the threshold T . At the end of
than sample-and-hold, they have a two important advan-
the measurement interval, we keep all entries whose counter
tages. They reduce the probability of false negatives to 0
is at least T and all entries that have been added during the
and decrease the probability of false positives, thereby re-
current interval and whose counter is at least R.
ducing the size of the required ?ow memory.
Shielding: Consider large, long lived ?ows that go through
The serial multistage ?lter
the ?lter each measurement interval. Each measurement in-
terval, the counters they hash to exceed the threshold. With
We brie?y present a variant of the multistage ?lter called
shielding, tra?c belonging to ?ows that have an entry in ?ow
a serial multistage ?lter. Instead of using multiple stages
memory no longer passes through the ?lter (the counters in
in parallel, we can place them serially after each other, each
the ?lter are not incremented for packets with an entry),
stage seeing only the packets that passed the previous stage.
thereby reducing false positives. If we shield the ?lter from
2To simplify computation, in our examples we assume that
a large ?ow, many of the counters it hashes to will not reach
1Mbyte=1,000,000 bytes and 1Kbyte=1,000 bytes.
the threshold after the ?rst interval. This reduces the proba-
3We describe details of a preliminary OC-192 chip imple-
bility that a random small ?ow will pass the ?lter by hashing
mentation of multistage ?lters in Section 8.
to counters that are large because of other ?ows.
• What are the resources required by the algorithm? The
key resource measure is the size of ?ow memory need-
ed. A second resource measure is the number of mem-
ory references required.
In Section 4.1 we analyze our sample and hold algorithm,
and in Section 4.2 we analyze multistage ?lters. We ?rst
analyze the basic algorithms and then examine the e?ect of
some of the improvements presented in Section 3.3. In the
next section (Section 5) we use the results of this section to
analytically compare our algorithms with sampled NetFlow.
Example: We will use the following running example to
give numeric instances. Assume a 100 Mbyte/s link with
100, 000 ?ows. We want to measure all ?ows whose tra?c
is more than 1% (1 Mbyte) of link capacity in a one second
Counter 1 Counter 2 Counter 3
Counter 1 Counter 2 Counter 3
Sample and hold
Figure 4: Conservative update: without conserva-
We ?rst de?ne some notation we use in this section.
tive update (left) all counters are increased by the
size of the incoming packet, with conservative up-
• p the probability for sampling a byte;
date (right) no counter is increased to more than
the size of the smallest counter plus the size of the
• s the size of a ?ow (in bytes);
• T the threshold for large ?ows;
Conservative update of counters
• C the capacity of the link – the number of bytes that
can be sent during the entire measurement interval;
We now describe an important optimization for multistage
?lters that improves performance by an order of magnitude.
• O the oversampling factor de?ned by p = O · 1/T ;
Conservative update reduces the number of false positives
of multistage ?lters by two subtle changes to the rules for
• c the number of bytes actually counted for a ?ow.
updating counters. In essence, we endeavour to increment
The quality of results for sample and hold
counters as little as possible (thereby reducing false positives
by preventing small ?ows from passing the ?lter) while still
The ?rst measure of the quality of the results is the prob-
avoiding false negatives (i.e., we need to ensure that all ?ows
ability that a ?ow at the threshold is not identi?ed. As
that reach the threshold still pass the ?lter.)
presented in Section 3.1 the probability that a ?ow of size T
The ?rst change (Figure 4) applies only to parallel ?lters
is not identi?ed is (1?p)T ? e?O. An oversampling factor of
and only for packets that don’t pass the ?lter. As usual,
20 results in a probability of missing ?ows at the threshold
an arriving ?ow F is hashed to a counter at each stage.
of 2 ? 10?9.
We update the smallest of the counters normally (by adding
Example: For our example, p must be 1 in 50,000 bytes
the size of the packet). However, the other counters are
for an oversampling of 20. With an average packet size of
set to the maximum of their old value and the new value of
500 bytes this is roughly 1 in 100 packets.
the smallest counter. Since the amount of tra?c sent by the
The second measure of the quality of the results is the
current ?ow is at most the new value of the smallest counter,
di?erence between the size of a ?ow s and our estimate.
this change cannot introduce a false negative for the ?ow the
The number of bytes that go by before the ?rst one gets
packet belongs to. Since we never decrement counters, other
sampled has a geometric probability distribution4: it is x
large ?ows that might hash to the same counters are not
with a probability5 (1 ? p)xp.
prevented from passing the ?lter.
Therefore E[s ? c] = 1/p and SD[s ? c] =
1 ? p/p. The
The second change is very simple and applies to both par-
best estimate for s is c + 1/p and its standard deviation is
allel and serial ?lters. When a packet passes the ?lter and it
1 ? p/p. If we choose to use c as an estimate for s then
obtains an entry in the ?ow memory, no counters should be
the error will be larger, but we never overestimate the size
updated. This will leave the counters below the threshold.
of the ?ow. In this case, the deviation from the actual value
Other ?ows with smaller packets that hash to these counters
of s is
E[(s ? c)2] =
2 ? p/p. Based on this value we
will get less “help” in passing the ?lter.
can also compute the relative error of a ?ow of size T which
2 ? p/p =
2 ? p/O.
ANALYTICAL EVALUATION OF OUR AL-
Example: For our example, with an oversampling factor
O of 20, the relative error for a ?ow at the threshold is 7%.
4We ignore for simplicity that the bytes before the ?rst sam-
In this section we analytically evaluate our algorithms.
pled byte that are in the same packet with it are also count-
We focus on two important questions:
ed. Therefore the actual algorithm will be more accurate
than our model.
How good are the results? We use two distinct mea-
5Since we focus on large ?ows, we ignore for simplicity the
sures of the quality of the results: how many of the
correction factor we need to apply to account for the case
large ?ows are identi?ed, and how accurately is their
when the ?ow goes undetected (i.e. x is actually bound by
the size of the ?ow s, but we ignore this).
The memory requirements for sample and hold
keep is at most C/R which can be smaller than OC/T , the
The size of the ?ow memory is determined by the number
bound on the expected number of sampled packets. The
of ?ows identi?ed. The actual number of sampled packets is
expected number of entries we need is C/R + OC/T .
an upper bound on the number of entries needed in the ?ow
To bound with high probability the number of entries we
memory because new entries are created only for sampled
use the normal curve. If R ? T /O the standard deviation
packets. Assuming that the link is constantly busy, by the
is given only by the randomness of the packets sampled in
linearity of expectation, the expected number of sampled
one interval and is
Cp(1 ? p).
bytes is p · C = O · C/T .
Example: An oversampling of 20 and R = 0.2T with over-
Example: Using an oversampling of 20 requires 2,000 en-
?ow probability 0.1% requires 2,647 memory entries.
tries on average.
The number of sampled bytes can exceed this value. Since
the number of sampled bytes has a binomial distribution, we
In this section, we analyze parallel multistage ?lters. We
can use the normal curve to bound with high probability the
only present the main results. The proofs and supporting
number of bytes sampled during the measurement interval.
lemmas are in . We ?rst de?ne some new notation:
Therefore with probability 99% the actual number will be
• b the number of buckets in a stage;
at most 2.33 standard deviations above the expected val-
ue; similarly, with probability 99.9% it will be at most 3.08
• d the depth of the ?lter (the number of stages);
standard deviations above the expected value. The standard
• n the number of active ?ows;
deviation of the number of sampled packets is
Cp(1 ? p).
Example: For an oversampling of 20 and an over?ow prob-
• k the stage strength is the ratio of the threshold and
ability of 0.1% we need at most 2,147 entries.
the average size of a counter. k = T b , where C de-
notes the channel capacity as before. Intuitively, this
The effect of preserving entries
is the factor we in?ate each stage memory beyond the
We preserve entries across measurement intervals to im-
minimum of C/T
prove accuracy. The probability of missing a large ?ow de-
Example: To illustrate our results numerically, we will
creases because we cannot miss it if we keep its entry from
assume that we solve the measurement example described
the prior interval. Accuracy increases because we know the
in Section 4 with a 4 stage ?lter, with 1000 buckets at each
exact size of the ?ows whose entries we keep. To quantify
stage. The stage strength k is 10 because each stage memory
these improvements we need to know the ratio of long lived
has 10 times more buckets than the maximum number of
?ows among the large ones.
?ows (i.e., 100) that can cross the speci?ed threshold of 1%.
The cost of this improvement in accuracy is an increase
in the size of the ?ow memory. We need enough memory to
The quality of results for multistage ?lters
hold the samples from both measurement intervals6. There-
As discussed in Section 3.2, multistage ?lters have no false
fore the expected number of entries is bounded by 2O · C/T .
negatives. The error of the tra?c estimates for large ?ows is
To bound with high probability the number of entries we
bounded by the threshold T since no ?ow can send T bytes
use the normal curve and the standard deviation of the the
without being entered into the ?ow memory. The stronger
number of sampled packets during the 2 intervals which is
the ?lter, the less likely it is that the ?ow will be entered into
2Cp(1 ? p).
the ?ow memory much before it reaches T . We ?rst state
Example: For an oversampling of 20 and acceptable prob-
an upper bound for the probability of a small ?ow passing
ability of over?ow equal to 0.1%, the ?ow memory has to
the ?lter described in Section 3.2.
have at most 4,207 entries to preserve entries.
Lemma 1. Assuming the hash functions used by di?erent
The effect of early removal
stages are independent, the probability of a ?ow of size s <
T (1?1/k) passing a parallel multistage ?lter is at most ps ?
The e?ect of early removal on the proportion of false neg-
atives depends on whether or not the entries removed early
k T ?s
are reported. Since we believe it is more realistic that im-
plementations will not report these entries, we will use this
The proof of this bound formalizes the preliminary anal-
ysis of multistage ?lters from Section 3.2. Note that the
assumption in our analysis. Let R < T be the early removal
threshold. A ?ow at the threshold is not reported unless one
bound makes no assumption about the distribution of ?ow
of its ?rst T ? R bytes is sampled. Therefore the probability
sizes, and thus applies for all ?ow distributions. The bound
of missing the ?ow is approximately e?O(T ?R)/T . If we use
is tight in the sense that it is almost exact for a distribution
an early removal threshold of R = 0.2 ? T , this increases the
that has (C ? s)/(T ? s) ?ows of size (T ? s) that send all
probability of missing a large ?ow from 2 ?10?9 to 1.1?10?7
their packets before the ?ow of size s. However, for realistic
with an oversampling of 20.
tra?c mixes (e.g., if ?ow sizes follow a Zipf distribution),
Early removal reduces the size of the memory required by
this is a very conservative bound.
limiting the number of entries that are preserved from the
Based on this lemma we obtain a lower bound for the
previous measurement interval. Since there can be at most
expected error for a large ?ow.
C/R ?ows sending R bytes, the number of entries that we
Theorem 2. The expected number of bytes of a large ?ow
undetected by a multistage ?lter is bound from below by
We actually also keep the older entries that are above the
threshold. Since we are performing a worst case analysis we
assume that there is no ?ow above the threshold, because if
E[s ? c] ? T
k(d ? 1)
there were, many of its packets would be sampled, decreasing
the number of entries required.
where ymax is the maximum size of a packet.
This bound suggests that we can signi?cantly improve the
accuracy of the estimates by adding a correction factor to
the bytes actually counted. The down side to adding a cor-
1+10 r log10(n)
rection factor is that we can overestimate some ?ow sizes;
1 + log10(n)
this may be a problem for accounting applications.
The memory requirements for multistage ?lters
Table 1: Comparison of the core algorithms: sample
We can dimension the ?ow memory based on bounds on
and hold provides most accurate results while pure
the number of ?ows that pass the ?lter. Based on Lemma 1
sampling has very few memory accesses
we can compute a bound on the total number of ?ows ex-
pected to pass the ?lter.
on the algorithm analysis in Section 4 and an analysis of
3. The expected number of ?ows passing a par-
NetFlow taken from .
allel multistage ?lter is bound by
Comparison of the core algorithms
E[npass] ? max
k ? 1
kn ? b
kn ? b
In this section we compare sample and hold, multistage
?lters and ordinary sampling (used by NetFlow) under the
assumption that they are all constrained to using M memory
Example: Theorem 3 gives a bound of 121.2 ?ows. Using
entries. We focus on the accuracy of the measurement of a
3 stages would have resulted in a bound of 200.6 and using 5
?ow (de?ned as the standard deviation of an estimate over
would give 112.1. Note that when the ?rst term dominates
the actual size of the ?ow) whose tra?c is zC (for ?ows of
the max, there is not much gain in adding more stages.
1% of the link capacity we would use z = 0.01).
In  we have also derived a high probability bound on
The bound on the expected number of entries is the same
the number of ?ows passing the ?lter.
for sample and hold and for sampling and is pC. By mak-
Example: The probability that more than 185 ?ows pass
ing this equal to M we can solve for p. By substituting in
the ?lter is at most 0.1%. Thus by increasing the ?ow memo-
the formulae we have for the accuracy of the estimates and
ry from the expected size of 122 to 185 we can make over?ow
after eliminating some terms that become insigni?cant (as
of the ?ow memory extremely improbable.
p decreases and as the link capacity goes up) we obtain the
results shown in Table 1.
The effect of preserving entries and shielding
For multistage ?lters, we use a simpli?ed version of the
Preserving entries a?ects the accuracy of the results the
result from Theorem 3: E[npass] ? b/k + n/kd. We increase
same way as for sample and hold: long lived large ?ows have
the number of stages used by the multistage ?lter logarith-
their tra?c counted exactly after their ?rst interval above
mically as the number of ?ows increases so that a single
the threshold. As with sample and hold, preserving entries
small ?ow is expected to pass the ?lter7 and the strength
basically doubles all the bounds for memory usage.
of the stages is 10. At this point we estimate the memory
Shielding has a strong e?ect on ?lter performance, since
usage to be M = b/k + 1 + rbd = C/T + 1 + r10C/T log10(n)
it reduces the tra?c presented to the ?lter. Reducing the
where r depends on the implementation and re?ects the rel-
tra?c ? times increases the stage strength to k ? ?, which
ative cost of a counter and an entry in the ?ow memory.
can be substituted in Theorems 2 and 3.
From here we obtain T which will be the maximum error of
our estimate of ?ows of size zC. From here, the result from
COMPARING MEASUREMENT METH-
Table 1 is immediate.
The term M z that appears in all formulae in the ?rst
row of the table is exactly equal to the oversampling we de-
In this section we analytically compare the performance
?ned in the case of sample and hold. It expresses how many
of three tra?c measurement algorithms: our two new algo-
times we are willing to allocate over the theoretical mini-
rithms (sample and hold and multistage ?lters) and Sampled
mum memory to obtain better accuracy. We can see that
NetFlow. First, in Section 5.1, we compare the algorithms
the error of our algorithms decreases inversely proportional
at the core of tra?c measurement devices.
For the core
to this term while the error of sampling is proportional to
comparison, we assume that each of the algorithms is given
the inverse of its square root.
the same amount of high speed memory and we compare
The second line of Table 1 gives the number of memory
their accuracy and number of memory accesses. This allows
locations accessed per packet by each algorithm. Since sam-
a fundamental analytical comparison of the e?ectiveness of
ple and hold performs a packet lookup for every packet8,
each algorithm in identifying heavy-hitters.
its per packet processing is 1. Multistage ?lters add to the
However, in practice, it may be unfair to compare Sam-
one ?ow memory lookup an extra access to one counter per
pled NetFlow with our algorithms using the same amount
stage and the number of stages increases as the logarithm of
of memory. This is because Sampled NetFlow can a?ord to
use a large amount of DRAM (because it does not process
Con?guring the ?lter such that a small number of small
every packet) while our algorithms cannot (because they
?ows pass would have resulted in smaller memory and fewer
memory accesses (because we would need fewer stages), but
process every packet and hence need to store per ?ow en-
it would have complicated the formulae.
tries in SRAM). Thus we perform a second comparison in
8We equate a lookup in the ?ow memory to a single memory
Section 5.2 of complete tra?c measurement devices. In this
access. This is true if we use a content associable memo-
second comparison, we allow Sampled NetFlow to use more
ry. Lookups without hardware support require a few more
memory than our algorithms. The comparisons are based
memory accesses to resolve hash collisions.
the number of ?ows. Finally, for ordinary sampling one in
Therefore we assume that SRAM speeds keep pace with link
x packets get sampled so the average per packet processing
capacities. We also assume that the speed of DRAM does
not improve signi?cantly ( states that DRAM speeds im-
Table 1 provides a fundamental comparison of our new
prove only at 9% per year while clock rates improve at 40%
algorithms with ordinary sampling as used in Sampled Net-
The ?rst line shows that the relative error of our
We assume the following con?gurations for the three al-
algorithms scales with 1/M which is much better than the
gorithms. Our algorithms preserve entries. For multistage
1/ M scaling of ordinary sampling. However, the second
?lters we introduce a new parameter expressing how many
line shows that this improvement comes at the cost of requir-
times larger a ?ow of interest is than the threshold of the
ing at least one memory access per packet for our algorithms.
?lter u = zC/T . Since the speed gap between the DRAM
While this allows us to implement the new algorithms us-
used by sampled NetFlow and the link speeds increases as
ing SRAM, the smaller number of memory accesses (< 1)
link speeds increase, NetFlow has to decrease its sampling
per packet allows Sampled NetFlow to use DRAM. This is
rate proportionally with the increase in capacity9 to provide
true as long as x is larger than the ratio of a DRAM mem-
the smallest possible error. For the NetFlow error calcula-
ory access to an SRAM memory access. However, even a
tions we also assume that the size of the packets of large
DRAM implementation of Sampled NetFlow has some prob-
?ows is 1500 bytes.
lems which we turn to in our second comparison.
Besides the di?erences (Table 1) that stem from the core
algorithms, we see new di?erences in Table 2. The ?rst big
Comparing Measurement Devices
di?erence (Row 1 of Table 2) is that unlike NetFlow, our
algorithms provide exact measures for long-lived large ?ows
Table 1 implies that increasing DRAM memory size M
by preserving entries. More precisely, by preserving entries
to in?nity can reduce the relative error of Sampled NetFlow
our algorithms will exactly measure tra?c for all (or almost
to zero. But this assumes that by increasing memory one
all in the case of sample and hold) of the large ?ows that
can increase the sampling rate so that x becomes arbitrarily
were large in the previous interval. Given that our measure-
close to 1. If x = 1, there would be no error since every
ments show that most large ?ows are long lived, this is a big
packet is logged. But x must at least be as large as the ratio
of DRAM speed (currently around 60 ns) to SRAM speed
Of course, one could get the same advantage by using an
(currently around 5 ns); thus Sampled NetFlow will always
SRAM ?ow memory that preserves large ?ows across mea-
have a minimum error corresponding to this value of x even
surement intervals in Sampled NetFlow as well. However,
when given unlimited DRAM.
that would require the router to root through its DRAM
With this insight, we now compare the performance of
?ow memory before the end of the interval to ?nd the large
our algorithms and NetFlow in Table 2 without limiting
?ows, a large processing load. One can also argue that if
NetFlow memory. Thus Table 2 takes into account the un-
one can a?ord an SRAM ?ow memory, it is quite easy to do
derlying technologies (i.e., the potential use of DRAM over
Sample and Hold.
SRAM) and one optimization (i.e., preserving entries) for
The second big di?erence (Row 2 of Table 2) is that we
both our algorithms.
can make our algorithms arbitrarily accurate at the cost of
We consider the task of estimating the size of all the ?ows
increases in the amount of memory used10 while sampled
above a fraction z of the link capacity over a measurement
NetFlow can do so only by increasing the measurement in-
interval of t seconds. In order to make the comparison possi-
ble we change somewhat the way NetFlow operates: we as-
The third row of Table 2 compares the memory used by
sume that it reports the tra?c data for each ?ow after each
the algorithms. The extra factor of 2 for sample and hold
measurement interval, like our algorithms do. The four char-
and multistage ?lters arises from preserving entries. Note
acteristics of the tra?c measurement algorithms presented
that the number of entries used by Sampled NetFlow is
in the table are: the percentage of large ?ows known to be
bounded by both the number n of active ?ows and the num-
measured exactly, the relative error of the estimate of a large
ber of memory accesses that can be made in t seconds. Fi-
?ow, the upper bound on the memory size and the number
nally, the fourth row of Table 2 is identical to the second
of memory accesses per packet.
row of Table 1.
Note that the table does not contain the actual memory
Table 2 demonstrates that our algorithms have two advan-
used but a bound. For example the number of entries used
tages over NetFlow: i) they provide exact values for long-
by NetFlow is bounded by the number of active ?ows and
lived large ?ows (row 1) and ii) they provide much better
the number of DRAM memory lookups that it can perfor-
accuracy even for small measurement intervals (row 2). Be-
m during a measurement interval (which doesn’t change as
sides these advantages, our algorithms also have three more
the link capacity grows). Our measurements in Section 7
advantages not shown in Table 2. These are iii) provable
show that for all three algorithms the actual memory usage
lower bounds on tra?c, iv) reduced resource consumption
is much smaller than the bounds, especially for multistage
for collection, and v) faster detection of new large ?ows. We
?lters. Memory is measured in entries, not bytes. We as-
now examine advantages iii) through v) in more detail.
sume that a ?ow memory entry is equivalent to 10 of the
counters used by the ?lter because the ?ow ID is typical-
ly much larger than the counter. Note that the number of
If the capacity of the link is x times OC-3, then one in x
packets gets sampled. We assume based on  that Net-
memory accesses required per packet does not necessarily
Flow can handle packets no smaller than 40 bytes at OC-3
translate to the time spent on the packet because memory
accesses can be pipelined or performed in parallel.
10Of course, technology and cost impose limitations on the
We make simplifying assumptions about technology evo-
amount of available SRAM but the current limits for on and
As link speeds increase, so must the electronics.
o?-chip SRAM are high enough for our algorithms.
Sample and hold
2/z + 1/z log10(n)
1 + log10(n)
Table 2: Comparison of tra?c measurement devices
iii) Provable Lower Bounds: A possible disadvantage
of Sampled NetFlow is that the NetFlow estimate is not an
usage = entriesused/f lowmemsize
actual lower bound on the ?ow size. Thus a customer may be
if (usage > target)
charged for more than the customer sends. While one can
threshold = threshold ? (usage/target)adjustup
make the average overcharged amount arbitrarily low (us-
ing large measurement intervals or other methods from ),
if (threshold did not increase for 3 intervals)
there may be philosophical objections to overcharging. Our
threshold = threshold ? (usage/target)adjustdown
algorithms do not have this problem.
iv) Reduced Resource Consumption: Clearly, while
Sampled NetFlow can increase DRAM to improve accuracy,
the router has more entries at the end of the measurement
interval. These records have to be processed, potentially ag-
gregated, and transmitted over the network to the manage-
Figure 5: Dynamic threshold adaptation to achieve
ment station. If the router extracts the heavy hitters from
target memory usage
the log, then router processing is large; if not, the band-
width consumed and processing at the management station
is large. By using fewer entries, our algorithms avoid these
variable does not contain the number of entries used over
resource (e.g., memory, transmission bandwidth, and router
the last measurement interval, but an average of the last 3
CPU cycles) bottlenecks.
v) Faster detection of long-lived ?ows: In a security
Based on the measurements presented in , we use a
or DoS application, it may be useful to quickly detect a
value of 3 for adjustup, 1 for adjustdown in the case of
large increase in tra?c to a server.
Our algorithms can
sample and hold and 0.5 for multistage ?lters and 90% for
use small measurement intervals and detect large ?ows soon
target.  has a more detailed discussion of the threshold
after they start. By contrast, Sampled NetFlow can be much
adaptation algorithm and the heuristics used to decide