Efficient Decision Tree Construction on Streaming Data
Ruoming Jin
Gagan Agrawal
Department of Computer and Information
Department of Computer and Information
Sciences
Sciences
Ohio State University, Columbus OH 43210
Ohio State University, Columbus OH 43210
jinr@cis.ohio-state.edu
agrawal@cis.ohio-state.edu
ABSTRACT
egorical attributes only. We present a numerical interval pruning
Decision tree construction is a well studied problem in data min-
(NIP) approach which significantly reduces the processing time for
ing. Recently, there has been much interest in mining streaming
numerical attributes, without any loss of accuracy. Our experimen-
data. Domingos and Hulten have presented a one-pass algorithm
tal results show an average of 39% reduction in execution times.
for decision tree construction. Their work uses Hoeffding inequal-
Using Smaller Samples Size for the Same Probabilistic Bound:
ity to achieve a probabilistic bound on the accuracy of the tree con-
Domingos and Hulten use Hoeffding’s bound [19] to achieve a
structed.
probabilistic bound. Hoeffding’s result relates the sample size, the
In this paper, we revisit this problem. We make the following two
desired level of accuracy, and the probability of meeting this level
contributions: 1) We present a numerical interval pruning (NIP) ap-
of accuracy, and is applicable independent of the distribution of in-
proach for efficiently processing numerical attributes. Our results
put data. In this paper, we show how we can use the properties
show an average of 39% reduction in execution times. 2) We ex-
of the gain function entropy (and gini) to reduce the sample size
ploit the properties of the gain function entropy (and gini) to reduce
required to obtain the same probabilistic bound. Again, this result
the sample size required for obtaining a given bound on the accu-
is independent of the distribution of input data. Our experimental
racy. Our experimental results show a 37% reduction in the number
results show that the number of samples required is reduced by an
of data instances required.
average of 37%.
Overall, the two new techniques introduced here significantly
Overall, these two contributions increase the efficiency of pro-
improve the efficiency of decision tree construction on streaming
cessing streaming data, where a real-time constraint may exist on
data.
the processing times, and only limited memory may be available.
Our work also has important implications for analysis of streaming
1.
INTRODUCTION
data beyond decision tree construction. We will be exploring these
further in our future work.
Decision tree construction is an important data mining problem.
The rest of the paper is organized as follows. Section 2 gives
Over the last decade, decision tree construction over disk-resident
background information on the decision tree construction problem,
datasets has received considerable attention [11, 13, 25, 27]. More
and reviews the gain function entropy. The problems and issues in
recently, the database community has focused on a new model of
processing of streaming data are discussed in Section 3. Our new
data processing, in which data arrives in the form of continuous
technique for efficient handling of numerical attributes is presented
streams [3, 4, 9, 12, 14, 16, 23, 30]. The key issue in mining on
in Section 4. The new sampling method is described in Section 5.
streaming data is that only one pass is allowed over the entire data.
We evaluate our techniques in Section 6. We compare our work
Moreover, there is a real-time constraint, i.e. the processing time
with related research efforts in Section 7 and conclude in Section 8.
is limited by the rate of arrival of instances in the data stream, and
the memory available to store any summary information may be
bounded. For most data mining problems, a one pass algorithm
2.
DECISION TREE CONSTRUCTION
cannot be very accurate. The existing algorithms typically achieve
This section provides background information on the decision
either a deterministic bound on the accuracy [17], or a probabilistic
tree construction problem.
bound [10]. Data mining algorithms developed for streaming data
also serve as a useful basis for creating approximate, but scalable,
2.1
Decision Tree Classifier
implementations for very large and disk-resident datasets.
Assume there is a data set ¢¡¤£¦¥¨§¦©¥©©¥! , where ¥"#¡%$
Domingos and Hulten have addressed the problem of decision
&
&
&
©(0)21
35476
¡C$
©©
)
§
'
. '98¨@BA
'
'ED
is the data associated
tree construction on streaming data [10, 21]. Their algorithm guar-
with the instance and ( is the class label. Each 'GF is called a field
antees a probabilistic bound on the accuracy of the decision tree
&
3
¡I3
4P
3
or an attribute of the data instance.
§
8@HA
D
is the
that is constructed. In this paper, we revisit the problem of decision
domain of data instances and 3 F is the domain of the attribute 'QF .
tree construction on streaming data. We make the following two
The domain of an attribute can either be a categorical set, such as
contributions:
£RTS¦UV©WX`YESa©bGScXdXdeTf!
6
, or a numerical set, such as
.
is the
gh
hipicq
Efficient Processing of Numerical Attributes: One of the chal-
domain of class labels. In this paper, our discussion will assume
lenges in processing of numerical attributes is that the total number
that there are only two distinct class labels, though our work can be
of candidate split points is very large, which can cause high compu-
easily extended to the general case.
tational and memory overhead for determining the best split point.
The classification problem is to find a computable function rts
&
The work presented by Domingos and Hulten is evaluated for cat-
3vuwx6
, such that for any instance ¥ extracted from the same distri-
&
¥¨
bution as
,
'
£¢
will give an as accurate as possible prediction
2.3
Evaluating Split Conditions
r
¡
of ¥¨ ( . Decision tree classifiers are frequently used for achieving the
Selecting the attribute and the split condition that maximizes in-
above functionality. A decision tree classifier is typically a binary
formation gain is the key step in decision tree construction. The
tree, where every non-leaf node ¥ is associated with a predicate .
¤
scalable decision tree construction algorithms proposed in the lit-
A predicate partitions the set of data instances associated with node
erature [11, 13, 25, 27] take a number of approaches towards per-
based upon the value of a particular attribute "
"
'
. If '
belongs to a
forming this step efficiently.
categorical domain,
is a subset predicate, for example,
¡
¥
RTYES
¤
¤
The major issues that need to be addressed are, what information
1
£RTS¦UV©WXdYVSp
if
"
"
'
. If '
belongs to a numerical domain,
is a
¤
is required for evaluating different candidate split conditions, and
range predicate, for example,
¡
¥RcYES
if
"
¦¥¨§
'
. Here, §
is
¤
i
i
how can this information be stored and processed efficiently. In the
called the cutting or the split point.
initial work on decision tree construction for disk-resident datasets,
Building a decision tree classifier generally includes two stages,
the training dataset is separated into attribute lists [25, 27]. For a
a growing stage and a pruning stage. The tree growing stage in-
particular attribute, the attribute list maintains the record-identifier
volves recursively partitioning the dataset, till the records associ-
and the value of that attribute for the training record. Moreover, for
ated with a leaf node either all have the same class label, or their
efficiently choosing the best split point for a numerical attribute,
cardinality is below a threshold. In partitioning any node, a number
the attribute lists for such attributes is kept sorted.
of different candidate splitting conditions are evaluated. The best
A significantly different approach is taken as part of the Rain-
splitting condition is typically chosen to maximize a gain function,
Forest approach by Gehrke et al. [13]. Here, a new data structure
which are based upon impurity measurements such as gini or en-
called an AVC (Attribute-Value, Classlabel) group is used. An AVC
tropy. The pruning stage eliminates some nodes to reduce the tree
group for a decision tree node comprises AVC sets for all attributes.
size. This paper will focus only on the tree growing stage.
For a given attribute and a node being processed, the AVC set sim-
There are two commonly used metrics to evaluate a decision
ply records the count of occurrence of each class label for each
tree classifier, inaccuracy and tree size. Inaccuracy is defined as
distinct value the attribute can take. Thus, it is essentially a class
R
¡
©!RcecW
¥
¡
¥¨
(
¢
¢
, where ¥ is a random instance from the
Br¡
histogram. The size of the AVC set for a given node and attribute is
underlying distribution. Tree size is defined as the total number of
proportional to the product of the number of distinct values of the
nodes in the tree, and measures the conciseness of the classifier.
attribute and the number of distinct class labels.
2.2
Entropy Function
3.
STREAMING DATA PROBLEM
An impurity function gives a measurement of the impurity in the
dataset. Originally proposed in the information theory literature,
In this section, we focus on the problem of decision tree con-
entropy has become one of the most popular impurity functions.
struction on streaming data. We give a template of the algorithm,
Suppose, we are looking at a training dataset
. Let § and be
which will be used as the basis for our presentation in the next three
¤
¤
the proportion of instances with class labels
and , respectively.
sections. Moreover, we describe how sampling is used to achieve a
h
Clearly, §
¡
.
probabilistic bound on the quality of the tree that is constructed.
¤
¤
h
Entropy function is defined as
3.1
Algorithm Template
E¥Rce
b
¡
4
!#"%$
4
&!'"($
§
§
¢
¤
¤
¤
¤
¤
¡
4
)!'"($
4
)!'"($
§
§
§
§
¢
¢
StreamTree Stream
¤
¤
h
¤
h
¤
Y
`ba
global c8dfege¡dfhXhgiqpsrutvewtxeyp9y
(
Now, suppose we split the node using a split predicate
and cre-
local hXIehXIeI
ate two subsets 10 and 12 , which are the left and right subsets,
local c8tesiq
respectively. Let 0 denote the fraction of the data instances in
yde¡efygde¡e8
¤
that are associated with 10 . Then, the gain associated with split-
h
IiYdfhXhjiqpklya6
(
ting using the predicate
is defined as
while not
and empty
YmegnoRiqpxYkyua
Yyuara
iss`t
uegiqYvaq
¡
©
0
2
h(zgzw{}|
3%4
3
¢
hXIeuxwy
pYmdXhXhjiqp9ira6
if ~hXIe
lly
h
zgh
¡
5E¥Rce
b
4
8E¥Rce
b
4
8E¥Rce
b
0
0
2
2
IiYm~hXIet
nox#ep9iraq
¢
¢6¢
¢6¢6¢
zgh
{}zj|
z
{
{
¤
6 7¤
¤
7¤
¤
if hXIeIt
i
p
i}hq
wwhg
i
hj
Further, let §90 be the proportion of instances with the class label
dfewndhg%eIYm~hXIepklya6
¤
if
zgh
z
within 10 , and let
§
92
be the proportion of instances with the
hXIeIt
eghgtxu
n¦#e
Yva
h
¤
z
z
{
|
{
z
z
{
t
e
i
tw6i
hgi}hfuewivi}e6e
i
iq
¥
RTe
b
class label
within
2
. Because
¢
is a constant, we
§
h
h
¤
YhXIe
pk~hXIe
asxhXIet
w6dfe
i}eIYqaq
can treat
0
§
90
§
92
4
3
as a function of three variables,
,
, and
.
¤
¤
¤
dfewndhg%eIYm~hXIepklya6
§
h
¡
0
©
§
90
©
§
92
¡
@E¥Rce
b
IRYrYhXIe
prhXIe
a6pjyuaq
4
3
3
¢
¢
7¤
¤
¤
¤
while eg~hjtu
ndewndhgdXpxYy¦pgyua
4
4
!'"%$
4
!'"($
0
§
90
§
90
§
90
§
90
¢
¢6¢
uegiqYhjeIpXya6
¤
¤
¤
h
¤
h
¤
h
0
4
§
92
4
&!'"($
§
92A
§
92
4
!#"%$
§
92
IRYhjeIpqyuaq
¢
¢
¢6¢
h
¤
¤
¤
h
¤
h
¤
For a given attribute
"
"
'
, let 3
'
¢
denote the best gain possible
using this attribute for splitting the node. If we have
attributes,
B
we are interested in determining , such that
C
Figure 1: StreamTree Algorithm
"
3
'
¢ED
FEGIH
3
'QFI¢
§
6S
T
T
T
S
"
D
VUXW
U
F
IPRQ
Q
We first list the issues in analyzing streaming data. The total size
If more than one attribute satisfies this condition, a pre-defined rule
of the data is typically much larger than the available memory. It
(such as randomization) can be used to select one of these.
is not possible to store and re-read all data from memory. Thus,
(
a single pass algorithm is required, which also needs to meet the
3%4
associated with a potential split point
for a numerical attribute
real-time constraint, i.e. the computing time for each item should
"
§
'
. If
and
are the fractions of data instances with class labels
¤
¤
be less than the interval between arrival times for two consecutive
1 and 2, respectively,
§
and are the estimates computed using
¤
¤
items.
the sample. Similarly, we have the definitions for 0 ©
§
90
, §92 , 4
3
,
¤
¤
¤
A key property that is required for analysis of streaming data is
and E¥ RTe
b
¢
.
¤
that the data instances arriving follow an underlying distribution.
We have,
It implies that if we collect a specific interval of streaming data,
¡
0
©
§
90
©
§
92
¡
¥
RTe
b
4
3
3
¢
¢
we can view them as a random sample taken from the underlying
¤
¤
¤
¤
distribution. It may be possible for a decision tree construction
algorithm to adjust the tree to changes in the distribution of data
!'"($
!'"($
0
§
90
§
90
§
90
§
90
¢
¢6¢
¤
h
¤
¤
h
¤
h
¤
instances in the stream [21], but we do not consider this possibility
here.
!'"($
!'"($
0
§
92
§
92
§
92
§
92
¢
¢
¢6¢
h
¤
¤
¤
h
¤
h
¤
Figure 1 presents a high-level algorithm for decision tree con-
4
4
struction on streaming data. This algorithm forms the basis for our
The value of 3
serves as the estimate of of 3
. Note that we do
E¥Rce
b
¢
presentation in the rest of the paper. The algorithm is based upon
not need to compute
, since we are only interested in
¤
two queues,
and
.
stands for active queue and denotes
the relative values of the gain values associated with different split
¡¢
¡¢
the set of decision tree nodes that we are currently working on ex-
points.
panding.
is the set of decision tree nodes that have not yet been
Now, we consider the procedure to find the best split point using
"
split, but are not currently being processed. This distinction is made
the above estimate of gains. Let 3
'
¢
be the estimate of the best
"
because actively processing each node requires additional memory.
gain that we could get from the attribute '
. Assuming there are B
"
For example, we may need to store the counts associated with each
attributes, we will use the attribute '
, such that
distinct value of each attribute. Therefore, the set AQ is constructed
"
3
'
¢
FEGIH
3
'
¢D§¦
F
from the set
by including as many nodes as possible, till suffi-
§
6S
T
T
T
S
"
D
UXW
U
F
IPRQ
Q
cient memory is available. The algorithm is initiated by putting the
root of the decision tree in the set
.
where ¦ is a small positive number. The above condition (called the
"
The input to the algorithm is a stream of data instances, denoted
statistical test) is used to infer that '
is likely to satisfy the original
by
. We successively obtain a data instance ¥ from this stream.
test for choosing the best attribute, which is
£
ecUQS
We determine the current decision tree node (denoted by
) that
"
3
'
¢&D
FEGH
3
'
¢
F
§
6S
T
T
T
S
"
eTUQS
this data instance belongs to. If
belongs to the set
, then
D
UjW
U
F
fPQ
Q
¡¤
the data instance is added to the set of samples available to process
To describe our confidence of above statistical inference, a param-
eTUQS
node. We then check if
satisfies the stop condition (which
eter
is used.
is the probability that the original test holds if the
can only be applied statistically). If so, ecUQS is removed from
.
¨
¨
¡¤
statistical test holds, and should be as close to 1 as possible. ¦ can
Otherwise, we check if we now have sufficient information to split
be viewed as a function of
and sample size
, i.e.
ecUQS
. If so, the node is split, removed from
, and its two child
¨
©
¥©
¡¢
nodes are added to the set
. The algorithm terminates when both
¡
©
¦
¢
r
¡ ¨
©
¥©
and
are empty.
¡¤
The algorithm, as presented here, is only different from the work
Domingos and Hulten use the Hoeffding bound [19] to construct
by Domingos and Hulten [10] in not assuming that all nodes at one
this function. The specific formula they use is
level of the tree can be processed simultaneously. The memory re-
!
quirements for processing a set of nodes is one of the issues we are
¢6¢
¡
h h
¨
¦
optimizing in our work. If the memory requirements for processing
4
©
¥©
a given node in the tree are reduced, more nodes can be fit into the
set
, and therefore, it is more likely that a given data instance
where
is the spread of the gain function. In this context, where
¡¤
can be used towards partitioning a node.
there are two classes and entropy is used as the impurity function,
¡
Besides the memory requirements, this algorithm also exposes a
. In Section 5, we will describe an alternative approach,
number of other issues in decision tree construction on streaming
which reduces the required sample size.
data. As we can see, one crucial problem here is to decide when
Based upon the probabilistic bound on the splitting condition for
we have sufficient samples available to make the splitting decision.
each node, Domingos and Hulten derive the following result on
Another question is, what information needs to be stored from the
the quality of the resulting decision tree. This result is based on
sample collected in order to make the splitting conditions. Simply
the measurement of intensional disagreement. The intensional dis-
#"
#"
"
§
storing all samples is one, but not the only possibility. Computa-
agreement
between two decision trees
and
is the
!
#"
§
tionally, yet another issue is how we efficiently examine all possible
probability that the path of an example through
will differ
splitting conditions associated with a node. Particularly, the num-
from its path through #" .
ber of distinct values associated with a numerical attribute can be
"&%
very large, and can make it computationally demanding to choose
THEOREM 1. If
is the tree produced by the algorithm
$
#"('
the best split point.
for streaming data with desired accuracy level
,
is the tree
¨
produced by batch processing an infinite training sequence, and
3.2
Using Sampling
is the leaf probability, i.e., the probability that a given example
¤
reaches a leaf node at any given level of the decision tree, then
Here, we review the problem of selecting splitting predicate based
upon a sample. Our discussion assumes the use of entropy as the
'
"
"&%
© #"
¥
¢
g
!
)$
q
h
¨
y¤
gain function, though the approach can be applied to other func-
"
© #"0'
"
©
#"0'
"
%
"
%
tions such as gini.
where
¢
is the expected value of
¢
g
!
)$
q
!
)$
Let
be a sample taken from the dataset . We focus on the gain
taken over an infinite training sequence.
¥
4.
A NEW ALGORITHM FOR HANDLING
cess the detailed information in the pruned interval to get best split
NUMERICAL ATTRIBUTES
point.
In this section, we present our numerical interval pruning ap-
The advantage of this approach is that we do not need to process
proach for making decision tree construction on streaming data
detailed information associated with a pruned interval. This results
more memory and computation efficient.
in a significant reduction in the execution time, but no loss of ac-
curacy. Further, as we will argue towards the end of this section,
we may not even store the detailed information associated with a
4.1
Problems and Our Approach
pruned interval. This further reduces the memory requirements,
One of the key problems in decision tree construction on stream-
but can result in a small loss of accuracy.
ing data is that the memory and computational cost of storing and
processing the information required to obtain the best split gain
can be very high. For categorical attributes, the number of distinct
NIP-Classifier Node
, Stream
Y
`ba
values is typically small, and therefore, the class histogram does
while not zwh
{qzj|
z
{
{
i
p
i}hq
wyhj
i
hj
Y
ga
¡
not require much memory. Similarly, searching for the best split
* Get Some Samples from Stream
*
`
¢
predicate is not expensive if number of candidate split conditions is
£
h
n¦#e¥¤
Y`
t
u%ewiqYqaraq
relatively small.
£
h
h
h%zjz¨§
{}z
¡x
i}e
n
#
¦
u
iqY©¤
aq
However, for numerical attributes with a large number of distinct
h
{}z
h(zgz¨§
{}z
¡x
i}e
¦
uhgw
e
¦
u
iqY©¤
aq
h
h{
|
h
{
values, both memory and computational costs can be very high.
¡x
i}e
bewi
eg
f
hgdXn
i
hjsY¤a6
¡
* Find the best gain *
Many of the existing approaches for scalable, but multi-pass, de-
¢
cision tree construction require a preprocessing phase in which at-
Find Best Gain(ClassHist)
u
~
{
Concise ClassHist
tribute lists for numerical attributes are sorted [25, 27]. Preprocess-
u¦
udXt
vuRYmu
p
a6
¡
* Split *
ing of data, in comparison, is not an option with streaming datasets,
¢
£
h
{qz
{
h
z
hI{
and sorting during execution can be very expensive. Domingos and
if
i
i
i
w
'p
Ve
i
Y
uia
£
{
Hulten have described and evaluated their one-pass algorithm fo-
i
lhXIeIY
a6
h
qdXe
cusing only on categorical attributes [10]. It is claimed that numer-
¡
* Pruning *¢
ical attributes can be processed by allowing predicates of the form
{
Concise ClassHist
udXt
vuY
uip
aq
$
“ "
"
"
'
'
F
”, for each distinct value ' F . This implies a very high
memory and computational overhead for determining the best split
point for a numerical attribute.
We have developed a Numerical Interval Pruning (NIP) approach
for addressing these problems. The basis of our approach is to
Figure 2: NIP Algorithm for Numerical Attributes Handling
partition the range of a numerical attribute into intervals, and then
use statistical tests to prune these intervals. At any given time, an
The main challenge in the algorithm is to effectively but correctly
interval is either pruned or intact. An interval is pruned if it does
prune the intervals. Over-pruning is a situation occurring when an
not appear likely to include the split point. An intact interval is an
interval does not appear likely to include the split point after we
interval that has not been pruned. In our current work, we have
have analyzed a small sample, but could include the split point after
used equal-width intervals, i.e. the range of a numerical attribute is
more information is made available. Under-pruning means that an
divided into intervals of equal width.
interval does not appear likely to include the split point but has
In Section 2.3, we had discussed how we can either store sam-
not yet been pruned. We refer to over-pruning and under-pruning
ples, or create class histograms to have sufficient information to
together as false pruning.
determine the best split condition. In the numerical interval prun-
The pseudo-code for our Numerical Interval Pruning (NIP) algo-
ing approach, we instead maintain the following sets for each node
rithm is presented in Figure 2. Here, after collecting some samples,
that is being processed.
we use small class histograms, concise class histograms, and the
Small Class Histograms: This is primarily comprised of class his-
detailed information from intact intervals and get an estimate of
tograms for all categorical attributes. The number of distinct ele-
the best (highest) gain. This is denoted as 3 . Then, by using 3 ,
ments for a categorical attribute is not very large, and therefore, the
we unprune intervals that look promising to contain the best gain,
3
size of the class histogram for each attribute is quite small. In ad-
based upon the current sample set. The best gain
can come from
!
3
dition, we also add the class histogram for numerical attributes for
or a newly unpruned intervals. Then, by performing a statisti-
which the number of distinct values is below a threshold.
cal test, we check if we can now split this node. If not, we need
to collect more samples. Before that, however, we check if some
Concise Class Histograms: The range of numerical attributes which
additional intervals can be pruned.
have a large number of distinct elements in the dataset is divided
The rest of this section presents more technical details, the cor-
into intervals. For each interval of a numerical attribute, the con-
rectness, and the computational and memory cost optimizations.
cise class histogram records the number of occurrences of instances
with each class label whose value of the numerical attribute is within
4.2
Technical Details
that interval.
In this subsection, we discuss the details of how pruning and
Detailed Information: The detailed information for an interval can
testing for false pruning are performed.
be in one of the two formats, depending upon what is efficient. The
For our discussion here, we initially assume that we have pro-
first format is class histogram for the samples which are within the
cessed the entire dataset, and are trying to prune the space of po-
interval. When the number of samples is large and the number of
tential split points from the range of a numerical attribute. Assume
distinct values of a numerical attribute is relatively small, this for-
that 3 is the best (highest) gain possible from any splitting condi-
©
mat is more efficient. The second format is to simply maintain the
tion associated
with a node of the decision tree. Suppose
"
"
©"
§
'
'
¢
g
set of samples with each class label. It is not necessary to pro-
is the
interval on the numerical attribute 3 . Let the class distri-
C
$#
Y
bution of the interval be
contain the split point only if " D
3
, and can be pruned if this is
C
§
4
&
not the case.
¡
©
©¦
"
"
"
"
¢
$
¡
As described above, this method is useful only if we have already
F
where
©
¥
£¢
¥
(
"
is the number of training records with the
scanned the entire dataset, know the value of the best gain 3 as well
h
¢
(
class label
that fall into the interval , and
is the total number of
as the gains at the interval boundaries. However, we perform this
C
class labels in the training data.
pruning after the sampling step, i.e., by examining only a fraction
We want to determine if any point within this interval can provide
of the training records. As we had described earlier in this section,
a higher gain than 3 . For this, we need to determine the highest
we use a sample and compute small and concise class histograms.
possible value of gain from any point within this interval. For the
Thus, for numerical attributes with a large number of distinct val-
boundary point "
'
, we define the cumulative distribution function
ues, we are processing the class frequencies for only the intervals,
§
and only using the sample. As defined above, 3 is the best gain pos-
4
&
¤¦¥¨§
¡
©©q
¥¨§
¥¨§
¢
sible using the entire dataset and all possible split conditions. Now,
let 3 be the best gain computed using the current sample set. Fur-
where,
!
ther, let 31@ be the best gain noted after using small class histograms,
"
§
W
concise class histograms and the detailed information from the in-
©
¡
©
©
¥
¥
(
§
¥
F
h
tact intervals. This is the value first computed by our algorithm.
§
F
We have,
Here,
©
§
¥
is the number of training records with the class label
¥
3
3
!
such that their value of the numerical attribute under consideration
is less than "
"
©"
§
This follows simply from the fact that 3
is computed from among
'
. Similarly, for the boundary point '
, we have
a subset of the split points that 3 would be computed from.
§
§
!
4
4
&
¤
§
¡
©©q
¥
§
§
¥
"
¥
"
¢
Using the values of gain at interval boundaries, and using the
Lemma 1, we can estimate the upper bound on the gain possible
b
Now, consider any point
between
"
"
©"
§
'
and '
. Let the cumu-
from a split point within a given interval . We denote this value as
C
lative distribution function be
Y
Y
"
, and is an estimate (using the sample) of the value
"
that could
!
§
4
&
¤¦ 2¡
©¨©y
be computed using the entire dataset.
¢
We focus on the function
Clearly, the following property holds.
¡
CY
"
3
!
!
¢
p©
¥
"¢
¥
7(T©8
§
V¥
# l¥
§$%
¥
¥
¢
h
'&
Y
If we have computed
"
3
and
using the entire dataset, we can prune
If we do not have any further information about the class distri-
)
an interval
if
. However, using sampling, we can only
C
!
i
bution of the training records in the interval, all points b satisfying
have estimate of this function. Suppose, we consider
the above constraint need to be considered in determining the high-
¡
Y
"
3
©
est gain possible from within the interval
"
"
©"
§
!
'
'
¢
. Formally, we
!
!
!
g
define the set ©
of possible internal points as all values within the
Based upon our discussion in Section 3.2, if
D
¦
!
, then with
!
©
interval
"
"
©"
§
'
'
¢
that satisfy the constraint .
)
g
&
probability
, we have
, where, ¦ ,
, and the sample size
¨
!
i
¨
The number of points in the set ©
can be very large, making it
are related through a function. Thus, pruning using the condition
computationally demanding to determine the highest possible gain
)
¦
!
gives us a probabilistic bound on the accuracy of pruning,
!
from within the interval. However, we are helped by a well-known
i.e., we could still prune incorrectly, but the probability is low and
mathematical result. To state this result, we define a set
, com-
¥
bounded.
prising of corner points within the interval. Formally,
Further, since we do not even have the value of 3 , we use 3 .
!
¥
3
&
Since 3
, we have
¡
£
)(
(
21
©
¤324(
¥
(
S
)(
¥
¥
"
"
©"
§
!
'
'
¥
©
g
q
10
&
C
r
~C
&
YE"
)
¡
BA
YE"
)
3
¢
¦
3
¢
¦
!
!
!
F
F
F
F
!
¢
p©
¥
"¢1¥7(T©
¡
¡
2
§
2
§$%
¥
¥
¢
#5
¢
h
Thus, after the first step, we perform statistical test using the con-
4
Y
)
"
¦
It is easy to see that the set
has
points. Now, the follow-
dition 3
for all of the pruned intervals . We call
!
C
¥
ing result allows us to compute the highest possible gain from the
an interval over-pruned if it is pruned using the statistical estimates
interval very efficiently.
described above from a small sample, but after the more data in-
stances are available, it turns out not to be the case. Note that to be
LEMMA 1. Let
be a concave gain function. Let ©
be the set
able to unprune intervals, we must not discard the detailed infor-
r
of possible internal points of interval , and
be the set of corner
mation associated with an interval that may be marked as pruned.
C
¥
points. Then
Thus, the algorithm presented so far only reduces the computa-
tional costs, but not the memory costs.
&
&
¤
¥
¤¦2
FEGIH
¢6¢
FEGIH
¢6¢
6
2
Br¡
Br¡
Next, we briefly discuss how we test for under-pruning. The
P
87
P
89
6
Y
algorithm simply computes
"
for intact intervals to see if the fol-
!
The above lemma is derived from a general mathematical theo-
lowing condition holds:
rem (see Magasarian [24]) and was also previously used by Gehrke
YE"
)
3
¢
¦
et al. in the BOAT approach [11]. Note that the gain functions
!
!
entropy and gini are concave functions.
If it is true, it means that this interval is not likely to have the best
By recording the frequency of intervals, computing gains at the
split point and can be pruned.
interval boundaries, and applying this lemma, we can compute the
By combining 3
with the detailed information from the over-
Y
upper bound " of the gain possible from the interval. If we already
pruned intervals, we get a new best gain 3 . We know that this is the
!
know the best gain 3 , we can compare Y " and 3 . The interval can
best gain possible from looking at the current sample size. This is
because if the pruned intervals could not achieve the gain 3
¦
,
mean
0 and the variance (or the square of the standard deviation)
they cannot achieve the gain
£
3
¦
either.
.
!
Suppose, we decide to partition a node using a sample
and the
¥
current best gain 3 . This gain is identical to the best gain that an
THEOREM 3. (Multivariate Delta Method) Let 3
§¦©¦©3¦¥
be
!
algorithm not performing any pruning would achieve by using the
3
¡v3
©¨©3
3
¡
a random sample. Let
"
§B"
"
. Further, let
"
¢
F
6
same sample set, as we formulate through the following theorem.
6#e¢¡
3
©3
¡
©¨
3
3
©3
©¦©3
"
"
"
"
"
§
"`
"
¥
§
and
¢
F
F
F
. Let
be the mean of
©
&
¡
©¦©
and let
§
§
§
§
¢
. For a given function 3 with continuous
6
T
HEOREM 2. The best gain 3
computed using our numerical
!
first partial derivatives, we have
interval pruning approach is the same as the one computed by an
algorithm that uses full class histograms, provided the two algo-
&
3
§©¦©
3
w
¤
©
¤£
3
¢
3
§
¢
¢
6
Hi
rithms use the same sample set.
where,
Proof:This follows from our discussion above.
&
&
Thus, the numerical interval pruning approach we have presented
3
§
¢
3
§
¢
£
¡
¨
"
F
"
does not limit accuracy in any way, as compared to any other algo-
§
§
F
rithm that uses samples.
Proof:See the reference [5], for example.
4.3
Computational and Memory Costs and Op-
Below, we show the application of the above result on the gain
timization
function entropy. This could similarly be applied on the gain func-
Initially, let us focus on the computational costs associated with
tion gini, but we do not present the details here.
the algorithm here. As a new data instance is received we check if
In applying the above result on the entropy function, we consider
it is associated with a node that is being processed. If so, a number
the following. The function 3 is a function of three measurements,
of update operations are performed. The processing time, however,
0
, §90 , and §92 . The three values or measurements are indepen-
¤
¤
¤
is a constant. The dominant computational part is when we want
6#e¢¡
©b
¡xb
dent of each other, i.e. the covariance
'
¢
is 0 if '
.
to determine the best split condition. Unlike in a batch algorithm,
this step may have to repeated several times, till we have a suffi-
¤
LEMMA 2. Let
be the sample size of
,
be the normal
¥
cient statistical confidence to perform the split. This step requires
distribution. Then, for the entropy function 3 , we have
processing the small and concise histograms, and the detailed in-
¥
¡
0
©
§
90
©
§
92
w
¤
©
£
¥
formation associated with intact intervals. Our experiments have
3
3
¢
3
'
£¢
¢
¤
¤
¤
determined that the main cost is associated with the processing of
where,
detailed information. Thus, pruning of intervals is crucial for re-
ducing this cost.
3
£
¡
0
0
¥
¢
¢
In the algorithm presented here, unpruning intervals is a require-
0
q¤
h
¤
¤
ment for provably achieving the same accuracy as in an algorithm
that does not do any pruning. Therefore, we need to maintain and
3
3
§
90
§
90
§
92
§
92
¢
¢
¢
¢
continue to update the detailed information associated with pruned
§
90
q¤
h
¤
§
92
6¤
h
¤
¤
¤
intervals. However, the probability of over-pruning can be shown
to be very small. Therefore, we can modify our original algorithm
Proof:The proof follows from the application of the multivari-
to not store the detailed information associated with pruned inter-
ate delta result (presented above), and the observation that the first
vals. This optimization has two benefits. First, the memory require-
derivatives for entropy are continuous functions (details are omitted
ments are reduced significantly. Second, we can further save on the
here).
computational costs by not having to update detailed information
Next, we focus on the following problem. Assume there is a
associated with a pruned interval.
b
3
©
¡
¢
point
belonging to the attribute
F
. We need to determine
C
)
$
b
if ¥
¥
3
3
or 3
3
, using just the sample
. Because
also
¥
5.
A NEW SAMPLING APPROACH
satisfies the Lemma 2, and ' and b are independent, we have
This section introduces a new approach for choosing the sam-
w
¤
G©£
3
3
¢
ple size. As compared to the Hoeffding inequality [19] based ap-
proach used by Domingos and Hulten [10], our method allows the
Therefore,
same probabilistic accuracy bound to be achieved using signifi-
w
¤
©
£
£
cantly smaller sample sizes.
¥
¥
¥
3
3
3
3
¢
¢
5.1
Exploiting Gain Functions
This leads to the following lemma.
As we have mentioned previously, the one-pass decision tree
construction algorithm by Domingos and Hulten uses Hoeffding
LEMMA 3. Let
inequality to relate the bound on the accuracy
¦
, the probability
,
£
£
¥
¨
¡
¥
%
and the sample size
. Hoeffding bound based result is indepen-
¦
©
¥
©
dent of the distribution of the data instances in the dataset. Here,
we derive another approach, which is still independent of the dis-
where % is the
¢
th percentile of the standard normal distri-
h
¨
tribution of the data instances, but uses properties of gain functions
bution. If ¥
¥
¥
3
3
D
§¦
, then with probability , we have 3
D
3
.
¨
like entropy and gini.
¥
If ¥
¥
¥
3
3
¦
, then with probability
, we have 3
D
3
.
¨
We use the following theorem, also known as the multivariant
delta result [5]. Here, the symbol
'
£¢
denotes the expected value
Proof:The above lemma follows from the application of well known
6#e¢¡
©b
of a variable ' ,
'
¢
denotes
results on simultaneous statistical inference [20].
the covariance of the two vari-
ables
©
¤£
'
and b , and ¤
¢
is the normal distribution with the
We call the above test the Normal test.
Hi
5.2
Sample Size Problem
and stores samples to evaluate candidate split conditions.
Once a desired level of accuracy
is chosen, the key issue with
ClassHist-H uses Hoeffding bound and creates full class his-
¨
the performance of a one-pass algorithm is the sample size selec-
tograms. NIP-H and NIP-N use numerical interval pruning, with
tion problem, i.e. how large a sample is needed to find the best split
Hoeffding bound and the normal distribution of entropy function,
point with the probability
. Specifically, we are interested in the
respectively. The version of NIP that we implemented and eval-
¨
sample size that could separate ¥¡
uated creates intervals after 10,000 samples have been read for
3
and 3
, where '¤£ and '¤¥ are
¢
the points that maximize the gain of split function for the top two
a node, performs interval pruning, and then deletes the samples.
3
3
attributes
Thus, unpruning is not an option here, and therefore, the accuracy
£
and
¥
.
Let ¥¡ l
¡
can be lower than an approach that uses full class histograms. Our
3
3
¦
. Thus, by normal distribution, the required
sample size is
implementation used a memory bound of 60 MB for all four ver-
sions. Consistent with what was reported for the implementation
£
£
%
¥
of Domingos and Hulten, we performed attribute pruning, i.e., did
¤
¡
¥
¦
not further consider an attribute that appeared to show poor gains
The required sample size from Hoeffding bound is
after some samples were analyzed.
Figure 3 shows the average number of nodes in the decision tree
!
¢6¢
¤
¡
generated using functions 1, 6, and 7, and using noise levels of
h
x h
¨
¦
0%, 2%, 4%, 6%, 8%, and 10%, respectively. This number does
Comparing the above two equations, we have the following re-
not change in any significant way over the four different versions
sult.
we experimented with. As expected, the size of the decision tree
increases with the level of noise in data.
THEOREM 4. The sample size required using the normal test
One interesting question is, what inaccuracy may be introduced
will always be less or equal to the sample size required for the
by our version of NIP-H, since it does not have the option of un-
Hoeffding test, i.e.,
pruning. Figure 4 shows the increase in inaccuracy for NIP-H,
¤
¥
¤
¥
as compared to the average of inaccuracy from Sample-H and
ClassHist-H. As can be seen from the figure, there is no sig-
Proof:This follows from comparing the two equations above.
nificant chance in inaccuracy. Note that whenever a different set
of data instances are used to split a node, the computed inaccuracy
6.
EXPERIMENTAL RESULTS
value can be different. Similarly, Figure 5 shows the increase in
inaccuracy for NIP-N, as compared to the average of inaccuracy
In this section, we report on a series of experiments designed
from Sample-H and ClassHist-H. Again, there is no signifi-
to evaluate the efficacy and performance of our new techniques.
cant change, and the average value of the difference is very close to
Particularly, we are interested in evaluating 1) the advantages of
zero.
using Numerical Interval Pruning (NIP), and 2) the advantages of
Figures 6, 7, and 8 show the execution times for decision tree
using normal distribution of the estimate of entropy function, as
construction with the four versions and different levels of noise, for
compared to Hoeffding’s bound.
functions 1, 6, and 7, respectively. Through-out, we will focus on
The datasets we used for our experiments were generated using a
comparing the performance of NIP-N and NIP-H with the better
tool described by Agrawal et al. [1]. There were two reasons for us-
one between Sample-H and ClassHist-H, which we denote
ing these datasets. First, these datasets have been widely used for
by existing.
evaluating a number of existing efforts on scalable decision con-
Initially, we focus on functions 6 and 7. For function 6, the exe-
struction [13, 11, 25, 27]. Second, the only real datasets that we are
cution times of NIP-H are between 40% and 70% of the execution
aware of are quite small in size, and therefore, were not suitable
time of existing. Moreover, NIP-N further reduces the exe-
for our experiments. The datasets we generated had 10 million
cution time by between 7% and 80%. For function 7, the execu-
training records, each with 6 numerical attributes and 3 categorical
tion times with NIP-H are between 35% and 75% of existing.
attributes. We used the functions 1, 6, and 7 for our experiments.
NIP-N further reduces the execution times by between 3% and
For each of these functions, we generated separate datasets with
65%. Results are relatively mixed from using the function 1. Our
0%, 2%, 4%, 6%, 8%, and 10% noise.
best version NIP-N is significantly better in 3 of the 6 cases, but
In growing the decision tree, our implementation did not expand
quite comparable (i.e. either marginally better or worse) in other
a node any further if one of the following conditions were true:
3 cases. This is because with this function, the decision to choose
95% or greater fraction of the training records had the same class
the best split condition can usually be made by examining only a
label, the depth of the node was 12, or less than 1% of all training
small number of samples. Therefore, the use of numerical interval
records were associated with the node. For each node, we start
pruning does not give better results in many cases.
evaluation for finding split condition after at least 10,000 records
We next compare these four version using two metrics we con-
associated with the node had been read. Further, we reevaluated
sider important. These metrics are, total instances read (TIR), and
each node every time after another 5,000 records had been read.
instances actively processed (IAP). TIR is the number of samples
The range of each numerical attribute was divided into 500 equal
or data instances that are read before the decision tree converges.
sized
W
¤¦
intervals. The value of
used in our experiments was
¨
h
When a sample is read, it cannot always be used as part of the al-
. All our experiments were conducted on a 700 MHz Intel
hi
gorithm. This is because it may be assigned to a node that does
Pentium III machine, with 1 GB of SDRAM and a 18 GB disk with
not need to be expanded any further, or is not being processed cur-
15000 rpm Ultra 160 SCSI drive.
rently because of memory considerations. Therefore, we measure
The results from experiments designed to evaluate the NIP ap-
IAP as the number of data instances that were used for evaluating
proach and the benefits of using normal distribution of the estimate
candidate split conditions. Figures 9, 10, and 11 show TIR for the
of entropy function are reported together. We created 4 different
four versions and for functions 1, 6, and 7, respectively. The use
versions, all based upon the basic StreamTree algorithm presented
of class histograms results in high memory requirements, which
in Figure 1. Sample-H is the version that uses Hoeffding bound,
800
10
10
F1
NIP-H-F1
NIP-N-F1
700
F6
8
NIP-H-F6
8
NIP-N-F6
F7
NIP-H-F7
NIP-N-F7
600
6
6
500
4
4
400
2
2
300
0
0
Number of Nodes 200
-2
-2
100
Relative Inaccuracy (%)
Relative Inaccuracy (%)
-4
-4
0
0
0.02
0.04
0.06
0.08
0.1
0
0.02
0.04
0.06
0.08
0.1
0
0.02
0.04
0.06
0.08
0.1
Noise Factor
Noise Factor
Noise Factor
Figure 3: Concept Size
Figure 4: Inaccuracy with NIP
Figure 5: Inaccuracy with Normal
500
600
1000
ClassHist-H
ClassHist-H
ClassHist-H
450
Sample-H
Sample-H
Add New Comment