Chapter 1
A SURVEY OF QUERY AUDITING TECHNIQUES
FOR DATA PRIVACY
Shubha U. Nabar
Stanford University
sunabar@cs.stanford.edu
Krishnaram Kenthapadi
Microsoft Search Labs
krishnaram.kenthapadi@microsoft.com
Nina Mishra
University of Virginia
nmishra@cs.virginia.edu
Rajeev Motwani
Stanford University
rajeev@cs.stanford.edu
Introduction
This chapter is a survey of query auditing techniques for detecting and
preventing disclosures in a database containing private data. Informally,
auditing is the process of examining past actions to check whether they
were in conformance with official policies. In the context of database
systems with specific data disclosure policies, auditing is the process of
examining queries that were answered in the past to determine whether
answers to these queries could have been used by an individual to ascer-
tain confidential information forbidden by the disclosure policies. Tech-
niques used for detecting disclosures could potentially also be used or
extended to prevent disclosures, and so in addition to the retroactive au-
diting mentioned above, researchers have also studied an online variant
2
of the auditing problem wherein the task of an online auditor is to deny
queries that could potentially cause a breach of privacy.
Other common approaches to tackling the disclosure prevention prob-
lem include adding noise to the the data or otherwise perturbing the
query results supplied to the user. However statisticians are generally
averse to potential biases introduced by adding noise. One commonly
stated reason is that the data collection process is already prone to biases
and imperfections due to factors such as too few respondents, the cost of
gathering data, and inaccurate answers provided by respondents. Since
important decisions are made based on this data, they prefer to receive
answers without additional noise. It is in this context that query re-
striction techniques become relevant in disclosure prevention. The work
on offline (or retroactive) auditing has also similarly focused on the case
where answers supplied to users are exact.
The main focus of this chapter will be on statistical databases with a
single private attribute that only permit aggregate queries such as sum,
max, min or median over this private attribute. An instructive example is
a company database with employee salary as a private attribute. Or a set
of medical records with a boolean private attribute indicating whether or
not a patient was HIV-positive. We will first review the most commonly
used notion of disclosure in the statistical database literature called full
disclosure and then review algorithms and hardness results for offline
auditing that have been developed for different classes of queries under
this definition.
A natural question to ask is whether offline auditors could directly
be used as online auditors as well. The answer to the question, as we
shall see, is no due to the fact that query denials can leak information.
Researchers have proposed the paradigm of simulatability to surmount
this problem, and developed simulatable auditors for different classes of
queries to prevent full disclosure. We will review some of them.
The notion of full disclosure is not entirely satisfactory as a measure
of disclosure, so we will next present a recently proposed measure called
partial disclosure as well as simulatable online auditors that have been
proposed for different classes of queries under this definition. We will
conclude the chapter with a brief survey of results in another auditing
scenario where the information to be protected is an arbitrary view of the
database; and finally end with a discussion of the limitations of present
day auditing techniques.
A Survey of Query Auditing Techniques for Data Privacy
3
1.
Auditing Aggregate Queries
Most work on aggregate queries has focused on the case of a single
numerical private attribute that is either real valued (from a bounded or
unbounded range) or boolean. Additionally, most auditing algorithms
developed are for queries of only one kind, with hardness results for
auditing combinations of queries. Before proceeding further, we will
formalize some of the terminology used in the remainder of this section.
Let X = {x1, . . . , xn} be the set of private attribute values of n indi-
viduals in a database. An aggregate query q = (Q, f ) specifies a subset
of the records Q ⊆ {1, . . . , n} and a function f such as sum, max, min or
median. The result, f (Q), is f applied to the subset {xi | i ∈ Q}. We
call Q the query set of q.
1.1
Offline Auditing
We now survey some of the results in the offline auditing literature.
1.1.1
Full Disclosure.
Given the set of private values X
and a set of aggregate queries Q = {q1, . . . , qt} posed over this data set
that were correspondingly answered {a1, . . . , at}, the goal of an offline
auditor is to determine if an individual’s private value can be deduced.
Traditionally, the definition of disclosure that has been used is the notion
of full disclosure defined below.
Definition 1.1 (Full Disclosure) An element xi ∈ X is fully dis-
closed by a query set Q if it can be uniquely determined, i.e., in all
possible data sets X consistent with the answers a1, . . . , at, to queries
q1, . . . , qt, xi is the same.
As a simple example, if the query set consisted of a single query asking
for the sum of the salaries of all the female employees in the company,
and Alice was the only female employee in the company, then the answer
to this query uniquely determines Alice’s salary.
In general the answers to many different queries can be stitched to-
gether by a user to uniquely determine an individual’s private value.
The goal of the auditor then is to prevent such a full disclosure.
1.1.2
Examples of Offline Auditors.
As one example of
such an auditor, consider a set of sum queries posed over X, the ele-
ments of which are real-valued from an unbounded range. To determine
if the answers to these queries can be used to uniquely deduce some
private value, the auditor essentially needs to solve a system of linear
equations. It maintains a matrix where the rows correspond to queries
4
and the columns to private values. Each query is represented by a vector
of 1s and 0s, indexing the private elements that were in the sum query.
The matrix of query vectors is diagonalized via a series of elementary
row operations and column interchanges. If the resulting matrix has a
row with only one 1 and n − 1 0s, then some element is uniquely deter-
mined. Since only a linearly independent set of query vectors need to be
examined, the matrix is of size at most n × n, and the diagonalization
can be carried out in time O(n3). Since finding a maximal set of linearly
independent query vectors requires O(n2|Q|) time, sum queries can be
audited in polynomial time.
Theorem 1.2 Let X ∈ Rn be a data set of private values. There is an
algorithm to determine if an xi ∈ X is fully disclosed by a set of sum
queries Q and corresponding answers A that runs in time O(n3 +n2|Q|).
Besides sum queries, offline auditors for exact determination of full
disclosure also exist for combinations of max and min queries, median
queries and average queries over real-valued data. Unfortunately, no
significant progress has been made in auditing arbitrary combinations
of aggregate queries. For example, the following hardness result has been
proved via a reduction from set partition.
Theorem 1.3 There is no polynomial time full-disclosure auditing al-
gorithm for sum and max queries unless P=NP.
The auditing problem has also been examined when the private at-
tribute is boolean. Surprisingly, full-disclosure auditing of sum queries
over boolean data is coNP-hard. There exists an efficient polynomial
time algorithm, however, in the special case where the queries are 1-
dimensional, i.e., for some ordering of the elements in X, the query set
for each query involves a consecutive sequence of xi’s. Considering such
restrictions of the general auditing problem is useful in practice, since in
reality, users would rarely be able to pose queries over arbitrary subsets
of the data. Rather, they would use conditions over some attribute or
combinations of attributes to select specific records in the data set to
aggregate. For example, a realistic query would ask for the total num-
ber of HIV-positive people in a particular age group. The set of queries
asking for the total number of HIV-positive people in various age groups
would form a set of 1-dimensional sum queries over a boolean private at-
tribute. Such assumptions about the structure of queries can yield even
more efficient auditors. For example, the sum auditor over real-valued
data can be made to run in linear time over 1-dimensional sum queries.
A Survey of Query Auditing Techniques for Data Privacy
5
1.2
Online Auditing
In recent years, researchers have also become interested in the on-
line auditing problem as a means of preventing data disclosure. Given
a sequence of queries, q1, . . . , qt−1 that have already been posed, corre-
sponding answers a1, . . . , at−1 that have already been supplied, and a
new query qt, the task of an online auditor is to determine if the new
query should be answered as such, or denied in order to prevent a pri-
vacy breach. Here each of the previous answers ai, is itself either the
true answer fi(Qi) to query qi, or a “denial”.
The earliest online auditors prevented disclosures by restricting the
size and overlap of queries that could be answered. For the case of sum
queries, for instance, it was shown that for queries with query sets of
exactly k elements, each pair of query sets overlapping in at most r
elements, any data set can be compromised in (2k − (l + 1))/r queries
by an attacker who knows l values a priori. For fixed k, r and l, if the
auditor denies answers to query (2k − (l + 1))/r and on, then the data
set is definitely not compromised, i.e., no private value can be uniquely
determined. Such an auditing scheme is rather limited: if k = n/c for
some constant c and r = 1, then after only a constant number of distinct
queries, the auditor would have to deny all further queries since there
are only about c queries where no two overlap in more than one element.
This motivated a search for auditors that could provide greater utility.
The next natural question is whether offline auditors can directly solve
the online auditing problem. Whenever a new query is posed, the online
auditor checks to see if the answer to this query in combination with all
previous query responses can be used to uniquely determine a private
value. If so, the query is denied, else it is answered exactly. While it
would seem that such an approach should work, in actuality it does not
as we demonstrate next.
Example where Denials Leak: Suppose that the underlying data
set is real-valued and that a query is denied only if some value is fully
disclosed. Suppose that the attacker poses the first query sum(x1, x2, x3)
and the auditor answers 15. Suppose also that the attacker then poses
a second query max(x1, x2, x3) and the auditor denies the answer. The
denial tells the attacker that if the true answer to the second query
were given then some value could be uniquely determined. Note that
max(x1, x2, x3) < 5 since then the sum could not be 15. Further, if
max(x1, x2, x3) > 5 then the query would not have been denied since no
value could be uniquely determined. Consequently, max(x1, x2, x3) = 5
and the attacker learns that x1 = x2 = x3 = 5 — a privacy breach of all
6
three entries. The issue here is that denials reduce the space of possible
consistent solutions, and we have not explicitly accounted for this.
In this example only a few values were compromised. However, it is
possible to construct examples where a large fraction of private values
can be uniquely determined. Intuitively, denials that depend on the
answer to the current query leak information because users can ask why
a query was denied, and the reason is in the data. If the decision to
answer or deny a query depends on the actual data, it reduces the set
of possible consistent solutions for the underlying data.
Another naive solution to the leakage problem is to deny whenever the
offline algorithm does, and to also randomly deny queries that would nor-
mally be answered. While this solution seems appealing, it has its own
problems. Most importantly, although it may be that denials leak less
information, leakage is not generally prevented. Furthermore, the au-
diting algorithm would need to remember which queries were randomly
denied, since otherwise an attacker could repeatedly pose the same query
until it was answered. A difficulty then arises in determining whether
two queries are equivalent. The computational hardness of this problem
depends on the query language, and may be intractable, or even unde-
cidable. As a work around to this problem, the simulation paradigm
(used vastly in cryptography) was proposed and is described next.
1.2.1
Simulatable Auditing.
The idea for simulatable au-
diting came from the following observation: Query denials have the
potential to leak information if in choosing to deny, the auditor uses
information that is unavailable to the attacker (the answer to the newly
posed query). A successful attacker capitalizes on this leakage to in-
fer private values. The requirement of a simulatable auditor then, is
that the attacker should be able to simulate or mimic the auditors de-
cisions to answer or deny a query. In such a scenario, because the at-
tacker can equivalently determine for himself when his queries will be
denied, denials provably do not leak information. More formally, let
Q = {q1, . . . , qt} be any sequence of queries and A = {a1, . . . , at} be
their corresponding answers. Here each ai is either the exact answer
fi(Qi) to query qi on the data set X, or a denial.
Definition 1.4 (Online Auditor) An online auditor B is a function
of Q, A and X that returns as output either an exact answer to qt or a
denial.
Definition 1.5 (Simulatable Auditor) An online auditor B is sim-
ulatable, if there exists another auditor B′ that is a function of only Q
and A \ at and whose output on qt is always equal to that of B.
A Survey of Query Auditing Techniques for Data Privacy
7
An attractive property of simulatable auditors is that the auditor’s
response to denied queries does not convey any new information to the
attacker (beyond what is already known given the answers to the pre-
vious queries). Hence denied queries need not be taken into account in
future decisions that the auditor makes.
Note that the auditor that restricted the size and overlap of queries
was simulatable since it never actually looked at the answers to queries
in choosing to deny. As another example of a simulatable auditor, the
sum auditor over real-valued data from Section 1.1.2 is also simulatable
since all that is examined in making the decision to deny or answer
is the matrix of query vectors and never the actual answers to any of
the queries, let alone the answer to the current query. In contrast to
the query-size-and-overlap-restricting auditor, this auditor has also been
shown to provide fairly high utility for large data sets — in a sequence
of random sum queries over a data set, the first denial can be expected
to occur only after a linear number of queries.
A more general sufficient condition for ensuring simulatability is that
in making its decision, with each new query, the auditor should deter-
mine if there is any possible data set, consistent with all past responses,
in which the answer to the current query would cause some element to be
fully disclosed. If so, the query should be denied, else it can be answered.
Since this is a condition that an attacker could check for himself and pre-
dict denials, denials leak no information. Using this idea, simulatable
online auditors have been constructed for max and min queries.
In the example from the previous section, the query q1 = sum{x1, x2, x3}
would be answered, since no matter the answer, no element from the data
set could be uniquely pinned down. The second query q2 = max{x1, x2, x3}
would always be denied, since there is a possible answer to this query,
consistent with the answer to q1 that would cause a private value to
be uniquely determined. Note that if the actual answer to q2 had been
greater than 1 f
3 1(Q1), q2 would in reality have been safe to answer, and
thus we lose some utility due to the requirement of simulatability.
1.2.2
Partial Disclosure.
The notion of full disclosure as a
measure of privacy breach has certain shortcomings. Even if a private
value cannot be uniquely determined, it might still be determined to
lie in a tiny interval, or even in a large interval with a heavily skewed
distribution — and some might consider this to be sufficient disclosure.
Researchers proposed a new definition of privacy to mitigate this issue
by modeling the change in an attacker’s confidence about the values
of private data points. In this definition, it is assumed that the data is
drawn from some distribution D on (−∞, ∞)n that is known to both the
8
attacker and the auditor. See Section 1.2.3 for some discussion about
this assumption.
Let Q = {q1, . . . , qt} be a sequence of queries on the data set X and
let A = {a1, . . . , at} be the corresponding answers. Here each ai is either
the true answer to query qi on X or a denial. We allow the auditor to
be randomized, i.e., it’s decision to answer or deny a query need not be
deterministic.
Definition 1.6 (Randomized Auditor) A randomized auditor is a
randomized function of Q, A, X and D that returns as output either an
exact answer to qt on X or a denial.
We say that the sequence of queries and corresponding answers is λ-
safe for an element xi and an interval I ⊆ (−∞, ∞) if the attacker’s con-
fidence that xi ∈ I does not change significantly upon seeing the queries
and answers. Consider for example a private value such as salary: if a
sequence of queries and answers does not change an attacker’s confidence
about a private individual’s salary, then the sequence is safe.
Definition 1.7 (λ-safe) The sequence of queries and answers, q1, . . . , qt,
a1, . . . , at is said to be λ-safe with respect to a data element xi and an
interval I ⊆ (−∞, ∞) if the following Boolean predicate evaluates to 1:
Safeλ,i,I(q1, . . . , qt, a1, . . . , at) =
1 if 1/(1 + λ) ≤ PrD(xi∈I|q1,...,qt,a1,...,at) ≤ (1 + λ)
PrD(xi∈I)
0 otherwise
Partial disclosure is defined in terms of the following predicate that
evaluates to 1 if and only if q1, . . . , qt, a1, . . . , at is λ-safe for all entries
and all intervals1:
AllSafeλ(q1, . . . , qt, a1, . . . , at) =
(1.1)
1 if Safeλ,i,I(q1,...,qt,a1,...,at)=1, for every i∈[n] and
every interval I
0 otherwise
We now turn to the privacy definition. Consider the following (λ, T )-
privacy game between an attacker and an auditor, where in each round
t (for up to T rounds):
1 The attacker (adaptively) poses a query qt = (Qt, ft).
1In reality, the privacy definition only considers all intervals that have a significant prior
probability mass.
A Survey of Query Auditing Techniques for Data Privacy
9
2 The auditor determines whether qt should be answered. The au-
ditor responds with at = ft(Qt) if qt is allowed and with at =
“denied” otherwise.
3 The attacker wins if AllSafeλ(q1, . . . , qt, a1, . . . , at) = 0.2
Definition 1.8 (Private Randomized Auditor) An auditor is (λ, δ, T )-
private if for any attacker A
Pr[A wins the (λ, T )-privacy game] ≤ δ .
Here the probability is taken over the distribution D that the data comes
from and the coin tosses of the auditor and the attacker.
Since here too, one would like to ensure that denials leak no infor-
mation, the condition of simulatability is imposed on auditors that are
designed. Consider Q, A and X as before. Then,
Definition 1.9 (Simulatable Randomized Auditor) A randomized
auditor B is simulatable, if there exists another auditor B′ that is a ran-
domized function of only Q, A \ at and D such that the output of B′ on
qt is computationally indistinguishable from that of B.
1.2.3
Discussion on Privacy Definition.
Note that the
above definition of privacy makes the assumption that the distribution
from which the data is drawn is known to the attacker. In reality it need
not be. In this scenario, the predicate AllSafe needs to be evaluated with
respect to the attacker’s prior distribution, since compromise occurs only
if there is a substantial change in his beliefs. However, if the attacker’s
distribution can be arbitrarily far from the true data distribution, there is
not much that the auditor can release without causing partial disclosure
of some private value, since it is required to release exact answers if at
all. For example, consider a database that contains height as a private
attribute, and consider an attacker whose prior belief is that all men
are less than a foot tall. If by querying the data, the attacker suddenly
learns that this is not true and there is substantial change in his posterior
distribution, the privacy breach would be massive. In reality, his prior
beliefs are so far off the mark, that there is no aggregate query about
the heights that the auditor can truthfully answer without compromising
privacy, not even the average height of all people in the database.
Instead the data distribution that we assume the auditor and the at-
tacker share is supposed to represent such common sense facts and it
2Hereafter, we will refer to the predicates without mentioning the queries and answers for
the sake of clarity.
10
repeat
Generate
Do q , ..., q
Yes, often
1
t
Deny
q , ..., q
1
t
consistent answer
a , ..., a
, a ’
1
t−1
t
a’ to q
t
t
cause a privacy
a , ..., a
1
t−1
No, most of the time
breach?
Answer
Figure 1.1.
Skeleton of a simulatable private randomized auditor
allows for more useful information to be released. There are many cir-
cumstances where such an assumption is realistic. For example, distri-
butions of attributes such as age or salary may be known from previous
data releases or even published by the auditor itself.
1.2.4
A General Approach for Constructing Private Ran-
domized Auditors.
A query is thus safe to answer if doing so is not
likely to cause a significant change in the attacker’s confidence that an xi
lies in any interval. Also, the decision to deny must be simulatable. We
now describe a general approach that could be used to construct such
simulatable randomized auditors. Figure 1.1 gives a high level picture.
The basic idea is to have the auditor generate random data sets (of
n private values) consistent with answers to past queries. The data sets
are generated according to the distribution D conditioned on the past
answers. The auditor then checks to see if answering the new query
on these random data sets causes a significant change in the attacker’s
confidence about any xi. If the answer is ‘no’ for a sizable fraction of the
generated data sets, the query is safe to answer. Since the true answer to
the query is never looked at in this process, the auditors are simulatable
and denials provably do not leak information.
The left circle in Figure 1.1 thus represents the process of generating a
possible answer a′t to the new query according to D conditioned on past
answers, and the right circle represents the evaluation of the predicate
AllSafe (Equation 1.1) that checks to see whether privacy is violated
for any xi and any interval I if a′t were revealed in conjunction with all
previous answers. For each new query this procedure is repeated many
times, and the decision to deny is based on the fraction of sampled con-
sistent answers that cause a privacy breach. By repeating often enough
and choosing an appropriate cut-off for denials, it can be shown using
Add New Comment