In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, Philadelphia.
Parsing the Wall Street Journal using a Lexical-Functional Grammar and
Discriminative Estimation Techniques
Stefan Riezler
Tracy H. King
Ronald M. Kaplan
Palo Alto Research Center
Palo Alto Research Center
Palo Alto Research Center
Palo Alto, CA 94304
Palo Alto, CA 94304
Palo Alto, CA 94304
riezler@parc.com
thking@parc.com
kaplan@parc.com
Richard Crouch
John T. Maxwell III
Mark Johnson
Palo Alto Research Center
Palo Alto Research Center
Brown University
Palo Alto, CA 94304
Palo Alto, CA 94304
Providence, RI 02912
crouch@parc.com
maxwell@parc.com
mj@cs.brown.edu
Abstract
statistical estimation of linguistically fine-grained
statistical parsing systems. Rather, parameter esti-
We present a stochastic parsing system
mation for such models had to resort to unsupervised
consisting of a Lexical-Functional Gram-
techniques (Bouma et al., 2000; Riezler et al., 2000),
mar (LFG), a constraint-based parser and
or training corpora tailored to the specific grammars
a stochastic disambiguation model. We re-
had to be created by parsing and manual disam-
port on the results of applying this sys-
biguation, resulting in relatively small training sets
tem to parsing the UPenn Wall Street
of around 1,000 sentences (Johnson et al., 1999).
Journal (WSJ) treebank. The model com-
Furthermore, the effort involved in coding broad-
bines full and partial parsing techniques
coverage grammars by hand has often led to the spe-
to reach full grammar coverage on unseen
cialization of grammars to relatively small domains,
data. The treebank annotations are used
thus sacrificing grammar coverage (i.e. the percent-
to provide partially labeled data for dis-
age of sentences for which at least one analysis is
criminative statistical estimation using ex-
found) on free text. The approach presented in this
ponential models. Disambiguation perfor-
paper is a first attempt to scale up stochastic parsing
mance is evaluated by measuring matches
systems based on linguistically fine-grained hand-
of predicate-argument relations on two
coded grammars to the UPenn Wall Street Journal
distinct test sets. On a gold standard of
(henceforth WSJ) treebank (Marcus et al., 1994).
manually annotated f-structures for a sub-
The problem of grammar coverage, i.e. the fact
set of the WSJ treebank, this evaluation
that not all sentences receive an analysis, is tack-
reaches 79% F-score. An evaluation on a
led in our approach by an extension of a full-
gold standard of dependency relations for
fledged Lexical-Functional Grammar (LFG) and a
Brown corpus data achieves 76% F-score.
constraint-based parser with partial parsing tech-
niques. In the absence of a complete parse, a so-
called “FRAGMENT grammar” allows the input to be
1
Introduction
analyzed as a sequence of well-formed chunks. The
Statistical parsing using combined systems of hand-
set of fragment parses is then chosen on the basis
coded linguistically fine-grained grammars and
of a fewest-chunk method. With this combination of
stochastic disambiguation components has seen con-
full and partial parsing techniques we achieve 100%
siderable progress in recent years. However, such at-
grammar coverage on unseen data.
tempts have so far been confined to a relatively small
Another goal of this work is the best possible ex-
scale for various reasons. Firstly, the rudimentary
ploitation of the WSJ treebank for discriminative es-
character of functional annotations in standard tree-
timation of an exponential model on LFG parses. We
banks has hindered the direct use of such data for
define discriminative or conditional criteria with re-
CS 1:
FRAGMENTS
Sadj[fin]
FRAGMENTS
S[fin]
TOKEN
"The golden share was scheduled to expire at the beginning of"
NP
VPall[fin]
of
PRED
’schedule<NULL, [132:expire]>[11:share]’
D
NPadj
VP[pass,fin]
PRED
’share’
PRED
’golden<[11:share]>’
the
AP[attr]
NPzero
AUX[pass,fin]
VPv[pass]
ADJUNCT
SUBJ
[11:share]
23 ADEGREE positive, ADJUNCT−TYPE nominal, ATYPE attributive
A
N
was
V[pass]
VPinf
SUBJ
NTYPE
GRAIN unspecified
golden
share
scheduled
VPinf−pos
SPEC
DET DET−FORM the_, DET−TYPE def
PARTinf
VPall[base]
11 CASE nom, NUM sg, PERS 3
PRED
’expire<[11:share]>’
to
VPv[base]
SUBJ
[11:share]
FIRST
PRED
’at<[170:beginning]>’
V[base]
PPcl
PRED ’beginning’
expire
PP
NTYPE GERUND +, GRAIN unspecified
XCOMP
ADJUNCT
OBJ
SPEC
DET DET−FORM the_, DET−TYPE def
P
NP
170 CASE acc, NUM sg, PCASE at, PERS 3
at
D
NPadj
164 ADV−TYPE vpadv, PSEM locative, PTYPE sem
132 INF−FORM to, PASSIVE −, VTYPE main
the
NPzero
TNS−ASP
MOOD indicative, TENSE past
67 PASSIVE +, STMT−TYPE decl, VTYPE main
N
REST
FIRST 229 TOKEN of
3218
beginning
3188
Figure 1: FRAGMENT c-/f-structure for The golden share was scheduled to expire at the beginning of
spect to the set of grammar parses consistent with
al. (1999). Evaluation with this metric measures the
the treebank annotations. Such data can be gathered
matches of dependency relations to Carroll et al.’s
by applying labels and brackets taken from the tree-
gold standard corpus. For a direct comparison of our
bank annotation to the parser input. The rudimen-
results with Carroll et al.’s system, we computed an
tary treebank annotations are thus used to provide
F-score that does not distinguish different types of
partially labeled data for discriminative estimation
dependency relations. Under this measure we obtain
of a probability model on linguistically fine-grained
76% F-score.
parses.
This paper is organized as follows. Section 2
Concerning empirical evaluation of disambigua-
describes the Lexical-Functional Grammar, the
tion performance, we feel that an evaluation measur-
constraint-based parser, and the robustness tech-
ing matches of predicate-argument relations is more
niques employed in this work. In section 3 we
appropriate for assessing the quality of our LFG-
present the details of the exponential model on LFG
based system than the standard measure of match-
parses and the discriminative statistical estimation
ing labeled bracketing on section 23 of the WSJ
technique. Experimental results are reported in sec-
treebank. The first evaluation we present measures
tion 4. A discussion of results is in section 5.
matches of predicate-argument relations in LFG f-
2
Robust Parsing using LFG
structures (henceforth the LFG annotation scheme)
to a gold standard of manually annotated f-structures
2.1
A Broad-Coverage LFG
for a representative subset of the WSJ treebank. The
The grammar used for this project was developed in
evaluation measure counts the number of predicate-
the ParGram project (Butt et al., 1999). It uses LFG
argument relations in the f-structure of the parse
as a formalism, producing c(onstituent)-structures
selected by the stochastic model that match those
(trees) and f(unctional)-structures (attribute value
in the gold standard annotation. Our parser plus
matrices) as output. The c-structures encode con-
stochastic disambiguator achieves 79% F-score un-
stituency. F-structures encode predicate-argument
der this evaluation regime.
relations and other grammatical information, e.g.,
Furthermore, we employ another metric which
number, tense. The XLE parser (Maxwell and Ka-
maps predicate-argument relations in LFG f-
plan, 1993) was used to produce packed represen-
structures to the dependency relations (henceforth
tations, specifying all possible grammar analyses of
the DR annotation scheme) proposed by Carroll et
the input.
The grammar has 314 rules with regular expres-
ifications to match the bracketing for special con-
sion right-hand sides which compile into a collec-
structions, e.g., negated infinitives, the grammar was
tion of finite-state machines with a total of 8,759
not altered to mirror the idiosyncrasies of the WSJ
states and 19,695 arcs. The grammar uses several
bracketing.
lexicons and two guessers: one guesser for words
recognized by the morphological analyzer but not
2.2
Robustness Techniques
in the lexicons and one for those not recognized.
To increase robustness, the standard grammar has
As such, most nouns, adjectives, and adverbs have
been augmented with a FRAGMENT grammar. This
no explicit lexical entry. The main verb lexicon con-
grammar parses the sentence as well-formed chunks
tains 9,652 verb stems and 23,525 subcategorization
specified by the grammar, in particular as Ss, NPs,
frame-verb stem entries; there are also lexicons for
PPs, and VPs. These chunks have both c-structures
adjectives and nouns with subcategorization frames
and f-structures corresponding to them. Any token
and for closed class items.
that cannot be parsed as one of these chunks is
For estimation purposes using the WSJ treebank,
parsed as a TOKEN chunk. The TOKENs are also
the grammar was modified to parse part of speech
recorded in the c- and f-structures. The grammar has
tags and labeled bracketing. A stripped down ver-
a fewest-chunk method for determining the correct
sion of the WSJ treebank was created that used
parse. For example, if a string can be parsed as two
only those POS tags and labeled brackets relevant
NPs and a VP or as one NP and an S, the NP-S
for determining grammatical relations. The WSJ la-
option is chosen. A sample FRAGMENT c-structure
beled brackets are given LFG lexical entries which
and f-structure are shown in Fig. 1 for wsj 0231.mrg
constrain both the c-structure and the f-structure of
(The golden share was scheduled to expire at the
the parse. For example, the WSJ’s ADJP-PRD la-
beginning of), an incomplete sentence; the parser
bel must correspond to an AP in the c-structure and
builds one S chunk and then one TOKEN for the
an XCOMP in the f-structure. In this version of the
stranded preposition.
corpus, all WSJ labels with -SBJ are retained and
A final capability of XLE that increases cov-
are restricted to phrases corresponding to SUBJ in
erage of the standard-plus-fragment grammar is a
the LFG grammar; in addition, it contains NP under
SKIMMING technique. Skimming is used to avoid
VP (OBJ and OBJth in the LFG grammar), all -LGS
timeouts and memory problems. When the amount
tags (OBL-AG), all -PRD tags (XCOMP), VP under
of time or memory spent on a sentence exceeds
VP (XCOMP), SBAR- (COMP), and verb POS tags
a threshhold, XLE goes into skimming mode for
under VP (V in the c-structure). For example, our
the constituents whose processing has not been
labeled bracketing of wsj 1305.mrg is [NP-SBJ His
completed. When XLE skims these remaining con-
credibility] is/VBZ also [PP-PRD on the line] in the
stituents, it does a bounded amount of work per sub-
investment community.
tree. This guarantees that XLE finishes processing
Some mismatches between the WSJ labeled
a sentence in a polynomial amount of time. In pars-
bracketing and the LFG grammar remain. These
ing section 23, 7.2% of the sentences were skimmed;
often arise when a given constituent fills a gram-
26.1% of these resulted in full parses, while 73.9%
matical role in more than one clause. For exam-
were FRAGMENT parses.
ple, in wsj 1303.mrg Japan’s Daiwa Securities Co.
The grammar coverage achieved 100% of section
named Masahiro Dozen president., the noun phrase
23 as unseen unlabeled data: 74.7% as full parses,
Masahiro Dozen is labeled as an NP-SBJ. However,
25.3% FRAGMENT and/or SKIMMED parses.
the LFG grammar treats it as the OBJ of the ma-
trix clause. As a result, the labeled bracketed version
3
Discriminative Statistical Estimation
of this sentence does not receive a full parse, even
from Partially Labeled Data
though its unlabeled, string-only counterpart is well-
formed. Some other bracketing mismatches remain,
3.1
Exponential Models on LFG Parses
usually the result of adjunct attachment. Such mis-
We employed the well-known family of exponential
matches occur in part because, besides minor mod-
models for stochastic disambiguation. In this paper
we are concerned with conditional exponential mod-
maximum likelihood estimation the joint probability
els of the form:
of the training data to best describe observations is
maximized. Since the discriminative task is kept in
½
¡
´Üµ
mind during estimation, discriminative methods can
Ô
´Ü
Ý
µ
´Ý
µ
yield improved performance. In our case, discrimi-
where
is the set of parses for sentence
,
native criteria cannot be defined directly with respect
´Ý
µ
Ý
È
¡
´Üµ
is a normalizing con-
to “correct labels” or “gold standard” parses since
´Ý
µ
ܾ
´Ý
µ
stant,
Ò
is a vector of
the WSJ annotations are not sufficient to disam-
´
µ
¾
Á
Ê
½
Ò
log-parameters,
is a vector of
biguate the more complex LFG parses. However, in-
´
µ
½
Ò
property-functions
for
stead of retreating to unsupervised estimation tech-
Á
Ê
½
Ò
on the set of parses
, and
is the vector dot
niques or creating small LFG treebanks by hand, we
¡
´Üµ
È
product
Ò
.
use the labeled bracketing of the WSJ training sec-
´Üµ
½
In our experiments, we used around 1000
tions to guide discriminative estimation. That is, dis-
complex property-functions comprising information
criminative criteria are defined with respect to the set
about c-structure, f-structure, and lexical elements
of parses consistent with the WSJ annotations.1
in parses, similar to the properties used in Johnson
The objective function in our approach, denoted
et al. (1999). For example, there are property func-
by
, is the joint of the negative log-likelihood
È
´
µ
tions for c-structure nodes and c-structure subtrees,
and a Gaussian regularization term
Ä´
µ
´
µ
indicating attachment preferences. High versus low
on the parameters
. Let
Ñ
be a set of
´Ý
Þ
µ
½
attachment is indicated by property functions count-
training data, consisting of pairs of sentences
and
Ý
ing the number of recursively embedded phrases.
partial annotations , let
be the set of parses
Þ
´Ý
Þ
µ
Other property functions are designed to refer to
for sentence
consistent with annotation , and let
Ý
Þ
f-structure attributes, which correspond to gram-
be the set of all parses produced by the gram-
´Ý
µ
matical functions in LFG, or to atomic attribute-
mar for sentence . Furthermore, let
denote the
Ý
Ô
℄
value pairs in f-structures. More complex property
expectation of function
under distribution . Then
Ô
functions are designed to indicate, for example, the
can be defined for a conditional exponential
È
´
µ
branching behaviour of c-structures and the (non)-
model
as:
Ô
´Þ
Ý
µ
parallelism of coordinations on both c-structure and
f-structure levels. Furthermore, properties refering
È
´
µ
Ä´
µ
´
µ
to lexical elements based on an auxiliary distribution
Ñ
Ò
¾
approach as presented in Riezler et al. (2000) are
ÐÓ
Ô
´Þ
Ý
µ
·
¾
¾
included in the model. Here tuples of head words,
½
½
È
argument words, and grammatical relations are ex-
¡
´Üµ
Ñ
Ò
¾
´Ý
Þ
µ
tracted from the training sections of the WSJ, and
È
ÐÓ
·
¾
¡
´Üµ
¾
´Ý
µ
fed into a finite mixture model for clustering gram-
½
½
Ñ
matical relations. The clustering model itself is then
¡
´Üµ
ÐÓ
used to yield smoothed probabilities as values for
½
´Ý
Þ
µ
property functions on head-argument-relation tuples
Ñ
Ò
¾
of LFG parses.
¡
´Üµ
·
ÐÓ
·
¾
¾
½
½
3.2
Discriminative Estimation
´Ý
µ
Discriminative estimation techniques have recently
Intuitively, the goal of estimation is to find model pa-
received great attention in the statistical machine
1An earlier approach using partially labeled data for estimat-
ing stochastics parsers is Pereira and Schabes’s (1992) work on
learning community and have already been applied
training PCFG from partially bracketed data. Their approach
to statistical parsing (Johnson et al., 1999; Collins,
differs from the one we use here in that Pereira and Schabes
2000; Collins and Duffy, 2001). In discriminative es-
take an EM-based approach maximizing the joint likelihood of
the parses and strings of their training data, while we maximize
timation, only the conditional relation of an analysis
the conditional likelihood of the sets of parses given the corre-
given an example is considered relevant, whereas in
sponding strings in a discriminative estimation setting.
rameters which make the two expectations in the last
which received at most 1,000 parses were used.
equation equal, i.e. which adjust the model param-
From this set, sentences of which a discriminative
eters to put all the weight on the parses consistent
learner cannot possibly take advantage, i.e. sen-
with the annotations, modulo a penalty term from
tences where the set of parses assigned to the par-
the Gaussian prior for too large or too small weights.
tially labeled string was not a proper subset of the
Since a closed form solution for such parame-
parses assigned the unlabeled string, were removed.
ters is not available, numerical optimization meth-
These successive selection steps resulted in a fi-
ods have to be used. In our experiments, we applied
nal training set consisting of 10,000 sentences, each
a conjugate gradient routine, yielding a fast converg-
with parses for partially labeled and unlabeled ver-
ing optimization algorithm where at each iteration
sions. Altogether there were 150,000 parses for par-
the negative log-likelihood
and the gradient
tially labeled input and 500,000 for unlabeled input.
È
´
µ
vector have to be evaluated.2 For our task the gra-
For estimation, a simple property selection pro-
dient takes the form:
cedure was applied to the full set of around 1000
properties. This procedure is based on a frequency
È
´
µ
È
´
µ
È
´
µ
, and
cutoff on instantiations of properties for the parses
ÖÈ
´
µ
½
¾
Ò
in the labeled training set. The result of this proce-
dure is a reduction of the property vector to about
Ñ
half its size. Furthermore, a held-out data set was
¡
´Üµ
È
´
µ
´Üµ
È
´
created from section 24 of the WSJ treebank for ex-
¡
´Üµ
ܾ
´Ý
Þ
µ
½
ܾ
´Ý
Þ
µ
perimental selection of the variance parameter of the
¡
´Üµ
prior distribution. This set consists of 120 sentences
´Üµ
È
µ
·
which received only full parses, out of which the
¾
¡
´Üµ
ܾ
´Ý
µ
ܾ
´Ý
µ
most plausible one was selected manually.
The derivatives in the gradient vector intuitively are
4.2
Testing
again just a difference of two expectations
Two different sets of test data were used: (i) 700 sen-
Ñ
Ñ
tences randomly extracted from section 23 of the
Ô
Ý
Þ
℄
·
Ô
Ý
℄
·
WSJ treebank and given gold-standard f-structure
¾
½
½
annotations according to our LFG scheme, and (ii)
Note also that this expression shares many common
500 sentences from the Brown corpus given gold
terms with the likelihood function, suggesting an ef-
standard annotations by Carroll et al. (1999) accord-
ficient implementation of the optimization routine.
ing to their dependency relations (DR) scheme.3
Annotating the WSJ test set was bootstrapped
4
Experimental Evaluation
by parsing the test sentences using the LFG gram-
mar and also checking for consistency with the
4.1
Training
Penn Treebank annotation. Starting from the (some-
The basic training data for our experiments are sec-
times fragmentary) parser analyses and the Tree-
tions 02-21 of the WSJ treebank. As a first step, all
bank annotations, gold standard parses were created
sections were parsed, and the packed parse forests
by manual corrections and extensions of the LFG
unpacked and stored. For discriminative estimation,
parses. Manual corrections were necessary in about
this data set was restricted to sentences which re-
half of the cases. The average sentence length of
ceive a full parse (in contrast to a FRAGMENT or
the WSJ f-structure bank is 19.8 words; the average
SKIMMED parse) for both its partially labeled and
number of predicate-argument relations in the gold-
its unlabeled variant. Furthermore, only sentences
standard f-structures is 31.2.
2An alternative numerical method would be a combination
Performance on the LFG-annotated WSJ test set
of iterative scaling techniques with a conditional EM algorithm
(Jebara and Pentland, 1998). However, it has been shown exper-
3Both corpora are available online. The WSJ f-structure
imentally that conjugate gradient techniques can outperform it-
bank at www.parc.com/istl/groups/nltt/fsbank/, and Carroll et
erative scaling techniques by far in running time (Minka, 2001).
al.’s corpus at www.cogs.susx.ac.uk/lab/nlp/carroll/greval.html.
was measured using both the LFG and DR metrics,
LFG scheme abstracts away from serialization and
thanks to an f-structure-to-DR annotation mapping.
phrase-structure. Facts like this can make a correct
Performance on the DR-annotated Brown test set
mapping of LFG f-structures to DR relations prob-
was only measured using the DR metric.
lematic. Indeed, we believe that we still underesti-
The LFG evaluation metric is based on the com-
mate by a few points because of DR mapping diffi-
parison of full f-structures, represented as triples
culties. 4
. The predicate-
Ö
Ð
Ø
ÓÒ´ÔÖ
Ø
Ö
ÙÑ
ÒØµ
argument relations of the f-structure for one parse of
4.3
Results
the sentence Meridian will pay a premium of $30.5
In our evaluation, we report F-scores for both types
million to assume $2 billion in deposits. are shown
of annotation, LFG and DR, and for three types
in Fig. 2.
of parse selection, (i) lower bound: random choice
of a parse from the set of analyses (averaged over
number($:9, billion:17)
number($:24, million:4)
10 runs), (ii) upper bound: selection of the parse
detform(premium:3, a)
mood(pay:0, indicative)
tense(pay:0, fut)
adjunct(million:4, ’30.5’:28)
with the best F-score according to the annotation
adjunct(premium:3, of:23)
adjunct(billion:17, ’2’:19)
scheme used, and (iii) stochastic: the parse selected
adjunct($:9, in:11)
adjunct(pay:0, assume:7)
by the stochastic disambiguator. The error reduc-
obj(pay:0, premium:3)
stmttype(pay:0, decl)
subj(pay:0, ’Meridian’:5)
obj(assume:7, $:9)
tion row lists the reduction in error rate relative to
obj(of:23, $:24)
subj(assume:7, pro:8)
the upper and lower bounds obtained by the stochas-
obj(in:11, deposit:12)
prontype(pro:8, null)
tic disambiguation model. F-score is defined as
stmttype(assume:7, purpose)
¾
¢
.
ÔÖ
×
ÓÒ
¢
Ö
Ð
Ð
´ÔÖ
×
ÓÒ
·
Ö
Ð
Ð
µ
Table 1 gives results for 700 examples randomly
Figure 2: LFG predicate-argument relation represen-
selected from section 23 of the WSJ treebank, using
tation
both LFG and DR measures.
The DR annotation for our example sentence, ob-
tained via a mapping from f-structures to Carroll et
Table 1: Disambiguation results for 700 randomly
al’s annotation scheme, is shown in Fig. 3.
selected examples from section 23 of the WSJ tree-
bank using LFG and DR measures.
(aux pay will)
(subj pay Meridian )
(detmod premium a)
(mod million 30.5)
LFG
DR
(mod $ million)
(mod of premium $)
upper bound
84.1
80.7
(dobj pay premium )
(mod billion 2)
stochastic
78.6
73.0
(mod $ billion)
(mod in $ deposit)
(dobj assume $ )
(mod to pay assume)
lower bound
75.5
68.8
error reduction
36
35
Figure 3: Mapping to Carroll et al.’s dependency-
relation representation
The effect of the quality of the parses on disam-
Superficially, the LFG and DR representations are
biguation performance can be illustrated by break-
very similar. One difference between the annotation
ing down the F-scores according to whether the
schemes is that the LFG representation in general
parser yields full parses, FRAGMENT, SKIMMED, or
specifies more relation tuples than the DR represen-
SKIMMED+FRAGMENT parses for the test sentences.
tation. Also, multiple occurences of the same lex-
The percentages of test examples which belong to
ical item are indicated explicitly in the LFG rep-
the respective classes of quality are listed in the first
resentation but not in the DR representation. The
row of Table 2. F-scores broken down according to
main conceptual difference between the two an-
classes of parse quality are recorded in the follow-
notation schemes is the fact that the DR scheme
4See Carroll et al. (1999) for more detail on the DR an-
crucially refers to phrase-structure properties and
notation scheme, and see Crouch et al. (2002) for more de-
tail on the differences between the DR and the LFG annotation
word order as well as to grammatical relations in
schemes, as well as on the difficulties of the mapping from LFG
the definition of dependency relations, whereas the
f-structures to DR annotations.
ing rows. The first column shows F-scores for all
specialized constraint-based parsing techniques for
parses in the test set, as in Table 1. The second col-
LFG grammars with partial parsing techniques. Fur-
umn shows the best F-scores when restricting atten-
thermore, a maximal exploitation of treebank anno-
tion to examples which receive only full parses. The
tations for estimating a distribution on fine-grained
third column reports F-scores for examples which
LFG parses is achieved by letting grammar analyses
receive only non-full parses, i.e. FRAGMENT or
which are consistent with the WSJ labeled bracket-
SKIMMED parses or SKIMMED+FRAGMENT parses.
ing define a gold standard set for discriminative es-
Columns 4-6 break down non-full parses according
timation. The combined system trained on WSJ data
to examples which receive only FRAGMENT, only
achieves full grammar coverage and disambiguation
SKIMMED, or only SKIMMED+FRAGMENT parses.
performance of 79% F-score on WSJ data, and 76%
Results of the evaluation on Carroll et al.’s Brown
F-score on the Brown corpus test set.
test set are given in Table 3. Evaluation results for
While disambiguation performance of around
the DR measure applied to the Brown corpus test set
79% F-score on WSJ data seems promising, from
broken down according to parse-quality are shown
one perspective it only offers a 3% absolute im-
in Table 2.
provement over a lower bound random baseline.
In Table 3 we show the DR measure along with an
We think that the high lower bound measure high-
evaluation measure which facilitates a direct com-
lights an important aspect of symbolic constraint-
parison of our results to those of Carroll et al.
based grammars (in contrast to treebank gram-
(1999). Following Carroll et al. (1999), we count
mars): the symbolic grammar already significantly
a dependency relation as correct if the gold stan-
restricts/disambiguates the range of possible analy-
dard has a relation with the same governor and de-
ses, giving the disambiguator a much narrower win-
pendent but perhaps with a different relation-type.
dow in which to operate. As such, it is more appro-
This dependency-only (DO) measure thus does not
priate to assess the disambiguator in terms of reduc-
reflect mismatches between arguments and modi-
tion in error rate (36% relative to the upper bound)
fiers in a small number of cases. Note that since
than in terms of absolute F-score. Both the DR and
for the evaluation on the Brown corpus, no heldout
LFG annotations broadly agree in their measure of
data were available to adjust the variance parame-
error reduction.
ter of a Bayesian model, we used a plain maximum-
The lower reduction in error rate relative to the
likelihood model for disambiguation on this test set.
upper bound for DR evaluation on the Brown corpus
can be attributed to a corpus effect that has also been
Table 3: Disambiguation results on 500 Brown cor-
observed by Gildea (2001) for training and testing
pus examples using DO measure and DR measures.
PCFGs on the WSJ and Brown corpora.5
Breaking down results according to parse quality
DO
DR
shows that irrespective of evaluation measure and
Carroll et al. (1999)
75.1
-
corpus, around 4% overall performance is lost due
upper bound
82.0
80.0
to non-full parses, i.e. FRAGMENT, or SKIMMED, or
stochastic
76.1
74.0
SKIMMED+FRAGMENT parses.
lower bound
73.3
71.7
Due to the lack of standard evaluation measures
error reduction
32
33
and gold standards for predicate-argument match-
ing, a comparison of our results to other stochastic
parsing systems is difficult. To our knowledge, so
5
Discussion
far the only direct point of comparison is the parser
of Carroll et al. (1999) which is also evaluated on
We have presented a first attempt at scaling up a
Carroll et al.’s test corpus. They report an F-score
stochastic parsing system combining a hand-coded
5
linguistically fine-grained grammar and a stochas-
Gildea
reports
a
decrease
from
86.1%/86.6%
re-
call/precision on labeled bracketing to 80.3%/81% when
tic disambiguation model to the WSJ treebank.
going from training and testing on the WSJ to training on the
Full grammar coverage is achieved by combining
WSJ and testing on the Brown corpus.
Table 2: LFG F-scores for the 700 WSJ test examples and DR F-scores for the 500 Brown test examples
broken down according to parse quality.
WSJ-LFG
all
full
non-full
fragments
skimmed
skimmed+fragments
% of test set
100
74.7
25.3
20.4
1.4
3.4
upper bound
84.1
88.5
73.4
76.7
70.3
61.3
stochastic
78.6
82.5
69.0
72.4
66.6
56.2
lower bound
75.5
78.4
67.7
71.0
63.0
55.9
Brown-DR
all
full
non-full
fragments
skimmed
skimmed+fragments
% of test set
100
79.6
20.4
20.0
2.0
1.6
upper bound
80.0
84.5
65.4
65.4
56.0
53.5
stochastic
74.0
77.9
61.5
61.5
52.8
50.0
lower bound
71.1
74.8
59.2
59.1
51.2
48.9
of 75.1% for a DO evaluation that ignores predicate
Empirical Methods in Natural Language Processing
labels, counting only dependencies. Under this mea-
(EMNLP), Pittsburgh, PA.
sure, our system achieves 76.1% F-score.
Tony Jebara and Alex Pentland. 1998. Maximum con-
ditional likelihood via bound maximization and the
CEM algorithm. In Advances in Neural Information
References
Processing Systems 11 (NIPS’98).
Gosse Bouma, Gertjan von Noord, and Robert Malouf.
Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi,
2000. Alpino: Wide-coverage computational analysis
and Stefan Riezler. 1999. Estimators for stochastic
of Dutch. In Proceedings of Computational Linguis-
“unification-based” grammars. In Proceedings of the
tics in the Netherlands, Amsterdam, Netherlands.
37th Annual Meeting of the Association for Computa-
tional Linguistics (ACL’99), College Park, MD.
Miriam Butt, Tracy King, Maria-Eugenia Ni˜no, and
Fr´ed´erique Segond. 1999. A Grammar Writer’s Cook-
Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz,
book. Number 95 in CSLI Lecture Notes. CSLI Publi-
Robert MacIntyre, Ann Bies, Mark Ferguson, Karen
cations, Stanford, CA.
Katz, and Britta Schasberger. 1994. The Penn tree-
bank: Annotating predicate argument structure.
In
John Carroll, Guido Minnen, and Ted Briscoe.
1999.
ARPA Human Language Technology Workshop.
Corpus annotation for parser evaluation. In Proceed-
ings of the EACL workshop on Linguistically Inter-
John Maxwell and Ron Kaplan. 1993. The interface be-
preted Corpora (LINC), Bergen, Norway.
tween phrasal and functional constraints. Computa-
tional Linguistics, 19(4):571–589.
Michael Collins and Nigel Duffy. 2001. Convolution
kernels for natural language. In Advances in Neural
Thomas Minka.
2001.
Algorithms for maximum-
Information Processing Systems 14(NIPS’01), Van-
likelihood logistic regression. Department of Statis-
couver.
tics, Carnegie Mellon University.
Michael Collins. 2000. Discriminative reranking for nat-
Fernando Pereira and Yves Schabes.
1992.
Inside-
ural language processing. In Proceedings of the Seven-
outside reestimation from partially bracketed corpora.
teenth International Conference on Machine Learning
In Proceedings of the 30th Annual Meeting of the
(ICML’00), Stanford, CA.
Association for Computational Linguistics (ACL’92),
Newark, Delaware.
Richard Crouch, Ronald M. Kaplan, Tracy H. King, and
Stefan Riezler. 2002. A comparison of evaluation
Stefan Riezler, Detlef Prescher, Jonas Kuhn, and Mark
metrics for a broad-coverage stochastic parser. In Pro-
Johnson. 2000. Lexicalized Stochastic Modeling of
ceedings of the ”Beyond PARSEVAL” Workshop at the
Constraint-Based Grammars using Log-Linear Mea-
3rd International Conference on Language Resources
sures and EM Training. In Proceedings of the 38th
and Evaluation (LREC’02), Las Palmas, Spain.
Annual Meeting of the Association for Computational
Linguistics (ACL’00), Hong Kong.
Dan Gildea.
2001.
Corpus variation and parser per-
formance.
In Proceedings of 2001 Conference on
Add New Comment