Discourse Deixis and Coreference: Evidence from AnCora
Marta Recasens
CLiC - Centre de Llenguatge i Computaci´o
Department of Linguistics
University of Barcelona
08007 Barcelona, Spain
mrecasens@ub.edu
Abstract
(1)
Al voltant d’aquesta passarel·la est`a previst
Few empirical studies have been conducted on
crear un espai lliure amb vistes cap al riu
discourse deixis, and no such study exists for
Cardener. Aix`o implica una modificaci´o del
tra¸cat del carrer Sant Antoni. (C)2
Catalan or Spanish. This paper presents an em-
‘Around this walkway it is planned to create
pirical analysis of 200 000 words from the An-
a free area facing the River Cardener. This
Cora corpora annotated with discourse deixis.
implies a modification of the route of Sant
It returns to and tests assumptions previously
Antoni Street.’
made, laying out the linguistic problems we still
(2)
“L’euro t´e potencial per a una apre-
need to account for. To this end, proposals
ciaci´o, basada en el creixement i l’estabilitat
are put forward with regard to (i) the detec-
de preus interna,” van declarar ahir els
tion of abstract anaphors, and (ii) the way their
ministres d’Economia i Finances.
(C)
antecedents should be understood, drawing on
La declaraci´o institucional va ser emesa pel
the theory of underspecification. The quantita-
Consell de l’Euro.
tive and qualitative corpus analysis casts light
‘“The euro has potential for apprecia-
on ways of improving the performance of coref-
tion, based on the internal growth and
stability of prices,” declared yesterday
erence resolution systems by shifting the focus
the ministers of Economy and Finance.
from the delimitation of antecedents to the de-
The institutional declaration was expressed
tection of abstract anaphors.
by the Euro Council.’
1 Introduction
Only small datasets (Byron and Allen, 1998;
Natural Language Processing (NLP) work deal-
Eckert and Strube, 2000; Navarretta and Olsen,
ing with discourse deixis (e.g. (1)) has been
2008) have been annotated with discourse deixis
little to date in comparison with the consider-
and limited to a few anaphoric expressions. By-
able amount of effort devoted to the automatic
ron (2002) emphasized that demonstrative pro-
resolution of pronominal individual anaphora.
nouns referring to clauses or larger stretches
It is probably the relative ease of identifying
of text abound in natural discourse, and the
term-denoting NPs as well as their relatedness
corpus-based study of the use of demonstrative
to named entities that accounts for this higher
NPs in Portuguese and French conducted by
attraction. Following Webber’s (1988) termi-
Vieira et al. (2002) also pointed out the lim-
nology, by discourse deixis it is meant NPs that
itation of systems restricted to anaphors with
refer to a previous discourse segment.1 The dis-
a nominal antecedent. Later annotation efforts
course segment is referred to as an abstract an-
such as OntoNotes (Pradhan et al., 2007) and
tecedent, and the NP as an abstract anaphor.
ARRAU (Poesio and Artstein, 2008) have tried
The abstract antecedent can be either the ref-
to overcome these limitations.
erent situation(s) or circumstances expressed by
This paper presents a corpus study of dis-
the stretch of text (1), or the proposition itself
course deixis in Catalan and Spanish. With the
(“wording”) as a linguistic object (2).
future goal of building a coreference resolution
system for these two languages, the AnCora cor-
pora (already annotated from the morphological
1Although from a semantico-logical point of view, dis-
to the semantic levels) are being enriched with
course deixis overlaps but is not the same as reference
to abstract objects, I will use both indistinctively, thus
2The language appears indicated at the end of each
treating events as abstract objects. Abstract reference
example: (S) for Spanish, (C) for Catalan.
is also called situation reference (Fraurud, 1992).
Johansson, C. (Ed.)
73
Proceedings of the Second Workshop
on Anaphora Resolution (2008)
coreference relations. The annotation includes
required going beyond the NP level, and dis-
all NPs, which implies that those whose linguis-
course model theories entered the scene. Web-
tic antecedent is a discourse segment are also
ber (1988) introduced the term “discourse seg-
encoded. I focus on written texts (newspaper
ments” to refer to the clausal mention in (1).
articles) unlike former work done on dialogs. In
Karttunen (1976) had previously talked of en-
addition, not only pronouns but also full NPs
tities introduced by NPs and referred back to
are annotated. Being Catalan and Spanish pro-
in the discourse as “discourse entities.” Web-
drop languages, zero pronouns are also consid-
ber’s term was meant as a complement to Kart-
ered. This empirical analysis of the 200 000
tunen’s, claiming that discourse segments have
words that are available at present (100 000 for
their own mental reality apart from the dis-
each language) returns to and tests assumptions
course entities they contain.
previously made. It thus lays out the linguistic
These first approaches were rather theoret-
problems we still need to account for.
ical.
Although they used some real exam-
In order to make sense of the real data, pro-
ples, these were selected according to what they
posals are put forward with regard to (i) the
wanted to prove and no systematic empirical
detection of abstract anaphors (distinguishing
study was conducted. Four ideas underlie Web-
between nominalizations and labels in Francis’
ber’s (1988) seminal work, and these recur in
(1994) terms), and (ii) the way their antecedents
subsequent works. I limit myself to mentioning
should be understood, drawing on the theory of
them here (in her own words). Section 4 returns
underspecification (Poesio et al., 2006). These
to this point to collate the assumptions with the
ideas have a twofold effect. On the one hand,
empirical data from AnCora.
they suggest that it is feasible for a coreference
resolution system to automatically detect ref-
1. Preference for demonstratives: “Subse-
erences to abstract objects, thus improving the
quent reference to a sequence of clauses is
overall performance of the system. On the other
most often done via deictic pronouns.”
hand, they argue for the likely failure of delim-
2. Referent coercion:
“Once the speaker
iting abstract antecedents on the basis of exact
has referred to it [discourse segment] via
boundaries.
this/that, it must now have the status of a
The paper is organised as follows. Section 2
discourse entity since it can be referenced
outlines previous work on discourse deixis in the
via the anaphoric pronoun it.
field of NLP: from early theoretical accounts to
more recent corpus-based approaches. The An-
3. Required presence: “The demonstratum
Cora corpora are described in Section 3, where
being something [explicitly] present in the
the coreference coding scheme and the reliabil-
shared context.”
ity study are also presented. The annotated
4. Ambiguity:4 “All pointing is ambiguous
data prompts a revision of previous assumptions
. . . The listener’s choice depends on what
in Section 4 on the basis of both a quantitative
is compatible with the meaning of the rest
and qualitative corpus analysis. Section 5 tries
of the sentence.”
to make sense of problematic issues discussed
in Section 4 by borrowing from other linguis-
Around the turn of the century, collections of
tic accounts. Finally, Section 6 summarizes the
real data became a reality and they have made
conclusions and outlines for future work.
it possible to collate early theoretical claims
with real occurring data. Table 1 presents some
2 Previous work
of the corpora where discourse-deictic NPs (in
Reference to abstract objects first came to the
some cases only pronouns or only demonstra-
scene in NLP when systems began to be de-
tives) have been annotated. These corpora were
veloped to resolve pronominal anaphora. Soon
developed either with a view to developing and
it was realized that neuter pronouns such as
testing algorithms or to extracting quantita-
it, and especially demonstratives like this and
tive figures about linguistic phenomena. Work
that often referred to linguistic units other than
in progress includes the OntoNotes coreference
NPs.3 The need to account for these pronouns
tions, and expletive uses are restricted to a very few con-
structions.
3Notice that not all instances of these pronouns are
4Although I will argue for “non-specification” in Sec-
referential in English. There are no counterparts in
tion 4.4, the term “ambiguity” is kept here in fidelity to
Catalan and Spanish to the English dummy-it construc-
Webber’s original words.
74
ISSN 1736-6305 Vol. 2
http://dspace.utlib.ee/dspace
/handle/10062/7129
System/Study
Corpus
Anaphor
Antecedent
NP
Clause
Byron and Allen (1998)
English dialogs
pers.pr.
75%
7%
(PHORA)
383 pronouns
dem.pr.
25%
35%
Eckert and Strube (2000)
English dialogs
pers.&dem.pr.
45%
23%
Navarretta and Olsen (2008)
Danish texts (60K)
pers.&dem.pr.
26%
29%
(DAD)
Italian texts (55K)
(zero) pers.&dem.pr.
85%
10%
Vieira et al. (2002)
Portuguese 50 dem.NPs
dem.full NP
62%
38%
French 50 dem.NPs
dem.full NP
68%
32%
Botley (2006)
English (300K)
this
56%
spoken discourse
that
32%
news
these
10%
literature
those
2%
Table 1: Corpora annotated with discourse deixis
corpus (Pradhan et al., 2007) (although only
86% (AnCora-Es) of neuter personal pronouns,
the heads of VPs are considered as antecedents),
and 59% (AnCora-Ca) and 57% (AnCora-Es) of
and the ARRAU corpus (Poesio and Artstein,
neuter demonstrative pronouns have a clausal
2008), where all clauses are presented as poten-
antecedent.6 The Catalan and Spanish neuter
tial antecedents for the coders to decide.
pronouns are the equivalent forms of English it,
Annotating this information, however, is still
this, and that. The correspondence, however,
an open problem, since “it is not completely
is not one-to-one, as the range of uses of the
clear the extent to which humans agree on the
Romance forms is much more restricted than
interpretation of such expressions” (Poesio and
those of English. This factor together with dif-
Artstein, 2008). The largest existing corpora
ferences in the way each corpus has been anno-
annotated with coreference information (for the
tated probably account for the differences with
MUC and ACE campaigns) all restrict to en-
Table 1.
coding NPs whose antecedent is also an NP. No
Finally, an interesting ratio not provided by
corpus-based study exists for Catalan or Span-
former work is the ratio of discourse-deictic NPs
ish.
to the total number of coreferent NPs:7 3% in
Catalan, and 4% in Spanish. Discourse-deictic
3 Coreference annotation in AnCora
NPs represent thus a small group in compari-
The AnCora corpora – Annotated Corpora for
son with coreference links between NPs. The
Catalan and Spanish5 (Taul´e et al., 2008) con-
fact that reference to abstract objects is more
sist of two 500 000-word corpora for Catalan
typical of dialogues than newspaper texts con-
(AnCora-Ca) and Spanish (AnCora-Es), mainly
tributes to these low figures. However, although
newspaper and newswire articles. Both corpora
discourse deixis accounts for less than 5% of
are annotated at different levels of linguistic
all coreference links, successfully detecting this
description: morphological (PoS and lemmas),
percentage could result in a statistically signif-
syntactic (constituents and functions), and se-
icant improvement on the overall performance
mantic (argument structures, thematic roles, se-
of a coreference resolution system by reducing
mantic verb classes, named entities, and Word-
the number of false positive links.
Net nominal senses). They are being enriched
6
with coreference annotation: 100 000 words for
If relative frequencies are computed including zero
pronouns, as done for Italian in (Navarretta and Olsen,
each language are available at present.
2008), then we obtain that 5% (AnCora-Ca) and 4%
The Catalan subset contains 31 079 NPs
(AnCora-Es) of pronouns have a clausal antecedent.
(10 975 coreferent); the Spanish subset 29 179
7The NP count includes pronouns as well as definite
NPs (10 499 coreferent). In terms of figures sim-
and demonstrative NPs, since these are the forms that
ilar to the ones reported in former works (Ta-
can be abstract anaphors.
ble 1), it emerges that 42% (AnCora-Ca) and
5Available from http://clic.ub.edu/ancora
Johansson, C. (Ed.)
75
Proceedings of the Second Workshop
on Anaphora Resolution (2008)
3.1 Coding scheme
(six undergraduates and two graduates of lin-
The coreference annotation follows a two-step
guistics, all of them native Spanish speakers),
process: (i) an automatic stage, (ii) a manual
who annotated the same two texts indepen-
one. Only markables corresponding to NPs are
dently. Given the high cost – both in time and
automatically encoded with XML tags thanks
money – of conducting such experiments,8 this
to the morphosyntactic annotations. Discourse
small-scale study was meant as a first approxi-
segments are marked at the manual stage when
mation to the quality of the scheme. Although
they are needed to mark up a link. The coding
high agreement scores (α=.85 and α=.90) were
guidelines (Recasens et al., 2007) distinguish be-
obtained for the coreferent vs. non-coreferent
tween identity and discourse deixis relations de-
distinction, the four instances likely to be an-
pending on the type of antecedent: the former
notated as discourse deixis turned out to be a
have an NP as antecedent, the latter a discourse
major source of disagreement. Annotators co-
segment (including at least one clause).
incided largely in the NPs chosen as abstract
Discourse deixis relations are further split
anaphors, but they often disagreed in the ex-
into “segment” (3) and “textual scene” (4) to
tension of abstract antecedents, although the
differentiate those antecedents that fall within
discourse segments usually overlapped. These
the sentence unit from those that go beyond.
results are in line with the conclusions reported
Segmental discourse deixis takes an attribute
by Artstein and Poesio (2006) from a similar
specifying the semantic type of the reference:
experiment on dialogues.
event-token, event-type, or proposition.
4 Evidence from AnCora
(3)
Un pirata inform´atico consigui´o robar
A quantitative and qualitative analysis of the
los datos de 485.000 tarjetas de cr´edito
200 000 words coreferentially annotated from
. . . El robo fue descubierto . . . (S)
‘A hacker managed to steal the data of
the AnCora corpora offer the chance to re-
485,000 credit cards . . . The robbery was
visit Webber’s (1988) assumptions (Section 2)
discovered . . . ’
by commenting on those examples arising most
(4)
Latinoam´erica concluy´o hoy su participaci´on
questions among annotators, thus taking a
en la “Bolsa de Turismo” de Berl´ın con un
bottom-up perspective. Throughout the discus-
balance preliminar un tanto pesimista porque
sion linguistic problems that have not been ac-
*0* no tuvo la cantidad de visitantes esper-
counted for become apparent.
ada. La competencia de Asia, los altos pre-
cios de los pasajes y la relaci´on d´olar-marco
4.1 Preference for demonstratives
alem´an, fueron los obst´aculos se˜nalados por
Webber (1988) states that there is a prefer-
varios pa´ıses para impedirles lograr sus ob-
ence to use demonstratives this and that vs.
jetivos. La escasa presencia de interesa-
the pronoun it to refer to a previous discourse
dos provoc´o que en algunos puestos el ma-
segment. To test whether this preference also
terial no se distribuyera por completo. . . .
holds for Catalan and Spanish, discourse-deictic
Fern´andez se mostr´o optimista con respecto
a que la situaci´on mejore. (S)
NPs were extracted and sorted by morpholog-
‘Latin America ended today its participation
ical form. Figures for absolute and relative
in the tourism stock market of Berlin with a
frequencies are presented in Table 2. Given
preliminary balance rather pessimistic since
that the antecedents of discourse deixis are usu-
(it) did not have the expected number of vis-
ally not longer than one sentence, I focus on
itors. The competition by Asia, the high
this group. As far as pronouns are concerned,
prices of the tickets, and the relation dollar–
Catalan makes a slightly greater use of demon-
German mark, were the obstacles pointed
stratives (15.04%) than personal (13.16%) pro-
out by several countries that impeded them
to achieve their aims. The scarce presence
nouns. No preference for demonstratives, how-
of interested people caused some stalls not
ever, is observed in Spanish, where personal
to have all their material distributed. . . .
pronouns (13.64%) are twice as much used as
Fern´andez showed herself optimistic with re-
demonstratives (6.17%). With regard to full
spect to the improvement of the situation.’
NPs, these are the forms that participate most
frequently into discourse deixis, both in Catalan
3.2 Reliability study
and Spanish (50%). This high percentage calls
The coding scheme of AnCora was tested in
a reliability study involving eight participants
8For this study, the anotation of two texts required
10 hours per coder.
76
ISSN 1736-6305 Vol. 2
http://dspace.utlib.ee/dspace
/handle/10062/7129
AnCora-Ca
AnCora-Es
Coreferent NP
≤ 1 sentence > 1 sentence
≤ 1 sentence > 1 sentence
#
%
#
%
#
%
#
%
Full NP
Definite
80
30.08
4
1.50
78
25.32 13
4.22
Demonstrative
47
17.67
9
3.38
52
16.88
3
0.97
Possessivea
–
–
–
–
15
4.87
0
0
Pronoun
Personal (neuter)
35
13.16
1
0.38
42
13.64
1
0.32
Zero
30
11.28
1
0.38
11
3.57
0
0
Relative
17
6.39
0
0
71
23.05
1
0.32
Demonstrative (neuter)
40
15.04
2
0.75
19
6.17
2
0.65
Total
249
93.61 17
6.39
273
93.51 20
6.49
a Given that possessive determiners are always preceded by the definite article in Catalan,
possessive full NPs are included in the definite group.
Table 2: Distribution of discourse deixis in AnCora
thus for their inclusion in a coreference resolu-
• full NP ... segment ... NP (6)
tion system.
(6)
Dos arque`olegs nord-americans acaben
4.2 Referent coercion
de muntar un gran enrenou amb
The assumption that a discourse segment turns
una nova teoria ...
Els primers
into a discourse entity when it is referred to by
pobladors del continent americ`a po-
a demonstrative (Webber, 1988) suggests that
drien haver estat habitants de la
the sequence
pen´ınsula Ib`erica que fa 18.000 anys
segment ... this ... it
van travessar l’Atl`antic. Aquesta ´es la
provocativa teoria de dos arque`olegs
is the prototypical one. Such a pattern, how-
nord-americans... (C)
ever, needs to be extended to allow for full NPs,
‘Two North American archaeologists
which broadens the range of possible patterns.
have just caused quite a commotion
AnCora includes instances of:
with a new theory ...
The first in-
habitants of the American continent
• segment ... full NP ... full NP (3)
could have been inhabitants of the
• segment ... full NP ... segment (5)9
Iberian Peninsula that crossed the
Atlantic 18,000 years ago.
This is
(5)
“El movimiento de las arenas hace
the provocative theory of two North
dif´ıcil saber d´onde est´an enterradas las
American archaeologists ...’
minas. No es una cuesti´on de mapas
el saber d´onde est´an *0* y cu´al es el
A usual way of elaborating on a previous NP,
estado de las minas”, a˜nadi´o *0* ...
providing additional information, is by using a
retirar estas minas, de las que no se
clausal mention in a subsequent reference.
sabe la situaci´on exacta. (S)
“‘The
movement
of
the
sand
4.3 Required presence
makes
it
difficult
to
know
Although Webber (1988) claims that a
where
the
mines
are
buried.
discourse-deictic pronoun must point to some-
The knowledge of where (they) are
thing which explicitly appears in the discourse,
and what is their state is not a matter
the fact that text comprehension is highly
of maps”, (he) added ... removing these
constructive accounts for counterexamples in
mines, of which the exact situation is
which the antecedent cannot be easily recovered
unknown.’
from the preceding context.
9The zero pronoun is marked with *0* and with the
(7)
Para el presidente, “es evidente” que el PSOE
corresponding pronoun in brackets in the English trans-
no llega a nuevos sectores de la poblaci´on
lation.
por lo que debe hacerse “un gran esfuerzo
de cambio, de mensajes claros y de valores
Johansson, C. (Ed.)
77
Proceedings of the Second Workshop
on Anaphora Resolution (2008)
que permitan que ese mensaje sea asumido
• Detecting the kind of NPs that can be ab-
por los nuevos componentes de una sociedad
stract anaphors.
espa˜nola que ha cambiado mucho”. (S)
‘For the president, “it is evident” that the
• Accounting for the inherent non-specificity
PSOE does not arrive to new sectors of the
of abstract antecedents.
population, so that there is the need for “a
big effort of change, of clear messages and of
I turn now my attention to these two issues.
values that allow this message to be assumed
First, I suggest linguistic cues for detecting ab-
by the new components of a Spanish society
stract anaphors by focusing on nominalizations
that has changed a lot.”’
and labels in Francis’ (1994) terms. Second, I
draw on the theory of underspecification (Poe-
These cases resemble “bridging” (Clark, 1977)
sio et al., 2006) to account for the continuum
in that the reader has to carry out a process of
of specificity on which antecedents of abstract
inference to arrive at the antecedent, since this
anaphors seem to lie.
does not appear explicitly. Clark’s original idea
of brigding needs to be extended twofold: (i) it
5.1 Detecting abstract anaphors
implies not only NP antecedents (her house ...
Although both pronouns and full NPs can be
the door) but also discourse segments, and (ii)
abstract anaphors, the former amount to no
both definite and demonstrative NPs are possi-
more than a reduced set (neuter, relative and
ble bridging anaphors.
zero pronouns) while the latter constitute an
4.4 Non-specificity
infinite set, thus posing greater difficulties. An
analysis of the Catalan and Spanish forms ob-
Webber (1988) points out the ambiguous nature
served in discourse deixis suggests that three
of discourse deixis, especially with respect to the
specific groups of nouns are potential candidates
extension of the antecedent (already highlighted
to be abstract anaphors. Table 3 illustrates ab-
by the reliability study, Section 3.2). From her
solute and relative frequencies as well as exam-
point of view, different extensions of a discourse
ples (from AnCora-Es) of each group.
segment might imply different referents. Hence,
the use of the term “ambiguity.” More accu-
• Nominalizations
rately, however, the point at issue is the non-
specific nature of the antecedent (8). Webber
- Deverbal nouns (e.g. exportaci´on ‘expor-
proposes the “right frontier” as a cue, according
tation’) can be detected by the pres-
to which only discourse segments on the current
ence of a nominalizing affix (Spanish
right frontier of the discourse tree can yield ref-
-ci´on, -miento, -cia, -aje, etc.).
erents for abstract anaphors. The problem lies
- Verbal forms converted into nouns (e.g.
in choosing which segment on the right fron-
apoyo ‘support’) can be identified
tier10. This non-specificity also applies to full
by extracting from the morphological
NPs, especially those of the kind la situaci´on
parser those pairs of tokens that can be
‘the situation’ in (4).
either a noun or a verb. Only a lim-
ited set of verbal forms can undergo
(8)
Agassi insisti´o que *0* puede ser mejor ju-
such conversion: first-person indica-
gador para volver a tener un gran a˜no,
tive, first- or third- person subjunc-
aunque *0* no le garantice los triunfos que
tive, and past participle.
*0* tuvo en 1999. (S)
‘Agassi stressed that (he) could be a better
• “Cousins”
player to have a great year again, although
They are non-deverbal abstract nouns de-
(it) does not guarantee him the victories that
noting things that are conceptually event-
he had in 1999.’
like. E.g. ´exito ‘success’ (no verb such as
5 Making sense of the data
the English succeed exists in Spanish).
It follows from the discussion in the previous
• Labels
section that, from a computational perspective,
The term is borrowed from Francis (1994):
the automatic resolution of discourse deixis can
labels11 are nominal groups that function
profit from insights on:
as pro-forms used “to encapsulate or pack-
age a stretch of (written) discourse.” They
10“When there is more than one [discourse segment],
. . . I will have nothing to say here about how the choice
11Also known as anaphoric nouns or shell nouns
between them is made.” (Webber, 1991)
(Schmid, 2000).
78
ISSN 1736-6305 Vol. 2
http://dspace.utlib.ee/dspace
/handle/10062/7129
Nominalizing
Noun /
Cousins
Labels
affix
Verb
Neutral
Evaluative
#
39 (34%)
32 (28%)
7 (6%)
13 (12%)
23 (20%)
e.g. concentraci´on
acuerdo
´
exito
asunto
actitud
‘concentration’,
‘agreement’,
‘success’,
‘issue’,
‘attitude’,
pensamiento
visita
desventaja
caso
dificultades
‘thought’,
‘visit’,
‘disadvantage’,
‘case’,
‘difficulties’,
entrenamiento
accidente
presencia
cuesti´on
objetivo
‘training’
‘accident’
‘presence’
‘matter’
‘objective’
Table 3: Typology of abstract anaphors (from AnCora-Es)
have both a naming and encapsulating
same: “Labels do not necessarily refer to a
function, and are extremely common in the
clearly delimited or identifiable stretch of dis-
press, summarising the preceding co-text.
course. It is the shift in direction signalled by
Two criteria can be used to recognize a la-
the label and its immediate environment which
bel: (i) the head noun is non-specific, and
is of crucial importance for the development of
(ii) it requires lexical realization in its co-
the discourse.”
text. Depending on the semantics, they fall
Both the reliability study and the corpus
into two main groups:
analysis seem to suggest that rather than a di-
- Neutral labels (e.g.
situaci´on ‘situa-
chotomy, specificity constitutes a continuum,
tion’): they simply build a package
extending along the range of boundaries that
from a stretch of discourse.
the antecedent can take, from specific bound-
aries (1) to fuzzy ones (8). I believe that the
- Evaluative labels (e.g. raz´on ‘reason’):
theory that best accounts for this reality is that
they inform the reader how a chunk
of underspecification which has been provided
of discourse is to be interpreted, thus
for some lexical ambiguities like homonymy and
adding a positive/negative evaluation
polysemy. Poesio et al. (2006) have extended
to the “package.”
it to anaphora: “With certain types of ambigu-
A list of labels can be extracted by mining
ity the ambiguous expressions may be left un-
the annotated corpus. Most of the labels
resolved in the right context.” They argue that
we obtain from AnCora-Es and AnCora-
the final interpretation of some (mereological)
Ca coincide with those reported in Francis
pronouns in dialogues is not fully specified, but
(1994), and the rest show semantic similar-
only “good enough” for the listener’s purposes.
ities. Although Francis (1994) points out
They give (9) as an example, where it is not
that modification is a device for adding ex-
clear whether the pronoun that refers to the or-
tra meaning to labels, modified labels were
ange juice which has been loaded into the tanker
the minor group in AnCora (18%), which
car, or the tanker car itself, or indeed whether
supports that they are pro-forms.
that matters. This leads them to formulate the
Justified Sloppiness Hypothesis.
5.2 Underspecified abstract
antecedents
(9)
so then we’ll
. . . we’ll be in a position to
The end of Section 4 argued for the non-
load the orange juice into the tanker car
specificity of abstract antecedents, which be-
. . . and send that off
comes evident as soon as one attempts to de-
limit their exact boundaries. This point is sup-
According to this hypothesis, there are cases
ported by Botley (2006), who reports that “in-
when an anaphor has two potential antecedents
direct anaphora definitely poses difficulties for
x and y, which are elements of an underly-
corpus-based linguistics, in that almost 30% of
ing mereological structure with summum x⊕y.
abstract reference cases analysed were hard to
There is still a fourth interpretation that can be
classify straightforwardly. This is because an-
derived from such a summum: z (x⊕y), which
tecedents lack clear surface linguistic bound-
is a p-underspecified interpretation in which the
aries.” Francis (1994) also comments on the
anaphor is interpreted as denoting an element
Johansson, C. (Ed.)
79
Proceedings of the Second Workshop
on Anaphora Resolution (2008)
z included in the summum. What is crucial is
6 Conclusion
that all four possible interpretations are equiv-
This paper takes an empirical approach to dis-
alent for the purposes of the plan.
course deixis in Catalan and Spanish, opening
Following this line, I take the account further
the field to these two languages. Emphasis was
by adapting the Justified Sloppiness Hypothesis
put on the need for complementing theoreti-
to cover discourse deixis for both pronouns and
cal accounts based on a limited set of exam-
full NPs. Instead of four we have three possible
ples. The coreference annotation in the AnCora
interpretations:
corpora includes discourse segments and thus
(i) x is the largest/maximal discourse segment.
offers a significant amount of data on the ba-
sis of which we can approach this topic from
(ii) y is the shortest/minimal discourse seg-
a bottom-up perspective and from the point
ment.
of view of coreference resolution. Such an ap-
(iii) y z x, in which z is a p-underspecified in-
proach lays bare the complexities of abstract
terpretation denoting a discourse segment
anaphora and shows the shortcomings of former
whose extension lies between the minimal
theoretical claims.
y and the maximal x.
The corpus study presented here differs from
former work in several aspects. Apart from ob-
Again, the three interpretations are equivalent
taining data for two languages not studied in
for the purposes of communication. It might
this respect, it is not limited to pronouns or
be possible to establish a mapping between
demonstrative NPs, but includes both pronouns
anaphor form and specificity of antecedent, e.g.
(personal, demonstrative, relative and zero) and
complex NPs usually specify the antecedent
full NPs (definite, demonstrative and posses-
whereas labels tend to leave their antecedent
sive). It deals with written discourse, unlike
p-underspecified.
most existing work on dialogues. Finally, the
Underspecification provides the theoretical
annotated dataset is much larger than those
framework for which we do not have to con-
used so far.
sider instances of discourse deixis with a non-
The coding scheme is in accordance with real
well delimited discourse segment as linguisti-
data, thus covering discourse deixis as it oc-
cally incorrect, but as wholly legitimate refer-
curs in the two Romance languages under anal-
ences whose role in the discourse does not re-
ysis. Two datasets of 100 000 words each served
quire that they be fully specified. Therefore, I
to return to four assumptions made by Web-
conclude that annotation efforts should not as-
ber (1988) and point out the problematic is-
sume that anaphoric expressions referring to an
sues with the help of real examples. On the
abstract object always have a clearly identifiable
basis of the annotated corpora, the resolution
antecedent.
of discourse deixis was divided into two differ-
Concerning the automatic resolution of dis-
ent tasks: detecting the units that are likely
course deixis, the fact that only 5% of all
to be abstract anaphors, and delimiting the
coreference links involve discourse deixis has a
boundaries of abstract antecedents. Given that
twofold effect: on the one hand, automatically
discourse-deictic NPs are found to represent less
delimiting exact abstract antecedents will not
than 5% of all coreferent NPs, I stressed that
be very helpful; on the other hand, success-
the main focus of attention should be the de-
fully detecting discourse-deictic NPs can stop
tection of abstract anaphors rather than the de-
them from being included in a wrong chain, and
limitation of the exact boundaries of their an-
thus have beneficial effects on the overall per-
tecedents.
formance of the coreference resolution system.
My main claim was that abstract anaphors
With the help of a morphological parser and the
conform to certain criteria: they are either one
extracted list of labels, if a nominalization or a
of a specific set of pronouns or full NPs in the
label is encountered by the system whose head
form of nominalizations, cousins, or labels. A
does not match any previous NP but matches
set of labels was extracted by mining the corpus,
a previous verbal form, then it is more likely
which largely overlapped with those reported
to be discourse deictic than non-coreferent or
by Francis (1994). As for abstract antecedents,
coreferent with an NP.
the non-specificity that makes its delimitation
so difficult was accounted for with the semantic
theory of underspecification according to Poe-
80
ISSN 1736-6305 Vol. 2
http://dspace.utlib.ee/dspace
/handle/10062/7129
sio et al. (2006). Hence, I proposed that they
Herbert H. Clark.
1977.
Bridging.
In
lie on a continuum, from fully specified to p-
P.N. Johnson-Laird and P.C.Wason, editors,
underspecified, and that it is legitimate for them
Thinking: Readings in Cognitive Science.
to be left underspecified.
Cambridge University Press, Cambridge.
From a computational perspective, the data
Miriam Eckert and Michael Strube. 2000. Dia-
discussed and suggested insights open new av-
logue acts, synchronising units and anaphora
enues for the automatic resolution of discourse
resolution. Journal of Semantics, 17(1):51–
deixis and coreference resolution by extension.
89.
Computational approaches should bear in mind
Gill Francis. 1994. Labelling discourse: an as-
that not all references require to be fully spec-
pect of nominal-group lexical cohesion. In
ified for successful communication, and so an-
M. Coulthard, editor, Advances in Written
notation efforts must not insist on setting fixed
Text Analysis, pages 83–101. Routledge, Lon-
boundaries in every case. Whereas it is possible
don.
for a coreference resolution system to detect ab-
Kari Fraurud. 1992. Processing Noun Phrases
stract anaphors with the help of a morphological
in Discourse. Ph.D. thesis, Department of
parser and an extracted list of labels, there is no
Linguistics, Stockholm University.
point in trying to delimit the exact antecedent
Lauri Karttunen. 1976. Discourse referents. In
when it is underspecified. Detection alone can
J. McCawley, editor, Syntax and Semantics,
result in a statistically significant improvement
volume 7, pages 363–385. Academic Press,
on the overall performance of the system by re-
New York.
ducing the number of false positive links.
Costanza Navarretta and Sussi Olsen. 2008.
Acknowledgements
Annotating abstract pronominal anaphora in
the DAD project. In Proceedings of the 6th
Thanks to all the annotators who participated
International Conference on Language Re-
in the reliability study: Irene Carb´o, Sandra
sources and Evaluation (LREC 2008), Mar-
Garc´ıa, Iago Gonz´alez, Esther L´opez, Jes´us
rakech, Morocco.
Mart´ınez, Laura Mu˜noz, and especially to Is-
Massimo Poesio and Ron Artstein. 2008.
abel Briz and Montse Nofre for annotating the
Anaphoric annotation in the ARRAU corpus.
AnCora corpora.
In Proceedings of the 6th International Con-
This work was supported by FPU-2006-08
ference on Language Resources and Evalua-
grant from the Spanish Ministry of Education
tion (LREC2008).
and Science, and Lang2World (TIN2006-15265-
Massimo Poesio, Patrick Sturt, Ron Artstein,
C06-06) – subproject of Text-Mess.
and Ruth Filik. 2006. Underspecification and
References
anaphora: Theoretical issues and preliminary
evidence. Discourse Processes: a multidisci-
Ron Artstein and Massimo Poesio. 2006. Iden-
plinary journal, 42:157–175.
tifying reference to abstract objects in dia-
Sameer S. Pradhan, Lance Ramshaw, Ralph
logue. In Proceedings of the 10th Workshop
Weischedel, Jessica MacBride, and Linnea
on the Semantics and Pragmatics of Dialogue
Micciulla. 2007. Unrestricted coreference:
(BRANDIAL2006), pages 56–63.
Identifying entities and events in OntoNotes.
Simon Botley. 2006. Indirect anaphora: Test-
In Proceedings of the 1st IEEE International
ing the limits of corpus-based linguistics. In-
Conference on Semantic Computing (ICSC
ternational Journal of Corpus Linguistics,
2007), pages 446–453.
11(1):73–112.
Marta Recasens, M. Ant`onia Mart´ı, and Mar-
Donna K. Byron and James F. Allen. 1998.
iona Taul´e. 2007. Text as scene: Discourse
Resolving demonstrative anaphora in the
deixis and bridging relations. Procesamiento
TRAINS93 corpus. In Proceedings of the 2nd
del Lenguaje Natural, 39:205–212.
Discourse Anaphora and Anaphor Resolution
Hans-J¨org Schmid. 2000. English Abstract
Colloquium (DAARC1998), pages 68–81.
Nouns as Conceptual Shells. From Corpus to
Donna K. Byron. 2002. Resolving pronominal
Cognition. Mouton de Gruyter, Berlin.
reference to abstract entities. In Proceedings
Mariona Taul´e, M. Ant`onia Mart´ı, and Marta
of the 40th Annual Meeting of the Association
Recasens. 2008. AnCora: Multilevel Anno-
for Computational Linguistics, pages 80–87.
tated Corpora for Catalan and Spanish. In
Proceedings of the 6th International Confer-
Johansson, C. (Ed.)
81
Proceedings of the Second Workshop
on Anaphora Resolution (2008)
ence on Language Resources and Evaluation
(LREC 2008), Marrakech, Morocco.
Renata Vieira, Susanne Salmon-Alt, Caroline
Gasperin, Emmanuel Schang, and Gabriel
Othero. 2002. Coreference and anaphoric re-
lations of demonstrative noun phrases in a
multilingual corpus. In Proceedings of the 4th
Discourse Anaphora and Anaphor Resolution
Colloquium.
Bonnie L. Webber. 1988. Discourse deixis: ref-
erence to discourse segments. In Proceedings
of the 26th Annual Meeting of the Associ-
ation for Computational Linguistics, pages
113–122.
Bonnie L. Webber. 1991. Structure and osten-
sion in the interpretation of discourse deixis.
Language and Cognitive Processes, 6(2):107–
135.
82
ISSN 1736-6305 Vol. 2
http://dspace.utlib.ee/dspace
/handle/10062/7129
Add New Comment