This is not the document you are looking for? Use the search form below to find more!

Report home > Education

The VENEXcorpusof anaphora and deixis in spoken and written Italian

0.00 (0 votes)
Document Description
The VENEX corpus is a corpus of Italian annotated with information about anaphora and deixis, created in a joint project between the Universit` a diVenezia and the University of Essex. The corpus includes both texts (articles from a financial newspaper) and dialogues (an Italian version of the MapTaskcorpus). The annotation scheme is an almost complete implementation of the scheme proposed in MATE, and the markup scheme is the simplified form of standoff adopted in the MMAX annotation tool.
File Details
Submitter
  • Name: nayu
Embed Code:

Add New Comment




Related Documents

Coherence and Cohesion in Spoken and Written Discourse

by: sasa, 30 pages

The research presented in this volume is inspired by our work on the project Coherence and Cohesion in English Discourse , which is supported by the Czech Science Foundation, the aim of which is to ...

Kenneth Smith The Association Of Latino Professionals In Finance And Accounting

by: tomaj, 3 pages

KENNETH SMITH 1130 North Franklin Street Pittsburgh, PA 15233 412-321-6163 Home ...

Atlas of Genetics and Cytogenetics in Oncology and Haematology

by: ronja, 4 pages

Atlas of Genetics and Cytogenetics in Oncology and Haematology

Tracking progress in cutting hunger and poverty in Africa and Southeast Asia

by: vane, 18 pages

Tracking progress in cutting hunger and poverty in Africa and Southeast AsiaShenggen FanDirector GeneralInternational Food Policy Research InstituteApril 12, 2010

Lotus Notes Contacts to Vcard and Also in Outlook and Excel format at $69 Only

by: silvawill359, 3 pages

If you have ever heard that Migration Lotus Notes Contacts to Vcard and Also in Outlook and Excel format at $69 Only? Yes, you are right place Convert Lotus Notes Address Book to PST in small priced ...

The Use of Lexical Cohesion in Reading and Writing

by: matthew, 23 pages

The aim of this paper is to investigate how a working knowledge of discourse-organizing vocabulary, especially lexical cohesion, can help EFL students in reading and writing. The subjects in this ...

FUZZY WORD MEANING ANALYSIS AND REPRESENTATION IN LINGUISTIC SEMANTICS. AN EMPIRICAL APPROACH TO THE RECONSTRUCTION OF LEXICAL MEANINGS IN EAST- AND WEST-GERMAN NEWSPAPER TEXTS.

by: song, 9 pages

Word semantics is gaining increasing interest within linguistics in view of both, more adequate representational structures of the semantic system and methods and procedures to analyze it empirically ...

Cohesion and Coherence in Children's Written English: Immersion ...

by: hossein, 38 pages

This study investigates the nature of cohesion, coherence, content, and grammar emergent in children's essays, with a greater emphasis given to the understanding of cohesion and coherence. Conceptual ...

Relationships between cohesion and coherence in essays and narratives

by: kristin, 8 pages

In an attempt to better understand Halliday and Hasan's system and its implications, I decided to replicate part of Tierney and Mosenthal's (1981) study with the following objectives in mind: 1. to ...

Extracellular Calcium and Magnesium in Preeclampsia and Eclampsia

by: dania, 6 pages

The cause of preeclampsia remains unknown and calcium and magnesium supplement are being suggested as means of prevention. The objective of this study was to assess magnesium and calcium in the ...

Content Preview
The VENEX corpus of anaphora and deixis in spoken and written Italian
Massimo Poesio, Rodolfo Delmonte§, Antonella Bristot§, Luminita Chiran§, and Sara Tonelli§
†University of Essex
§ Universit´a di Venezia
Abstract
The VENEX corpus is a corpus of Italian annotated with information about anaphora and deixis, created in a joint project
between the Universit`a di Venezia and the University of Essex. The corpus includes both texts (articles from a financial
newspaper) and dialogues (an Italian version of the MapTask corpus). The annotation scheme is an almost complete
implementation of the scheme proposed in MATE, and the markup scheme is the simplified form of standoff adopted in the
MMAX annotation tool.
1.
INTRODUCTION
2.
THE MATE ANNOTATION SCHEME
FOR ‘COREFERENCE’
We summarize in this section the most distinctive
The MATE ‘meta-scheme’ scheme for anaphora an-
features of the MATE scheme. A complete descrip-
notation (Poesio et al., 1999) is one of the annotation
tion of the MATE scheme is available from the MATE
schemes developed as part of the MATE project (McK-
project pages at http://mate.nis.sdu.dk/.
elvie et al., 2001) The MATE proposals have served as
2.1.
The Markup Scheme
the basis for a number of annotation projects, espe-
cially the development of the GNOME corpus (Poesio,
The core aspect of the MATE proposals is the
2000; Poesio et al., 2004b), as well as to the develop-
scheme for marking up anaphoric relations in XML.
ment of tools for anaphora annotation, such as MMAX
As in the MUC scheme (MUCCS) (Hirschman, 1998),
(M¨uller and Strube, 2003). However, not all aspects
it is assumed that annotation of anaphora is best sep-
of the original recommendations have been tested.
arated in two steps: first the markables (the text con-
Aspects never tried before include the recommenda-
stituent that realize semantic objects that may enter in
tions for marking anaphoric reference to landmarks in
anaphoric relations) are agreed upon, then anaphoric
MapTask-style situations (Anderson et al., 1991), and
relations between them are marked. The main dif-
for dealing with anaphoric elements that are either un-
ference from the MUC scheme is that whereas in
realized or incorporated in other elements, e.g., in Ital-
MUCCS anaphoric relations are annotated using at-
ian. From the point of view of annotation technology,
tributes on the markables, in the MATE scheme–
the GNOME annotation did not adopt one of the central
following the recommendations of the Text Encoding
aspects of the MATE proposals, standoff; in addition,
Initiative (Burnard and Sperberg-McQueen, 2002),
in MATE it was recommended to do anaphoric anno-
and of Bruneseaux and Romary (1998)–the distinc-
tation off the output of a parser, whereas in GNOME
tion between these two steps of annotation is mirrored
markables were identified by hand.
by a distinction between two XML elements:
de ,
used to indicate the markables, and link , used to
All of these aspects of the proposals have been
mark information about these relations.
link ele-
tested during the creation of the VENEX corpus, a
ments are structured elements, containing one or more
joint project between the Universit`a di Venezia and
anchor
element. The link element specifies
the University of Essex. The corpus includes both
the anaphoric expression (using XML’s HREF mech-
texts and dialogues. The annotation scheme is an
anism) and the relation between the anaphoric expres-
almost complete implementation of the scheme pro-
sion and its antecedent; whereas the anchor ele-
posed in MATE, and the markup scheme is the simpli-
ment specifies the antecedent.
fied form of standoff adopted in the MMAX annotation
(1)
coref.xml
tool. The work on VENEX has led to a number of de-
velopments of the original proposals, as well as to a
<de ID="de_01">we</de>’re gonna take
<de ID="de_07"> the engine E3 </de>
re-examination of a number of their aspects. In this
and shove <de ID="de_08"> it </de> over
paper, we consider some of the issues raised by this
to <de ID="de_02">Corning</de>,
hook <de ID="de_09"> it </de> up to
work, and discuss they were addressed.
<de ID="de_03">the tanker car</de>...

<link href="coref.xml#id(de_07)"
markup scheme using additional relations, as in (2).
type="ident">
(2) a.
<anchor href="coref.xml#id(de_08)"/>
F: Alors donc / vous avez / ici /
</link>
LES MODELES DE FUSEES /
<link href="coref.xml#id(de_08)"
M: Oui
type="ident">
F: Et vous allez essayer de vous
<anchor href="coref.xml#id(de_09)"/>
mettre d’accord sur un classement
</link>
/hein classer
LES FUSEES QUI ONT BIEN VOLE‘ ou
QUI ONT MOINS BIEN VOLE‘
The design of the
b.
MATE workbench was strongly in-
F: Alors donc / vous avez / ici /
<de ID="de_88"> les mode‘les de fuse’es </de>
spired by the concept of standoff annotation intro-
M: Oui
duced for the MapTask. The main principle of stand-
F: Et vous allez essayer de vous mettre d’accord
sur un classement /hein classer
off annotation is that each level of annotation–for ex-
<de ID="de_89"> les fuse’es qui ont
ample, syntactic annotation, dialogue act annotation,
bien vole’ </de>
and anaphoric annotation–should be stored indepen-
ou <de ID="de_90"> qui ont
moins bien vole’ </de>
dently; in this way, annotators working on one level
need not be concerned about the other levels of an-
<link href="coref.xml#id(de_89)">
<anchor href="coref.xml#id(de_88)"
notation, and can start immediately without having
type="subset " />
to wait for other annotation tasks to be completed.
</link>
<link href="coref.xml#id(de_90)"
The separate levels of annotation are synchronized via
type="subset " >
a base file, to which the separate levels point using
<anchor href="coref.xml#id(de_88)"/>
the
</link>
HREF mechanism of XML. For the coreference
scheme, as well, it was proposed that link ele-
It was pointed out, however, that the results of Poesio
ments should be kept in separate files pointing at the
and Vieira (1998) indicated that this type of annotation
file in which the de elements were indicated.
could be highly unreliable.
References to the Visual Situation
A special
2.2.
Instantiations of the Meta-Scheme
universe
element was suggested for MapTask-
One of the most important assumptions behind the
style annotations of references to visible objects. The
design of the MATE proposals for anaphoric annota-
universe element containing one ue element
tion is the belief that given the variety of phenomena
for each object in the visual scene; including such
that go under the name of anaphora, and the variety of
elements in an annotation makes it possible to use
possible applications, there can be no such thing as a
link elements to annotate references to such ob-
general-purpose scheme for anaphoric annotation. In-
jects. Cases in which the participants to a conversation
stead, it was shown how the basic mechanisms dis-
have different visual situations, as in the MapTask di-
cussed above could be used to implement different
alogues, can be handled by having separate universes,
types of anaphoric annotation, including some of the
one for each participant to the conversation. In addi-
most popular schemes for ’coreference annotation,’
tion, a WHO-BELIEVES attribute of link elements
such as MUCCS, Passonneau’s DRAMA scheme (1997)
was proposed to represent situations in which only one
, and the scheme used for annotation of references to
participant believes that a particular anaphoric relation
landmarks in the MapTask corpus.
holds, as in the following example, where only Fol-
The Core Scheme
In the most basic type of coref-
lower believes that the ‘gold mine’ refers to the same
erence scheme, such as MUCCS, only anaphoric rela-
object as the ‘diamond mine’.
tions between NPs are considered, and only identity
(3) a.
GIVER: Do_you have diamond_mine.
relations. Schemes of this type can be implemented
FOLLOWER: Yes I’ve got a gold_mine.
GIVER: Ah. S--.
by allowing for only one anaphoric relation, IDENT.
FOLLOWER: ....
The remaining differences between the schemes have
GIVER: You don’t have diamond_mine though.
FOLLOWER: No. It’s a gold_mine according to
then mostly to do with the instructions to annotators–
this one.
for example, which types of anaphoric relations to be
Presumably that’s the same.
considered as cases of ’identity’ (see (van Deemter
GIVER: Well I’ve got a gold_mine as well
you see. (MT)
and Kibble, 2000) for some problems with the choices
made in MUC).
b.
coref.xml:
<universe ID="common">
Extended Relations
DRAMA extends such schemes
<ue ID="ue2"> gold mine </ue>
....
with ways of annotating associative relations. Ref-
</universe>
<universe ID="GIVER_universe"
erences of this type can be annotated in the MATE
modifies="common">

<ue ID="ue1"> diamond mine </ue>
veloped to study discourse properties claimed to af-
...
</universe>
fect the way discourse entities are realized, includ-
<universe ID="FOLLOWER_universe"
modifies="common">
ing definiteness (Poesio, 2004) and salience, partic-
....
</universe>
ularly as formalized in Centering theory (Poesio et al.,
2004b) and Grosz and Sidner’s theory of the atten-
GIVER: Do_you have
<de ID="de_20"> diamond_mine. </de>
tional state (Poesio and Di Eugenio, 2001).
These
FOLLOWER: Yes I’ve got
<de ID="de_21"> a gold_mine. </de>
studies were in part motivated by work on natural lan-
GIVER: Ah. S--.
FOLLOWER: ....
guage generation, and fed into a series of papers study-
GIVER: You don’t have
<de ID="de_22"> diamond_mine </de>
ing sentence planning (Poesio, 2000; Henschel et al.,
though.
FOLLOWER: No.
2000; Cheng et al., 2001) and text planning (Kara-
It’s <de ID="de_23"> a gold_mine</de>
according to this one.
manis, 2003; Kibble and Power, 2003). The corpus
Presumably <de ID="de_24"> that’s </de>
the same.
is also being used to study anaphora resolution (Poe-
GIVER: Well I’ve got
<de ID="de_25"> a gold_mine </de>
sio and Alexandrov-Kabadjov, 2004), with a special
as well you see.
focus on the resolution of bridging references (Poe-
<link href="coref.xml#id(de_20)"
type="ident"
who-believes="G">
sio, 2003; Poesio et al., 2004a). This work led both
<anchor href="coref.xml#id(ue1)"/>
</link>
to the development of a detailed coding manual for
<link href="coref.xml#id(de_21)"
type="ident"
who-believes="F" >
the parts of the MATE proposals incorporated in the
<anchor href="coref.xml#id(ue2)"/>
</link>
GNOME scheme, and to further developments. In this
<link href="coref.xml#id(de_21)"
type="ident"
who-believes="F" >
section we briefly discuss how the MATE scheme was
<anchor href="coref.xml#id(de_20)"/>
</link>
used and further developed in GNOME, particulary as
far as the annotation of bridging references and deixis
2.3.
Instructions for Identifying Markables
is concerned. For further details about the GNOME
One of the novel aspects of the MATE instructions
corpus and for the complete annotation manual, see
was the concern for markable identification in lan-
http://hcrc.ed.ac.uk/ ˜ gnome.
guages other than English. One such issue was how
to deal with incorporated clitics and empty subjects.
3.1.
Genres
The suggestion contained in the MATE guidelines was
The GNOME corpus includes texts from three do-
to use a separate element, seg , to turn verbs into
mains. The museum subcorpus consists of descrip-
non-nominal markables, as in the following example:
tions of museum objects, generally with an associated
(4)
picture, and brief texts about the artists that produced
coref.xml:
them. The pharmaceutical subcorpus is a selection of
A: Dov’e‘ <de ID="de_157">Gianni?</de>
leaflets providing the patients with legally mandatory
[Where is Gianni?]
B: <seg type="pred" ID="seg_158 >e‘
information about their medicine.
andato a mangiare </seg>
[_ went to have lunch]
3.2.
Markup scheme
<link href="coref.xml#id(seg_158)"
Several layers of information were annotated, in-
type="ident">
cluding layout in the case of text and rhetorical struc-
<anchor href="coref.xml#id(de_157)"/>
</link>
ture in the case of tutorial dialogues, sentences and
potential utterances, noun phrases, a variety of at-
The seg element was also meant to be used in more
tributes of the objects denoted by noun phrases,2 and
ambitious schemes as general mechanism for specify-
anaphoric relation. We concentrate here on anaphoric
ing non-nominal markables –e.g., to indicate the an-
information, and refer the reader to the manual avail-
tecedents of discourse deixis, or for ellipsis.1
able from (http://hcrc.ed.ac.uk/ ˜ gnome)
3.
AN INSTANCE OF THE MATE
for the other types of annotation.
META-SCHEME: GNOME
The parts of the GNOME annotation scheme that
have to do with anaphora implement are based on
Ideas from the MATE ’scheme’ have been adopted
and tested both in annotation projects, such as the de-
2E.g., whether an NP denoted generically or not;
velopment of the GNOME corpus, and by the develop-
whether it denoted an animate or inanimate entity, as well
ers of annotation tools. The GNOME corpus was de-
as other ontological properties; and whether it denoted a
discourse entity, a quantifier, or a predicate. In the case
1A second range of issues considered in the MATE
of a discourse entity, we also annotated whether it denoted
scheme had to do with dialogue phenomena, such as non-
an atom, a set, or a mass term; and whether it denoted
contiguous elements; we will not consider these issues here.
uniquely.

the ‘Core’ and the ’Extended Relations’ instantiations
non-identity relations, as well as identity. These rela-
of the MATE meta-scheme. The markup scheme for
tions are a subset of those proposed in the ‘extended
markables and anaphoric relations adopted in GNOME
relations’ version of the MATE scheme: set member-
follows very closely that proposed in MATE, except
ship (ELEMENT), subset (SUBSET), and ‘generalized
that the de element was renamed ne , and the
possession’ (POSS), which includes both part-of rela-
link
element was renamed ante . More sub-
tions and ownership relations.
stantial differences are the decision not to use stand-
In our preliminary attempts at annotating deictic
off, and the introduction of new elements necessary
references we used a technique similar to the ’Uni-
for the study of salience, such as elements that could
verse’ scheme developed in MATE.
However, we
be used to investigate the notion of UTTERANCE used
quickly realized that, first of all, ’real’ pictures cannot
in Centering (Poesio et al., 2004b).
be decomposed into ’objects’ as easily as the maps
Although standoff is a clear improvement over in-
used in the MapTask, hence asking the annotators to
cluding all annotation levels in a single file, our own
identify specific objects as the referents of deictic ref-
experiences during the creation of the GNOME corpus
erences was quite hard. Secondly, that none of the
being further proof of this, it’s only really possible
studies we intended to carry out actually required this
when tools are available both to create the annotation
identification; all that was needed –e.g., to study the
and–crucially–later to ’knit back’ the separate levels
use of demonstratives–was to know whether a refer-
when needed. As neither the MATE workbench nor
ence was deictic or not. As a result, we used a boolean
any other tools based on standoff were available by
DEICTIC attribute.
the time the GNOME annotation started,3 in GNOME
we didn’t use standoff, but integrated all levels of an-
3.4.
Coder manual
notation in one file; an Emacs minor mode extending
Perhaps
the
most
important
aspects
of
SGML-mode, GNOME-mode, was developed.4
the
GNOME
annotation
are
the
development
The main new aspect of the markup scheme, espe-
of
detailed
instructions
for
annotators
(see
cially as far as our studies of salience were concerned,
http://hcrc.ed.ac.uk/ ˜ gnome)
and
is the inclusion of elements used to annotate poten-
the reliability experiments testing several aspects of
tial utterances in the sense of Centering (Grosz et al.,
the scheme, particularly bridging references.
1995). In order not to prejudge the answer to the ques-
The identification of sentences, units and mark-
tion of which text constituents are best viewed as ut-
ables was done entirely by hand, without encounter-
terances, we used a ‘generic’ element called unit
ing particular problems.
All attributes of sentences,
to mark up finite and non-finite clauses, but also par-
unit s and ne s in the final version of the scheme,
entheticals and elements of bulleted lists.
including DEIX, can be annotated reliably. In order to
3.3.
Bridging References and Deixis
achieve reliability on anaphoric annotation, the range
of anaphoric phenomena considered was restricted in
Apart from the relation of identity, in GNOME we
many ways. Apart from marking a limited number
were concerned with bridging references and deictic
of associative relations, the annotators only marked
reference, hence the annotation scheme incorporated
relations between objects realized by noun phrases
aspects of the ‘Extended Relations’ and the ‘MapTask’
and not, for example, anaphoric references to actions,
instantiations of the MATE meta-scheme.
events or propositions implicitly introduced by clauses
One of our aims was to continue the work on
or sentences. We also gave strict instructions to our
bridging references annotation and interpretation in
annotators concerning how much to mark. We found
(Poesio and Vieira, 1998; Vieira and Poesio, 2000),
a rather good agreement on identity relations. In our
which showed that marking up bridging references is
most recent analysis (two annotators looking at the
quite hard. In addition, work such as (Sidner, 1979;
anaphoric relations between 200 NPs) we observed
Strube and Hahn, 1999) suggested that indirect re-
no real disagreements; 79.4% of these relations were
alization can play a crucial role in maintaining the
marked up by both annotators; 12.8% by only one of
CB. After testing a few types of associative reference
them; and in 7.7% of the cases, one of the annotators
(Hawkins, 1978), we decided to annotate only three
marked up a closer antecedent than the other. Con-
cerning associative references, limiting the relations
3In the end lack of time prevented the inclusion of a tool
did limit the disagreements among annotators (only
for anaphoric annotation in the released workbench.
4
4.8% of the relations are actually marked differently)
GNOME-mode provides some support for introducing
new elements, marking regions, and attribute editing, as
but only 22% of bridging references were marked in
well as anaphoric annotation.
the same way by both annotators; 73.17% of rela-

tions are marked by only one or the other annota-
4.2.
The Annotation Tool
tor. Reaching agreement on this information involved
As the choice of the annotation tool played an
several discussions between annotators and more than
important role in the design of the markup scheme,
one pass over the corpus (Poesio, 2000).
we will discuss this first. Having observed the dif-
ficulties annotators had during GNOME, we wanted
to use a proper annotation tool, possibly one based
4.
VENEX: GOALS, MARKUP, AND
on standoff technology. No annotation tool imple-
ANNOTATION SCHEME
menting the MATE or GNOME schemes as described
The general goals of the VENEX annotation were
exists; but in the years after the development of the
to produce a resource that could be used to study both
MATE guidelines tools supporting XML standoff anno-
pronominal and full NPanaphora in Italian, both from
tation for coreference have appeared, including MMAX
an anaphora resolution and from an anaphora gener-
from EML (M¨uller and Strube, 2003) and the Annota-
ation perspective. More specific goals included con-
tor from ILSP. Although the format used for storing
ducting for Italian a study of the effect of Centering
anaphoric information by these tools is not entirely
on pronominal anaphora analogous to (Poesio et al.,
satisfactory, the files they produce can be easily con-
2004b), also taking into account the work of (Di Euge-
verted into MATE format.
nio, 1998); and a study of definite description use like
The tool used in VENEX, MMAX is based on a
(Poesio and Vieira, 1998), but looking also at deic-
simplified standoff format without href references
tic references to the visual situation in dialogues. The
to the base file. Three main files are maintained for
ultimate goal is to use the corpus to test anaphora res-
each annotated file in the corpus: a base file contain-
olution algorithms for Italian.
ing the words, a file specifying how the text is broken
From a linguistic point of view, the more sub-
up (into paragraphs and sentences in the case of writ-
stantial difference between the
ten text, into turns in the case of dialogues), and a file
VENEX annotation and
the annotation effort in
identifying markables. A special .anno file records
GNOME is the inclusion of
dialogue data and the need to consider a variety of
the names of the three files, which have to be kept
forms of anaphoric reference not present in English.
in the same directory. The word level and markable
This made it necessary to consider issues addressed
level files for one of the files in the VENEX corpus,
in the
napoli-05, are shown in Fig. 1.
MATE guidelines but not relevant for GNOME.
The
Word files contain one word element per to-
VENEX annotation scheme incorporates aspects
of all three instantiations of the
ken, with a unique ID. Text files for written text
MATE meta-scheme–
core, extended relations, and references to the visual
consist of a text element containing one or more
situation– as well as the suggestions for dealing with
paragraph
elements, in turn containing one or
clitics, zero anaphora, and misunderstandings.
more sentence elements with a unique id and
a span indicating the words belonging to the sen-
The VENEX annotation also goes beyond GNOME
tence. Turn files for dialogue consist a turns ele-
in that more modern annotation technology is used, in
ment including one or more turn elements which,
two respects: markables are identified automatically
in addition to id and span , contain an optional
as far a possible, and data are stored in a standoff for-
speaker attribute. Examples of both types of files
mat, using a modern annotation tool (MMAX).
are given below.
4.1.
The Data
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE words SYSTEM "text.dtd">
The VENEX annotation effort builds on the results
<text id="1">
<paragraph id="paragraph_1">
of two separate corpus-annotation initiatives: SI-TAL
<sentence id="sentence_1" span="word_1..word_13"/>
(Montemagni, 2000), concerned with the creation of
</paragraph>
a corpus of written Italian text from financial newspa-
....
</text>
pers (Il Sole 24 Ore) comparable to the Wall Street
Journal corpus; and
<?xml version="1.0" encoding="iso-8859-1"?>
IPAR (Bristot et al., 2000), a
<turns>
continuation of the previous projects API and AVIP
<turn id="turn0" span="word_1..word_19"
projects, whose result was a collection of spoken task-
speaker="g001"/>
.....
oriented dialogues of speakers performing the Map-
</turns>
Task (Anderson et al., 1991). The VENEX corpus con-
sists of 30 SI-TAL newspaper articles and 6 IPAR dia-
The most important file for our purposes is the mark-
logues.
able file, which contains a markables element

containing one or more markable elements. In ad-
Attribute
Brief explanation
np form
Type of NP (the-np, etc.)
dition to id and span elements, markables have
soggetto vuoto
Boolean: ”true” for seg elements
two special attributes used to store anaphoric informa-
is anaphoric
Whether the head predicate of the NP
has already been used to describe the entity
tion: the MEMBER attribute, used to indicate member-
anaphoric space
Which discourse models contain
ship in a coreference chain (a coreference equivalence
the identity relation
class), and the POINTER attribute, used to mark up to
is bridging
Type of associative relation (part-of,
element, set, attribute)
one associative anaphoric relation for each anaphoric
bridging space
Which discourse models contain the
expression. Any number of additional attributes for
bridging relation
function
Whether the NP refers deictically
markable elements can be specified by users via a
COREFERENCE SCHEME.
Table 1: Markable attributes
4.3.
Adapting the MATE Markup Scheme For
Use with
MMAX
The
WHO-BELIEVES
attribute of the
link
ele-
MATE / GNOME markup scheme had to be
adapted in a number of ways to be usable with
ment in the MATE scheme. FOr the moment, deictic
function has been implemented with a simple boolean
MMAX.
The first issue was that MMAX doesn’t
support
attribute, as done in
link
elements–anaphoric information is
GNOME.
stored with markables–so in the VENEX annotation we
had to use markable attributes to represent informa-
5.
VENEX: ANNOTATION
tion that in MATE and GNOME-style markup schemes
METHODOLOGY
would have been encoded as part of the link el-
5.1.
Parsing
ements. We used a separate attribute to specify the
type of associative relation expressed by POINTER
Whereas in the GNOME annotation annotators had
attribute, and a SPACE attribute to encode the in-
to add markables by hand, the VENEX annotation is
formation stored in the WHO-BELIEVES attribute of
more alike the type of annotation originally envisaged
links (see below). In addition, only one MEMBER and
in MATE, whereby markables have largely been iden-
POINTER attributes can be specified for each mark-
tified automatically, and then corrected by hand.
able. This latter limitation wasn’t much of a problem,
Both the written and spoken corpora were tok-
given that the annotation instructions used in VENEX
enized, POS-tagged and parsed using a suite of tools
are derived from those developed for GNOME and also
developed at the Universit´a di Venezia, including the
attempt to limit annotators to mark at most one iden-
IMMORTALE POS-tagger (Delmonte and Piana, 1999)
tity and one bridging relation for each anaphoric ex-
and the GETA-RUN parser (Delmonte, 2002). A mor-
pression. The separation of attributes of links proved,
phological analyzer extracts a number of features, in-
however, a problem, as annotators often forget to an-
cluding agreement features; GETA-RUN builds a com-
notate one or the other.
plete constituent and functional structure. The output
A second problem is that the version of MMAX
of the parser is then corrected semi-automatically us-
we used (0.94) only allows for one type of mark-
ing separate annotation tools.
able, meaning that unit elements could not be
A series of scripts converts the (different) formats
annotated, and instead of using separate ne and
used in SI-TAL and in IPAR into the MMAX format,
seg elements for nominal and non-nominal mark-
identifies the markables, and automatically computes
ables, a single markable had to be used (see be-
the type of the NPs.
low).5 A boolean attribute SOGGETTO VUOTO was
used; SOGGETTO VUOTO = “none” for normal ne
5.2.
Annotation procedure
markables, whereas SOGGETTO VUOTO = “true” for
A new and detailed coding manual was produced
markable elements that do the job of seg ele-
for the project. The instructions are mostly derived
ments, marking verbal elements that contain incorpo-
from those developed for GNOME; we briefly summa-
rated clitics or when the subject is dropped.
rize here the main differences.
The complete list of user-defined markable at-
Markable correction
The initial list of markables
tributes currently being annotated is in Table 1.
has to be corrected by hand. Although many details
Two
attributes,
ANAPHORIC SPACE
and
of this correction task are likely to be specific to the
BRIDGING SPACE,
are
used
to
realize
the
particular parser used, a few problems will probably
5The version of MMAX currently being developed will
have to be considered by other similar annotations as
allow for multiple markables.
well.

One obvious problem are incorporated clitics and
at the moment there is no tool that can be used to cre-
empty subjects; the parser cannot identify those, so in
ate this type of annotation directly.
these cases, the annotators have to mark an element
Anaphoric relations
One aspect of the markup
of the verbal complex as a markable . (We saw
scheme that needs revision is the placement of the re-
above that whereas in the MATE scheme a separate the
lation. One problem we observed in GNOME is that
seg element for this purpose, in VENEX the same
often the ambiguity is not simply between two possi-
type of markable is used in both cases, but with an at-
ble antecedents each of which stands in the same re-
tribute specifying the type of markable.) The parser
lation to the anaphoric expression, but between two
also misses a few nominal markables, especially pro-
antecedents which stand in different relations. In the
nouns in possessive position and nominals in certain
pharmaceutical texts, for example, it is often unclear
types of coordinated constructions.
whether a particular mention of the medicine under
Misunderstandings
The MapTask part of the
consideration refers to the generic product, or to the
VENEX corpus contains numerous examples like (3),
particular instance that the user has in their hands.
where the differences between Giver and Follower
In this case, we would want annotators to mark the
map lead to one participant believing that two objects
anaphoric expression as IDENT with one object, and
are anaphorically related, while the other participant
ELEMENT of the other (ELEMENT is also used in
either is not aware of this or doesn’t believe this to be
GNOME for relations between instances and types), as
the case. We found that after a few iterations of train-
follows, but this is not possible in either the original
ing, our annotators were able to handle these cases
MATE scheme or in the GNOME markup scheme:
properly (a more formal evaluation is underway; we
(5)
<ante current="ne1">
hope to report the results at the meeting). Again, the
<anchor ID="ne2" rel="ident">
only problems were caused by the fact that these at-
<anchor ID="ne3" rel="element">
</ante>
tributes had to be added to markables, which some-
times led to annotators forgetting to set them. (This
Ambiguity
Offering annotators the opportunity to
was only required in case the default, that an anaphoric
annotate anaphoric ambiguity is essential, especially
relation was in the common ground of both partici-
for annotations used to study linguistic phenomena,
pants, didn’t hold.)
but raises serious theoretical and practical problems.
A coreference chain containing such links becomes
5.3.
State of the Annotation
a coreference (directed) graph, in which each of the
The entire corpus has been annotated; we are cur-
paths across the graph is a potential interpretation.
rently running new reliability studies, and will then
While having multiple paths is not a problem as far
revise the annotation. We expect the work to be com-
as evaluating the results of an anaphoric resolver (any
pleted in the Summer, and the corpus to be made avail-
path in the graph counts as a valid solution), it is a
able at the end of the year or early next year.
serious problems both for scripts attempting to ensure
consistency (e.g., that all references to the same ob-
6.
DISCUSSION
ject are marked as either generic or non-generic–this
We briefly discussed the annotation scheme and
is of course impossible when one of the possible an-
methods used to create the VENEX corpus. This expe-
tecedents is generic while the other isn’t) as well for
rience has prompted a reconsideration of the original
annotation tools (the problem is of course worsened
MATE recommendations for anaphoric annotation. We
when the tool only uses a single attribute to indicate
discuss a few issues directly related to the question of
membership in a coreference chain).
using XML for this type of annotation.
Revision
A second difficult problem is caused by
LINK elements
Our experience with VENEX sug-
cases, common in the MapTask dialogues, in which
gests that having a separate link element would be
after a while a participant realizes that their previous
very useful; in fact, two of the most beneficial aspects
belief that an object was identical to another object is
that would derive from this we had not originally con-
mistaken. In these cases, the participant is arguably
sidered. First of all, separate link elements can
revising their previous beliefs; it is not clear then what
be used to mark general semantic relations, not just
should be done with the annotation of the original
anaphoric relations (for more complex types of se-
anaphoric information.
mantic annotation). Secondly, and perhaps most im-
portantly, grouping all attributes relevant to links into
7.
References
a single element makes it harder for annotators to for-
A. H. Anderson, M. Bader, E.G. Bard, E. Boyle,
get to fill in aspects of the annotation. Unfortunately,
G. Doherty, S. Garrod, S. Isard, J. Kowtko, J. McAl-

lister, J. Miller, C. Sotillo, H. Thompson, and
R. Passonneau. 1997. Instructions for applying dis-
R. Weinert. 1991. The HCRC Map Task corpus.
course reference annotation for multiple applica-
Language and Speech, 34(4):351–366.
tions (DRAMA). Unpublished manuscript., De-
A. Bristot, L. Chiran, and R. Delmonte. 2000. Verso
cember.
un’annotazione XML di dialoghi spontanei per
M. Poesio and M. Alexandrov-Kabadjov. 2004. A
l’analisi sintattico-semantica. In XI Giornate di Stu-
general-purpose, off the shelf anaphoric resolver. In
dio GFS, Multimodalita’ e Multimedialit nella co-
Proc. of LREC, Lisbon, May.
municazione, pages 42–50, Padova.
M. Poesio and B. Di Eugenio. 2001. Discourse struc-
F. Bruneseaux and L. Romary. 1998. Documents
ture and anaphoric accessibility. In Ivana Kruijff-
pr´eparatoires pour le codage de dialogues multi-
Korbayov´a and Mark Steedman, editors, Proc. of
modaux suivant les directives de la TEI.
the ESSLLI 2001 Workshop on Information Struc-
L. Burnard and C. M. Sperberg-McQueen. 2002. TEI
ture, Discourse Structure and Discourse Semantics.
lite: An introduction to text encoding for inter-
M. Poesio and R. Vieira. 1998. A corpus-based inves-
change. http://www.tei-c.org/Lite.
tigation of definite description use. Computational
H. Cheng, M. Poesio, R. Henschel, and C. Mellish.
Linguistics, 24(2):183–216, June.
2001. Corpus-based NP modifier generation. In
M. Poesio, F. Bruneseaux, and L. Romary. 1999. The
Proc. of the Second NAACL, Pittsburgh.
MATE meta-scheme for coreference in dialogues in
R. Delmonte and E. Piana. 1999. Tag disambiguation
multiple languages. In M. Walker, editor, Proc. of
in Italian. In Proc. of the ATALA Workshop on Tree-
the ACL Workshop on Standards and Tools for Dis-
banks, pages 41–49.
course Tagging, pages 65–74.
R. Delmonte. 2002. GETARUN: A parser equipped
M. Poesio, R. Mehta, A. Maroudas, and J. Hitzeman.
with Quantifier Raising and Anaphoric Binding
2004a. Learning to solve bridging references. Sub-
based on LFG. In Proc. of LFG-02, pages 130–153.
mitted.
B. Di Eugenio. 1998. Centering in italian. In M. A.
M. Poesio, R. Stevenson, B. Di Eugenio, and J. M.
Walker, A. K. Joshi, and E. F. Prince, editors, Cen-
Hitzeman. 2004b. Centering: A parametric theory
tering Theory in Discourse, pages 115–138. Oxford.
and its instantiations. Computational Linguistics.
To appear.
B. J. Grosz, A. K. Joshi, and S. Weinstein. 1995.
M. Poesio. 2000. Annotating a corpus to develop and
Centering: A framework for modeling the local co-
evaluate discourse entity realization algorithms: is-
herence of discourse. Computational Linguistics,
sues and preliminary results. In Proc. of the 2nd
21(2):202–225.
LREC, pages 211–218, Athens, May.
J. A. Hawkins. 1978. Definiteness and Indefiniteness.
M. Poesio.
2003.
Associative descriptions and
Croom Helm, London.
salience. In Proc. of the EACL Workshop on Com-
R. Henschel, H. Cheng, and M. Poesio.
2000.
putational Treatments of Anaphora, Budapest.
Pronominalization revisited. In Proc. of 18th COL-
M. Poesio.
2004.
An empirical investigation of
ING, Saarbruecken, August.
definiteness. In S. Kepser, editor, Proc. of the
L. Hirschman. 1998. MUC-7 coreference task defini-
International Conference on Linguistic Evidence,
tion, version 3.0. In N. Chinchor, editor, In Proc. of
T¨ubingen, January. University of T¨ubingen, SFB
the 7th Message Understanding Conference.
441.
N. Karamanis. 2003. Entity coherence for descriptive
C. L. Sidner. 1979. Towards a computational theory
text structuring. Ph.D. thesis, Univ. of Edinburgh.
of definite anaphora comprehension in English dis-
R. Kibble and R. Power. 2003. Optimising referential
course. Ph.D. thesis, MIT.
coherence in text generation. Submitted.
M. Strube and U. Hahn. 1999. Functional centering.
D. McKelvie, A. Isard, A. Mengel, M. B. Moeller,
Computational Linguistics, 25(3):309–344.
M. Grosse, and M. Klein.
2001.
The MATE
K. van Deemter and R. Kibble. 2000. On corefer-
workbench - an annotation tool for XML corpora.
ring: Coreference in MUC and related annotation
Speech Communication, 33(1-2):97–112.
schemes. Computational Linguistics, 26(4):629–
L. Montemagni. 2000. The italian syntactic-semantic
637. Squib.
treebank: Architecture, annotation, tools and evalu-
R. Vieira and M. Poesio. 2000. An empirically-based
ation. In Proc. of LINC, pages 18–27.
system for processing definite descriptions. Com-
C. M¨uller and M. Strube. 2003. Multi-level annota-
putational Linguistics, 26(4), December.
tion in MMAX. In Proc. of the 4th SIGDIAL, pages
198–207.

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE words SYSTEM "words.dtd">
<words>
<word id="word_1">G001</word>
<word id="word_2">Sara</word>
<?xml version="1.0"?>
<word id="word_3">,</word>
<markables>
<word id="word_4">allora</word>
<markable id="markable_1" span="word_25" type="verbalspec" np_form="clitic"/>
<word id="word_5">c</word>
...
<word id="word_6">hai</word>
<markable id="markable_297" span="word_11..word_18" type="verbalspec"
<word id="word_7">sulla</word>
np_form="none" function="deictic" soggetto_vuoto="none" member="set_1"
<word id="word_8">tua</word>
is_anaphoric="none" is_bridging="none" anaphoric_space="ana_both"
<word id="word_9">sinistra</word>
bridging_space="bridging_both"/>
<word id="word_10">-</word>
<word id="word_11">una</word>
<word id="word_12">figura</word>
....
</words>
Figure 1: Standoff annotation in MMAX: Words and Markables

Download
The VENEXcorpusof anaphora and deixis in spoken and written Italian

 

 

Your download will begin in a moment.
If it doesn't, click here to try again.

Share The VENEXcorpusof anaphora and deixis in spoken and written Italian to:

Insert your wordpress URL:

example:

http://myblog.wordpress.com/
or
http://myblog.com/

Share The VENEXcorpusof anaphora and deixis in spoken and written Italian as:

From:

To:

Share The VENEXcorpusof anaphora and deixis in spoken and written Italian.

Enter two words as shown below. If you cannot read the words, click the refresh icon.

loading

Share The VENEXcorpusof anaphora and deixis in spoken and written Italian as:

Copy html code above and paste to your web page.

loading