29.01.2014 Views

GWC 2008

GWC 2008

GWC 2008

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

A. Tanács, D. Csendes, V. Vincze, Ch. Fellbaum, P. Vossen (Eds.)<br />

<strong>GWC</strong> <strong>2008</strong><br />

The Fourth Global WordNet Conference,<br />

Szeged, Hungary, January 22-25, <strong>2008</strong><br />

Proceedings


Volume editors<br />

Attila Tanács (HU)<br />

University of Szeged, Department of Informatics<br />

H-6720 Szeged, Árpád tér 2.<br />

e-mail: tanacs@inf.u-szeged.hu<br />

Dóra Csendes (HU)<br />

University of Szeged, Department of Informatics<br />

H-6720 Szeged, Árpád tér 2.<br />

e-mail: dcsendes@inf.u-szeged.hu<br />

Veronika Vincze (HU)<br />

University of Szeged, Research Group on Artificial Intelligence<br />

H-6720 Szeged, Árpád tér 2.<br />

e-mail: vinczev@inf.u-szeged.hu<br />

Christiane Fellbaum (USA)<br />

Princeton University, Department of Psychology<br />

Princeton, New Jersey 08544<br />

e-mail: fellbaum@princeton.edu<br />

Piek Vossen (NL)<br />

Irion Technologies BV<br />

Herensingel 168, Weesp, 1382 VV<br />

e-mail: vossen@irion.nl<br />

Copyright information<br />

ISBN 978-963-482-854-9<br />

© University of Szeged, Department of Informatics, 2007<br />

This work is subject to copyright. All rights are reserved, whether the whole or part of<br />

the material is concerned, specifically the rights of reprinting, recitation, translation,<br />

re-use of illustrations, reproduction in any form and storage in data banks.<br />

Typesetting<br />

Camera-ready by Attila Tanács, Dóra Csendes and Veronika Vincze from source files<br />

provided by authors.<br />

Data conversion by Attila Tanács.<br />

Printed at Juhász Press Ltd.<br />

H-6771 Szeged, Makai út. 4.


Preface<br />

We are very pleased to hold the Fourth Global WordNet Conference in Szeged,<br />

Hungary, following our tradition of alternating the meeting locations between<br />

different parts of the world.<br />

The program includes 45 paper presentations and demos, two invited talks (Hitoshi<br />

Isahara, Adam Kilgarriff) and two topical panels. We received fewer submissions<br />

than in previous years; rather than reflecting a decrease in WordNet-related research<br />

this probably indicates increased "competition" with the many other conferences and<br />

workshops on lexical resources, computational linguistics and Natural Language<br />

Processing where work on WordNets is increasingly featured.<br />

We are excited about several new WordNets whose creation is reported here:<br />

Croatian, Polish and South African languages. The language of the host country is<br />

highlighted with several papers on Hungarian WordNet.<br />

We counted participants from 26 countries in Europe, Asia, Africa, the Near East<br />

and the US. Among the authors are many old WordNetters as well as new colleagues,<br />

some from countries as far away as Oman and South Africa.<br />

The presentations cover a wide range of topics, including manual and automatic<br />

WordNet construction for general and specific domains, lexicography, software tools,<br />

ontology, linguistics, applications and evaluation. As in previous meetings, we expect<br />

lively discussions and exchanges that plant the seeds for new ideas and future<br />

collaborations.<br />

Our thanks go to the Programme Committee who provided thoughtful and fair<br />

reviews in a timely fashion.<br />

Christiane Fellbaum, Piek Vossen (for the Global WordNet Organization)<br />

János Csirik, Dóra Csendes (for the Local Organizers)<br />

November, 2007


Organisation<br />

The Fourth Global WordNet Conference is organised by the University of Szeged,<br />

Department of Informatics in co-operation with the Global WordNet Association.<br />

The conference home page can be found at http://www.inf.u-szeged.hu/gwc<strong>2008</strong>.<br />

Programme Committee<br />

Eneko Agirre (San Sebastian, Spain), Zoltan Alexin (Szeged, Hungary), Antonietta<br />

Alonge (Perugia, Italy), Pushpak Bhattacharyya (Mumbai, India), Bill Black<br />

(Manchester, UK), Jordan Boyd-Graber (Princeton, US), Nicoletta Calzolari (Pisa,<br />

Italy), Key-Sun Choi (Seoul, Korea), Salvador Climent (Barcelona, Spain), Dan<br />

Cristea (Iasi, Romania), Janos Csirik (Szeged, Hungary), Andras Csomai (Szeged,<br />

Hungary), Tomas Erjavec (Ljubljana, Slovenia), Christiane Fellbaum (Princeton, US),<br />

Julio Gonzalo (Madrid, Spain), Ales Horak (Brno, Czech Republic), Chu-Ren Huang<br />

(Taipei, Republic of China), Hitoshi Isahara (Kyoto, Japan), Neemi Kahusk (Tartu,<br />

Estonia), Kyoko Kanzaki (Kyoto, Japan), Adam Kilgarriff (Brighton, UK), Claudia<br />

Kunze (Tuebingen, Germany), Birte Loenneker (Berkeley, US/Hamburg, Germany),<br />

Bernado Magnini (Trento, Italy), Palmira Marrafa (Lisbon, Portugal), Rada Mihalcea<br />

(Texas, US), Adam Pease (San Francisco, US), Karel Pala (Brno, Czech Republic),<br />

Ted Pedersen (Minneapolis, US), Bolette Pedersen (Copenhagen, Denmark),<br />

Emanuele Pianta (Trento, Italy), Eli Pociello (San Sebastian, Spain), German Rigau<br />

(San Sebastian, Spain), Horacio Rodriguez (Barcelona, Spain), Virach<br />

Sornlertlamvanich (Pathumthani, Thailand), Sofia Stamou (Patras, Greece), Dan Tufis<br />

(Bucarest, Romania), Tony Veale (Dublin, Ireland), Kadri Vider (Tartu, Estonia),<br />

Piek Vossen (Amsterdam, Netherlands)<br />

Organisation Committee<br />

János Csirik (Chair)<br />

Dóra Csendes (Secretary)<br />

Attila Tanács, Dóra Csendes, Veronika Vincze (Proceedings)<br />

Veronika Vincze, Attila Almási, Róbert Ormándi (Helpers)<br />

Christiane Fellbaum, Piek Vossen (Co-organisers)


Table of Contents<br />

Papers<br />

Consistent Annotation of EuroWordNet with the Top Concept Ontology ................... 3<br />

Javier Álvez, Jordi Atserias, Jordi Carrera, Salvador Climent, Antoni Oliver,<br />

German Rigau<br />

SemanticNet: a WordNet-based Tool for the Navigation of Semantic Information... 21<br />

Manuela Angioni, Roberto Demontis, Massimo Deriu, Franco Tuveri<br />

Verification of Valency Frame Structures by Means of Automatic Context<br />

Clustering in RussNet................................................................................................. 35<br />

Irina V.Azarova, Anna S. Marina, Anna A. Sinopalnikova<br />

Some Issues in the Construction of a Russian WordNet Grid .................................... 44<br />

Valentina Balkova, Andrey Sukhonogov, Sergey Yablonsky<br />

A Comparison of Feature Norms and WordNet ......................................................... 56<br />

Eduard Barbu, Massimo Poesio<br />

Enhancing WordNets with Morphological Relations: A Case Study from Czech,<br />

English and Zulu......................................................................................................... 74<br />

Sonja Bosch, Christiane Fellbaum, Karel Pala<br />

On the Categorization of Cause and Effect in WordNet............................................. 91<br />

Cristina Butnariu, Tony Veale<br />

Evaluation of Synset Assignment to Bi-lingual Dictionary...................................... 101<br />

Thatsanee Charoenporn, Virach Sornlertlamvanich, Chumpol Mokarat, Hitoshi<br />

Isahara, Hammam Riza, Purev Jaimai<br />

Using and Extending WordNet to Support Question-Answering ............................. 111<br />

Peter Clark, Christiane Fellbaum, Jerry Hobbs<br />

Using GermaNet as a Semantic Resource for the Extraction of Thematic<br />

Structures: Methods and Issues................................................................................. 120<br />

Irene Cramer, Marc Finthammer<br />

On the Utility of Automatically Generated WordNets ............................................. 147<br />

Gerard de Melo, Gerhard Weikum


vi<br />

Words, Concepts and Relations in the Construction of Polish WordNet.................. 162<br />

Magdalena Derwojedowa, Maciej Piasecki, Stanisław Szpakowicz, Magdalena<br />

Zawisławska, Bartosz Broda<br />

Exploring and Navigating: Tools for GermaNet ...................................................... 178<br />

Marc Finthammer, Irene Cramer<br />

Using Multilingual Resources for Building SloWNet Faster ................................... 185<br />

Darja Fišer<br />

The Global WordNet Grid Software Design ............................................................ 194<br />

Aleš Horák, Karel Pala, Adam Rambousek<br />

The Development of a Complex-Structured Lexicon based on WordNet ................ 200<br />

Aleš Horák, Piek Vossen, Adam Rambousek<br />

WordNet-anchored Comparison of Chinese-Japanese Kanji Word.......................... 209<br />

Chu-Ren Huang, Chiyo Hotani, Tzu-Yi Kuo, I-Li Su, Shu-Kai Hsieh<br />

Paranymy: Enriching Ontological Knowledge in WordNets.................................... 220<br />

Chu-Ren Huang, Pei-Yi Hsiao, I-Li Su, Xiu-Ling Ke<br />

Proposing Methods of Improving Word Sense Disambiguation for Estonian.......... 229<br />

Kadri Kerner<br />

Morpho-semantic Relations in WordNet – a Case Study for two Slavic Languages 239<br />

Svetla Koeva, Cvetana Krstev, Duško Vitas<br />

Language Independent and Language Dependent Innovations in the Hungarian<br />

WordNet ................................................................................................................... 254<br />

Judit Kuti, Károly Varasdi, Ágnes Gyarmati, Péter Vajda<br />

Introducing the African Languages WordNet........................................................... 269<br />

Jurie le Roux, Koliswa Moropa, Sonja Bosch, Christiane Fellbaum<br />

Towards an Integrated OWL Model for Domain-Specific and General Language<br />

WordNets.................................................................................................................. 281<br />

Harald Lüngen, Claudia Kunze, Lothar Lemnitzer, Angelika Storrer<br />

The Possible Effects of Persian Light Verb Constructions on Persian WordNet ..... 297<br />

Niloofar Mansoory, Mahmood Bijankhan<br />

Towards a Morphodynamic WordNet of the Lexical Meaning ................................ 304<br />

Nazaire Mbame


vii<br />

Methods and Results of the Hungarian WordNet Project......................................... 311<br />

Márton Miháltz, Csaba Hatvani, Judit Kuti, György Szarvas, János Csirik,<br />

Gábor Prószéky, Tamás Váradi<br />

Synset Based Multilingual Dictionary: Insights, Applications and Challenges........ 321<br />

Rajat Kumar Mohanty, Pushpak Bhattacharyya, Shraddha Kalele, Prabhakar<br />

Pandey, Aditya Sharma, Mitesh Kopra<br />

Estonian WordNet: Nowadays.................................................................................. 334<br />

Heili Orav, Kadri Vider, Neeme Kahusk, Sirli Parm<br />

Event Hierarchies in DanNet .................................................................................... 339<br />

Bolette Sandford Pedersen, Sanni Nimb<br />

Building Croatian WordNet...................................................................................... 349<br />

Ida Raffaelli, Marko Tadić, Božo Bekavac, Željko Agić<br />

Towards Automatic Evaluation of WordNet Synsets ............................................... 360<br />

J. Ramanand, Pushpak Bhattacharyya<br />

Lexical Enrichment of a Human Anatomy Ontology using WordNet...................... 375<br />

Nils Reiter, Paul Buitelaar<br />

Arabic WordNet: Current State and Future Extensions............................................ 387<br />

Horacio Rodríguez, David Farwell, Javi Farreres, Manuel Bertran, Musa<br />

Alkhalifa, M. Antonia Martí, William Black, Sabri Elkateb, James Kirk, Adam<br />

Pease, Piek Vossen, Christiane Fellbaum<br />

Building a WordNet for Persian Verbs..................................................................... 406<br />

Masoud Rouhizadeh, Mehrnoush Shamsfard, Mahsa A. Yarmohammadi<br />

Developing FarsNet: A Lexical Ontology for Persian.............................................. 413<br />

Mehrnoush Shamsfard<br />

KUI: Self-organizing Multi-lingual WordNet Construction Tool ............................ 419<br />

Virach Sornlertlamvanich, Thatsanee Charoenporn, Kergrit Robkop, Hitoshi<br />

Isahara<br />

Extraction of Selectional Preferences for French using a Mapping from<br />

EuroWordNet to the Suggested Upper Merged Ontology ........................................ 428<br />

Dennis Spohr<br />

Romanian WordNet: Current State, New Applications and Prospects ..................... 441<br />

Dan Tufiş, Radu Ion, Luigi Bozianu, Alexandru Ceauşu, Dan Ştefănescu<br />

Enriching WordNet with Folk Knowledge and Stereotypes..................................... 453<br />

Tony Veale, Yanfen Hao


viii<br />

Comparing WordNet Relations to Lexical Functions............................................... 462<br />

Veronika Vincze, Attila Almási, Dóra Szauter<br />

KYOTO: A System for Mining, Structuring, and Distributing Knowledge Across<br />

Languages and Cultures............................................................................................ 474<br />

Piek Vossen, Eneko Agirre, Nicoletta Calzolari, Christiane Fellbaum, Shu-Kai<br />

Hsieh, Chu-Ren Huang, Hitoshi Isahara, Kyoko Kanzaki, Andrea Marchetti,<br />

Monica Monachini, Federico Neri, Remo Raffaelli, German Rigau, Maurizio<br />

Tesconi, Joop VanGent<br />

The Cornetto Database: Architecture and Alignment Issues of Combining Lexical<br />

Units, Synsets and an Ontology................................................................................ 485<br />

Piek Vossen, Isa Maks, Roxane Segers,Hennie van der Vliet, Hetty van Zutphen<br />

CWN-Viz : Semantic Relation Visualization in Chinese WordNet.......................... 506<br />

Ming-Wei Xu, Jia-Fei Hong, Shu-Kai Hsieh, Chu-Ren Huang<br />

Using WordNet in Extracting the Final Answer from Retrieved Documents in a<br />

Question Answering System..................................................................................... 520<br />

Mahsa A. Yarmohammadi, Mehrnoush Shamsfard, Mahshid A. Yarmohammadi,<br />

Masoud Rouhizadeh<br />

Towards the Construction of a Comprehensive Arabic WordNet ............................ 531<br />

Hamza Zidoum<br />

Author List .............................................................................................................. 545


Papers


Consistent Annotation of EuroWordNet<br />

with the Top Concept Ontology<br />

Javier Álvez 1 , Jordi Atserias 2 , Jordi Carrera 3 , Salvador Climent 3 ,<br />

Antoni Oliver 3 , and German Rigau 1<br />

1<br />

Basque Country University.<br />

2<br />

Web Research Group - Universitat Pompeu Fabra<br />

3<br />

Open University of Catalonia.<br />

jibalgij@si.ehu.es, jordi.atserias@upf.edu, jcarrerav@uoc.edu, scliment@uoc.edu,<br />

aoliverg@uoc.edu, german.rigau@ehu.es<br />

Abstract. This paper presents the complete and consistent annotation of the<br />

nominal part of the EuroWordNet (EWN). The annotation has been carried out<br />

using the semantic features defined in the EWN Top Concept Ontology. Up to<br />

now only an initial core set of 1024 synsets, the so-called Base Concepts, were<br />

ontologized in such a way.<br />

1 Introduction<br />

Componential semantics has a long tradition in Linguistics since the work of poststructuralists<br />

as Hjelmslev in the thirties [cf. [1]] or [2] among generativists. There is<br />

common agreement that this kind of lexical-semantic information can be extremely<br />

valuable for making complex linguistic decisions. Nevertheless, according to [1],<br />

componential analysis cannot be actually achieved due to three main reasons (being<br />

the first the most important): (1) the vocabulary of a language is too large, (2) each<br />

word needs several features for its semantics to be adequately represented and (3)<br />

semantic features should be organized in several levels.<br />

Our work provides a good solution to these problems, since 65.989 noun concepts<br />

from WordNet 1.6 (WN16) [3] corresponding to 116.364 noun lexemes (variants)<br />

have been consistently annotated with an average of 6.47 features per synset, being<br />

those features organized in a multilevel hierarchy. Therefore, it might allow<br />

componential semantics to be tested and applied in real world situations probably for<br />

the first time, thus contributing to a wide number of NLP tasks involving semantic<br />

processing: Word Sense Disambiguation, Syntactic Parsing using selectional<br />

restrictions, Semantic Parsing or Reasoning.<br />

Despite its wide scope, the work presented here is envisaged to be the first stage of<br />

an incremental and iterative process, as we do not assume that the current version of<br />

the EWN Top Concept Ontology (TCO) covers the optimal set of features for the<br />

aforementioned tasks. Currently, a second phase has started within the framework of


4 Javier Álvez et al.<br />

the KNOW Project 1 in which the first version of the enriched lexicon is being used to<br />

label a corpus. We plan to use later this annotation for abstracting the semantic<br />

properties of verbs occurring in the corpus. This will lead, presumably, to a<br />

reformulation of the TCO, through addition, deletion or reorganisation of features.<br />

In this paper, is organized as follows. After a brief summary of the state of the art<br />

(section §2), we present our methodology for annotating the nominal part of EWN<br />

(section §3). Then, we provide a qualitative analysis by providing some relevant<br />

examples (section §4). Section §5 summarizes a quantitative analysis and finally,<br />

section §6 provides some concluding remarks.<br />

2 Previous Work and State of the Art<br />

2.1 The EuroWordNet Top Ontology<br />

The EWN TCO was not primarily designed to be used as a repository of lexical<br />

semantic information, but for clustering, comparing and exchanging concepts across<br />

languages in the EWN Project. Nevertheless, most of its semantic features (e.g.<br />

Human, Object, Instrument, etc.) have a long tradition in theoretical lexical semantics<br />

and have been postulated as semantic components of meanings. We will only describe<br />

here some of its major characteristics (see [4] for further details).<br />

The EWN TCO (Fig. 1) consists of 63 features and it is primarily organized<br />

following [5]. Correspondingly, its root level is structured in three disjoint types of<br />

entities:<br />

- 1stOrderEntity (physical things, e.g.: vehicle, animal, substance, object)<br />

- 2ndOrderEntity (situations, e.g.: happen, be, begin, cause, continue, occur)<br />

- 3rdOrderEntity (unobservable entities e.g.: idea, information, theory, plan)<br />

1stOrderEntities are further distinguished in terms of four main ways of<br />

conceptualizing or classifying concrete entities:<br />

- Form: as an amorphous substance or as an object with a fixed shape<br />

(Substance or Object)<br />

- Composition: as a group of self-contained wholes or as a necessary part of a<br />

whole, hence the subdivisions Group and Part.<br />

- Origin: the way in which an entity has come about (Artifact or Natural).<br />

- Function: the typical activity or action that is associated with an entity<br />

(Comestible, Furniture, Instrument, etc.)<br />

These main features are then further subdivided. These classes are comparable to<br />

the Qualia roles as described in [6] and are based on empirical findings raised during<br />

the development of the EWN project, when the classification of the Base Concepts<br />

(BCs) was undertaken. Concepts can be classified in terms of any combination of<br />

these four roles. As such, these top concepts function more as features than as<br />

ontological classes.<br />

1 KNOW. Developing large-scale multilingual technologies for language understanding<br />

. Ministerio de Educación y Ciencia. TIN2006-15049-C03-02.


Consistent Annotation of EuroWordNet with the Top Concept Ontology 5<br />

Although the main-classes are intended for cross-classification, most of the<br />

subdivisions are disjoint classes: a concept cannot be both an Object and a Substance,<br />

or Natural and Artifact. As explained below, feature disjunction will play an<br />

important role in our methodology.<br />

2ndOrderEntities can lexicalize both nouns and verbs (as well as adjectives and<br />

adverbs) denoting static or dynamic situations, such as birth, live, life, love, die and<br />

death. All 2ndOrderEntities are classified using two different classification schemes:<br />

- SituationType: the event-structure in terms of which a situation can be<br />

characterized as a conceptual unit over time<br />

- SituationComponent: the most salient semantic component(s) that<br />

characterize(s) a situation<br />

SituationType represents a basic classification in terms of the event-structure (in<br />

the formal tradition) or the predicate-inherent Aktionsart properties of nouns and<br />

verbs, as described for instance in [7]. SituationTypes can be Static or Dynamic,<br />

further subdivided in Property and Relation on the one side and UnboundedEvent and<br />

BoundedEvent on the other.<br />

SituationComponents (e.g. Location, Existence, Cause, Mental, Purpose) emerged<br />

empirically when selecting verbal and deverbal Base Concepts in EWN. They<br />

resemble the cognitive components that play a role in the conceptual structure of<br />

events, as described in [8] and others. In fact, much in the same way as Function did<br />

for 1stOrderEntities, they are good candidates for encoding important semantic<br />

properties of words denoting situations.<br />

Typically, SituationType represents disjoint features that can not be combined,<br />

whereas it is possible to assign any range or combination of SituationComponents to a<br />

word meaning. Each 2ndOrderEntity meaning can thus be classified in terms of an<br />

obligatory but unique SituationType and any number of SituationComponents.<br />

Finally, 3rdOrderEntities was not further subdivided, since there appeared to be a<br />

limited number of BCs of this kind in EWN.<br />

The TCO has been redesigned twice, first by the EAGLES expert group [9] and<br />

then by [10]. EAGLES expanded the original ontology by adding 74 concepts while<br />

the latter made it more flexible, allowing, for instance, to cross-classify features<br />

between the three orders of entities.<br />

Moreover, the Global Wordnet Association [11] recently distributed a taxonomy<br />

consisting of 71 so-called Base Types which can be seen as semantic primitives or<br />

taxonomic tops playing a key role in large-scale semantic networks. The Base Types<br />

have been derived by refining the original set of BCs. They are connected to both<br />

EWN synsets and TCO features, and represent an important synthesis effort in order<br />

to achieve a more elegant and economic modelling of the TCO.


6 Javier Álvez et al.<br />

1stOrderEntity<br />

2ndOrderEntity<br />

3rdOrderEntity<br />

Form<br />

Object<br />

Substance<br />

Composition<br />

Group<br />

Part<br />

Origin<br />

Artifact<br />

Natural<br />

Function<br />

SituationType<br />

Dynamic<br />

Gas<br />

Liquid<br />

Solid<br />

Living<br />

Animal<br />

Creature<br />

Human<br />

Plant<br />

Building<br />

Comestible<br />

Container<br />

Covering<br />

Furniture<br />

Garment<br />

Instrument<br />

Occupation<br />

Place<br />

Representation<br />

ImageRepresentation<br />

LanguageRepresentation<br />

MoneyRepresentation<br />

Software<br />

Vehicle<br />

Static<br />

SituationComponent<br />

Cause<br />

BoundedEvent<br />

UnboundedEvent<br />

Property<br />

Relation<br />

Agentive<br />

Phenomenal<br />

Stimulating<br />

Communication<br />

Condition<br />

Existence<br />

Experience<br />

Location<br />

Manner<br />

Mental<br />

Modal<br />

Physical<br />

Possesion<br />

Purpose<br />

Quantity<br />

Social<br />

Time<br />

Usage<br />

Fig. 1. The EWN Top Concept Ontology


Consistent Annotation of EuroWordNet with the Top Concept Ontology 7<br />

2.2 Ontological information in the Multilingual Central Repository<br />

In the framework of the UE-funded MEANING project [12] a Multilingual Central<br />

Repository 2 (MCR) was designed and implemented in order to act as a multilingual<br />

interface for integrating and distributing lexical-semantic knowledge [13]. The MCR<br />

follows the model proposed by the EuroWordNet project (EWN) [14], i.e. a<br />

multilingual lexical database with WordNets for several languages. It includes<br />

WordNets for English, Spanish, Italian, Catalan and Basque.<br />

The EWN architecture includes the Inter-Lingual-Index (ILI), which is a list of<br />

records that interconnect synsets across WordNets. Using the ILI, it is possible to go<br />

from word meanings in one language or particular WordNet to their equivalents in<br />

other languages or WordNets. The current version of the MCR uses the set of<br />

Princeton WordNet 1.6 synsets as ILI.<br />

In the MCR, the ILI is connected to three separate ontologies: the EWN TCO<br />

(described above), the Domain Ontology (DO) [15] and the Suggested Upper Merged<br />

Ontology (SUMO) [16]. The DO is a hierarchy of 165 domain labels, which are<br />

knowledge structures grouping meanings in terms of topics or scripts, e.g. Transport,<br />

Sports, Medicine, Gastronomy. SUMO incorporates previous ontologies and insights<br />

by Sowa, Pierce, Russell and Norvig and others and, compared to EWN TCO, is<br />

much larger and deeper. The WN-SUMO mapping [17] assigns only one SUMO<br />

category to every WN16 synset (being SUMO a large formal ontology), while the<br />

EWN TCO, as explained above, assigns a combination of a more reduced number<br />

categories. This makes the TCO much more suitable than that of SUMO for<br />

implementing componential semantics. While all the ILI is connected to the DO and<br />

to SUMO, only 1024 ILI-Records were connected to the TCO, i.e. those were selected<br />

as BCs in the EWN project.<br />

2.3 Lexical Semantics for Robust NLP<br />

Some NLP systems, such as knowledge-based Machine Translation systems usually<br />

include some kind of decision making (e.g. transfer module, PP-attachment) using<br />

lexical semantic features such as Human, Animate, Event, Path, Manner etc. [18]. Its<br />

use, however, is restricted to demo systems, e.g. [19] or, in real-world systems, to a<br />

limited number of lexical entries or/and to a very reduced number of semantic<br />

features, due to the difficulty of annotating a comprehensive lexicon with an<br />

exhaustive set of features.<br />

However, WordNets are large lexical resources freely-available and widely used<br />

by the NLP community. Currently, they serve a wide number of tasks involving some<br />

degree of semantic processing. In most of these tasks, WordNets are used to<br />

generalize or abstract a set of synsets to a subsuming one by following the WordNet<br />

hierarchy up. The main problem is finding the right level of generalization; that is,<br />

finding the concept which optimally subsumes a given set of concepts; but it could be<br />

the case that the class which would optimally capture the generalization is not lexical,<br />

but abstract –thus having to be represented through features. It can also be the case<br />

2<br />

http://adimen.si.ehu.es/cgi-bin/wei5/public/wei.consult.perl


8 Javier Álvez et al.<br />

that WordNet simply is not the kind of taxonomy required, fact which can be due to<br />

several reasons: incompleteness, incorrect structuring, or perhaps that its structuring<br />

should be arranged differently for a particular NLP task.<br />

Bearing these drawbacks in mind, some authors have turned to use the ontologies<br />

mapped onto WordNet to determine new sets of classes. For instance, [20] and [21]<br />

have already used the MCR including SUMO, DO, WN16 Semantic<br />

(Lexicographer’s) Files and a preliminary rough expansion of the TCO for Word<br />

Sense Disambiguation.<br />

For many tasks, it seems that using a feature-annotated lexicon seems more<br />

appropriate than using the WordNet tree-structure, since (i) the WordNet hierarchy is<br />

not consistently structured [22] and (ii) a feature-annotated lexicon allows to make<br />

predictions based on measures of similarity even for words that, being sparsely<br />

distributed in WordNet, can only be generalized by reaching common hypernyms in<br />

levels too high in the hierarchy. Besides, a multiple-feature design allows to naturally<br />

depict semantically complex concepts, such as so-called dot-objects [6], e.g.,<br />

intrinsically polysemic words such as “letter”, since a letter is something that can both<br />

be destroyed and carry information (as in “I burnt your love letter”). These aspects of<br />

meaning can be easily coded through using the EWN TCO, as shown in (1)<br />

(1) “letter”: FUNCTION: LanguageRepresentation<br />

FORM: Object<br />

In this direction, [23] uses a lexicon augmented with EWN TCO features both to<br />

implement selectional restrictions to limit the search space when parsing and to<br />

perform type-coercion in a dialogue system.<br />

3 Methodology<br />

Our methodology for annotating the ILI with the TCO 3 is based on the common<br />

assumption that hyponymy corresponds to feature set inclusion [24, p.8] and in the<br />

observation that, since WordNets are taken to be crucially structured by hyponymy “<br />

(…) by augmenting important hierarchy nodes with basic semantic features, it is<br />

possible to create a rich semantic lexicon in a consistent and cost-effective way after<br />

inheriting these features through the hyponymy relations" [9, pp. 204-205].<br />

Nevertheless, performing such operation is not straightforward, as WordNets are<br />

not consistently structured by hyponymy [22]. Moreover, WordNets allow multiple<br />

inheritance. These are both drawbacks to overcome and situations to take advantage<br />

of by our methodology. As told above, within the EWN project, a limited set of<br />

lexical base concepts 4 (the BCs) was annotated with TCO features. Despite being<br />

3<br />

We use WN16 since the ILI is drawn up on this version of WordNet<br />

4<br />

Base Concepts (BCs) should not be confused with Basic Level Concepts (BLCs) as defined<br />

by [26] but in a future work BCs can be taken as a starting set to define that of BLCs. Since<br />

BLCs are supposed to be richer in distinctive features and the most psychologically salient<br />

lexical categories, they can also be relevant for advanced NLP tasks.


Consistent Annotation of EuroWordNet with the Top Concept Ontology 9<br />

largely general in meaning, this set did not cover all of the upper level nodes in the<br />

WordNets. This was clearly a drawback for expanding features down all of WN1.6,<br />

thus the first step of our work consisted of annotating the gaps up the hierarchy, from<br />

the BCs to the Unique beginners. This was made semiautomatically: given that every<br />

synset in WN1.6 originally belongs to a so-called Semantic File (a flat list of 45<br />

lexicographer files), those synsets were assigned a TCO feature via a table of<br />

expected equivalence between TCO nodes and Semantic Files.<br />

This made the WN1.6 ready to be fully populated with at least one feature per<br />

synset. Nevertheless, in many cases, synsets got more than one feature, for one or<br />

more of the following reasons:<br />

- They are BCs, so they were manually annotated with more than one<br />

feature<br />

- In addition to their own manual annotation, they inherit features from<br />

one or more hypernyms<br />

- They inherit features from different hypernyms, either located at<br />

different levels in a single line of hierarchy or by the effect of multiple<br />

inheritance<br />

An initial rough expansion was the first ground for revision and inspection,<br />

following the strategy defined in [25]. The a task has lasted for about three years and<br />

has involved several re-expansion cycles.<br />

The manual work has been based on TCO feature incompatibilities. It consisted in<br />

automatically detecting co-occurrences in a synset of pairs of incompatible features.<br />

The axiomatic incompatibilities are the following:<br />

- 1stOrderEntity - 2ndOrderEntity<br />

- 1stOrderEntity - 3rdOrderEntity<br />

- 3rdOrderEntity - 2ndOrderEntity [except for SituationComponent]<br />

- 3rdOrderEntity - Mental 5<br />

- Object - Substance<br />

- Gas - Liquid - Solid<br />

- Artifact - Natural<br />

- Animal - Creature - Human - Plant<br />

- Dynamic - Static<br />

- BoundedEvent - UnboundedEvent<br />

- Property - Relation<br />

- Physical - Mental<br />

- Agentive - Phenomenal – Stimulating<br />

The first rough expansion described above caused the following number of feature<br />

conflicts:<br />

5<br />

The incompatibility between 3rdOrderEntity and Mental and the compatibility between<br />

3rdOrderEntity and Situation Components is explained below.


10 Javier Álvez et al.<br />

- 214 feature conflicts in 49 synsets caused by incompatible hand annotation<br />

- 2247 feature conflicts in 743 synsets caused by hand annotation<br />

incompatible with inherited features<br />

- 225.447 feature conflicts in 26.166 synsets caused by incompatibility<br />

between inherited features<br />

The first type of conflicts usually indicates synsets causing ontological doubts to<br />

annotators within the EWN project (e.g. is “skin” an object or a substance?). The third<br />

type usually reveals errors in WordNet structure (i.e. ISA overloading [22]). The<br />

second type might be caused by either or both reasons.<br />

The task consisted on manual checking feature incompatibilities in order to (i)<br />

adding or deleting ontological features, and (ii) setting inheritance blockage points. A<br />

blockage point is an annotation in WN1.6 which breaks the ISA relation between two<br />

synsets, thus no information can be passed by inheritance through it.<br />

When a case of feature incompatibility occurred, the synset involved, together with<br />

its structural surroundings (hypernyms, hyponyms), was analyzed. If the problem was<br />

due to a WN1.6 subsumption error, the corresponding link was blocked and synsets<br />

below the blockage point are annotated with new TCO features.<br />

Changes in the annotation were made and blockage points were set until all<br />

conflicts were resolved. Then a second re-expansion of TCO features was launched<br />

which resulted in a new (smaller) number of conflicts. Following this iterative and<br />

incremental approach, inheritance was being re-calculated and the resulting data was<br />

re-examined several times. Although such hand-checking is extremely complex and<br />

laborious, and despite the large number of conflicts to solve, the task ended up being<br />

feasible because working on the topmost origin of one feature conflict results in fixing<br />

many levels of hyponyms. For instance, leaf_1, “the main organ of photosynthesis<br />

and transpiration in higher plants”, is a synset that subcategorizes 66 kinds of leaves.<br />

It was originally categorized as Substance, but, being in that sense a bounded entity, it<br />

seemed clear that it cannot be assigned such TCO label. Therefore, fixing this case<br />

resulted in fixing as many as 66 conflicts downward with a single action.<br />

The task has been carried out using application interfaces, which allowed access<br />

the synsets and their glosses in three languages at the same time: English, Spanish and<br />

Catalan. The information that was relied on in order to make decisions was of the<br />

following kinds:<br />

- Relational information regarding every synset and neighboring ones; i.e. the<br />

WN1.6 structure<br />

- The nature of the feature conflict (any of the three types of incompatibility<br />

aforementioned)<br />

- Synsets' glosses as provided by EWN<br />

- Glosses, descriptions and examples of the TCO features as provided in [4]<br />

- Usual word-substitution tests that acknowledge hyponymy, as in [27, pp. 88-<br />

92]<br />

The task finished when finally a re-expansion of properties did not result in new<br />

conflicts. Then, two final steps were applied. First, as the TCO is itself a hierarchy,<br />

for every synset, its resulting annotation was expanded up-feature; e.g. if a synset


Consistent Annotation of EuroWordNet with the Top Concept Ontology 11<br />

beared the feature Animal it was also labelled Living, Natural, Origin and<br />

1stOrderEntity. Second, the whole noun hierarchy was been checked for consistency<br />

using several formal Theorem Provers like Vampire [28] and E-prover [29]. This step<br />

resulted in a number of new conflicts which were finally fixed.<br />

This methodology has led to detect many more inconsistencies in WordNet<br />

and much deeper into the hierarchy than previous approaches (e.g. [30]).<br />

This procedure can be seen as a shallow ontologization of WN1.6. That is,<br />

blocked links are reassigned to the TCO. This constitutes a pragmatic solution to the<br />

problem of the difficulty of complete WordNets ontologization. In this sense, our<br />

work will probably be the second one to ontologize the whole WordNet, after that<br />

with SUMO [17]. However, our coding (i) is multiple (SUMO links every synset to<br />

only one label of the ontology) and (ii) it is more workable since it uses a more<br />

intuitive and simple TCO.<br />

Regarding the completion of the work, the possibility that some areas in the<br />

WordNet hierarchy have remained unexamined cannot be completely excluded,<br />

although a very large number of changes have been introduced: (i.e. more than 13.000<br />

manual interventions). Moreover, it should be noticed that, when removing links or<br />

features to fix errors, all hyponimy lines involved by the action have been reexamined<br />

and reannotated in order not to loss information.<br />

4 Examples and qualitative discussion<br />

In this section some examples of our methodology are presented at work. Hereinafter,<br />

noun synsets are represented by one of their variants enclosed in curly brackets and<br />

TCO features by its name in italics, capitalized and enclosed in square brackets.<br />

Inherited features are marked ‘+’ while manually assigned features are marked ‘=’.<br />

Indentations stand for ISA relations. The symbol ‘x’ as in '-x-' or '-x->' means that the<br />

relation has been blocked.<br />

4.1 Bandung is not Java but a part of it<br />

A simple but very typical case is the following, in which the conflict results from<br />

multiple inheritance and the incorrect use of hyponymy instead of meronomy in<br />

WN1.6:<br />

{Bandung_1 6 [Artifact+ Natural+]}]<br />

---> {Java_1 [Natural+]}<br />

---> {island_1 [Natural+]}<br />

---> {city_1 [Artifact=]}<br />

Clearly, Bandung is a city, but it is not a Java (though it is part of Java). This case<br />

is revealed thanks to incompatibility between Natural and Artifact. It is fixed by<br />

blocking the subsumption link between Bandung_1 and Java_1:<br />

6<br />

A city in the island of Java.


12 Javier Álvez et al.<br />

{Bandung_1 [Artifact+]}]<br />

-x-> {Java_1 [Natural+]}<br />

---> {island_1 [Natural+]}<br />

---> {city_1 [Artifact=]}<br />

4.2 A drug is a substance<br />

This case is less straightforward but as well quite representative of malfunctions in the<br />

WN1.6 hierarchy. In WN1.6, {artifact_1} is both glossed as "a man-made object" and<br />

an hyponym of {physical_object_1}. Thus, in EWN it was annotated with the TCO<br />

feature [Object], which stands for bounded physical things. Nevertheless, its hyponym<br />

{drug_1} subsumes substances, therefore, it was annotated in EWN as [Substance]. It<br />

seems clear that the WN1.6 builders wanted to capture the fact that drugs are artificial<br />

compounds (although there indeed exist natural drugs 7 ). But this fact, which is<br />

represented by the ISA relation between {drug_1} and {artifact_1} is not consistent<br />

with conceptualising {artifact_1} as a physical, bounded, object. In our work, feature<br />

expansion revealed the contradiction, since TCO features [Object] and [Substance]<br />

are incompatible:<br />

{artifact_1 [Object=]}<br />

--- {article_2 [Object+]}<br />

--- {antiquity_3 [Object+]}<br />

--- {... [Object+]}<br />

--- {drug_1 [Substance= Object+]}<br />

--- {aborticide_1 [Substance= Object+]}<br />

--- {anesthetic_1 [Substance= Object+]}<br />

--- {... [Substance=] [Object+]}<br />

In this case, there were two possible solutions: either to underspecify {artifact_1}<br />

for Object and Substance, thus allowing it to subsume both kinds of entities, or<br />

blocking the subsumption relation between {drug_1} and {artifact_1}. We chosed the<br />

latter solution because {artifact_1} mainly subsumes hundreds of physical objects in<br />

WN1.6. Moreover, this solution is consistent with the glosses and respects the<br />

statement of {artifact_1} as hyponym of [physical_object_1}. Therefore, it seems<br />

better to treat {drug_1} as an exception than to change the whole structure:<br />

{artifact_1 [Object=]}<br />

--- {article_2 [Object+]}<br />

--- {antiquity_3 [[Object+]}<br />

--- {... [Object+]}<br />

-x- {drug_1 [Substance=]}<br />

--- {aborticide_1 [Substance+]}<br />

--- {anesthetic_1 [Substance+]}<br />

--- {... [Substance+]}<br />

7<br />

This fact prevents {drug_1} to be labelled [Artifact]. Only a number of its hyponyms can be<br />

done so.


Consistent Annotation of EuroWordNet with the Top Concept Ontology 13<br />

If we conceptualize the annotation with the TCO not just as simple feature<br />

labelling but as connecting WN1.6 to an upper flat abstract ontology, this solution is<br />

equivalent to chopping off the {drug_1] subtree and link it to the [Substance] node of<br />

the TCO:<br />

[1stOrderEntity]<br />

--- [Form]<br />

--- [Object]<br />

--- {artifact_1}<br />

--- {article_2}<br />

--- {antiquity_3}<br />

--- {...}<br />

--- [Substance]<br />

--- {drug_1}<br />

--- {aborticide_1}<br />

--- {anesthetic_1}<br />

--- {...}<br />

This vision was termed the shallow ontologization of WordNet in [25].<br />

4.3 The Statue of Liberty<br />

In this section, a complete case is described showing how one single feature<br />

conflicting in the bottom of the hierarchy reveals a chain of inconsistencies up to the<br />

upper levels of the taxonomy, thus resulting in hundreds of wrongly classified<br />

synsets. We also show how our methodology is applied to solve these problems.<br />

One conflict between first order and second order features originally taking place<br />

in {Statue_of_Liberty_1} climbs up to {creation_2} reveals the big confusion existing<br />

in WN1.6 regarding art, artistic genres, works of art and art performances (the last<br />

being events). Fixing this involved blockages and feature underspecification<br />

throughout the hierarchy. Finally, one synset {creation_2} should be underspecified<br />

as it would need disjunction of properties to be properly represented, as it can be<br />

either an object or an event.<br />

In order to facilitate the explanation, synsets are represented by a single intuitive<br />

word, the 3rdOrderEntity feature is more intuitively represented as [Concept] and<br />

only the more relevant synsets and features are shown.


14 Javier Álvez et al.<br />

Fig. 2. The case of the Statue. Initial situation.


Consistent Annotation of EuroWordNet with the Top Concept Ontology 15<br />

As a starting point, there were four BCs manually annotated in EWN: {artifact} as<br />

[Object], {abstraction} as [Concept], {attribute} as [Property] and {sculpture} as<br />

[ImageRepresentation]. Figure 2 shows the clear-cut result of a direct expansion of<br />

properties by feature inheritance.<br />

As a result of this process, several shocking annotations can be noticed at a first<br />

sight, for instance: (1) {musical composition}, {dance} and {impressionism} as<br />

[Object]; (2) {sculpture} as [Property], and (3) {Statue_of_Liberty} as [Concept].<br />

Notice that we became aware of all this situation by inspecting the incompatibility<br />

of those TCO features inherited by {Statue of Liberty}. Due to multiple inheritance,<br />

the popular monument was taken to be an artifact, hence an object; but at the same<br />

time a kind of {art} —as e.g. {dance}, which is clearly an event, while<br />

{impressionism} is nothing but a concept. Moreover, {Statue of Liberty} appeared to<br />

be an abstraction, a [Concept], just as the geometric notion of a {plane}. Last, the<br />

statue also inherited [Property]. So, the result of applying full inheritance of<br />

ontological properties in WN1.6 resulted in multiple incompatible features eventually<br />

colliding at {Statue_of_Liberty}.<br />

The analysis of the situation led to blockage of the following hierarchy paths, as it<br />

is shown in Figure 3:<br />

- Between {artifact} and {creation}<br />

- Between {art} and {dance} (but not between {art} and {genre})<br />

- Between {plastic_art} and {sculpture}<br />

- Between {three_dimensional_figure} and {sculpture}<br />

Moreover, {creation} was underspecified by assigning the upmost neutral feature<br />

[Top] and [Property] was deleted in {attribute} since it is better represented by<br />

{attribute}'s hyponym {property} while the rest of hyponynms here considered (lines,<br />

planes, etc.) are, according to their glosses and relations, concepts.<br />

The reasons behind these changes were the following:<br />

(1) Although, intuitively, one might say that a creation is an artifact (for<br />

creations are made by men), according to the glosses and hyponyms one can<br />

realize that the synset {artifact} subsumes objects, while {creation}<br />

subsumes both objects and activities brought about by men (e.g. a “musical<br />

composition”). Therefore, {creation} can not inherit first order features,<br />

since they are incompatible with second order ones. Consequently,<br />

{creation} was here labeled as [Top] thus allowing its hyponyms to be<br />

further specified as entities or events since neither its gloss (“something that<br />

has been brought into existence by someone”) nor the lack of homogeneity<br />

of its hyponyms allowed to make a choice. In a more flexible version of the<br />

TCO, as that proposed by [10], [Origin] features could be also attributed to<br />

second and third order entities. This will allow to assign [Artifact] to synsets<br />

like {Creation}. We intend to evolve to a TCO like this in the future.<br />

(2) Although, intuitively, one might say that dance is a kind of art, according to<br />

the glosses and other hyponyms it is realized that {art} refers to the concept<br />

(like e.g. {impressionism}) while {dance} refers to an activity. Therefore,<br />

while “art” and “impressionism” are considered ideas, “dance”, however, is<br />

an activity.


16 Javier Álvez et al.<br />

(3) Although, intuitively, one might say that sculpture is a plastic art, according<br />

to the glosses and other hyponyms one can realize that, as regards the senses<br />

given, {sculpture} refers to physical objects, while {plastic_art} refers to the<br />

abstract concept — a type of {art}, such as {impressionism}.<br />

(4) Although, intuitively, one might say that a sculpture is a three dimensional<br />

figure, according to the glosses and other hyponyms it is realized that,<br />

{three_dimensional_figure} refers to the shape (the same as one-dimensional<br />

lines or two-dimensional planes, that is, abstract shapes). Therefore, in this<br />

sense, “sculptures” are objects while “figures” or “shapes” are geometrical<br />

abstractions.<br />

The final result, as it can be seen in Figure 3, is a new quite reasonable labelling of<br />

the set of concepts, implicitly involving a reorganisation of the WN1.6 hierarchy. It is<br />

easy to realize how these limited set of decisions (four blockages, one feature deletion<br />

and few feature relabelling) subsequently affect hundreds of synsets. For instance,<br />

{creation} and {sculpture} relate to 713 and 28 hyponyms respectively.<br />

4.4 Notes for further discussion<br />

During the time devoted to carry on the work, a lot of interesting facts have been<br />

discovered about two objects of study: the structure of the noun hierarchy of WN1.6<br />

and the nature of the EWN TCO features – as well as the mapping between both.<br />

These facts are going to be further studied taken into consideration at least the<br />

following issues:<br />

- To which extent those noun hierarchy problems correspond to those<br />

described in [22] or there are other kinds of facts distorting the WordNet<br />

structure<br />

- Typical doubts or mistakes in the BCs annotation with the TCO carried on in<br />

the EWN Project<br />

- Problems related to lack of clear definition of either synsets in WordNet or<br />

features in the TCO<br />

For instance, a very common malpractice in EWN when annotating BCs with the<br />

TCO was that of the double coding of non-physical entities both as 3rdOrderEntity<br />

and Mental. Mental is a subfeature for 2ndOrderEntity and, as far as 2ndOrderEntity<br />

and 3rdOrderEntity are explicitly declared as incompatible, Mental and<br />

3rdOrderEntity can not coexist. Therefore, Mental has to be deleted, since what the<br />

encoder was intuitively doing in these cases was telling twice that the synset stands<br />

for a mental or conceptual entity. In a future enhanced TCO, following [10], it would<br />

be better to allow Origin, Form, Composition and Function features to be applied to<br />

situations and concepts, instead of the current classification based on 3rdOrderEntity<br />

and Mental being disjunct. This will allow Concept to be cross-classified as for<br />

instance to classify “Neverland” as both Concept and Place in order to indicate that it<br />

is an imaginary location or to underspecify "creation" by classifying it simply as<br />

Artifact.


Consistent Annotation of EuroWordNet with the Top Concept Ontology 17<br />

Fig. 3. The case of the Statue. Final result


18 Javier Álvez et al.<br />

5 Quantitative analysis<br />

Summarizing, the whole process provided a complete and consistent annotation of the<br />

nominal part of WN1.6 which consist of 65,989 synsets nominals with 116,364<br />

variants or senses. All 227,908 initial incompatibilities were solved by manually<br />

adding or removing 13,613 TCO features and establishing 359 blockage points. Now,<br />

the final resource has 207,911 synset-feature pairs without expansion and 427,460<br />

synset feature pairs with consistent feature inheritance.<br />

6 Conclusions and further work<br />

We have presented the full annotation of the nouns on the EuroWordNet (EWN)<br />

Interlingual Index (ILI) with those semantic features constituting the EWN Top<br />

Concept Ontology (TCO). This goal has been achieved by following a methodology<br />

based on an iterative and incremental expansion of the initial labelling through the<br />

hierarchy while setting the inheritance blockage points. Since this labelling has been<br />

set on the ILI, it can be also used to populate any other WordNet linked to it through<br />

a simple porting process.<br />

This resource 8 is intended to be useful for a large number of semantic NLP tasks<br />

and for testing for the first time componential analysis on real environments.<br />

Moreover, those mistakes encountered in WordNet noun hierarchy (i.e. false ISA<br />

relations), which are signalled by more than 350 blocking annotations, provide an<br />

interesting resource which deserves future attention.<br />

Further work will focus on the annotation of a corpus oriented to the acquisition of<br />

selectional preferences. This, compared to state-of-the-art synset-generalisation<br />

semantic annotation, will result in a qualitative evaluation of the resource and in<br />

gaining knowledge for designing an enhanced version of the Top Concept Ontology<br />

more suitable for semantically-based NLP.<br />

References<br />

1. Simone, R.: Fondamenti di Linguistica. Laterza & Figli, Bari-Roma. Trad. Esp.: Ariel, 1993<br />

(1990)<br />

2. Katz, J.J., Fodor, J. A.: The Structure of a Semantic Theory. J. Language 39, 170-210 (1963)<br />

3. Fellbaum C. (ed.): WordNet: An Electronic Lexical Database. The MIT Press, Cambridge<br />

MA (1998)<br />

4. Alonge, A., Bertagna, F., Bloksma, L., Climent, S., Peters, W., Rodríguez, H., Roventini, A.,<br />

Vossen, P. The Top-Down Strategy for Building EuroWordNet: Vocabulary Coverage, Base<br />

Concepts and Top Ontology. In: Vossen, P. (ed.) EuroWordNet: A Multilingual Database<br />

with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht (1998)<br />

5. Lyons, J: Semantics. Cambridge University Press, Cambridge, UK(1977)<br />

6. Pustejovsky, J.: The Generative Lexicon. MIT Press, Cambridge, MA (1995)<br />

7. Vendler, Z.: Linguistics in philosophy. Cornell University Press, Ithaca, N.Y. (1967)<br />

8<br />

http://lpg.uoc.edu/files/wei-topontology.2.2.rar


Consistent Annotation of EuroWordNet with the Top Concept Ontology 19<br />

8. Talmy, L.: Lexicalization patterns: Semantic structure in lexical forms. In: Shopen (ed.)<br />

Language typology and syntactic description: Grammatical categories and the lexicon. Vol.<br />

3, pp. 57–149. Cambridge University Press. Cambridge, UK (1985)<br />

9. Sanfilippo, A., Calzolari, N., Ananiadou, S. et al.: Preliminary Recommendations on Lexical<br />

Semantic Encoding. Final Report. EAGLES LE3-4244 (1999)<br />

10. Vossen, P.: Tuning Document-Based Hierarchies with Generative Principles. In: GL'2001<br />

First International Workshop on Generative Approaches to the Lexicon. Geneva (2001)<br />

11. The Global WordNet Association web site. Last accessed 04.06.2007-07-04<br />

http://www.globalwordnet.org/gwa/gwa_base_concepts.htm<br />

12. Rigau, G., Magnini, B., Agirre, ., Vossen, P., Carroll, J.: Meaning: A roadmap to<br />

knowledge technologies. In: Proceedings of COLING'2002 Workshop on A Roadmap for<br />

Computational Linguistics. Taipei, Taiwan (2002)<br />

13. Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Carroll, J., Magnini, B., Vossen, P.: The<br />

MEANING multilingual central repository. In: Proceedings of the Second International<br />

Global WordNet Conference (<strong>GWC</strong>'04). Brno, Czech Republic, January 2004. ISBN 80-<br />

210-3302-9 (2004)<br />

14. Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks.<br />

Kluwer Academic Publishers (1998)<br />

15. Magnini, B., Cavaglià, G.: Integrating subject field codes into wordnet. In: Proceedings of<br />

the Second International Conference on Language Resources and Evaluation LREC'2000.<br />

Athens, Greece(2000)<br />

16. Niles, I. Pease, A.: Towards a standard upper ontology. In: Proceedings of the 2nd<br />

International Conference on Formal Ontology in Information Systems (FOIS-2001) (2001)<br />

17. Niles, I. Pease, A.: Linking Lexicons and Ontologies: Mapping WordNet to the Suggested<br />

Upper Model Ontology. In: Proceedings of the 2003 International Conference on<br />

Information and Knowledge Engineering. Las Vegas, USA (2003)<br />

18. Hutchins, J.: A new era in machine translation research. J. Aslib Proceedings 47 (1) (1995)<br />

19. Nasr, A., Rambow, O., Palmer, M., Rosenzweig, J.: Enriching Lexical Transfer With Cross-<br />

Linguistic Semantic Features (or How to Do Interlingua without Interlingua). In:<br />

Proceedings of the 2nd International Workshop on Interlingua. San Diego, California (1997)<br />

20. Atserias, J., Padró, L., Rigau, G.: An Integrated Approach to Word Sense Disambiguation.<br />

Proceedings of the RANLP 2005. Borovets, Bulgaria (2005)<br />

21. Villarejo, L., Màrquez, L., Rigau, G.: Exploring the construction of semantic class<br />

classifiers for WSD. J. Procesamiento del Lenguaje Natural 35, 195-202. Granada, Spain<br />

(2005)<br />

22. Guarino, N.: Some Ontological Principles for Designing Upper Level Lexical Resources.<br />

In: Proceedings of the 1 st International Conference on Language Resources and Evaluation.<br />

Granada (1998)<br />

23. Dzikovska, O., Myroslava, M., Swift, D., Allen, J. F.: Customizing meaning: building<br />

domain-specific semantic representations from a generic lexicon. Kluwer Academic<br />

Publishers (2003)<br />

24. Cruse, D.A.: Hyponymy and Its Varieties. In: Green, R., Bean, C.A., Myaeng, S. H. (eds.)<br />

The Semantics of Relationships: An Interdisciplinary Perspective, Information Science and<br />

Knowledge Management. Springer Verlag (2002)<br />

25. Atserias, J., Climent, S., Rigau, G.: Towards the MEANING Top Ontology: Sources of<br />

Ontological Meaning. In: Proceedings of the LREC 2004. Lisbon (2004)<br />

26. Rosch, E., Mervis, C.B.: Family resemblances: Studies in the internal structure of<br />

categories. J. Cognitive Psychology 7, 573-605 (1975)<br />

27. Cruse, D. A.: Lexical Semantics. Cambridge University Press, NY (1986)<br />

28. Riazanov A., Voronkov, A.: The Design and implementation of Vampire. J. Journal of AI<br />

Communications 15(2). IOS Press (2002)


20 Javier Álvez et al.<br />

29. Schulz, S.: A Brainiac Theorem Prover. J. Journal of AI Communications 15(2/3). IOS<br />

Press (2002)<br />

30. Martin, Ph.: Correction and Extension of WordNet 1.7. In: Proceedings of the 11th<br />

International Conference on Conceptual Structures. LNAI 2746, pp. 160-173. Springer<br />

Verlag, Dresden, Germany (2003)


SemanticNet: a WordNet-based Tool for the Navigation<br />

of Semantic Information<br />

Manuela Angioni, Roberto Demontis, Massimo Deriu, and Franco Tuveri<br />

CRS4 - Center for Advanced Studies, Research and Development in Sardinia, Polaris -<br />

Edificio 1, 09010 Pula (CA), Italy<br />

{angioni, demontis, deriu, tuveri}@crs4.it<br />

Abstract. The main aim of the DART search engine is to index and retrieve<br />

information both in a generic and in a specific context where documents can be<br />

mapped or not on ontologies, vocabularies and thesauri. To achieve this goal, a<br />

semantic analysis process on structured and unstructured parts of documents is<br />

performed. While the unstructured parts need a linguistic analysis and a<br />

semantic interpretation performed by means of Natural Language Processing<br />

(NLP) techniques, the structured parts need a specific parser. In this paper we<br />

illustrate how semantic keys are extracted from documents starting from<br />

WordNet and used by an automatic tool in order to define a new semantic net<br />

called SemanticNet build enriching the WordNet semantic net with new nodes,<br />

links and attributes. Formulating the query through the search engine, the user<br />

can move through the SemanticNet and extracts the concepts which really<br />

interest him, limiting the search field and obtaining a more specific result by<br />

means of a dedicated tool called 3DUI4SemanticNet.<br />

Keywords: Semantic net, Ontologies, NLP, 3D User Interface.<br />

1 Introduction<br />

The main aim of the DART ([1] and [2]) project is to realize a distributed architecture<br />

for a semantic search engine, facing the user with relevant resources in reply to a<br />

query about a specific domain of interest. In this paper we expose concepts and<br />

solutions related to the semantic aspects and to the geo-referencing features, designed<br />

to support the user in the information retrieval and to supply position based<br />

information strictly related to a specified area.<br />

In order to reach this goal, a prototype able to enrich the WordNet [3] semantic net<br />

by means of new concepts often related to specific knowledge domains has been<br />

realized as part of the DART project[4]. This aspect of the problem bring us to<br />

distinguish between a specific and a generic context both in indexing and in retrieval<br />

of information, whether documents can be mapped or not on ontologies, vocabularies<br />

and thesauri.<br />

The paper deals with two main aspects. The first is the definition of a semantic net,<br />

called SemanticNet, and the definition of a 3D user interface, called<br />

3DUI4SemanticNet, that allows to navigate in the concepts through relations. The<br />

second puts in evidence the use of ontologies or structure descriptors, as in the case of


22 Manuela Angioni, Roberto Demontis, Massimo Deriu, and Franco Tuveri<br />

dynamic XML documents generated by Web services, RDF and OWL documents,<br />

and the possibility to build specialized Semantic Nets based on a specific context of<br />

use.<br />

Section 2.1 describes how the system behaves in the general context while section<br />

2.2 does it in specific contexts. Section 3 describes how the SemanticNet is obtained<br />

from WordNet and enriched by means of a certified source of documents. Section 4<br />

proposes a use case related to a GIS (Geographical Information System) specific<br />

domain. Finally, section 5 describes the 3DUI4SemanticNet, a 3D navigation tool<br />

developed in order to give users a friendly interface to information contained in the<br />

SemanticNet.<br />

2 Generic and Specific Contexts<br />

Our main goal is to index and retrieve information both in a generic and in a specific<br />

context whether documents can be mapped or not on ontologies, vocabularies and<br />

thesauri. To achieve this goal, a semantic analysis process on structured and<br />

unstructured parts of documents is performed.<br />

The first step of this process is to identify structured and unstructured parts of a<br />

document. The structured portions are identified by ontologies or structure<br />

descriptors, as in the case of dynamic XML documents generated by Web services,<br />

RDF and OWL documents, if they are known by the system.<br />

In the second step the semantic analysis is performed. The unstructured parts need<br />

a linguistic analysis and a semantic interpretation performed by means of Natural<br />

Language Processing (NLP) techniques, while the structured parts need a specific<br />

parser. Finally, the system extracts the correct semantic keys that represent the<br />

document or a user query through a syntactic and a semantic analysis and identifies<br />

them by a descriptor which is stored in a Distributed Hash Tables, DHT [1]. These<br />

keys are defined starting from the semantic net of WordNet, a lexical dictionary for<br />

the English language. The system defines, then, a new semantic net, called<br />

SemanticNet, derived from WordNet [5] by adding new nodes, links and attributes.<br />

Web resources managed by the system are naturally heterogeneous, from the<br />

contents point of view. The classification processing of web resources is necessary in<br />

order to identify the meaning of keywords by means of their context of use. The same<br />

keywords are the references the system uses both in indexing and searching phases.<br />

2.1 The Generic Context<br />

In a generic context the system doesn't know a formal or terminological ontology<br />

describing the specific semantic domain. So, the concepts defined in the structured<br />

part of the document by a formal or terminological ontology are evaluated using the<br />

WordNet point of view. The conceptual mapping could reduce the quality of<br />

information, but its importance is to allow a better navigation between concepts<br />

defined in two ontologies, using a shared semantics. In general, the more accurate the<br />

conceptual mapping will be, the better the response of the system. The module can


SemanticNet: a WordNet-based Tool for the Navigation of Semantic… 23<br />

also extract the specialized semantics keys from the structured part of a document and<br />

return the generic semantic key mapped to these keys by means of the conceptual<br />

mapping.<br />

Unstructured parts need a linguistic analysis and a semantic interpretation to be<br />

performed by means of NLP techniques. The main tools involved are:<br />

• the Syntactic Analyzer and Disambiguator, a module for the syntactic analysis,<br />

integrated with the Link Grammar parser [6], a highly lexical, context-free<br />

formalism. This module identifies the syntactical structure of sentences, and<br />

resolves the terms' roles ambiguity in natural languages;<br />

• the Semantic Analyzer and Disambiguator, a module that analyzes each<br />

sentence identifying roles, meanings of terms and semantic relations in order to<br />

extract “part of speech” information, the synonymy and hypernym relations<br />

from the WordNet semantic net. It also evaluates terms contained in the<br />

document by means of a density function based on the synonyms and<br />

hypernyms frequency [7];<br />

• the Classifier, a module that classifies documents automatically. As proposed in<br />

WordNet Domains ([8] and [9]), a lexical resource representing domain<br />

associations between terms, the module applies a classification algorithm based<br />

on the Dewey Decimal Classification (DDC) and associates a set of categories<br />

and a weight to each document.<br />

The analysis of structured parts, followed by the linguistic analysis and the semantic<br />

interpretation of unstructured parts, produces as results three types of semantic keys:<br />

• a synset ID identifying a particular sense of a term of WordNet<br />

• a category name given by a possible classification<br />

• a key composed by a word and a document category, i.e. when the semantic key<br />

related to the word is not included in the WordNet vocabulary.<br />

Finally, all semantic keys are used to index the document whereas in searching<br />

phase they are used in order to retrieve document descriptors using the SemanticNet<br />

through the concept of semantic vicinity.<br />

2.2 The Specific Context<br />

In a specific context the system adopts a formal or terminological ontology describing<br />

the specific semantic domain. The semantic keys are the identifiers of concepts in the<br />

specialized ontology and the classification is performed using the taxonomy defined<br />

in it.<br />

Different analysis are performed in structured and unstructured parts of the<br />

document, as identified in the generic context. Moreover two types of indexing are<br />

performed: one with a set of generic semantic keys and the other with a set of<br />

specialized semantic keys.<br />

The module extracts all of the specialized semantic keys each time a structured part<br />

related to a specific context is identified in a document. Otherwise the system


24 Manuela Angioni, Roberto Demontis, Massimo Deriu, and Franco Tuveri<br />

performs the same analysis as in the general context and adds the result to the generic<br />

semantic keys set. Then, the extracted keys have to be mapped into specialized<br />

semantic keys. In this way a conceptual mapping between the specific semantics<br />

domain ontology and the WordNet ontology is needed in the design phase of the<br />

module. However, the conceptual map is not always exhaustive. Each time a concept<br />

can not be mapped into a synset ID, it is mapped into a new unique code.<br />

Structured parts of a document related to an explicit semantic (XML, RDF, OWL,<br />

etc) and defined by a known formal or terminological ontology are parsed by the<br />

system. The results are concepts of the ontology, so that the system has to know their<br />

mapping onto a concept of the WordNet ontology. The conceptual mapping of a<br />

specialized concept returns a WordNet concept or its textual representation as a code<br />

or a set of words. Through the conceptual mapping the relation 'SAME-AS' has been<br />

defined, it connects WordNet's synset IDs with concepts of the specialized ontology.<br />

This new relation gives us the possibility to build a specialized Semantic Net (sSN)<br />

complying with the following properties:<br />

1. Every sSN node is a concept in the specialized ontology.<br />

2. Every sSN node has at least one sSN relation with another sSN node or it is<br />

related with a node in the SemanticNet by means of the 'SAME-AS' relation.<br />

3. A sSN relation has to be consistent to the same relation in the SN. A similar<br />

example could be the 'broader' relation described in SKOS [10], which<br />

identifies a more general concept in meaning. The 'broader' relation can be<br />

mapped to the “IS-A” if the mapping is consistent to the “IS-A” relation<br />

derived from WordNet 'hyponymy' relation.<br />

A generic semantic key extracted from the unstructured parts of the document can<br />

identify, in this way, a specialized semantic key if a mapping between the two<br />

concepts exists.<br />

Once the semantic keys are extracted from the text, the system performs the<br />

classification related to the generic and the specific context. To index a document, the<br />

system needs to classify it and two set of semantic keys: one related to the generic<br />

context and the other to the specialized context. In the search phase the system<br />

retrieves documents by means of a specialized semantic key, for example a URI [11],<br />

or a generic semantic key and the other one with a semantic vicinity in the sSN or SN.<br />

An example of specific context module can be found in section 4 where the specific<br />

ontology is a terminological ontology for the GIS specific context.<br />

3 Building the SemanticNet<br />

In the context of DART, we think that the answer to an user query can be given<br />

providing the user with several kind of results, not always related in the standard way<br />

the search engines we use today do. To achieve this goal, we increase the semantic net<br />

of WordNet identifying valid and well-founded conceptual relations and links<br />

contained in documents in order to build a data structure, composed by concepts and<br />

correlation between concepts and information, that allows users to navigate in the


SemanticNet: a WordNet-based Tool for the Navigation of Semantic… 25<br />

concepts through relations. Formulating the query , the user can move through the net<br />

and extract the concepts which really interest him, limiting the search field and<br />

obtaining a more specific result. The enriched semantic net can also be used directly<br />

by the system without the user being aware of it. In fact, the system receives and<br />

elaborates queries by means of the SemanticNet, extracts from the query the concepts<br />

and their related relations, then shows the user a result set related to the new concepts<br />

found as well as the found categories.<br />

The automatic creation of a conceptual knowledge map using documents coming<br />

from the Web is a very relevant problem and it is very hard because of the difficulty<br />

to distinguish between valid and invalid contents documents. We therefore realized<br />

the importance of being able to access a multidisciplinary structure of documents, so a<br />

great amount of documents included in Wikipedia [12] was used to extract new<br />

knowledge and to define a new semantic net enriching WordNet with new terms, their<br />

classification and new associative relations.<br />

In fact WordNet, as semantic net, is too limited with respect to the web language<br />

dictionary. WordNet contains about 150,000 words organized in over 115,000 synsets<br />

whereas Wikipedia contains about 1.900.000 encyclopedic information; the number<br />

of connections between words related by topics is limited; several word “senses” are<br />

not included in WordNet. These are only some of the reasons that convinced us to<br />

enrich the WordNet semantic net , as emphasized in [13] where authors identify this<br />

and 5 other weaknesses in the WordNet semantic net constitution.<br />

We chose Wikipedia, the free content encyclopedia, excluding other solutions<br />

such as language specific thesaurus or on-line encyclopedia available only in a<br />

specific language. A conceptual map built using Wikipedia pages allows a user to<br />

associate a concept to other ones enriched with some relations that an author points<br />

out. The use of Wikipedia guarantees, with reasonable certainty, that such conceptual<br />

connection is valid because it is produced by someone who, at least theoretically, has<br />

the skills or the experience to justify it. Moreover, the rules and the reviewers controls<br />

set up guarantee reliability and objectivity and the correctness of the inserted topics.<br />

The reviewers also control the conformity of the added or modified voices.<br />

What we are more interested in are terms and their classification in order to build<br />

an enriched semantic net, called “SemanticNet” to be used in the searching phase in<br />

the general context, while in the specific context we build a specialized Semantic Net<br />

(sSN). The reason for this is that varied mental association of places, events, persons<br />

and things depend on the cultural backgrounds of the users' personal history. In fact,<br />

the ability to associate a concept to another is different from person to person. The<br />

SemanticNet is definitely not exhaustive, it is limited by the dictionary of WordNet,<br />

by the contents included in Wikipedia and by the accuracy of the information given<br />

by the system.<br />

Starting from the information contained in Wikipedia about a term of WordNet, the<br />

system is capable of enriching the SemanticNet by adding new nodes, links and<br />

attributes, such as IS-A or PART-OF relations. Moreover, the system is able to<br />

classify the textual contents of web resources, indexed through the Classifier that uses<br />

WordNet Domains and applies a density function ([1], [2]), based on the computation<br />

of the number of synsets related to each term of the document. In this way, it is able to<br />

retrieve the most frequently used “senses” by extracting the synonyms relations given<br />

by the use of similar terms in the document. Through the categorization of the


26 Manuela Angioni, Roberto Demontis, Massimo Deriu, and Franco Tuveri<br />

document it can associate to the term the most correct meaning and can assign a<br />

weight to each category related to the content.<br />

In fact, each term in WordNet has more than one meaning each corresponding to a<br />

Wikipedia page. We therefore need to extract the specific meaning described in the<br />

Wikipedia page content, in order to build a conceptual map where the nodes are the<br />

“senses” (synset of WordNet or term+category) and the links are given both by the<br />

WordNet semantical-linguistic relations and by the conceptual associations built by<br />

the Wikipedia authors. For example, the term tiger in WordNet corresponds to the<br />

Wikipedia page having http://en.wikipedia.org/wiki/Tiger as URL and the same title<br />

as the term, as showed in figure 1.<br />

Such a conceptual map, allows the user to move on an information structure that<br />

connects several topics through the semantical-lexical relations extracted from<br />

WordNet but also through the new associations made by the conceptual connections<br />

inserted by Wikipedia users and extracted by the system.<br />

From each synset a new node of the conceptual map is defined, containing<br />

information such as the synsetID, the categories and their weight. Through the<br />

information extracted from Wikipedia it is possible to build nodes, having a<br />

conceptual proximity with the beginning node, and to define, through these relations,<br />

a data structure linking all nodes of the SemanticNet. Each node is then uniquely<br />

identified by a semantic key, by the term referred to the Wikipedia page and by the<br />

extracted categories.<br />

3.1 The Data Structure<br />

The text of about 60.000 Wikipedia articles corresponding to a WordNet term and<br />

independently from their content has been analyzed, in order to extract new relations<br />

and concepts, and it has been classified, using the same convention used for the<br />

mapping of the terms of WordNet in WordNet Domains, assigning to each category a<br />

weight of relevance.<br />

Only the categories having a greater importance are taken in consideration. The<br />

success in the assigning the categories has been evaluated to reach about 90%.<br />

Measures are made choosing a set of categories and analyzing manually the<br />

correctness of the resources classified under each category. Through the content<br />

classification of Wikipedia pages, the system assigns to their title a synsetID of<br />

WordNet, if it exists.<br />

By analyzing the content of the Wikipedia pages, a new relation named<br />

“COMMONSENSE” is defined, that delineates the SemanticNet in union with the<br />

semantic relations “IS-A” and “PART-OF” given by WordNet.<br />

The COMMONSENSE relation is a connection between a term and the links<br />

defined by the author of the Wikipedia page related to that term. These links are<br />

associations that the author identified and put in evidence. Sometimes this relation<br />

closes with a direct connection a circle of semantic relations between concepts of the<br />

WordNet semantic net. In such situation we consider valid these direct links because<br />

someone, in the Wikipedia resource, certified a relation between them. In fact, the<br />

vicinity between concepts is justified with the logic expressed from the author in the


SemanticNet: a WordNet-based Tool for the Navigation of Semantic… 27<br />

Wikipedia article. The importance of these relations comes out each time the system<br />

give back results to a query, providing the user concepts that are related in WordNet<br />

or in the Wikipedia derived conceptual map.<br />

Fig. 1: The node of the SemanticNet which describes the concept “Tiger”<br />

In the SemanticNet, relations are also defined as oriented edges of a graph. The<br />

inverse relations are then a usable feature of the net. An example is the<br />

Hyponym/Hypernym relation, labeled in the SemanticNet as a “IS-A” relation with<br />

direction. The concept map defined is still under development and improvement. It is<br />

constituted by 25010 nodes each corresponding to a “sense” of WordNet and related


28 Manuela Angioni, Roberto Demontis, Massimo Deriu, and Franco Tuveri<br />

to a page of Wikipedia. Starting from these nodes, 371281 relations were extracted,<br />

part of which are also included in WordNet, but they are mainly new relations<br />

extracted from Wikipedia.<br />

In particular 306911 COMMONSENSE relations are identified, where 48834 are<br />

tagged as IS-A_WN and 15536 as PART-OF_WN, that are relations identified by the<br />

system but also included in WordNet. Sometimes, it is also possible to extract IS-A<br />

and PART-OF relations from Wikipedia pages as new semantic relations, not<br />

contained in WordNet even if still now they are not added in our structure.<br />

Terms not contained in WordNet and existing in Wikipedia are new nodes of the<br />

augmented semantic net. Terms which meanings are different in Wikipedia in respect<br />

with the meaning existing in WordNet will be added as new nodes in the SemanticNet<br />

but they will not have a synsetID given by WordNet.<br />

In fig.1 a portion of the SemanticNet is described, starting from the node tiger. The<br />

text contained in the Wikipedia page corresponding to the term is analyzed and it is<br />

classified under the category Animals. In this way the system is able to exclude one of<br />

the two senses included in WordNet, the one having the meaning related with person<br />

(synsetID: 10012196), and to consider only the sense related with the category<br />

Animals (synset ID: 02045461). So, the net is enriched with other nodes having a<br />

conceptual proximity with the starting node tiger and the new relations extracted from<br />

the page itself as well as the relation included in WordNet can be associated to this<br />

specific node by the system. In particular, the green arrow means a<br />

COMMONSENSE relation between tiger and the concepts the Wikipedia authors<br />

have pointed out by the links. Some of them are terms not contained in WordNet at all<br />

(superpredator), others are included in WordNet but in the vocabulary they are not<br />

directly related with the term itself (lion). Other terms are included in WordNet and<br />

directly related with the term tiger yet (panthera, tigress) by PART-OF, HAS-<br />

MEMBER or IS-A relations (blue and red arrows).<br />

Each term showed in the figure and related with the central node is a node itself,<br />

showed as a word for simplicity in the representation. It is really a specific sense of<br />

the term, related with the node tiger as animal with the specific meaning of the<br />

showed term and characterized by a synsetID, or by the couple if it<br />

is not included in WordNet, by a specific category and by a description that<br />

unequivocally identify it.<br />

For example, if the user is interested in information about the place where the<br />

greatest percentage of tigers lives but he doesn't know or he can not remember its<br />

name, by navigating the SemanticNet he can point out a COMMONSENSE relation<br />

between the term tiger and the term Indian subcontinent. Then he can refine the query<br />

limiting the search field to resources related with these two concepts.<br />

Some concepts, such as lake, river or swimmer, do not always seem related with<br />

the concept tiger. But if we consider that tigers are strong swimmers and that they are<br />

often found bathing in ponds, lakes, and rivers, the relations pointed out turn out<br />

relevant for the concept itself and useful for users interested in this specific aspect of<br />

the animal.


SemanticNet: a WordNet-based Tool for the Navigation of Semantic… 29<br />

3.2 Evaluation of the Information<br />

Concerning the information added in the SemanticNet we have to distinguish between<br />

information gathered from WordNet and information extracted from Wikipedia<br />

documents. Properties of the information coming from WordNet are maintained in the<br />

structure used in the SemanticNet. Then we add new terms by means of a phase of<br />

classification of Wikipedia documents, in order to identify correctly information<br />

added in the map of concept.<br />

As we said, COMMONSENSE relations added in the SN are the links contained in<br />

the Wikipedia pages related to a particular sense of a term. The main question is to<br />

identify which categories are associated with that specific sense. We conducted our<br />

tests on sets of documents, extracted from a total set of 47639 documents and<br />

evaluating only 5 categories : Plants, Medicine, Animals, Geography and Chemistry.<br />

In the evaluation we only consider if the classified document belongs or not to the<br />

specified category. The Classifier assigns a number of possible categories to each<br />

document with a weight associated. We selected the best result for each document by<br />

means of a minimum level of weight. In this way all terms added to the SemanticNet<br />

are always related in the correct way with others.<br />

Fig. 2: Classified documents for each category<br />

In Fig.3 the measures of the Classifier about 5 categories are showed. Results are<br />

validated by hand verifying all documents identified by the Classifier for each<br />

category.<br />

Fig. 3: Measures of the classification


30 Manuela Angioni, Roberto Demontis, Massimo Deriu, and Franco Tuveri<br />

4 An Example of Specific Context<br />

An example of a specific context is referred to a GIS context. In order to execute a<br />

coherent analysis of documents, two relevant problems have to be resolved. First is<br />

the choice of the taxonomy and thesaurus describing the specific semantic domain.<br />

The multilingual 2004 GEMET (GEneral Multilingual Environmental Thesaurus,<br />

version 1.0) [14] was chosen. GEMET consists in 5.298 terms hierarchically ordered<br />

with 109 top terms, 40 themes, 35 groups and 3 super groups.<br />

Then the problem of mapping GEMET concepts into the WordNet concepts. For<br />

example, the GEMET term “topography” has the right meaning in the WordNet<br />

semantic net, while the term “photogrammetry” does not appear at all. In such cases,<br />

and generally in all specialized contexts or specific domains, we need to define a new<br />

semantic key for this kind of term that is the identifier used in the GEMET thesauri.<br />

In order to generate the conceptual mapping, a semi-automatic procedure was<br />

implemented. This procedure is very similar to the evaluation of the Wikipedia page<br />

for enriching the SemanticNet and it is based on two similar properties of GEMET<br />

and WordNet: GEMET and WordNet have a hierarchical structure. For example, the<br />

GEMET relations “narrower” and “broader” are similar to the “hypernym” and<br />

“hyponym” relations of WordNet. This similarity is also used to build the relation “IS-<br />

A” in the sSN. Another important property is the presence of textual description of<br />

concepts in both GEMET and WordNet.<br />

Starting from the top term in the GEMET thesauri, the semantic keys from the<br />

textual description of a concept are extracted and its semantic vicinity is calculated<br />

with semantic keys extracted from the textual descriptions of concepts found in<br />

WordNet using the term related to the GEMET concept. The presence of some<br />

concepts such as the “narrower/broader” of the GEMET concept mapped to concepts<br />

“hyperonymy/hyponymy” of the concepts found in WordNet are also evaluated. The<br />

final evaluation of results was performed by a supervisor. A similar approach can be<br />

found at [15]. If the GEMET concept is not found in WordNet, the term related to the<br />

GEMET concept is used as the mapped semantic key.<br />

Using such mapping makes it possible to classify the documents analysed in the<br />

generic context with the GEMET taxonomy and to generate the sSN which contains<br />

the relations “IS-A”, derived from the “narrower” and “broader” GEMET relations,<br />

and the 'SAME-AS' relation, described in the section 2, which is derived from the<br />

conceptual mapping and connects the sSN node with the SN nodes. An example is<br />

showed in figure 4.<br />

5 3D User Interface for SemanticNet<br />

As described before, the SemanticNet allows users to navigate in the concepts through<br />

relations. This feature include the need to overcome the limits of traditional user<br />

interface, specially in terms of effectiveness and usability. For this reason, in parallel<br />

with the development of the SemanticNet we investigate new paradigm and model of<br />

UI, focusing on 3D visualization to allows the user to change the point of view,<br />

improving the perception and the understanding of contents [16]. The main idea is to


SemanticNet: a WordNet-based Tool for the Navigation of Semantic… 31<br />

Fig. 4: A segment of the specialized SemanticNet relative to the concept “Tiger”.<br />

improve the user experience during the phases of searching and extracting of the<br />

concepts. An essential requirement for such goal is to guarantee a better usability in<br />

information navigation by concrete representations and simplicity [17]. We develop<br />

an alternative tool (3DUI4SemanticNet) for the browsing of the Web resources by<br />

means of the map of concept. Formulating the query through the search engine, the<br />

user can move through the SemanticNet and extract the concepts which really interest<br />

him, limiting the search field and obtaining a more specific result.<br />

This tool works according to the user preferences and current user context,<br />

providing a 3D interactive scene that represent the selected portion of SemanticNet.<br />

Moreover, it provides a three-dimensional view that guarantee a better usability in


32 Manuela Angioni, Roberto Demontis, Massimo Deriu, and Franco Tuveri<br />

terms of information navigation, and it could also provide different layouts for<br />

different cases [18] and [19]. By example, a user could navigate through the net,<br />

moving from a term to another looking for relations and additional information. In<br />

order to keep the user into a Web context, the scene is described by an X3D document<br />

[20][21]. This choice has been driven by the features of this language, specially<br />

because it is a standard based on XML, and it is the ISO open standard for 3D content<br />

delivery on the web supported by a large community of users. The X3D runtime<br />

environment is the scene graph which is a directed, acyclic graph containing the<br />

objects represented as nodes and object relationships in the 3D world. The basic<br />

structure of X3D documents is very similar to any other XML document. In our<br />

application this structure is built starting from a description of the relations between<br />

terms. This description is provided by a GraphML [18] file, that represent the<br />

structural properties of SemanticNet.<br />

As X3D, GraphML is based on XML and unlike many other file formats for<br />

graphs, it does not use a custom syntax. Its main idea is to describe the structural<br />

properties of a graph and a flexible extension mechanism to add application-specific<br />

data; from our point of view, this fits the need to describe a portion of the<br />

SemanticNet starting from the singular nodes. The following figure shows the X3D<br />

model built from the GraphML provided for a part of the SemanticNet defined for the<br />

term “tiger”, disambiguated and assigned to the “Animals” WordNet Domains<br />

category.<br />

Fig. 5: X3D representation for the term “tiger” described by GraphML


SemanticNet: a WordNet-based Tool for the Navigation of Semantic… 33<br />

6 Conclusions and Future Work<br />

In this paper we have described the SemanticNet as part of the project DART,<br />

Distributed Agent-Based Retrieval Toolkit, currently under development. We have<br />

focused the efforts into the idea to provide a user friendly tool, able to reach and filter<br />

relevant information by means of a conceptual map based on WordNet. One of the<br />

main aspect of this work has been the right classification of resources. It has been<br />

very important in order to enrich the WordNet semantic net with new contents<br />

extracted from Wikipedia pages and with concepts coming from the GEMET<br />

thesaurus.<br />

The SemanticNet is structured as a highly connected directed graph. Each vertex is<br />

a node of the net and the edges are the relations between them. Each element, vertex<br />

and edge, is labeled in order to give the user a better usability in information<br />

navigation even through a dedicated 3D tool.<br />

Future works will include the improvement of the SemanticNet, by extracting new<br />

nodes and relations, and the measuring of the user preferences in the navigation of the<br />

net in order to give a weight to the more used paths between nodes.<br />

Moreover the structure of nodes, as defined in the net, allows to access the glosses<br />

given by WordNet and Wikipedia contents. The geographic context gives the user<br />

further filtering elements in the search of Web contents. In order to make the<br />

implementation of modules of a specialized context easier, its conceptual mapping<br />

and the definition of the specialized semantic net, a future work will describe the<br />

SemanticNet with a simple formalism like SKOS. Therefore the system could be<br />

more flexible in indexing and searching of web resources.<br />

References<br />

1. Angioni, M. et al.: DART: The Distributed Agent-Based Retrieval Toolkit. In: Proc.<br />

of CEA 07, pp. 425–433. Gold Coast – Australia (2007)<br />

2. Angioni, M. et al.: User Oriented Information Retrieval in a Collaborative and<br />

Context Aware Search Engine. J. WSEAS Transactions on Computer Research,<br />

ISSN: 1991-8755, 2(1), 79–86 (2007)<br />

3. Miller, G. et al.: WordNet: An Electronic Lexical Database. Bradford Books (1998)<br />

4. Angioni, M., Demontis, R., Tuveri, F.: Enriching WordNet to Index and Retrieve<br />

Semantic Information. In: 2nd International Conference on Metadata and Semantics<br />

Research, 11–12 October 2007, Ionian Academy, Corfu, Greece (2007)<br />

5. Wordnet in RDFS and OWL,<br />

http://www.w3.org/2001/sw/BestPractices/WNET/wordnet-sw-20040713.html<br />

6. Sleator, D.D., Temperley, D.: Parsing English with a Link Grammar. In: Third<br />

International Workshop on Parsing Technologies (1993)<br />

7. Scott, S., Matwin, S.: Text Classification using WordNet Hypernyms. In:<br />

COLING/ACL Workshop on Usage of WordNet in Natural Language Processing<br />

Systems, Montreal (1998)<br />

8. Magnini, B., Strapparava, C., Pezzulo, G., Gliozzo, A.: The Role of Domain<br />

Information in Word Sense Disambiguation. J. Natural Language Engineering,<br />

special issue on Word Sense Disambiguation, 8(4), 359-373. Cambridge University<br />

Press (2002)


34 Manuela Angioni, Roberto Demontis, Massimo Deriu, and Franco Tuveri<br />

9. Magnini, B., Strapparava, C.: User Modelling for News Web Sites with Word Sense<br />

Based Techniques. J. User Modeling and User-Adapted Interaction 14(2), 239–257<br />

(2004)<br />

10. SKOS, Simple Knowledge Organisation Systems, http://www.w3.org/2004/02/skos/<br />

11. Uniform Resource Identifier, http://gbiv.com/protocols/uri/rfc/rfc3986.html<br />

12. Wikipedia, http://www.wikipedia.org/<br />

13. Harabagiu, S., Miller, G., Moldovan, D.: WordNet 2 - A Morphologically and<br />

Semantically Enhanced resource. In: Workshop SIGLEX'99: Standardizing Lexical<br />

Resources (1999)<br />

14. GEneral Multilingual Environmental Thesaurus – GEMET<br />

http://www.eionet.europa.eu/gemet/<br />

15. Mata, E. J. et al.: Semantic disambiguation of thesaurus as a mechanism to facilitate<br />

multilingual and thematic interoperability of Geographical Information Catalogues.<br />

In: Proceedings 5th AGILE Conference, Universitat de les Illes Balears, pp. 61–66<br />

(2002)<br />

16. Biström, J., Cogliati, A., Rouhiainen, K.: Post- WIMP User Interface Model for 3D<br />

Web Applications. Helsinki University of Technology Telecommunications Software<br />

and Multimedia Laboratory (2005)<br />

17. Houston, B., Jacobson, Z.: A Simple 3D Visual Text Retrieval Interface. In TRO-<br />

MP-050 - Multimedia Visualization of Massive Military Datasets. Workshop<br />

Proceedings (2002)<br />

18. GraphML Working Group: The GraphML file format.<br />

http://graphml.graphdrawing.org/<br />

19. Web 3D Consortium - Overnet. http://www.web3d.org<br />

20. Bonnel, N., Cotarmanac’h, A., Morin, A.: Meaning Metaphor for Visualizing Search<br />

Results. In: International Conference on Information Visualisation, IEEE Computer<br />

Society, pp. 467–472 (2005)<br />

21. Wiza, W., Walczak, K., Cellary, W.: AVE - Method for 3D Visualization of Search<br />

Results. In: 3rd International Conference on Web Engineering ICWE, Oviedo –<br />

Spain. Springer Verlag (2003)


Verification of Valency Frame Structures by Means of<br />

Automatic Context Clustering in RussNet<br />

Irina V.Azarova 1 , Anna S. Marina 2 , and Anna A. Sinopalnikova 3<br />

1 Department of Applied Linguistics, St-Petersburg State University, Universitetskaya nab.<br />

11, 199034 St-Petersburg, Russia.<br />

2 Department of Lexicography, Institute of Linguistic Studies, Tuchkov pereulok 9, 199053<br />

Saint-Petersburg, Russia.<br />

3<br />

Brno University of Technology, Bozetechova 2, 61266 Brno, Czech<br />

ivazarova@gmail.com, a_s_marina@rambler.ru, sino@fit.vutbr.cz<br />

Abstract. The major point of the RussNet technique is a specification of<br />

valency frames for synsets. Parameters of valency frames are employed for<br />

word meaning and synsets differentiating in the procedure of thesaurus<br />

construction and automatic text analysis for word disambiguation. Valency<br />

description is calculated on the basis of statistically stable context features in<br />

the text corpus: morphologic, syntactic, and semantic. The automatic<br />

classification of verb contexts with unambiguous morphology annotation is<br />

discussed in the paper. The goal of this technique is differentiation of semantic<br />

types for verbs. The procedure is fulfilled with a help of morphology tag<br />

distributions in some context window for verbs from different semantic trees of<br />

RussNet. The optimal width of a distribution window, an appropriate tag set,<br />

and clustering results are discussed. This procedure may be helpful at various<br />

stages of analysis, especially for valency frame verification in some semantic<br />

tree.<br />

1 Introduction<br />

The computer thesaurus RussNet1 developed at the Department of Applied<br />

Mathematic Linguistics of Saint-Petersburg State University inherited the main<br />

principles of WordNet construction method [1]. RussNet is based on the corpus of<br />

modern texts (dated from 1985 up to nowadays) including 21 million of words, the<br />

major part of which (60%) are articles on various topics from newspapers and<br />

magazines, covering thematic diversity of the common Russian language [2].<br />

The RussNet was not translated from the WordNet prototype, its construction<br />

involves some additional components in its structure, that were oriented to its usage in<br />

automatic text analysis [3].<br />

The basic node in RussNet – the synset – may include several members (words or<br />

multiword expressions), which are ordered by their frequency of appearance in the<br />

corpus contexts in the particular sense described by the synset. This frequency is<br />

1<br />

http://www.phil.pu.ru/depts/12/RN


36 Irina V.Azarova, Anna S. Marina, and Anna A. Sinopalnikova<br />

measured in ipm-s. In order to fix frequency distribution of a polysemous word we<br />

use manual marking up of meanings (WM) in the contexts of a random corpus<br />

sample. We investigated the necessary size of a sample, and learnt that a random<br />

sample of 100–150 contexts represents the same distribution of WMs characteristic<br />

for the whole set of contexts in the corpus. The possible error of portion frequencies<br />

hardly exceeds 1% (which may be, however, crucial for scarce WMs). Our results<br />

coincide with previous investigations mentioned by [4], in which the marking up of<br />

WMs in the samples were compared.<br />

The next issue for investigation was weather there is some particular type of word<br />

meaning frequencies distribution. As WM frequencies may be so similar that the least<br />

distortion may be undesirable. We found out that the most common distribution of<br />

WMs in the corpus (for polysemious words) is rather specific: in approximately 80%<br />

cases such word has a distinct first meaning, which is the most frequent and usually<br />

occurs in 50–70% contexts of the corpus. There are so called low-frequency WMs<br />

(occurring in 1–3% of contexts), which are unrealistic to order according their<br />

frequencies. Other WMs fill the array between the first and low-frequency in a<br />

manner of decreasing numbers.<br />

The procedure of marking up WMs in a corpus sample requires calculation of<br />

frequencies for grammatical context markers and listing of semantic classes of words<br />

on the suggested valency positions in order to substantiate making decision<br />

concerning word senses differentiation. Stable context markers are included into<br />

valency frames expanding standard WordNet structure. Below valency frame<br />

specification is described in detail.<br />

2 Valency Frame Specification<br />

The idea of extending a computer lexicon by syntagmatic information is rather<br />

common in various WordNet-dictionaries as well as traditional ones [5]. [6], [7].<br />

In our project we compile valency frame description on the basis of stable features in<br />

the marked up contexts from random corpus samples [3]. The general parameters of<br />

the valency frame are<br />

• its active or passive attribute, which reflects a syntactic position of the described<br />

synset as a head (dominant) element or a daughter (subordinate) one;<br />

• the attribute of a syntactic construction – predicative or attributive – in which the<br />

valency frame primarily occurs.<br />

The valency frame may include several valencies with following parameters<br />

• its obligatory or optional attribute, which correlates with the frequency of valency<br />

occurrence in the corpus: prevailing (100% ≤ f ≤ 66%), rather stable (65% ≤ f<br />

≤ 35%); less stable (


Verification of Valency Frame Structures by Means of Automatic… 37<br />

• morphological and syntactic features, i.e. frequent surface expression (e.g. the<br />

particular preposition for nouns, the aspect form for infinitive, an adverb from the<br />

particular group, etc.).<br />

The described structure is used for automatic disambiguation [3]: the parsing of the<br />

phrase or sentence is mapped against valency frames of words in the construction. In<br />

case of matching between a parse structure and obligatory valency frames, it is<br />

considered verified, otherwise, optional valency frames may show preferred analysis.<br />

The open question is whether inheritance of valency frame parameters in<br />

troponymic verbal trees exists, in what form and for what parameters. We had a<br />

preliminary research basing on three semantic verb trees in RussNet [8] and, though<br />

didn’t receive unequivocal confirmation of the inheritance scheme, it may be stated<br />

that context features of the major part of verbs from a particular semantic tree share a<br />

lot of statistically stable parameters.<br />

In order to prove this stability we investigated the automatic clustering of verb<br />

contexts as representatives of WMs in RussNet. This research is presented below.<br />

3 Motivation: How to Use a Morphological Marking up of the<br />

Text for Sense Disambiguation<br />

In our research we chose for a starting point the approach of [4]. In this work the<br />

disambiguation procedure for a verb serve was described, in which 3 distributions<br />

were used: main POS tags, additional markers (e.g. punctuation marks, prepositions,<br />

etc.), and lexical items. This investigation demonstrated that POS tags and some other<br />

features afford to reliably (80–83%) differentiate meanings of this polysemious verb.<br />

The results depend on the width of the analysis window, and the substantial amount of<br />

contexts in the training set.<br />

The authors drew conclusions comparing their approach with similar ones<br />

1. initial processing of the text (e.g. syntactically connected fragments) doesn’t affect<br />

the results crucially;<br />

2. it was unachievable to differentiate low frequency WMs, because it was hardly<br />

possible to compile quality training set;<br />

3. it was easier to differentiate homonyms (or contrast WMs) than similar WMs;<br />

4. the huge amount of a training set improves the results but not to the same extent as<br />

processing time increased for preparing these sets.<br />

3.1 Preliminary Results: the POS Tag Distribution for Different Semantic Verb<br />

Groups<br />

The mentioned research might be valid only for languages with a fixed word order,<br />

and hardly applicable for languages with freer word order. We decided for the<br />

beginning to fulfil “pilot” check: compare distributions of 9 verbs from two groups:<br />

verbs of movement (идти ‘to go’, пойти ‘to start walking’, выйти ‘to go out’,<br />

вернуться ‘to return’, ходить ‘to walk’) and communication verbs (сказать ‘to<br />

say’, говорить ‘to speak’, спросить ‘to ask’, ответить ‘to answer’, просить ‘to


38 Irina V.Azarova, Anna S. Marina, and Anna A. Sinopalnikova<br />

ask for’). We selected 200 contexts from the corpus per verb in its first meaning. The<br />

width of the analysis window was chosen [-6…+6], its zero position being a key verb<br />

form and other positions being occupied by neighbouring words and punctuation<br />

marks.<br />

Fig. 1. A correlated distribution of frequencies for the preposition tag<br />

in the contexts of travel verbs.<br />

We compared 3 distributions for: (1) punctuation marks; (2) main POS: nouns,<br />

adjectives, verbs, adverbs; (3) close classes and syntactic POS: pronouns & numerals,<br />

conjunctions, prepositions, particles. After distribution calculated, some tags appeared<br />

to be too scarce (< 5% occurrences), they were eliminated from further investigation.<br />

Fig. 2. An uncorrelated distribution of frequencies for the adjective tag<br />

in the contexts of communication verbs.<br />

The distribution for a tag was presented as a graph defined over the analysis<br />

window i=[-6…+6], fr i showing the overall occurrence of the described tag in the i-th<br />

position of all training contexts for a particular verb. The average for the group was<br />

calculated. The graphs show correlation of tag distributions in groups or its absence. It<br />

appeared that some tags have specific distribution throughout the whole window, e.g.<br />

nouns, verbs, pronouns, commas, quotation marks, colons, dashes – for<br />

communication verbs, and nouns, adverbs, pronouns, conjunctions, prepositions,<br />

commas – for travel verbs. Some tags have a particular distribution only in the right


Verification of Valency Frame Structures by Means of Automatic… 39<br />

part of the context (e.g. conjunctions for communication verbs) or left context (e.g.<br />

prepositions for communication verbs). On Fig. 1 the correlated distribution of<br />

prepositions in contexts of verbs of movement is shown, on Fig. 2 – the uncorrelated<br />

distribution of adjectives for verbs of communication.<br />

Distribution data showed that it is feasible to use them for automatic differentiating<br />

of some groups of verbs, however, the width of the distribution analysis window and<br />

the appropriate tag set should be examined in detail.<br />

4 The Optimal Width of the Distribution Window<br />

The research reported in this paper involves 51 frequent Russian verbs from 21<br />

semantic groups taken from [9]. Each verb was represented by 200 contexts chosen at<br />

random from our corpus, which were unambiguously marked up with morphological<br />

tags. At first we chose the maximal window of [-10…+10] positions and the tag set<br />

including POS marker plus case specification for substantives and the aspect value for<br />

verbs (e.g. Nnom, Aloc, Vperf, etc). Punctuation marks are represented by one tag<br />

PM. Tag distributions are calculated in all positions of the maximum window for the<br />

contexts of each verb, thus each position was represented by the vector of tag<br />

frequencies. If some tag doesn’t occur in i-th position, its frequency is zero.<br />

Distributions were compared according to the vector model [10], [11] with the<br />

cosine similarity. For example, in i-th position in the window and distributions a and<br />

b the similarity is equal to:<br />

sim(<br />

a , b ) =<br />

i<br />

i<br />

∑<br />

∑<br />

N<br />

N<br />

a<br />

a<br />

ij<br />

2<br />

ij<br />

× b<br />

In Tab. 1 a fragment of the positional similarity matrix for verbs is shown,<br />

similarity is measured in per cents.<br />

×<br />

∑<br />

Table 1. A fragment of positional similarity matrix (%).<br />

Verb1 Verb2 all -10 -5 -2 -1 1 2 7 8<br />

брать 'to take' мочь 'to be able' 81 95 93 88 57 14 85 94 96<br />

хотеть 'to wish' 84 97 94 90 63 39 88 94 96<br />

идти 'to go' 88 96 89 93 90 58 91 92 97<br />

иметь 'to have' 91 93 95 94 87 83 85 91 94<br />

казаться 'to appear' 82 93 84 89 86 45 65 92 89<br />

N<br />

ij<br />

b<br />

2<br />

ij<br />

(1)


40 Irina V.Azarova, Anna S. Marina, and Anna A. Sinopalnikova<br />

It is easily seen that the “distant” positions of the distributions look very similar,<br />

thus they are non-specific for any verb group.<br />

Fig. 3. The stemma of verb clustering in the [-10]th position of the distribution window.<br />

In order to visualise results of similarity measurement we used automatic cluster<br />

analysis [12]. We represent the results of clustering on stemmas (see Fig. 3–5), which<br />

reflect the order of grouping in reverse order: leaves are the “closest” verbs, and the<br />

whole verb group is shown at the root node.<br />

Fig. 4. The stemma of verb clustering in the [+1]st position of the distribution window.


Verification of Valency Frame Structures by Means of Automatic… 41<br />

It is possible to assess the clustering quality so that to choose the optimal window<br />

width (and other parameters). We consider clustering to be interpretable if verbs from<br />

the same semantic tree are put together in one cluster, and inexplicable – if there are a<br />

lot of representatives from different semantic groups mixed in one cluster. If there<br />

were not morphological correlations among contexts of different verbs with similar or<br />

relative meanings, we’d never received any explicable groupings.<br />

Results of [4] show that a range of positions in the window has a cumulative<br />

outcome. So we compared verb distributions per each position and per position range<br />

and received expected result: there were no separate position in the window [-<br />

10…+10], which was sufficient by itself to show reliable clustering. On Fig. 3, 4, 5<br />

there are stemmas for clustering in the [-10] position, [+1] position and the better<br />

range [-3,+5]. Grey background marks verbs from the same semantic group, numbers<br />

at the top nodes show the step of clustering and an average similarity<br />

5 An Optimal Tag Set for Distribution Capture<br />

The tag set (TS) is another key point of distribution description. The first variant of<br />

the tag set showed interpretable results, however, it was significant to what extent it<br />

may deviate clustering. We tried 3 tag sets: the 1st TS was described above, the 2nd<br />

TS was simple POS tagging without specification of grammar category values (e.g. N,<br />

A, Adv, V, Pron, etc.); the 3rd TS was a kind of POS tag generalisation: all<br />

substantives were united under one tag, however, case specification for substantives<br />

were added (Nom, Gen, Dat, Acc, Abl, Loc, V, Adv, etc.). The comparison of<br />

clustering parameters shows that clustering with the 2nd TS produces a flatter<br />

structure without elaboration of the inner structure in groups, the clustering with the<br />

3rd TS creates more detailed structure than the 1st TS.<br />

Fig. 5. The stemma of verb clustering in the [-3,+5] position range<br />

of the distribution window with the 1st TS


42 Irina V.Azarova, Anna S. Marina, and Anna A. Sinopalnikova<br />

6 The Structure of Clusters<br />

The clustering of verbs with the help of morphological tagging of contexts,<br />

distribution window [-3,+5] and 1st or 3rd TSs afford us to differentiate 13 groups, in<br />

which 70% of verbs are united. Some groups are very close to WordNet classification<br />

into semantic trees: verbs of communication, cognition, stative, motion, emotion,<br />

possession, contact, modal, creation, perception. Moreover, we tried to use united<br />

morphological distributions as centres of clusters, and receive interesting structure for<br />

further cluster structuring. For example, communication and cognition verbs are very<br />

close and united at the very early stage of clustering, the same for possession and<br />

contact verbs. It is interesting that verbs – aspect pairs are united at one step, though<br />

we reported in our paper [8] that they very often have different meaning.<br />

1. сказать ‘to say’, ответить ‘to answer’, спросить ‘to ask’;<br />

2. понимать ‘to understand, imperf.’, знать ‘to know’, понять ‘to understand,<br />

perf.’, помнить ‘to remember’, думать ‘to think’;<br />

3. сидеть ‘to sit’, лежать ‘to lie’, стоять ‘to stand’;<br />

4. взять ‘to take, perf.’, брать ‘to take, imperf.’, получить ‘to receive’, иметь ‘to<br />

have’;<br />

5. идти ‘to go’, ехать ‘to ride/drive’, пойти ‘to start walking’;<br />

6. ненавидеть ‘to hate’, любить ‘to love’, чувствовать ‘to feel’;<br />

7. бросить ‘to throw’, послать ‘to send’;<br />

8. мочь ‘can/to be able’, успеть ‘to manage/succeed’, хотеть ‘want/to wish’;<br />

9. делать ‘to do, imperf.’, сделать ‘to do, perf.’;<br />

10. видеть ‘to see, imperf.’, увидеть ‘to see, perf.’;<br />

11. жить ‘to live’, работать ‘to work’;<br />

12. дать ‘to give, perf.’, давать ‘to give, imperf.’;<br />

13. остаться ‘to stay’, оказаться ‘to be found’.<br />

7 Conclusion and Future Work<br />

The reported research shows that it is highly probable to differentiate verbs from<br />

different semantic groups with the help of morphology tagging. This procedure may<br />

be helpful at the preliminary stages of corpus contexts processing, at the stage of<br />

valency frame verification for some semantic tree, and during automatic text<br />

processing. For different purposes it is essential to formulate the similarity measure of<br />

a particular context with cluster patterns.<br />

In order to understand why some verbs were not clustered according to their<br />

semantic type, we examined these cases, and discovered that primarily the failure was<br />

connected with random character of the training set. It was supposed that due to the<br />

mentioned above prevailing of the first WMs in the corpus contexts, we receive<br />

representation of the semantic class with low noise. But in cases of contrast meaning<br />

when verb meanings appear rather as homonyms, than similar WMs, we received a<br />

mixed distribution. After refining some of them, we saw appropriate clustering<br />

results.


Verification of Valency Frame Structures by Means of Automatic… 43<br />

Another perspective of this approach is a hypothesis that other main POS may be<br />

processed in the same manner. Now we investigate similar distributions for nouns.<br />

References<br />

1. Fellbaum, C. (ed.): WordNet. An Electronic Lexical Database. The MIT Press (1998)<br />

2. Azarova, I.V., Sinopalnikova, A.A.: Adjectives in RussNet. In: Proceedings of the Second<br />

Global WordNet Conference, pp. 251–258. Brno, Czech Republic (2004)<br />

3. Azarova, I.V., Ivanov, V.L., Ovchinnikova, E.A., Sinopalnikova, A.A.: RussNet as a<br />

Semantic Component of the Text Analyser for Russian. In: Proceedings of the Third<br />

International WordNet Conference, pp. 19–27. Brno, Czech Republic (2006)<br />

4. Leacock, C., Chodorow, M.: Combining Local Context and WordNet Similarity for Word<br />

Sense Identification. In: C. Fellbaum (ed.) WordNet: An Electronic Lexical Database, pp.<br />

265–283. MIT Press (1998)<br />

5. Stranakova-Lopatkova, M., Zabokrtsky, Z.: Valency Dictionary of Czech Verbs: Complex<br />

Tectogrammatical Annotation. In: Proceedings of LREC-2002, pp. 949–956. Las Palmas,<br />

Spain (2002)<br />

6. Agirre, E., Martinez, D.: Integrating Selectional Preferences in wordnet. In: Proceedings of<br />

the <strong>GWC</strong>-2002. Mysore, India (2002)<br />

7. Bentivogli, L., Pianta, E.: Extending WordNet with Syntagmatic Information. In: Proceeding<br />

of the 2nd Global WordNet Conference, pp. 47–53. Brno, Czech Republic (2004)<br />

8. Azarova, I.V., Ivanov, V.L., Ovchinnikova, E.A.: RussNet Valency Frame Inheritance in<br />

Automatic Text Processing. In: Proceedings of the Dialog-2005, pp. 18–25. Moscow (2005)<br />

9. Babenko, L.G. (ed.): Ideographic dictionary of Russian verbs. “AST-PRESS”, Moscow<br />

(1999)<br />

10. Voorhees, E.M.: Using WordNet for Text Retrieval. In: C. Fellbaum (ed.) WordNet: An<br />

Electronic Lexical Database, pp. 285–303. MIT Press (1998)<br />

11. Pantel, P., Lin, D.: Word-for-Word Glossing with Contextually Similar Words. In: Human<br />

Language Technology Conference of the North American Chapter of the Association for<br />

Computational Linguistics (2003)<br />

12. Alexeev, A.A., Kuznetsova, E.L.: EVM i problema tekstologii drevneslavjanskikh tekstov.<br />

In: Linguisticheskije zadachi i obrabotka dannykh na EVM. Moscow (1987)


Some Issues in the Construction<br />

of a Russian WordNet Grid<br />

Valentina Balkova 2 , Andrey Sukhonogov 1 , and Sergey Yablonsky 1,2<br />

1 Petersburg Transport University, Information Systems Department, Moscow av., 9,<br />

St.-Petersburg, 190031, Russia<br />

2<br />

Russicon Company, Kazanskaya str., 56, ap.2, 190000, Russia<br />

v_balk@front.ru, asukhonogov@rambler.ru, serge_yablonsky@hotmail.com<br />

Abstract. This paper deals with the development of the Russian WordNet Grid.<br />

It describes usage of Russian and English-Russian lexical language resources<br />

and software to process Russian WordNet Grid for Russian language and design<br />

of a XML-markup of the grid resources. Relevant aspects of the DTD/XML<br />

format and related technologies are surveyed.<br />

1 Introduction<br />

The Semantic Web aims to add a machine tractable layer to compliment the existing<br />

web of natural language hypertext. In order to realise this vision, the creation of<br />

semantic annotation, the linking of web pages to ontologies, and the creation,<br />

evolution and interrelation of ontologies must become automatic or semi-automatic<br />

processes.<br />

Computational lexicons (CL) provide machine understandable word knowledge.<br />

That is important for turning the WWW into a machine understandable knowledge<br />

base Semantic Web. CL supply explicit representation of word meaning with word<br />

content accessible to computational agents. Word meaning in CL is linked to word<br />

syntax and morphology and has multilingual lexical links.<br />

Computational lexicons are key components of HLT and usually have such<br />

typology:<br />

• monolingual vs. multilingual;<br />

• general purpose vs. domain (application) specific;<br />

• content type (morpho-syntactic, semantic, mixed, terminological).<br />

• Today such types of CL are designed<br />

• network based (hierarchy/taxonomy ─ WordNet [1, 2, 10], heterarchy ─<br />

EuroWordNet [3]);<br />

• frame based (Mikrokosmos, FrameNet);<br />

• hybrid (SIMPLE).<br />

Application of WordNet for different tasks on the Semantic Web requires a<br />

representation of WordNet in RDF and/or OWL [4-6]. There are several conversions<br />

available (from WordNet’s Prolog format to RDF/OWL) which differ in design<br />

choices and scope. It is expected that the demand for WordNet in RDF/OWL will


Some Issues in the Construction of a Russian WordNet Grid 45<br />

grow in the coming years, along with the growing number of Semantic Web<br />

applications.<br />

The WordNet Task Force of the W3C’s Semantic Web Best Practices Working<br />

Group aims at providing a standard conversion of WordNet. There are two main<br />

motivations that support the development of a standard conversion:<br />

• the development through the W3C’s Working Group process results in a peerreviewed<br />

conversion that is based on consensus of the participating experts. The<br />

resulting standard provides application developers with a resource that has the<br />

desired level of quality for most common purposes;<br />

• a standard improves interoperability between applications and multilingual lexical<br />

data [10].<br />

Semi-automatic integration and enrichment of large-scale multilingual lexicons<br />

like WordNet is used in many computer applications. Linking concepts across many<br />

lexicons belonging to the WordNet-family started by using the Interlingual Index<br />

(ILI). Unfortunately, no version of the ILI can be considered a standard and often the<br />

various lexicons exploit different version of WordNet as ILI.<br />

At the 3rd GWA Conference in Korea there was launched the idea to start building<br />

a WordNet grid around a Common Base Concepts expressed in terms of WordNet<br />

synsets and SUMO definitions (http://www.globalwordnet.org/gwa/gwa_grid.htm).<br />

This first version of the Grid was planned to be build around the set of 4689 Common<br />

Base Concepts. Since then only three languages with essentially various number of<br />

synsets and different WordNet versions were placed in the Grid mappings (English –<br />

4689 synsets with WN 2.0 mapping, Spanish – 15556 synsets with WN1.6 mapping<br />

and Catalan - 12942 synsets with WN1.6 mapping). But there is yet no official<br />

format for the Global WordNet Grid. So far there are just only 3 files in the specified<br />

format. As alternative another possible solution can use the DTD from the Arabic<br />

WordNet: http://www.globalwordnet.org/AWN/DataSpec.html.<br />

This paper deals with the development of the Russian WordNet Grid. It describes<br />

usage of Russian and English-Russian lexical language resources and software to<br />

process WordNet Grid for Russian language (4600 synsets with WN 2.0 mapping)<br />

and design of a XML/RDF/OWL-markup of the grid resources. Relevant aspects of<br />

the DTD/XML/RDF/OWL formats and related technologies are surveyed.<br />

2 Conceptual model<br />

The three core concepts in WordNet are the synset, the word sense and the word.<br />

Words are the basic lexical units, while a sense is a specific sense in which a specific<br />

word is used. Synsets group word senses with a synonymous meaning, such as {car,<br />

auto, automobile, machine, motorcar} or {car, railcar, railway car, railroad car}.<br />

There are four disjoint types of synset, containing exclusively nouns, verbs, adjectives<br />

or adverbs. There is one specific type of adjective, namely an adjective satellite.<br />

Furthermore, WordNet defines seventeen relations, of which<br />

• ten between synsets (hyponymy, entailment, similarity, member meronymy,<br />

substance meronymy, part meronymy, classification, cause, verb grouping,<br />

attribute);


46 Valentina Balkova, Andrey Sukhonogov, and Sergey Yablonsky<br />

• five between word senses (derivational relatedness, antonymy, see also, participle,<br />

pertains to);<br />

• “gloss” (between a synset and a sentence);<br />

• “frame” (between a synset and a verb construction pattern).<br />

In the Table 1 the set of relations in different WordNet realization are summarized,<br />

where S – any synset, N – noun synset, V –verb synset, A – adjective synset, R -<br />

adverb synset, WS – any word sense, NS – noun sense, VS – verb sense, AS –<br />

adjective sense, RS – adverb sense.<br />

3 XML structure of the Russian WordNet Grid<br />

Several Russian lexical resources were used for the Russian WordNet Grid<br />

development and the test version of the English-Russian WordNet [7]. We’ve done<br />

porting of the original English and Russian WordNet into XML using the DTD for the<br />

XML structure from http://www.globalwordnet.org/gwa/gwa_grid.htm and the DTD<br />

from the Arabic Wordnet: http://www.globalwordnet.org/AWN/DataSpec.html;<br />

The standard DTD for the Russian grid XML structure and the English/Russian<br />

XML format for the Grid is shown on Fig. 1, 2. The grid of English and Russian local<br />

WordNets is realized as a virtual repository of XML databases accessible through web<br />

services. Basic services devoted to the management of the actual versions of<br />

Princeton and Russian WordNets.<br />

Unfortunately, no version of the grid can be considered a standard because the<br />

various grids exploit different versions of WordNet, have different numbers of entries<br />

and there is no mappings of the multilingual grids on new versions of WordNet.<br />

4 RDF/OWL structure of the Russian WordNet Grid<br />

The WordNet Task Force [6] developed a new approach in WordNet RDF<br />

conversion. This conversion builds on three previous WordNet conversions [2-5]. The<br />

W3C WordNet project is still in the process of being completed, at the level of<br />

schema and data (http://www.w3.org/2001/sw/BestPractices/WNET/wnconversion.html).<br />

We’ve done porting of the original English and Russian WordNet Grid into RDF<br />

(Resource Description Framework) and OWL (Ontology Web Language). All specific<br />

Russian WordNet classes/properties (Table 2) are defined in another name space –<br />

rwn (in Princeton WordNet we have wn name space).<br />

Still there are open issues how to support different versions of WordNet in<br />

XML/RDF/OWL and how to define the relationship between them and how to<br />

integrate WordNet with sources in other languages. Although again the TF did not<br />

focus on solving this problem as it is out of scope, we have tried to take this into<br />

account in our design, e.g. by making Words separate entities with their own URI.<br />

This allows them to be referenced directly and related to structures representing words<br />

in other RDF/OWL sources.


1. Hyponymy N->N hasHyponym ~ HAS_HYPONYM<br />

Some Issues in the Construction of a Russian WordNet Grid 47<br />

Table 1. Relations between synsets in the different WordNet realizations (Russian WordNet,<br />

Princeton WordNet and EuroWordNet)<br />

N Relation<br />

Part of<br />

speech<br />

Russian WordNet<br />

Princeton<br />

WordNet EuroWordNet<br />

N->N hyponymOf @ HAS_HYPERONYM<br />

2. Troponymy<br />

V->V troponymOf @ HAS_HYPERONYM<br />

V->V hasTroponym ~ HAS_HYPONYM<br />

N->N hasMeronym HAS_MERONYM<br />

N->N hasMemberMeronym #m HAS_MERO_MEMBER<br />

N->N hasSubstanceMeronym #s<br />

HAS_MERO_PORTION<br />

3. Meronymy<br />

N->N hasPartMeronym #p<br />

HAS_MERO_PART<br />

N->N meronymOf HAS_HOLONYM<br />

N->N memberMeronymOf %m HAS_HOLO_MEMBER<br />

N->N substanceMeronymOf %s HAS_HOLO_PORTION<br />

N->N partMeronymOf %p HAS_HOLO_PART<br />

4. Attribute<br />

N->A attribute = xpos_hyponym<br />

A->N valueOf = xpos_hyponym


48 Valentina Balkova, Andrey Sukhonogov, and Sergey Yablonsky<br />

5. Derivation SS relatedForm +<br />

S->S domainCategory ;c<br />

S->S<br />

domainCategoryMemb<br />

er<br />

-c<br />

6.<br />

DomainLabe<br />

l<br />

S->S domainRegion ;r<br />

S->S domainRegionMember -r<br />

S->S domainUsage ;u<br />

S->S domainUsageMember -u<br />

SS nearAntonym<br />

NEAR_ANTONYM<br />

7. Antonymy WS<br />

<br />

WS<br />

8. VerbGroup<br />

V <br />

V<br />

Antonym ! ANTONYM<br />

sameGroupAs $<br />

9. Entailment<br />

V->V isSubeventOf * IS_SUBEVENT_OF<br />

V->V hasSubevent<br />

HAS_SUBEVENT<br />

10. Causaton<br />

V->V causes > CAUSES<br />

V->V isCausedBy<br />

IS_CAUSED_BY<br />

11. AlsoSee<br />

12. Derived<br />

WS<br />

<br />

WS<br />

WS-><br />

WS<br />

WS-><br />

WS<br />

seeAlso<br />

^<br />

isDerivedFrom \ IS_DERIVED_FROM<br />

hasDerived<br />

HAS_DERIVED<br />

13. SimilarTo<br />

A <br />

A<br />

similarTo<br />

&<br />

14. Participle<br />

WS-><br />

WS<br />

WS-><br />

WS<br />

participleOf <<br />

hasParticiple


Some Issues in the Construction of a Russian WordNet Grid 49<br />

Fig.1 DTD for the Russian grid


50 Valentina Balkova, Andrey Sukhonogov, and Sergey Yablonsky<br />

Fig.2 [English-] Russian grid XML markup<br />

Table 3. Specific Russian WordNet classes/properties<br />

№ Class Property Comments<br />

1. Word &rdfs;Literal Position of the stress for every<br />

&wnr;vowelPosition lemma in Russian WordNet.<br />

2. Word &xsd;nonNegativeInteger<br />

&wnr;paradigmID<br />

3. WordSense &rdfs;Literal<br />

&wnr;glossaryWord<br />

Lemma’s paradigm number. One<br />

lemma in general has many<br />

paradigms.<br />

Russian WordNet has glossaries for<br />

every word.<br />

4. WordSense &xsd;nonNegativeInteger<br />

&wnr;senseNumber<br />

5. WordSense &xsd;nonNegativeInteger<br />

&wnr;synsetPosition<br />

6. WordSense &rdfs;Literal<br />

&wnr;styleMark<br />

7. WordSense &rdfs;Literal<br />

&wnr;isDominant<br />

8. WordSense #WordSense/#Idiom<br />

&wnr;hasIdiom<br />

9. Idiom &rdfs;Literal<br />

&wnr;idiom<br />

10. Idiom &rdfs;Literal<br />

&wnr;idiomDefinition<br />

Dominant property.


Some Issues in the Construction of a Russian WordNet Grid 51<br />

Table 4. Equivalent Classes<br />

№ W3C RDFS Russian WordNet OWL<br />

equivalentClass<br />

1. SynSet Synset<br />

2. NounSynSet Noun<br />

3. VerbSynSet Verb<br />

4. AdjectiveSynSet Adjective<br />

5. AdverbSynSet Adverb<br />

Table 5. Russian WordNet OWL<br />

№<br />

Russian WordNet (OWL)<br />

Class/property<br />

Data type<br />

1. Synset owl:Class<br />

2. owl:ObjectProperty<br />

#Synset/&rdfs;Literal<br />

index<br />

3. owl:ObjectProperty<br />

#Synset/&rdfs;Literal<br />

glossaryEntry<br />

4. owl:ObjectProperty<br />

#Synset/&rdfs;Literal<br />

exampleSentences<br />

5. owl:TransitiveProperty<br />

#Synset/#Synset<br />

hyponymOf<br />

6. owl:TransitiveProperty<br />

#Synset/#Synset<br />

hasHyponym<br />

7. owl:SymmetricProperty<br />

#Synset/#Synset<br />

nearAntonym<br />

8. owl:SymmetricProperty<br />

#WordSense/#WordSense<br />

seeAlso<br />

9. owl:ObjectProperty<br />

#Synset/#Synset<br />

relatedForm<br />

10. Noun owl:Class<br />

11. Verb owl:Class<br />

12. Adjective owl:Class<br />

13. Adverb owl:Class<br />

14. AdjectiveSatellite owl:Class<br />

15. owl:ObjectProperty<br />

#Noun/#Noun<br />

meronymOf<br />

16. owl:ObjectProperty<br />

hasMeronym<br />

#Noun/#Noun<br />

17. owl:ObjectProperty<br />

memberMeronymOf<br />

18. owl:ObjectProperty<br />

hasMemberMeronym<br />

19. owl:ObjectProperty<br />

substanceMeronymOf<br />

20. owl:ObjectProperty<br />

hasSubstanceMeronym<br />

21. owl:ObjectProperty<br />

partMeronymOf<br />

#Noun/#Noun<br />

#Noun/#Noun<br />

#Noun/#Noun<br />

#Noun/#Noun<br />

#Noun/#Noun


52 Valentina Balkova, Andrey Sukhonogov, and Sergey Yablonsky<br />

22. owl:ObjectProperty<br />

hasPartMeronym<br />

23. owl:ObjectProperty<br />

isCausedBy<br />

24. owl:ObjectProperty<br />

causes<br />

25. owl:SymmetricProperty<br />

sameGroupAs<br />

26. owl:ObjectProperty<br />

isDerivedFrom<br />

27. owl:ObjectProperty<br />

hasDerived<br />

#Noun/#Noun<br />

#Verb/#Verb<br />

#Verb/#Verb<br />

#Verb/#Verb<br />

#WordSense/#WordSense<br />

#WordSense/#WordSense<br />

28. owl:TransitiveProperty<br />

#Verb/#Verb<br />

isSubeventOf<br />

29. owl:TransitiveProperty<br />

#Verb/#Verb<br />

hasSubevent<br />

30. owl:SymmetricProperty<br />

#Adjective/#Adjective<br />

similarTo<br />

31. owl:ObjectProperty<br />

attribute<br />

#Noun/#Adjective<br />

32. owl:ObjectProperty<br />

#Adjective/#Noun<br />

valueOf<br />

33. owl:ObjectProperty<br />

#Synset/#Synset<br />

domainUsage<br />

34. owl:ObjectProperty<br />

#Synset/#Synset<br />

domainUsageMember<br />

35. owl:ObjectProperty<br />

#Synset/#Synset<br />

domainCategory<br />

36. owl:ObjectProperty<br />

#Synset/#Synset<br />

domainCategoryMember<br />

37. owl:ObjectProperty<br />

#Synset/#Synset<br />

domainRegion<br />

38. owl:ObjectProperty<br />

#Synset/#Synset<br />

domainRegionMember<br />

39. WordSense owl:Class<br />

40. owl:ObjectProperty<br />

#WordSense/#Synset<br />

inSynSet<br />

41. owl:ObjectProperty<br />

#Synset/#WordSense<br />

containsWordSense<br />

42. Word owl:Class<br />

43. owl:ObjectProperty<br />

#WordSense/#Word<br />

senseOf<br />

44. owl:ObjectProperty<br />

#Word/#WordSense<br />

hasSense<br />

45. owl:ObjectProperty<br />

#WordSense/&xsd;double<br />

frequency<br />

46. owl:ObjectProperty<br />

#Word/ &rdfs;Literal<br />

lemma<br />

47. owl:ObjectProperty<br />

#WordSense/&rdfs;Literal<br />

senseKey<br />

48. owl:ObjectProperty<br />

#WordSense/#WordSense<br />

participleOf<br />

49. owl:ObjectProperty<br />

#WordSense/#WordSense<br />

hasParticiple<br />

50. owl:SymmetricProperty<br />

antonym<br />

#WordSense/#WordSense


Some Issues in the Construction of a Russian WordNet Grid 53<br />

51. TopOntology owl:Class<br />

52. owl:ObjectProperty<br />

#TopOntology/#Synset<br />

hasItem<br />

53. owl:ObjectProperty<br />

#TopOntology/&rdfs;Literal<br />

index<br />

54. owl:ObjectProperty<br />

#TopOntology/&rdfs;Literal<br />

name<br />

55. owl:ObjectProperty<br />

#TopOntology/#TopOntology<br />

broaderItem<br />

56. owl:ObjectProperty<br />

narrowerItem<br />

#TopOntology/#TopOntology<br />

5 Developing and Managing the WordNet Semantic Web Models<br />

For managing WordNet Semantic Web models we use the Multilingual WordNet<br />

Editor [8] together with XMLSpy 2007 and Oracle 10g/11g that provides important<br />

XML/RDF/OWL support for data modeling and editing of XML/RDF/OWL<br />

WordNet models.<br />

Fig.3 XML/RDF/OWL WordNet in Oracle 11g Data Base<br />

Oracle Database 11g incorporates native RDF/RDFS/OWL support, enabling<br />

WordNet application to benefit from a scalable, secure, integrated, efficient platform<br />

for semantic data management. Ontological datasets, containing 100s of millions of<br />

data items and relationships, can be stored in groups of three, or "triples" using the<br />

RDF data model. Oracle Database 11g enables such repositories to scale into the<br />

billions of triples, thereby meeting the needs of the most demanding applications of


54 Valentina Balkova, Andrey Sukhonogov, and Sergey Yablonsky<br />

WordNet. Managing semantic data models within Oracle Database 11g introduces<br />

significant benefits over file-based or specialty database approaches:<br />

• Low Cost of Ownership: Semantic applications can be combined with other<br />

applications and deployed on a corporate level with data stored centrally, lowering<br />

ownership costs. Beyond the advantage of central data storage and query, service<br />

oriented architectures (SOA) eliminate the need to install and maintain client-side<br />

software on the desktop and store and manage data separately, outside of the<br />

corporate database.<br />

• Low Risk: RDF and OWL models can be integrated directly into the corporate<br />

DBMS, along with existing organizational data, XML and spatial information, and<br />

text documents. This results in integrated, scalable, secure high-performance<br />

WordNet applications that could be deployed on any server platform (UNIX,<br />

Linux, or Windows).<br />

• Performance and Security: For mission-critical semantic data models Oracle<br />

provides the security, scalability, and performance of the industry’s leading<br />

database, to manage multi-terabyte RDF datasets and server communities ranging<br />

from tens to tens of thousands of users.<br />

• Open Architecture: The leading semantic software tool vendors have announced<br />

support for the Oracle Database 11g RDF/OWL data model. In addition, plug-in<br />

support is now available from the leading open source tools.<br />

• Native inference using OWL and RDFS semantics and also user-defined rules.<br />

• Querying of RDF/OWL data and ontologies using SPARQL-like graph patterns<br />

embedded in SQL<br />

• Ontology-assisted querying of enterprise (relational) data storage,<br />

• Loading, and DML access to semantic data.<br />

Based on a graph data model, RDF triples are persistent, indexed, and queried,<br />

similar to other object-relational data types. Oracle database capabilities to manage<br />

semantics expressed in RDF and OWL ensure that WordNet developers benefit from<br />

the scalability of the Oracle database to deploy high performance enterprise<br />

applications.<br />

References<br />

1. Fellbaum, C. (ed.): WordNet: An ElectronicLexical Database. Bradford Books (1998)<br />

2. Miller, G. et al.:. Five Papers on WordNet // CSL-Report. Vol. 43. Princeton.<br />

ftp://ftp.cogsci.priceton.edu/pub/wordnet/5papers.ps (1990)<br />

3. Vossen, P.: EuroWordNet: A Multilingual Database with Lexical Semantic Network.<br />

Dordrecht (1998)<br />

4. Brickley, D.: Message to RDF Interest Group: "WordNet in RDF/XML: 50,000+ RDF<br />

class vocabulary". http://lists.w3.org/Archives/Public/www-rdfinterest/1999Dec/0002.html.<br />

See also http://xmlns.com/2001/08/wordnet/<br />

5. Decker, S. Melnik, S.: WordNet RDF representation.<br />

http://www.semanticweb.org/library/.<br />

6. WordNet OWL Ontology: http://www2.unine.ch/imi/page11291_en.html.


Some Issues in the Construction of a Russian WordNet Grid 55<br />

7. http://www.dcc.uchile.cl/~agraves/wordnet<br />

8. RDF/OWL Representation of WordNet, W3C Working Draft 19 June 2006:<br />

http://www.w3.org/TR/wordnet-rdf/#figure1 (2006)<br />

9. Balkova, V., Suhonogov, A., Yablonsky, S. A.: Russia WordNet. From UML-notation to<br />

Interne / Intranet Database Implementation. In: Proceedings of the Second International<br />

WordNet Conference (<strong>GWC</strong> 2004), pp. 31–38. Brno (2004)<br />

10. Bertagna, F., Calzolari, N., Monachini, M., Soria, C., Hsieh, S.K., Huang, C.R.,<br />

Marchetti, A., Tesconi, M.: Exploring Interoperability of Language Resources: the Case<br />

of Cross-lingual Semi-automatic Enrichment of Wordnets. In: IWIC, 2007, pp. 146–158<br />

(2006)


A Comparison of Feature Norms and WordNet<br />

Eduard Barbu and Massimo Poesio<br />

Center for Mind/Brain Sciences, Rovereto, Trento, Italy<br />

eduard.barbu@email.unitn.it<br />

poesio@dit.unitn.it<br />

Abstract. Concepts are the most important objects of study in cognitive<br />

science, the building blocks of any theory of mind. Most theories of conceptual<br />

organization of semantic memory including the classical theory of concepts<br />

assume a featural representation of concepts. The importance of features in<br />

contemporary theories of semantic memory posed the researchers the hard<br />

problem of finding psychologically relevant concept description. It is believed<br />

that the most reliable method for achieving this goal is asking the human<br />

subjects in controlled experiments to produce features for a set of concepts. The<br />

purpose of this paper is to compare the featural description of concepts in two<br />

psychological feature norms with the featural descriptions of concepts in<br />

WordNet. To perform this comparison we mapped the concepts in the two<br />

feature norms on Princeton WordNet 2.1 and then automatically extracted the<br />

potential features from a suitable semantic neighborhood of the respective<br />

concepts.<br />

Keywords: concepts, featural representation, feature norms, semantic memory,<br />

WordNet, comparison<br />

1 Introduction<br />

Concepts are the most important objects of study in cognitive science, the building<br />

blocks of a theory of mind. The debate over the nature and the structure of concepts is<br />

as old as the philosophical reflection. In the contemporary cognitive science there are<br />

three main tenets about the nature of concepts: concepts are mental representations,<br />

cognitive abilities or Fregean senses.<br />

Because this paper will focus on the structure and not on the nature of concepts it is<br />

essential for the purpose of subsequent discussion to assume that concepts are mental<br />

representations and not cognitive abilities of some sort. We also assume that the<br />

humans can access and report the content of these representations. Moreover, we will<br />

concentrate not on the structure of concepts in general but on the structure of those<br />

concepts lexicalized in (English) language.<br />

Any discussion of the structure of concepts should begin with the older and for<br />

some still appealing theory of concepts called the classical theory. It has its roots in<br />

the work of philosophers like Plato and Aristotle and till the second part of the 20 th<br />

century this theory has been practically unchallenged. According to the modern<br />

formulation of the classical theory there are two types of lexical concepts: primitive<br />

and complex. The complex concepts have a definitional structure composed of other


A Comparison of Feature Norms and WordNet 57<br />

concepts that specify their necessary and sufficient conditions. If we redefine the<br />

constituents entering in the description of the complex concepts we get to define all<br />

the concepts using the finite stock of primitive concepts. From now on we will call the<br />

description of complex concepts a featural description and the components of the<br />

description, features. In this paper we will use interchangeably, when there is no<br />

possibility of confusion, the terms feature, property and relation. If we take as<br />

example the well-known concept “bachelor”, its featural description contains the<br />

features “is a man” and “is unmarried”.<br />

Of course, the stock of concepts that have a featural description in terms of<br />

necessary and sufficient conditions is much bigger and not controversial as it is<br />

“bachelor”. We can consider countless examples from mathematics: prime number,<br />

even number, vectorial space, equilateral triangle, etc.<br />

According to the classical theory, when we classify an object in the world we<br />

check to see what the features of the object to be classified are and then we assign the<br />

object the category that uniquely fits its description. For example, when we classify a<br />

particular dog we verify that the perceptual features we extract from the interaction<br />

with that particular exemplar (presumably these are like “has four legs”, “has a head”,<br />

etc.), match our mental description of the concept dog.<br />

During the second part of the 20 th century the classical theory of concepts came<br />

under heavy attack from many quarters: philosophy, psychology and the newly born<br />

cognitive science. The main reproach from the psychological perspective has been<br />

that the classical model does not predict either the typicality effects or the category<br />

fuzziness. The typicality effects refer to the fact that people tend to rank the members<br />

of natural categories according to how good examples they are of the respective<br />

category. For instance, a sparrow is considered a more typical example of a bird than<br />

a chicken. This phenomenon cannot be explained by the classical theory of concepts<br />

because, according to it, all members of a category have equal status. Category<br />

fuzziness, on the other hand, refers to the fact that some categories have indeterminate<br />

boundaries. For example, both answers to the question: “Is a carpet a furniture?”<br />

seems to be inadequate [1]. Again this is a problem for the classical theory of<br />

concepts because it does not allow for category indeterminacy.<br />

In psychology the first theories that were proposed as alternatives to the classical<br />

theory were the twin theories “prototype theory” and “exemplar theories”. These<br />

theories succeed where the classical theory failed, namely in explaining both category<br />

fuzziness and typicality effects. But for achieving this they had to reject one major<br />

tenet of classical theory, namely that concepts have necessary and sufficient<br />

conditions. What is important for us is that they did not question the other main<br />

classical contention the fact that concepts in the case of prototype theory and the<br />

individuals that define a concept in the case of exemplar theories are featural<br />

representations.<br />

The importance of features in contemporary theories of semantic memory posed<br />

the researchers the hard problem of specifying psychologically relevant concept<br />

descriptions. It is believed that the most reliable method for achieving this goal is<br />

asking the human subjects in controlled experiments to produce features for a set of<br />

concepts. The purpose of this paper is to compare the featural description of concepts<br />

in the psychological feature norms with the featural descriptions of concepts in<br />

WordNet. To make this comparison we mapped the concepts in the two feature norms


58 Eduard Barbu and Massimo Poesio<br />

on Princeton WordNet 2.1 and then automatically extracted the potential features<br />

from a suitable semantic neighborhood of the respective concepts. For assessing the<br />

quality of the proposed automatic procedure we manually performed a comparison<br />

between 20 concept descriptions found in feature norms and their descriptions in<br />

WordNet.<br />

The rest of the paper is organized as follows. In the first part we introduce the two<br />

feature norms and compare them quantitatively and qualitatively. The second part of<br />

the paper presents the algorithm for extracting the potential features from Princeton<br />

WordNet. The third part gives a quantitative and qualitative comparison between each<br />

of the feature norms and the feature extracted automatically. We conclude the paper<br />

presenting some related work and the conclusions.<br />

2 Feature Norms<br />

The empirical question a researcher working with featural concept descriptions for<br />

testing the hypothesis about semantic memory organization confronts is how to derive<br />

a set of features that approximates the mental representation of concepts.<br />

In a paper about semantic memory impairment Farah and McClelland [2] showed<br />

how modality specific semantic memory system could account for category deficits<br />

after the brain damage. They implemented a neural network model of semantic<br />

memory based on the hypothesis that the functional and visual features have a<br />

different distribution for living and for non-living things. The proportion of visual<br />

versus functional features for living and non-living things was estimated based on a<br />

set of dictionary definitions. Their approach has been criticized on the ground that the<br />

features that are extracted from dictionary definitions do not provide a good model of<br />

the human mental representation.<br />

A better alternative for feature generation is to ask people in controlled<br />

psychological experiments to make explicit the content of their semantic memory. In a<br />

celebrated series of experiments in the 70’s Rosch and Mervis [3] asked their subjects<br />

to produce features for twenty members of six basic level categories. Subsequently<br />

they demanded the subjects to rank the respective members according to how good<br />

examples they are for the respective categories. For example, the subjects were asked<br />

to rank the concepts chair, piano and clock in function of how representative<br />

examples they are for the category furniture. One major finding of their study was that<br />

typicality of a concept is highly correlated with the total cue validity for the same<br />

concept. That is, the most typical items are those that have many features in common<br />

with other members of the category and few features in common with members<br />

outside the category. Subsequent research replicated the results of Rosch and Mervis,<br />

but nowadays it is acknowledged that besides cue validity there are other factors that<br />

determine the typicality [4].<br />

Following Rosch, other researchers [5, 6] built feature norms and used them for<br />

investigations of the semantic memory. The norms became the empirical material for<br />

constructing computational theories about information encoding, storage and retrieval<br />

from semantic memory. Following the line of research that started with Rosch and


A Comparison of Feature Norms and WordNet 59<br />

Mervis, the norms are also used to examine the relation between semantic<br />

representations and prototypicality.<br />

According to our knowledge only two feature norms are publicly available. The<br />

first one has been built by Garrard and his colleagues [7]. The norm was produced by<br />

asking 20 people to provide featural descriptions for a set of 64 concepts of living and<br />

nonliving things. McRae and his collaborators [8] have acquired the second feature<br />

norm, the largest norm to date, asking 725 subjects to list features for 541 living and<br />

not living basic level concepts. From now on, we will refer to these feature norms as<br />

Garrard database and McRae database respectively.<br />

The methodology for building the norms differs in some details from one<br />

researcher to the other. For example, unlike Rosch and Mervis neither Garrard nor<br />

McRae impose their subjects time limits for feature listing task.<br />

In Garrard’s experiment each stimulus (the concept for which the subjects should<br />

provide descriptions) was presented on a separate page and the task of the subjects<br />

was to fill in the fields present on the page. The fields classified the type of features<br />

that the subject should provide: classification features (under the head Category),<br />

descriptive features (under is field), parts (under has field) and abilities (under can<br />

field).<br />

In McRae’s experiment the stimuli were shown on empty pages. In a task<br />

description session experimenters hinted the subjects about the nature of the<br />

description that they were expected to provide. As we will see later these<br />

methodological differences can account only partially for the dissimilarities in the<br />

concept description between Garrard and McRae databases.<br />

To get a feeling of what kind of features are listed in the experiments we present<br />

the partial description of the concept “apple” as registered in Garrard database: Apple<br />

= {“is a fruit”, “has pips”, “has skin”, “is round”, “has stalk”, “has flesh”, “has core”,<br />

“is red”, “is green”, “is sweet”, “has leaves”, “is juicy”, “is coloured”, “is sour”, “has<br />

white flesh”, “is small”, “is edible”, “can be cooked”, “can fall”, “can be picked”,<br />

“can ripen”, “can rot”}<br />

In addition to the featural description of concepts, the databases contain a wealth of<br />

interesting information. We will mention only three fields that are particularly<br />

important. Dominance is a field indicating the number of subjects that listed a certain<br />

feature. It reflects the “weight” of a certain feature in the mental representation of a<br />

concept: the higher the dominance for a specific feature, the greater the importance of<br />

the respective feature. Distinctiveness reflects the percent of members of a category<br />

for which a specific feature is listed. It is a measure of how good the individual<br />

features are in distinguishing the categories. For example, “has trunk” is a highly<br />

distinctive feature for the category elephant because it helps distinguishing the<br />

members of this class from the other animals not members, whereas “has tail” is a<br />

lowly distinctive feature because the elephants share this feature with other animals.<br />

The third field, the most significant from our point of view, gives a classification of<br />

feature types in the databases. Unfortunately the two databases contain different<br />

feature classifications. Garrard database has a relative simple but nevertheless<br />

controversial feature classification. The features are classified as categorizing,<br />

sensory, functional or encyclopedic. The categorizing features taxonomically classify<br />

the stimulus (e.g. lion “is an animal”), the sensory features are those features that are<br />

grounded in a sensory modality (e.g. the bus “is coloured” or apple “is sour”), the


60 Eduard Barbu and Massimo Poesio<br />

functional features describes an activity or the use someone makes of an item<br />

(monkeys “can run”, a brush “can apply paint”) and the encyclopedic features are the<br />

features that cannot be classified as superordinate, sensory or functional. Sometimes<br />

the way Garrard and colleagues are making use of this classification is puzzling. For<br />

example, they are classifying some features that denote parts of the sophisticated<br />

modern apparatus as being sensory (as for example the rotor or the control of an<br />

helicopter). Even if some can argue that we “see” these parts, their identification as<br />

parts is largely based on the knowledge of the structure and the functions of a modern<br />

vehicle.<br />

More interesting is the classification employed by McRae and colleagues in<br />

classifying the features in their database. They are using a taxonomic classification, a<br />

slightly modified version of Wu and Barsalou taxonomy [9], a taxonomy derived from<br />

studies on human perception. Among the principles that Wu and Barsalou considered<br />

when constructing the taxonomy were: the introspective experience of the subjects<br />

when they generate feature norms, the modality specific regions of the brain, the<br />

frame theory of Fillmore and others.<br />

Their taxonomy has two levels; at the coarsest level the features are classified as<br />

taxonomic properties, entity properties, situational properties or introspective<br />

properties. The taxonomic properties are those properties that taxonomically classify<br />

an entity, entity properties denote general properties of an entity, situational properties<br />

are properties characteristic to situations and introspective properties are properties of<br />

subject mental states.<br />

At the next level each mentioned category of properties is again further divided.<br />

The modified Wu and Barsalou taxonomy used by McRae has at the second level 27<br />

categories. To make things clear let’s consider a partial description of the concept<br />

accordion: {“a musical instrument” (Taxonomic::Superordinate), “has keys”<br />

(Entity::External Component), “produces music (Entity::Entity Behaviour).<br />

In the above description three features of the concept accordion are listed. The first<br />

feature states that the accordion is a musical instrument, and according to Wu and<br />

Barsalou category it is classified at the first level as being a taxonomic feature and at<br />

the second level as being a superordinate feature (Taxonomic ::Superordinate). The<br />

other two features “has keys” and “produces music” are classified as being<br />

“Entity::External Component” and “Entity::External Behavior” feature type<br />

respectively. The Entity::Component properties denote those features that are external<br />

components of the object to be described whereas Entity::Behaviour features denote<br />

activities component of the behavior of the object under description.<br />

Before making the comparison between WordNet and these databases we want to<br />

compare the databases with each other (Table 1).


A Comparison of Feature Norms and WordNet 61<br />

Table 1. A quantitative evaluation of Garrard and McRae databases<br />

Garrard Database<br />

McRae Database<br />

Concepts 62 1 541<br />

Feature-concept Pairs 1657 7275<br />

Average Number of F/C 26.7 13.4<br />

The first row of the Table 1 lists the number of concepts in each database; the<br />

second row gives the number of concept-feature pairs in the each of the two databases<br />

and the last row lists the average number of feature per concept for each database.<br />

Please observe that in Garrard database the average number of feature per concept is<br />

twice bigger than in McRae database. Perhaps the strategy that Garrard used for<br />

feature production task fields paid off.<br />

For the qualitative comparison of the databases we semi-automatically mapped the<br />

databases. First we identified the common concepts in the two databases and then we<br />

semi-automatically mapped the concept-feature pairs. In most cases the mapping<br />

between concept-features pairs is one to one but in some cases the mapping is one to<br />

many or many to one.<br />

The mapping between McRae and Garrard databases revealed the results presented<br />

in the tables 2 and 3. From table 2 it can be seen that we found a set of 50 concepts<br />

common to both databases (Mapped Concepts). For this common set of concepts we<br />

list the number of concept-feature pairs present in each database (“GarrardCF Pairs”<br />

and “McRae CF Pairs” respectively). Finally, “Common Mapped Pairs” field<br />

represents the number of concept-feature pairs the databases have in common.<br />

Table 2. The mapping between Garrard and McRae databases<br />

Mapped Concepts 50<br />

Garrard CF pairs 1326<br />

McRae CF pairs 765<br />

Common Mapped Pairs 430<br />

Table 3. A per feature type comparison between the Garrard and McRae databases<br />

Relation classification CFPM CFPG CFPG / CFPM<br />

Made Of 32 27 0.84<br />

Superordinate 67 54 0.80<br />

External Component 171 129 0.75<br />

Entity Behaviour 91 60 0.65<br />

External Surface Property 102 64 0.62<br />

Internal Component 21 13 0.61<br />

Internal Surface Property 18 11 0.61<br />

1<br />

Garrard published only 62 from the 64 concepts for which he collected featural descriptions


62 Eduard Barbu and Massimo Poesio<br />

As one can see looking at the same table, 56 % of concept-feature pairs listed in<br />

McRae database are present in Garrard database, but only 32 % of the concept-feature<br />

pairs in Garrard database are also present in McRae database. The problem is how to<br />

make sense of these differences. The second finding namely that 68 % of the concept<br />

feature pairs in Garrard database are not in McRae database can be explained by the<br />

methodological difference between the authors. One can argue that Garrard subjects<br />

had only to fill in the fields already on the page, whereas McRae subjects should have<br />

produced the features with no help. More problematic is how to interpret the first<br />

number, 43% of the concept-feature pairs in the McRae database are not in the<br />

Garrard database. This fact poses serious problems for the computational theories of<br />

the semantic memory based on feature norms but we will not address the problem in<br />

this paper.<br />

Table 3 gives the database comparison using Wu and Barsalou taxonomy. The first<br />

column lists some salient relation types in Wu and Barsalou taxonomy omitting the<br />

first level of classification. For the set of common concepts in the two databases, the<br />

second column gives the number of concept-feature pairs classified with a certain<br />

relation type in McRae database (CFPM). Thus we find that 32 concept-feature pairs<br />

were classified as instances of “Made Of” relation type, 67 as exemplifying the<br />

Superordinate relation type and so on. The third column looks at the same statistic for<br />

the concept-feature pairs that are in the intersection between McRae and Garrard<br />

database (CFPG). For example from the 32 concept-feature types classified as<br />

instances of “Made Of” relation type in the McRae database, 27 have been mapped on<br />

Garrard database. The last column gives the report of the last two columns. We<br />

eliminated those relation types that classified less than 11 concept-feature pairs or had<br />

a score in the last column less than 0.51.<br />

One can see that the feature types successfully mapped from McRae database to<br />

Garrard database are parts (“Made Of”, “External Component”, “Internal<br />

Component”), taxonomic features (Superordinate), the features classified under<br />

“Entity Behaviour” and the features that denote external and internal surface<br />

properties.<br />

3 WordNet Feature Extraction<br />

The procedure for building the feature norms is a time consuming one: for example<br />

McRae and his colleagues started the feature collection in the 90’s.<br />

Hoping that we can find an automatic procedure for producing featural concept<br />

description we want to see how the features norms compare with WordNet. WordNet<br />

is a resource built starting from psycholinguistic principles and aiming to be a model<br />

of human semantic memory. The feature norms, as we showed before, are built<br />

having in mind the computational modeling of semantic memory. Therefore one<br />

would expect to find in WordNet many of the features that are produced by the<br />

subjects in the psychological experiments.<br />

To automatically compare the concept descriptions in the two databases with the<br />

concept descriptions in WordNet we mapped the concepts in the databases to


A Comparison of Feature Norms and WordNet 63<br />

WordNet concepts. The mapping procedure has two steps: the first one is fully<br />

automatic and the second one is manual.<br />

In the automatic step we try to guess which is the most likely assignment between<br />

the words that were offered as stimuli in the databases and the corresponding<br />

WordNet synsets. First we generate all synsets that contain the stimuli words in the<br />

databases and their respective hyperonyms up to the root of the WordNet tree. Then<br />

from the Category field in Garrard database and from the Superordinate property<br />

types in the McRae database we generate the classification of the database concepts.<br />

Afterwards we perform the intersection between the words that classy the stimuli in<br />

the databases and the hyperonyms in WordNet. In case that the intersection is not<br />

empty and not two senses of a word have the same hyperonym in WordNet we can<br />

automatically find the synset corresponding to the stimulus. For example the word<br />

apple present in both databases is classified in both of them as a fruit. There are two<br />

senses of the word apple in WordNet, the first one (apple#1) refers to the apple as a<br />

fruit and the second one (apple#2) refers to the apple as a tree. One of the hyperonym<br />

synsets of apple#1 has as one of its members the word fruit. Therefore we find that the<br />

apple in both databases should be mapped on the first sense of apple in WordNet<br />

(apple#1).<br />

In the second step we manually map the stimuli words that could not be mapped<br />

automatically and we also briefly recheck the accuracy of automatic mapping.<br />

Before presenting the algorithm for WordNet feature extraction we give some<br />

useful term definitions: projection set, semantic neighborhood and WordNet feature.<br />

Definition 1 (Projection Set). The set of synsets that represent the mappings of the<br />

concepts in the databases onto WordNet is called the projection set.<br />

We have two projection sets, one for each database: Garrard project set and McRae<br />

project set respectively. When we use the term projection set without qualification we<br />

refer to the mapping of both projection sets.<br />

Definition 2 (Semantic Neighborhood). The semantic neighborhood of a synset s<br />

is a graph where N is a finite set of nodes representing WordNet synsets and<br />

R is a set of relations linking the nodes.<br />

The algorithm for feature extraction considers only two semantic relations in R:<br />

hyperonymy and meronymy. We choose the hyperonymy relation because it is an<br />

inheritance and transitive relation therefore along this relation a concept inherits all<br />

the featural descriptions of its superordinates. We included the meronymy relation<br />

because the parts are among the most salient feature types produced by the subjects in<br />

the feature generation task.<br />

Definition 3 (WordNet Feature). A WordNet feature of a concept is any word in<br />

the synsets of its semantic neighborhood and any noun, adjective or verb in the<br />

glosses of the synsets of its semantic neighborhood.<br />

The feature extraction for the synsets in the projection set is performed from the<br />

semantic neighborhood of each synset.


64 Eduard Barbu and Massimo Poesio<br />

Considering among potential features of a concept any noun adjective or verb<br />

seems to overestimate the number of real features present in WordNet. Remember that<br />

we want to see which features in the databases are also present in WordNet. Therefore<br />

the generation of a reasonable number of “false” features would not affect at all the<br />

comparison because the set of real features is a subset of generated WordNet features.<br />

The algorithm for the extraction of WordNet features for the concepts represented<br />

by the synsets in the projection set has three steps. In the first step we generate the<br />

semantic neighborhoods of each synset in the projection set. In the second step we<br />

part of speech tag and lemmatize all the glosses of the synsets from the semantic<br />

neighborhood. The part of speech tagging and the lemmatization is performed with<br />

TreeTagger, a language independent part of speech tagger, developed by the Institute<br />

for Computational Linguistics of the University of Stuttgart. The tagger uses an<br />

English parameter file trained on Penn Treebank. In the third step we extract all the<br />

WordNet features and eliminate possible duplicate features. Figure 1 shows a part of<br />

the semantic neighborhood of the synset apple. A node of the graph is labeled with its<br />

corresponding synset; the synset is followed by its gloss. The edges of the graph are<br />

labeled with the semantic relations in the above-mentioned R set.<br />

Running the algorithm for the toy example above we obtain the following potential<br />

features for the concept apple: fruit, red, yellow, green, skin, sweet, tart, crisp,<br />

whitish, flesh, edible, reproductive, body, seed, plant, vegetable, grow, market, peel.<br />

The above algorithm allows us to make a global comparison between the database<br />

features and WordNet features. To perform a much finer comparison we will classify<br />

the synsets in the projection set and then compare the database features and WordNet<br />

features per category.<br />

To find a suitable classification, first we generate the WordNet tree along the<br />

hyperonym relation starting from the synsets in the projection set. We treat the synsets<br />

in the projection set as the objects to be classified and any category subsuming the<br />

synsets in the projection set as a potential classifier. There are many potential<br />

classifications one can find but we would like to find a classification that forms a<br />

partition of the objects to be classified. We also want that the clear-cut categories<br />

formed to be basic level categories.


A Comparison of Feature Norms and WordNet 65<br />

Synset: edible fruit<br />

Gloss: edible reproductive body of a seed<br />

plant especially one having sweet flesh<br />

Synset: produce, green goods,<br />

green groceries<br />

Gloss: fresh fruits and vegetable<br />

grown for the market<br />

hyperonym<br />

hyperonym<br />

meronym<br />

Synset: apple<br />

Gloss: fruit with red or yellow or green skin and<br />

sweet to tart crisp whitish flesh<br />

Synset: peel, skin<br />

Gloss: the rind of a fruit or vegetable<br />

Fig. 1. The semantic neighborhood of the synset apple<br />

In figure 2 we see a part of the classification tree whose leaves are synsets from the<br />

projection set. The problem one confronts is where the tree should be cut to form a<br />

good partition. Should we cut the tree at the node “musical instrument” and classify<br />

with its label all the leaves that fall under the node or should we cut the tree at the<br />

nodes “free reed instrument” and “woodwind” and classify with these other two labels<br />

the leaves that fall under them?<br />

musical instrument<br />

woodwind<br />

free reed instrument<br />

flute<br />

harmonica<br />

accordion<br />

Fig. 2. A part of the classification tree


66 Eduard Barbu and Massimo Poesio<br />

Because we want to produce basic level categories, cutting the tree at the node<br />

musical instrument seems the obvious solution. We explored the possibility of finding<br />

an automatic resolution of the problem. Ideally an algorithm should take as input the<br />

hyperonimic tree and produce as output a good partition of it.<br />

The algorithm we tested cut the tree at the nodes that give the smaller<br />

generalization possible. A node from the hyperonimic tree gives the smaller possible<br />

generalization if it dominates at least two synsets from the projection set. After we<br />

collect all the categories satisfying the above condition we retain only those that form<br />

a partition of the objects to classify. For example, applying the algorithm for the toy<br />

example in figure 2 one cuts the tree at the nodes “free reed instrument” or “musical<br />

instrument”. Please observe that the tree cannot be cut at the node woodwind because<br />

this node dominates only one leaf and it does not give us any useful generalization.<br />

Subsequently we observe that the category “musical instrument” dominates the<br />

category “free reed instrument” and that only the category “musical instrument” gives<br />

us a partition of the objects to classify. Unfortunately this straightforward method<br />

does not produce satisfying results because it generates many artificial categories.<br />

The other automatic approach we considered was to use the classifications already<br />

present in the two databases: the category field for Garrard database and the<br />

Superordinate relation type for the McRae database. One can argue that the categories<br />

thus obtained are basic-level because the subjects of the psychological experiments<br />

produced them. This method however leaves unclassified the word stimuli for which<br />

the subjects in McRae experiment did not produced categories. We chose to take a<br />

middle path. Starting from the categories produced by the subjects in each of the two<br />

experiments and inspecting the classification tree we came up with a much better<br />

category set. The partition we obtained came with the cost of not being able to cover<br />

the whole space. For the Garrard database the following categories form a partition of<br />

50 concepts: {“implement”, “bird”, “mammal”, “fruit”, “container”, “vehicle”,<br />

“reptile”}. One can see that the class of animals is split into reptiles, mammals and<br />

birds an then we have the partition of tools (implement), fruits and vehicles. For the<br />

McRae database instead the partition has 16 categories and covers 345 concepts:<br />

{“clothing”, “implement”, “fruit”, “furniture”, “mammal”, “plant”,” appliance”,<br />

“weapon”, “container”, “musical instrument”, “building”, “vehicle”, “fish”, “reptile”,<br />

“insect”, “bird}<br />

4 Results and discussion<br />

For each of the two databases we performed a global comparisons with WordNet, a<br />

comparison by feature type and then a per category comparison using the category<br />

partitions presented in the final part of the section 3. To make possible the automatic<br />

comparison between feature norms and WordNet we had to make two simplifying<br />

assumptions. In both Garrard and McRae databases “has legs” and “has four legs” for<br />

example are considered to be distinct features. We neglect the cardinality and collapse<br />

this features in one: “has legs”. We also considered that in the case when a feature<br />

expresses a two place relation and the relation is not explicitly defined in WordNet<br />

(e.g. meronym or hyperonym), the presence of the arguments of the relation in


A Comparison of Feature Norms and WordNet 67<br />

WordNet are sufficient for deciding that the relation linking the arguments in<br />

WordNet is the same relation expressed by the database feature. For example if we<br />

want to decide if the feature “used for cooking” for the concept “pot” exists in<br />

WordNet and we find the word cooking in the semantic neighborhood of the concept<br />

pot then we assume that the relation that holds between pot and cooking is the<br />

functional relation “used for” For most features in the databases this is true but there<br />

are some cases when our second assumption is false. Table 4 shows the proportion of<br />

the concept-feature pairs in the databases one can find in WordNet.<br />

Table 4. A global comparison between feature norms and WordNet<br />

Database CF pairs in database CF pairs in Percent<br />

WordNet<br />

WordNet<br />

McRae 6925 2108 30%<br />

Garrard 1537 342 22%<br />

in<br />

The “CF pairs in database” column lists the number of concept-feature pairs in<br />

each database whereas the “CF pairs in WordNet” column gives the number of<br />

concept-feature pairs in the intersection between each database and WordNet. The last<br />

column shows the percent of the features in the databases estimated to be in WordNet.<br />

One can see that the percent of concept-feature pairs in the McRae database also<br />

found in WordNet is higher that the percent of features in the Garrard database that<br />

are in WordNet (30 % vs. 22%).<br />

In the next two tables we see what features types are better represented in WordNet.<br />

Tables 5 and 6 list in the first column the feature types, in the second column the<br />

number of the typed concept feature pairs in the respective database, in the third<br />

column the number of concept-feature pairs in the intersection between WordNet and<br />

the databases for each feature type, and in the last column the percent of conceptfeature<br />

pairs found in WordNet for each feature type.<br />

Table 5. Per feature type comparison between Garrard database and WordNet<br />

Feature Type CF pairs in CF pairs in Percent<br />

database<br />

WordNet WordNet<br />

Categorizing<br />

115 83 72%<br />

Sensory<br />

737 190 25%<br />

Encyclopedic 241 26 11 %<br />

Functional 444 43 10%<br />

in


68 Eduard Barbu and Massimo Poesio<br />

Table 6. Per feature type comparison between McRae database and WordNet<br />

Relation Type CF pairs in CF pairs in Percent<br />

database WordNet WordNet<br />

Superordinate 588 470 80%<br />

External Component 926 442 48%<br />

Internal Component 168 64 38%<br />

Origin 59 16 27%<br />

Contingency 91 24 26%<br />

External Surface Property 1175 306 26%<br />

Made Of 471 122 26%<br />

Function 1098 281 25%<br />

Participant 183 44 24%<br />

Internal Surface Property 179 40 22%<br />

Location 455 84 18%<br />

Associated Entity 153 22 14%<br />

Systemic Property 293 38 13%<br />

Entity Behavior 495 63 13%<br />

Action 184 20 11%<br />

Evaluation 105 0 0%<br />

in<br />

Looking at the tables 5 and 6 we see which feature types are better represented in<br />

WordNet and which are lacking. Table 5 gives the comparison for Garrard database,<br />

the features being classified according to the categorization employed by Garrard and<br />

colleagues. As one expects the feature type better covered by WordNet is the<br />

classification type, 72 % of the classification features produced by the subjects in<br />

Garrard experiment are found in WordNet. All other feature types are not so well<br />

represented, the second place being adjudicated by the sensory features with 25 %. As<br />

we argued in section 2, Garrard classification is very crude and therefore not very<br />

informative.<br />

Much more interesting is the comparison for McRae database (in the table 3 we do<br />

not list all the feature types in Wu and Barsalou taxonomy. We do not show the<br />

feature types that classify less than 50 concept-feature pairs).<br />

Meeting our expectations the best feature type in terms of coverage is the<br />

superordinate type (78 %), this is the only feature type that has coverage over 50% in<br />

WordNet. The relations that denote parts and that correspond to various types of<br />

meronymy are relatively well-represented occupying position 2, 3 and 7.<br />

The features classified under “External Surface Property” label, occupies the fifth<br />

place. The high position in the table for the external surface properties can be<br />

explained by the fact that the definitions of many concepts denoting concrete objects<br />

list properties of their external surfaces (e.g. shape, color). For example, the definition<br />

of the concept apple contains the attributes red, green and yellow, all being external<br />

surface properties according to Wu and Barsalou taxonomy.<br />

The last feature type, labeled “evaluation”, has no representation in WordNet. The<br />

features typed as evaluation reflect subjective assessment of objects or situations (for


A Comparison of Feature Norms and WordNet 69<br />

example the evaluation that a bag is useful, that a blouse is pretty or the shark is<br />

dangerous).<br />

A comparison between databases and WordNet using the category partitions<br />

discussed above is given in the table 7 (we show only the top seven categories in<br />

McRae partition):<br />

Table 7. A per category comparison between feature norms and WordNet<br />

Garrard Percent Overlap McRae category Percent Overlap<br />

category WordNet<br />

WordNet<br />

Fruit 37% Fish 52%<br />

Bird 34% Fruit 43%<br />

Implement 25% Vehicle 43%<br />

Container 21% Bird 42%<br />

Mammal 20% Plant 37%<br />

Vehicle 20% Musical 36%<br />

instrument<br />

Reptile 17% Weapon 33%<br />

The columns 1 and 3 represent the set of category partitions in the databases. The<br />

columns 2 and 4 represent the percent of the concept feature pairs for each category<br />

that are present in WordNet. The best-represented categories in WordNet for Garrard<br />

database are fruits and birds whereas for McRae database the best represented<br />

categories in WordNet are fish, fruit, vehicle and birds.<br />

For assessing the accuracy of our automatic procedure we performed a manual<br />

comparison between 20 WordNet concept descriptions and each of the two<br />

corresponding database descriptions. The 20’s concept set contains 10 concepts that<br />

our algorithm says have the highest overlap with the databases and 10 concepts that<br />

have the lowest overlap with the databases.<br />

A manual mapping between the database concept description and WordNet concept<br />

description revealed that the number of concept-feature pairs common to databases<br />

and WordNet is bigger than the estimation given by our algorithm. There are three<br />

reasons for this fact. The first reason is that some features present in WordNet are<br />

expressed using different words from the words used to register the same features in<br />

the two databases. For example in McRae database one of the features of the concept<br />

“anchor” is “found on boats”. The definition of the concept “anchor” in WordNet<br />

contains a semantically close word: vessel (WordNet says that vessel is a hyperonym<br />

of boat). If the words in the WordNet glosses would have been semantically<br />

disambiguated our algorithm would have exploited this information and improved the<br />

automatic estimation. However even a WSD of the words in the glosses does not<br />

completely solves our problem because is a notorious fact the WordNet makes very<br />

fine sense discriminations and many features that are near synonyms of the words in<br />

the glosses would not be found.<br />

The second reason for the inaccuracy of our automatic procedure is related with a<br />

general problem of feature norms. It is assumed for methodological simplicity that<br />

features listed in the feature production task are independent. However the above


70 Eduard Barbu and Massimo Poesio<br />

statement is known to be false. One of the most important relations that link the<br />

features is the entailment relation. For example the concept trolley’s features: ”used<br />

for carrying things” and “used for moving things” are related by entailment. If<br />

someone carries some things with a trolley he will always move the things. The<br />

relation of entailment holds also between some features in the feature norms and some<br />

features in WordNet. The functional feature of the concept “anchor” “used for holding<br />

the boats still” is logical equivalent with the feature “prevents a vessel for moving”<br />

found in the WordNet gloss of the concept anchor.<br />

The third reason why the automatic comparison fails to reveal the true overlap<br />

between databases and WordNet is the incompleteness of WordNet. A very salient<br />

feature that the human subjects produce when they describe concrete objects are the<br />

parts of the respective objects but many concepts from the projection set lack the<br />

meronyms in PWN 2.1. We do think that a manual comparison with a complete<br />

WordNet would show an overlap of approximately 40 % with McRae database and 30<br />

% with Garrard database.<br />

The comparison between feature norms and WordNet reveals some potential<br />

improvements for the future WordNet versions. To find what feature types are lacking<br />

one needs to inspect Table 6 and evaluate any feature type except Superordinate type<br />

and the feature types that are related with parts. We will briefly discuss three-feature<br />

type present in feature norms but lacking or underrepresented in WordNet: the<br />

evaluation, the associated entity and the function feature types.<br />

As we argued above even if the evaluation features are an important part of the<br />

semantic representation for some concepts they are totally missing from WordNet.<br />

We do not think that every possible subjective evaluation should find a place in<br />

WordNet but only the most salient ones. For example the evaluation that sharks are<br />

generally considered dangerous or that hyenas are seen as being ugly should be part of<br />

the WordNet entry for shark and hyena respectively.<br />

Other interesting feature type under-represented in WordNet is the associated entity<br />

feature type. As many of the concepts presented as stimuli in feature generation tasks<br />

denote concrete objects, the mental representation for these concepts includes the<br />

knowledge of the entities that we normally associate with these objects in the<br />

situations we typically encounter them. For example we typically associate an anchor<br />

with the chains or ropes it is attached to or we associate an apple with the worms it<br />

may be infested or even we associate bagpipes with Scotland.<br />

The function or role that an entity serves for an agent is an important part of the<br />

meaning of the respective entity. We use keys to lock or open doors, we empty and<br />

fill the basket we use trolley for transporting things and the garages are used for<br />

storing cars. Many of these important functional features are lacking from WordNet<br />

(only 281 of the 1098 function features in McRae database are present in WordNet).<br />

If WordNet wants to be a model of the human semantic memory it should rethink its<br />

structure to accommodate the feature types present in feature norms.


A Comparison of Feature Norms and WordNet 71<br />

5 Related work<br />

We are not aware of other work that compared the feature norms concept descriptions<br />

with WordNet concept descriptions. However there has been a lot of dedicated effort<br />

to the concept extraction from web or corpora and in some cases there were attempts<br />

to compare these concepts descriptions with feature norms. Some of this work has<br />

sought to extract information about attributes such as parts and qualities [10, 11].<br />

Almuhareb and Poesio developed supervised and unsupervised methods for feature<br />

extraction from the Web based on ideas from Guarino [12] and Pustejovsky [13]<br />

among others, and showed that focusing on extracting ‘attributes’ and ‘values’ leads<br />

to concept descriptions that are more effective from a clustering perspective – e.g., to<br />

distinguish animals from tools or vehicles – than purely distributional descriptions.<br />

They extracted candidate attributes using constructions inspired by [14] such as “the<br />

X of the car is…” and then removed false positives using a statistical classifier.<br />

Recently Poesio and collaborators [15] evaluated the concept descriptions<br />

automatically extracted from one of the biggest corpora in existence against three<br />

feature norms, the two presented in this paper plus a feature norm produced by Vinson<br />

and Vigliocco . They made an in depth comparison between the three feature norms<br />

including the computation of statistical correlations between the feature norms<br />

concept descriptions and corpora concept descriptions.<br />

More generally our work is connected with the ontology learning effort in the<br />

natural language processing and semantic web community and with various work in<br />

psychology that tries to understand the human conceptual system using empirical<br />

methods.<br />

6 Conclusions and further work<br />

The comparison between the concept description in the databases Garrard and McRae<br />

and the comparison between these database concept description and WordNet<br />

revealed some interesting results. First we saw that 56 % of concept feature pairs<br />

listed in McRae database are present in Garrard database and 32 % of the conceptfeature<br />

pairs in Garrard database are present in McRae database. We also found using<br />

an automatic procedure that 30 % percents of the concept-feature pairs in McRae<br />

database are found in WordNet and 22 % of concept-feature pairs in Garrard database<br />

are present in WordNet. We argued that an ideal comparison between the two<br />

databases and WordNet would reveal a bigger overlap, which will be comparable with<br />

the overlap between the two psychological databases.<br />

Using Wu and Barsalou taxonomy and the manual comparison of 20 concept<br />

descriptions in the databases and WordNet we showed that WordNet description lack<br />

or under represent important features types present in the feature norms such as:<br />

evaluation, associated entity and functional feature types. We firmly believe that any<br />

future improvement of WordNet should take into consideration the feature norms.<br />

We also stressed the fact that the features in the feature norms are not independent.<br />

We would like to find an automatic method for learning the structure that ties together<br />

the features.


72 Eduard Barbu and Massimo Poesio<br />

A week point of our automatic feature extraction algorithm from WordNet is that it<br />

does not find the relation between the concept to be described and the potential<br />

WordNet features extracted from glosses. Taking into account this observation we are<br />

exploring a better procedure for feature extraction, a procedure that exploits a parser<br />

for finding the correct relation between the focal concept and the concepts found in<br />

the glosses. We hope we will produce in the near future a graphical tool that will help<br />

the researchers working with feature norms to easily extract WordNet concept<br />

descriptions.<br />

Acknowledgments<br />

We like to thank professor Lawrence Barsalou from Emory University for providing<br />

us the paper discussing Wu and Barsalou taxonomy. We are also indebt to our<br />

colleagues Marco Baroni and Brian Murphy for some stimulating discussions.<br />

References<br />

1. Medin, D.: Concepts and Conceptual structure. J. American Psychologist 44, 1469–1481<br />

(1989)<br />

2. Farah, M. J., McClelland, J. L.: A computational model of semantic memory impairment:<br />

Modality- specificity and emergent category-specificity. J. Journal of Experimental<br />

Psychology: General 120, 339–357 (1991)<br />

3. Rosch, E., Mervis, C. B.: Family resemblances: Studies in the internal structure of categories.<br />

J. Cognitive Psychology 7, 573–605 (1975)<br />

4. Barsalou, L. W.: Ideals, central tendency, and frequency of instantiation as determinants of<br />

graded structure in categories. J. Journal of Experimental Psychology: Learning, Memory,<br />

and Cognition 11, 629–654 (1985)<br />

5. Ashcraft, M. H.: Property norms for typical and atypical items from 17 categories: A<br />

description and discussion. J. Memory & Cognition 6, 227–232 (1978)<br />

6. Moss, H. E., Tyler, L. K., Devlin, J. T.: The emergence of category-specific deficits in a<br />

distributed semantic system. In: Forde, E. M. E., Humphreys, G. W. (eds.) Categoryspecificity<br />

in brain and mind, pp. 115–147. Psychology Press, East Sussex, UK (2002)<br />

7. Garrard, P., Lambon Ralph, M. A., Hodges, J. R., Patterson, K.: Prototypicality,<br />

distinctiveness, and intercorrelation: Analyses of the semantic attributes of living and<br />

nonliving concepts. J. Cognitive Neuropsychology 18, 125–174 (2001)<br />

8. McRae, K., Cree, G. S., Seidenberg, M. S., McNorgan, C.: Semantic feature production<br />

norms for a large set of living and nonliving things. J. Behavior Research Methods 37, 547–<br />

559 (2005)<br />

9. Wu, L-L, Barsalou, L.W.: Grounding Concepts in Perceptual Simulation: Evidence from<br />

Property Generation. In press.<br />

10. Almuhareb, A., Poesio, M.: Finding Attributes in the Web Using a Parser. In: Proceedings<br />

of Corpus Linguistics, Birmingham (2005)<br />

11. Cimiano, P., Wenderoth, J.: Automatically Learning Qualia Structures from the Web. In:<br />

Proceedings of the ACL Workshop on Deep Lexical Acquisition, pp 28–37. Ann Arbor,<br />

USA,(2005)


A Comparison of Feature Norms and WordNet 73<br />

12. Guarino, N.: Concepts, attributes and arbitrary relations: some linguistic and ontological<br />

criteria for structuring knowledge bases. J. Data and Knowledge Engineering 8, 249–261<br />

(1992)<br />

13. Pustejovsky, J. The Generative Lexicon. MIT Press, Cambridge/London (1995)<br />

14. Hearst, M. A.: Automated Discovery of WordNet Relations. In: Fellbaum, C. (ed.)<br />

WordNet: An Electronic Lexical Database. MIT Press (1998)<br />

15. Poesio, M, Baroni, M, Murphy, B., Barbu, E., Lombardi, L., Almuhareb, A., Vinson, D. P.,<br />

Vigliocco, G.: Speaker generated and corpus generated concept features. Presented at the<br />

conference Concept Types and Frames, Düsseldorf (2007)


Enhancing WordNets with Morphological Relations:<br />

A Case Study from Czech, English and Zulu<br />

Sonja Bosch 1 ,<br />

Christiane Fellbaum 2 , and Karel Pala 3<br />

1 University of South Africa Pretoria, South Africa,<br />

boschse@unisa.ac.za<br />

2 Department of Psychology,<br />

Princeton University USA<br />

fellbaum@princeton.edu<br />

3 Faculty of Informatics, Masaryk University Brno,<br />

Czech Republic<br />

pala@fi.muni.cz<br />

Abstract. WordNets are most useful when their network is dense, i.e., when a<br />

given word of synsets is connected to many other words and synsets with lexical<br />

and conceptual relations. More links mean more semantic information and<br />

thus better discrimination of individual word senses. In the paper we discuss<br />

one kind of cross-POS relation for English, Czech and Bantu WordNets. Many<br />

languages have rules whereby new words are derived regularly and productively<br />

from existing words via morphological processes. The morphologically<br />

unmarked base words and the derived words, which share a semantic core with<br />

the base words, can be interlinked and integrated into WordNets, where they<br />

typically form "derivational nests", or subnets. We describe efforts to capture<br />

the morphological and semantic regularities of derivational processes in English,<br />

Czech and Bantu to compare the linguistic mechanisms and to exploit<br />

them for suitable computational processing and WordNet construction. While<br />

some work has been done for English and Czech already, WordNets for Bantu<br />

languages are still in their infancy ([2], [16]) and we propose to explore ways in<br />

which Bantu can benefit from existing work.<br />

1 Introduction: Inflectional and Derivational Morphology<br />

Many languages possess rules of word formation, whereby new words are formed<br />

from a base word by means of affixes. The derived words differ from the base words<br />

not only formally but also semantically, though the meanings of base and derivative<br />

words are closely related. These processes, referred to as morphology, fall into two<br />

major categories. Inflectional morphology, also called grammatical morphology, is<br />

concerned with affixes that have purely grammatical function. Thus, most Indo-<br />

European languages have (or once had) verbal morphology to mark person, number,<br />

tense and aspect as well as noun morphology to indicate categories like gender, number<br />

and case. Czech exploits what can be called a 'cumulation' of functions, i.e., one


Enhancing WordNets with Morphological Relations… 75<br />

inflectional suffix conveys as a rule several grammatical categories; for nouns, adjectives,<br />

pronouns (as well as numerals) the categories expressed by the affixes are gender,<br />

number and case. While Czech is a richly inflected language, English has developed<br />

characteristics of an analytic language where grammatical functions are assumed<br />

by free morphemes; for example, future tense, unlike past and present, is marked by<br />

will. As in Czech, a single morpheme can have several grammatical functions; -s<br />

marks both plural nouns and present tense third person verbs. Bantu languages are<br />

agglutinative and use affixes to express a variety of grammatical relations and meanings.<br />

These morphemes 'glue' onto stems or roots. The morphemes are not polysemous,<br />

as one of the principles that characterises agglutinating languages is the one-toone<br />

mapping of form and meaning [11], and each morpheme therefore conveys one<br />

grammatical category or distinct lexical meaning.The preparation of manuscripts<br />

which are to be reproduced by photo-offset requires special care. Papers submitted in<br />

a technically unsuitable form will be returned for retyping, or canceled if the volume<br />

cannot otherwise be finished on time.<br />

Importantly, the inflected word belongs to the same form class (i. e., represents the<br />

same part of speech) as the base. By contrast, derivational morphology often yields<br />

words from a different form class. For example, the English verb soften is derived<br />

from the adjective soft by means of the suffix -en. Both inflectional and derivational<br />

morphology encompass regular and productive rules that are an important part of<br />

speakers' grammar. Given a new (or nonce) word like wug, even young children effortlessly<br />

produce the (inflected) plural form wugs [1]. Speakers avail themselves of<br />

the rules of derivational morphology to form and interpret tens of thousands of words.<br />

A third productive mechanism to derive new words from existing ones is compounding.<br />

Examples are English flowerpot, bittersweet, and dry-clean. In Czech compouding<br />

is a regular word derivation procedure but it is considered rather marginal<br />

and not so productive. An example: česko+slovenský (Czecho-Slovak) or bratro+vrah<br />

(murderer of the brother).<br />

In Bantu, compounding is also a productive and regular way of creating new<br />

words and it has its own rules. Examples are:<br />

Northern Sotho<br />

sekêpê (ship) + môya (air): sêkêpemôya (airship)<br />

Zulu<br />

abantu (people) + inyoni (bird): abantunyoni (astronauts)<br />

umkhumbi (boat) + ingwenya (crocodile): umkhumbingwenya (submarine)<br />

Venda<br />

ngowa (mushroom/s) + mpengo (madman): ngowampengo (inedible mushroom/s)<br />

In the remainder of this paper, we focus on derivational morphology. We ask how<br />

we can exploit its regularity to populate WordNets and to characterize both formal<br />

and semantic relations. We explore and formulate derivational rules (D-rules) allowing<br />

us to generate automatically as many word forms as possible in the three languages<br />

we focus on (English, Czech and Bantu) and to assign meaning to the output<br />

of these rules. Formulating D-rules rules would bypass the task of compiling and<br />

maintaining large lists of base forms (stems) and would allow us to generate automatically<br />

the core of the word stock of the individual languages.


76 Sonja Bosch, Christiane Fellbaum, and Karel Pala<br />

When trying to write the formal D-rules that allow us to generate new words automatically<br />

we meet the problem of over- and undergeneration of derived forms. That<br />

is, the D-rules could either produce forms that are possible but not actually occurring<br />

forms (in corpora or dictionaries), or the rules could fail to generate all attested forms.<br />

To avoid errors as well as undergeneration, one currently relies primarily on the manual<br />

checking of the output, but we are developing procedures that can semiautomatize<br />

this process by comparing the output of the D-rules to corpora or dictionaries.<br />

Addressing the overgeneration problem requires re-inspection of the D-rules<br />

and correcting those that generate ill-formed strings.<br />

2 Derivational morphology<br />

Derivational affixes form new words with meanings that are related to, but distinct<br />

from the base to which they are attached. In this way they differ from inflectional<br />

affixes, which add grammatical specifications to a base word. Like inflectional morphology,<br />

derivational morphology tends to be regular and productive, i.e., speakers<br />

use the rules to form and understand words that they may never have encountered.<br />

This shows that derivational morphemes are associated with meanings.<br />

However, the meanings may be polysemous, as in English and Czech, and speakers<br />

have to rely on the meaning of the base word or on world knowledge to understand<br />

the derived word.<br />

In comparison to Czech and English, D-affixes in Bantu only acquire meaning by<br />

virtue of their connection with other morphemes (for example agent, result of an action,<br />

instrument of an action etc.) and cannot always be assigned an independent semantic<br />

value. This poses a challenge for the definition of “lexical unit” or “word”,<br />

which must be met when one constructs a WordNet.<br />

2.1 Derivational relations in Czech<br />

We discuss the two main mechanisms of Czech derivational morphology, suffixation<br />

and prefixation. We classify the suffixes and prefixes semantically.<br />

2.1.1 The Czech morphological analyzer<br />

The basic and most productive derivational relations expressed by affixes or, more<br />

precisely, the rules describing them were formulated and integrated into a Czech morphological<br />

analyzer, Ajka, resulting in a D-version. Ajka is an automatic tool that is<br />

based on the formal description of the Czech inflection paradigms [21 and that was<br />

developed at the NLP Centre at the Faculty of Informatics Masaryk University Brno.<br />

Ajka's list of stems comprises approx. 400 000 items, up to 1600 inflectional paradigms<br />

and it is able to generate approx. 6 mil. Czech word forms. It can be used for<br />

lemmatization and tagging, as a module for syntactic analyzer, and other NLP applications.


Enhancing WordNets with Morphological Relations… 77<br />

A version of Ajka for derivational morphology, (D-Ajka), can generate new word<br />

forms derived from the stems using rules capturing suffix and prefix derivations. A<br />

Web Derivational Interface make it possible to further explore the semantic nature of<br />

the selected noun derivational suffixes as well as verb prefixes and establish a set of<br />

the semantic labels associated with the individual D-relations. For verbs, the work<br />

focused on exploring the derivational relations between selected prefixes and corresponding<br />

Czech verb stems or basic non-derived verbs for one verb semantic class<br />

(verbs of motion).<br />

Using the analyzer Ajka and the D-interface allowed the addition of selected noun<br />

and verb D-relations to the Czech WordNet and its enrichment of approx. 31 000 new<br />

Czech synsets, using the DebVisdic editor and browser (see Fig. 1 screenshots).<br />

2.1.2 The Czech data<br />

The starting Czech data include 126 000 noun stems and 22 noun suffixes, 42 745<br />

verb stems (or basic verb forms) and 14 verb prefixes. There are also alternations<br />

(infixes) in stems that are not considered here.<br />

The complete inventory of the main noun suffixes is much larger (approx. 120)<br />

and the same holds for the set of verb prefixes (approx. 240); here we consider only<br />

the primary prefixes (14, and from them 4). The higher number of prefixes in Czech<br />

follows from the fact that for each primary prefix there are about 15 secondary (double)<br />

ones.<br />

In Czech grammars [10] we can find the following main types (presently 14) of the<br />

derivational processes exploiting suffixes and prefixes:<br />

1. mutation: noun -> noun derivation, e.g. ryba -ryb-ník (fish -> pond), semantic<br />

relation expresses location – between an object and its typical location,<br />

2. transposition (the relation existing between different POS): noun -> adjective<br />

derivation, e.g. den -> den-ní (day ->daily), semantically the relation expresses property,<br />

3. agentive relation (existing between different POS): verb -> noun e.g. myslit -><br />

mysli-tel (think -> thinker), semantically the relation exists between action and its<br />

agent,<br />

4. patient relation: verb -> noun, e.g. trestat -> trestanec (punish ->convict), semantically<br />

it expresses a relation between an action and the object (person) impacted<br />

by it,<br />

5. instrument (means) relation: verb -> noun, e.g. držet -> držák (hold ->holder),<br />

semantically it expresses a tool (means) used when performing an action,<br />

6. action relation (existing between different POS): verb -> noun, e.g. uèit -> uèen-í<br />

(teach -> teaching), usually the derived nouns are characterized as deverbatives,<br />

semantically both members of the relation denote action (process),<br />

7. property-verbadj relation (existing between different POS): verb -> adjective,<br />

e.g. vypracovat -> vypracova-ný (work out -> worked out), usually the derived adjec-


78 Sonja Bosch, Christiane Fellbaum, and Karel Pala<br />

tives are labelled as de-adjectives, semantically it is a relation between action and its<br />

property,<br />

8. property-adjadv relation (existing between different POS): adjective -> adverb,<br />

e.g. rychlý -> rychl-e (quick -> quickly), semantically we can speak about property,<br />

9. property-adjnoun (existing between different POS): adjective -> noun, e.g.<br />

rychlý -> rychl-ost (fast -> speed), semantically the relation expresses property in<br />

both cases,<br />

10. gender change relation: noun -> noun, e.g. inženýr -> inženýr-ka (engineer -><br />

she engineer), semantically the only difference is in sex of the persons denoted by<br />

these nouns,<br />

11. diminutive relation: noun -> noun -> noun, e.g. dùm -> dom-ek -> dom-eèek<br />

(house -> small house -> very little house or a house to which a speaker has an emotional<br />

attitude), in Czech the diminutive relation can be binary or ternary,<br />

12. augmentative relation: noun -> noun, e.g. dub -> dub-isko (oak tree -> huge,<br />

strong oak tree), semantically it expresses different emotional attitudes to a person or<br />

object,<br />

13. possessive relation (existing between different POS): noun -> adjective otec -><br />

otcùv (father -> father’s), semantically it is a relation between an object (person) and<br />

its possession.<br />

14. the last D-relation exploits prefixes, in fact, it represents a whole complex of D-<br />

relations holding between verbs only, i. e.: verb -> verb, e.g nést -> od-nést (carry -><br />

carry away), tancovat -> dotancovat (dance ->finish dancing). We will say more<br />

about them below.<br />

The 25 selected suffixes in Table 1 express a number of semantic relations, particularly<br />

Action (deverbative nouns), Property, Possessive, Agentive, Instrument,<br />

Location, Gender Change and Diminutive. Result and Augmentative relation are not<br />

included in the Table 1.


Enhancing WordNets with Morphological Relations… 79<br />

Table 1. Selected D-relations with suffixes implemented in Czech WordNet<br />

Label Parts of<br />

speech<br />

Meaning No of<br />

literals<br />

Suffix<br />

deriv-na noun -> adj Property 641 -í,<br />

deriv-pos noun -> adj Possessive 4037 -ùv, -in<br />

deriv-an adj -> noun Property 1930 -ost<br />

deriv-aad adj -> adverb Property 1416 -e, -ì<br />

deriv-dvrb verb -> noun Action 5041 -í, -ní<br />

deriv-ag verb -> noun Agentive 186 -tel, -ík, -ák,<br />

-ec<br />

deriv-instr verb -> noun Instrument 150 -tko, -ík<br />

deriv-loc verb -> noun Location 340 -iště, -isko<br />

deriv-ger verb -> adj Property 1951 -ící, -ající,<br />

-ející<br />

deriv-pas verb -> adj Passive 9801 -en, -it<br />

deriv-g noun -> noun Gender 2695 -ka<br />

deriv-dem noun -> noun Diminutive 3695 -ek, -eèek,<br />

-ièka, -uška<br />

Total 31429<br />

The abbreviated labels used in Czech WordNet can be seen in the Table 1 and 2.


80 Sonja Bosch, Christiane Fellbaum, and Karel Pala<br />

2.1.3 Prefixes<br />

The core of the primary 14 prefixes contains the following ones: do- (to), na- (on, at),<br />

nad- (above, up), od- (from, away), pro- (for, because), při- (by, at), pře- (over), roz-<br />

(over), s-/se- (with, by), u- (at, near), v-/ve- (in, up), vy- (out, off) z-/ze- (of, off), za-<br />

(over, behind). The English equivalents are all phrasal verbs (though English does<br />

have verbal prefixes like out and over), reflecting the difference between an inflectional<br />

and an analytic language; English, unlike Czech, is undergoing a change from<br />

the former to the latter.<br />

Prefix D-relations hold only among verbs, typically betwen a stem or basic form<br />

and the respective prefix. It can be seen that semantics of the prefix D-relations is<br />

different from the suffix ones because they hold between verbs that usually denote<br />

actions, processes, events and states.<br />

Table 2 shows the analysis of Czech prefixes, indicates the semantic nature of the<br />

D-relations, and shows the number of literals generated by the individual D-relations:<br />

The 4 selected prefixes in Table 2 denote a number of semantic relations such as<br />

location, time, intensity of action, various kinds of motion (see below), iterativity<br />

(repeated motion or action in general) and some others. It is obvious that they differ<br />

significantly from suffix based D-relations since they hold only between verbs. In the<br />

following we will show how they combine with the selected verbs of motion. Presently,<br />

we have explored the 4 following prefix D-relations:<br />

Table 2. D-relations with prefixes implemented in Czech WordNet<br />

Label Parts of speech Meaning No of literals Prefixes<br />

prefix do- (to, at)<br />

deriv-act-t verb -> verb finishing<br />

motion<br />

deriv-act-t-iter verb -> verb finishing. mot<br />

iterative<br />

173 do-<br />

24 do-<br />

prefix od- (from,<br />

off)<br />

deriv-mot-from verb -> verb motion from 187 od-<br />

deriv-mot-fromiter<br />

verb -> verb mot. from<br />

iterative<br />

25 od-<br />

deriv-oblig verb -> verb Obligation 2 od-


Enhancing WordNets with Morphological Relations… 81<br />

prefix pře- (over)<br />

deriv-mot-over verb -> verb motion over a<br />

place<br />

207 pře-<br />

deriv-mot-over-it verb -> verb<br />

motion over a<br />

place iteratively<br />

21 pře-<br />

prefix při- (to, at)<br />

deriv-mot-to verb -> verb Motion to a<br />

place<br />

deriv-mot-to-iter verb -> verb Motion to a<br />

place iteratively<br />

171 při-<br />

18 při-<br />

deriv-add verb -> verb Additivity 3 při-<br />

Total 743<br />

Note that the D-relation Iterative is a subset of the verbs of motion, thus we do not<br />

count iterative verbs here as a new group. We also deal only with verbs of motion that<br />

have one argument, i.e. the moving Agent (jít,walk/go). Verbs of motion with two<br />

arguments like nést (carry) are not included here though they represent quite a large<br />

number of the motion verbs. They are also not pure motion verbs but cross over into<br />

contact and transfer ("I bring you flowers").<br />

2.1.4 Semantic classes of verbs and prefixes<br />

The relation between semantic classes of verbs and verb prefixes should be mentioned<br />

here because in Czech WordNet we adduce for each verb the semantic class it<br />

belongs to.<br />

The approaches to the semantic classes of verbs, particularly Levin’s classification<br />

of English verbs [12] and its extension by Palmer ([18]), are based on argument alternations<br />

whose nature is mostly syntactic. For instance, verbs that show a transitiveinchoative<br />

alternation (like break) not only share this particular syntactic behavior but<br />

are semantically similar in that they denote changes of state or location.<br />

Levin's list of the most frequent English verbs falls into over 50 classes (most of<br />

them with several subclasses); Palmer's VerbNet project has extended this work to<br />

395 classes. These verb classes have been translated and adapted for the Czech language.<br />

Presently, we work with approximately 100 semantic verb classes in the VerbaLex<br />

database of Czech valency frames containing approx. 12 000 verbs.


82 Sonja Bosch, Christiane Fellbaum, and Karel Pala<br />

In this approach to the verb classification in Czech we exploit the verb valency<br />

frames that contain semantic roles. It appears that the verb classes established using<br />

semantic roles can be well compared with the classes obtained by the alternations,<br />

however, according to our results the classes obtained by means of the semantic roles<br />

appear to be semantically more consistent.<br />

The third approach is based on the meanings of prefixes. The function of prefixes<br />

in Czech is to classify verbs, yielding rather small and even more consistent semantic<br />

classes of verbs. Using prefixes as sorting criteria we obtain classes that are visibly<br />

closer to the real lexical data due to the fact that the prefixes are well established formal<br />

means. For example, let’s take prefix do- (it corresponds to the English preposition<br />

to or at) and apply it to the larger group of verbs of motion (approx. 1200). The<br />

result is a group containing 173 Czech verbs denoting finishing motion. The verb<br />

classes based on prefix criteria will be examined more thoroughly in future research.<br />

2.2 Derivational relations in the English WordNet<br />

Many traditional paper dictionaries include derivational word forms but list them as<br />

run-ons without any information on their meaning, relying on the user's knowledge of<br />

morphological rules. Dorr and Habash ([9]), recognizing the importance of<br />

morphology-based lexical nests for NLP, created "CatVar," a large-scale database of<br />

categorial variations of English lexemes. CatVar relates lexemes belonging to<br />

different syntactic categories (part of speech) and sharing a stem, such as hunger (n.),<br />

hunger (v.) and hungry (adj.). CatVar is a valuable resource containing some 100,000<br />

unique English word forms; however, no information is given on the words'<br />

meanings.<br />

2.2.1 Morphosemantics relations<br />

Miller and Fellbaum ([15]) describe the addition of "morphosemantic links" to<br />

WordNet ([14], [6]), which connect words (synset members) that are similar in<br />

meaning and where one word is derived from the other by means of a morphological<br />

affix. For example, the verb direct (defined in WordNet as "guide the actors in plays<br />

and films") is linked to the noun director (glossed as "someone who supervises the<br />

actors and directs the action in the production of a show"). Another link was created<br />

for the verb-noun pair direct/director, meaning "be in charge of" and "someone who<br />

controls resources and expenditures," respectively. Most of these links connect words<br />

from different classes (noun-verb, noun-adjective, verb-adjective), though there are<br />

also noun-noun pairs like gang-gangster. English has many such affixes and<br />

associated meaning-change rules (Marchand,1969).<br />

2.2.2 Adding semantics to the morphosemantic links<br />

When the morphosemantic links were added to WordNet, their semantic nature was<br />

not made explicit, as it was assumed — following conventional wisdom — that the


Enhancing WordNets with Morphological Relations… 83<br />

meanings of the affixes are highly regular and that there is a one-to-one mapping<br />

between the affix forms and their meanings. But ambitious NLP tasks and automatic<br />

reasoning require explicit knowledge of the semantics of the links. Fellbaum,<br />

Osherson and Clark ([7]) describe on-going efforts to label noun-verb pairs with<br />

semantic "roles" such as Agent (direct-director) and Result (produce-product). The<br />

assumption was that there was a one-to-one mapping between affixes and meanings.<br />

Fellbaum et al. extracted all noun-verb pairs with derivational links from WordNet<br />

and grouped them into classes based on the affix. They manually inspected each affix<br />

class expecting to find only a limited number of exceptions in each class. Instead,<br />

they found that the affixes in each class were polysemous, i.e., a given affix yields<br />

nouns that bear different semantic relations to their base verbs.<br />

Table 3 shows Fellbaum et al.'s [7] semantic classification of -er noun and verb<br />

pairs, with the number of pairs given in the right-hand column.<br />

Table 3: Distribution of -er verb-noun pair relations in English<br />

Agent 2,584<br />

Instrument 482<br />

Inanimate agent/Cause 302<br />

Event 224<br />

Result 97<br />

Undergoer 62<br />

Body part 49<br />

Purpose 57<br />

Vehicle 36<br />

Location 36<br />

Examination of other morphological patterns showed that polysemy of affixes is<br />

widespread. Thus, nouns derived from verbs by -ion suffixation exhibit regular<br />

polysemy between Event and Result readings (the exam lasted two hours/the exam<br />

was lying on his desk, [20].<br />

Fellbaum et al. [7] also found one-to-many mappings for semantic patterns and affixes:<br />

a semantic category can be expressed by means of several distinct affixes,<br />

though there seems to be a default semantics associated with a given affix. Thus,<br />

while many -er nouns denote Events, event nouns are regularly derived from verbs<br />

via -ment suffixation (bomb-bombardment, punish-punishment, etc.) Patterns are<br />

partly predictable from the thematic structure of the verb. Thus, nouns derived from


84 Sonja Bosch, Christiane Fellbaum, and Karel Pala<br />

unergative verbs (intransitives whose subject is an Agent) are Agents, and the pattern<br />

is productive: runner, dancer, singer, speaker, sleeper, etc. Nouns derived from unaccusative<br />

verbs (intransitives whose subject is a Patient/Undergoer) are Patients:<br />

breaker (wave), streamer (banner), etc. This pattern is far from productive: *faller,<br />

?<br />

arriver,<br />

?<br />

leaver, etc. Many verbs have both transitive (causative) and intransitive<br />

readings (cf. [12]):<br />

(1) a. The cook roasted the chicken<br />

b. The chicken was roasting<br />

2.2.3 How many semantic relations are there?<br />

For many such verbs, there are two corresponding readings of the derived nouns:<br />

both the host in (1a) and the chicken in the (1b) can be referred to as a roaster. Other<br />

examples of Agent and Patient nouns derived from the transitive and intransitive<br />

readings of verbs are (best)seller, (fast) developer, broiler. But the pattern is not<br />

productive, as nouns like cracker, stopper, and freezer show.<br />

For virtually all -er pairs that we examined, the default agentive reading of the<br />

noun is always possible, though it is not always lexicalized. Thus a person who plants<br />

trees etc. could well be referred to as a planter, but under this reading the noun seems<br />

infrequent enough not to deserve an entry in most lexicons. Speakers easily generate<br />

and process ad-hoc nouns like planter (gardener), but only in its (non-default) location<br />

reading ("pot") is the noun part of the lexicon, as its meaning cannot be guessed<br />

from its structure.<br />

We focused here on the –er class. But we note that the suffixes for Czech discussed<br />

earlier have close English correspondences.<br />

The semantic relations that were identified by Fellbaum et al. are doubtless somewhat<br />

subjective. Other classifiers might well come up with more coarse-grained or<br />

finer distinctions. Nevertheless, it is encouraging to see that this classification overlaps<br />

largely with that for Czech suffixes, which was arrived independently. In addition,<br />

the English relations are a subset of those identified by Clark and Clark ([3]),<br />

who examined the large number of English noun-verb pairs related by zero-affix<br />

morphology, i.e., homographic pairs of semantically related verbs and nouns (roof,<br />

lunch, Xerox, etc.) This is the largest productive verb-noun class in English, and<br />

Clark and Clark's relations include not only Agent, Location, Instrument and Body<br />

Part, but also Meals, Elements, and Proper Names.<br />

2.2.4 Related work<br />

In the context of the EuroWordNet project ([23]), Peters ([19], n.d.) manually<br />

established noun-verb and adjective-verb pairs that were both morphologically and<br />

semantically related. Of the relations that Peters considered, the following match the<br />

ones we identified: Agent, Instrument, Location, Patient, Cause. (Peters's<br />

methodology differed from that of Fellbaum et al., who proceeded from the<br />

previously classified morphosemantic links and assumed a default semantic relation


Enhancing WordNets with Morphological Relations… 85<br />

for pairs with a given affix. Peters selected pairs of word forms that were both<br />

morphologically related and where at least one member had only a single sense in<br />

WordNet. These were then manually disambiguated and semantically classified,<br />

regardless of regular morphosemantic patterns.)<br />

2.3 Derivational relations in Bantu<br />

Derivational morphology in Bantu constitutes a combination of morphemes, which<br />

may either produce a new word in a different word category or may leave the word<br />

category (class membership) unchanged. Firstly, types of derivation that produce another<br />

word class include nouns, verbs, adverbs and ideophones derived from other<br />

word categories. The derivation process of nouns from verbs (deverbatives) is the<br />

most productive, and is therefore singled out in this discussion. The Bantu language<br />

Zulu is used for illustrative purposes.<br />

When nouns are derived from verb roots, a noun prefix as well as a deverbative<br />

suffix is required, as illustrated in the following examples of nouns formed from the<br />

verb root -fund- 'learn':<br />

u-m(u)-fund-i 'student' (in Czech the corresponding root is uč-)<br />

i-m-fund-o 'education' (in Czech the corresponding root is uč-e-n-í)<br />

i-si-fund-o 'lesson' (no appropriate equivalent in Czech).<br />

The deverbative suffixes in the above example are -i and -o. Such nouns may have<br />

more than one suffix if the deverbative noun is derived from a verb root that has been<br />

extended, e.g.<br />

u-m(u)-fund-is-i 'teacher'<br />

(in Czech we have uč-i-t-el (teach-er))<br />

The suffix -is- is a causative extension which changes the meaning of -fund-<br />

"learn" to "cause to learn" i.e. "teach". (Compare with English, where causatives are<br />

usually not morphologically derived, with very few exceptions like rise-raise and<br />

fall-fell; in most cases, causative and non-causatives are different morphemes: killdie,<br />

show-see, etc.). The last suffix -i is the deverbative suffix.<br />

The following are general rules for the formation of nouns from verb stems, however<br />

not every verb can be treated in this way (cf. [5]):


86 Sonja Bosch, Christiane Fellbaum, and Karel Pala<br />

Personal deverbatives<br />

Prefix of personal class<br />

(i.e. noun class 1/2 or 7/8<br />

or 9/10)<br />

Table 4: D-Relations in Zulu<br />

Verb Root Suffix -i<br />

umu/aba (class 1/2)<br />

(personal class only)<br />

(most common)<br />

isi/izi (class 7/8)<br />

(personal as well as<br />

impersonal class)<br />

in/izin (class 9/10)<br />

(personal as well as<br />

impersonal class)<br />

fund (learn) i<br />

hamb (go, walk) i<br />

theng (buy) i<br />

shumayel (preach) i<br />

eb (steal)<br />

thul (be silent)<br />

gijim (run)<br />

i<br />

i<br />

i<br />

umfundi "student"<br />

umhambi "traveller"<br />

umthengi "customer"<br />

umshumayeli<br />

"preacher"<br />

isebi "thief"<br />

isithuli "a mute"<br />

isigijimi "runner,<br />

messenger"<br />

bong (praise) i imbongi "royal praiser"<br />

Impersonal deverbatives<br />

Prefix of impersonal class<br />

(i.e. noun class 3/4 or 5/6Verb root Suffix -o<br />

or 7/8 or 9/10 or class 11)<br />

umu/imi (class 3/4)<br />

(impersonal class only)<br />

buz (ask) o umbuzo "question"<br />

(result)<br />

i(li)/ama (class 5/6)<br />

(personal as well as<br />

impersonal class)<br />

ceb (devise, con-trive)<br />

icebo "plan, scheme"<br />

(result)<br />

isi/izi (class 7/8)<br />

(personal as well as<br />

impersonal class)<br />

in/izin (class 9/10)<br />

(personal as well as<br />

impersonal class)<br />

u(lu) (class 11)<br />

(impersonal class only)<br />

aphul (break) o isaphulo "rupture"<br />

(result)<br />

phuc (shave) o impuco "razor"<br />

(instrument)<br />

thand (love) o uthando "love"<br />

(abstract)


Enhancing WordNets with Morphological Relations… 87<br />

Impersonal deverbatives indicate the following semantic relations:<br />

a) Instrument of the action signified by the verb<br />

b) Result of an action is conveyed<br />

c) Abstract idea conveyed by the verb<br />

As can be seen in column 1 of the above table of impersonal deverbatives, there is<br />

overlap in the semantic content of classes (i.e. personal and impersonal), which<br />

makes the choice of correct class prefix rather unpredictable.<br />

Exceptions to the general rule also occur, e.g. the impersonal noun umsebenzi (umsebenz-i)<br />

“work” is derived from the verb root –sebenz- (work), but uses the “personal”<br />

suffix –i.<br />

Secondly, derivations that produce a derived form of the same word class include<br />

diminutives, feminine gender, augmentatives and locatives, as illustrated in the following<br />

table:<br />

Table 5: Same word class derivations in Zulu<br />

Noun Prefix Suffix Derived form<br />

isitsha (dish) ana (diminutive) isitshana (small dish)<br />

Intaba<br />

(mountain)<br />

kazi (augmentative) intabakazi (big mountain)<br />

imvu (sheep) kazi (feminine gender) imvukazi (ewe)<br />

ikhaya (home) e<br />

ekhaya (at home)<br />

(locative)<br />

indlu (house) e<br />

(locative)<br />

ini (locative)<br />

endlini (in the house)<br />

Although locativised nouns such as ekhaya (at home) may also be used to function<br />

as adverbs, they continue to exhibit certain characteristics of regular nouns, for instance<br />

functioning as subjects and objects and in the process triggering agreement.<br />

3 Similarities and Differences for English, Czech and Bantu<br />

A comparison of the D-relations in three languages indicates that Czech and a Bantu<br />

language such as Zulu are in a certain respect formally closer than Czech and English.<br />

This is due to rich system of affixes in both languages, though they are not exploited<br />

in the same way in Czech and Zulu. Similarity consists in highly developed prefixation<br />

and suffixation; in Zulu both are used in a way that is typical for agglutinative<br />

languages, in particular for noun prefixes. In Czech prefixation is typical mostly for<br />

verbs and deverbatives which are, in fact, verbs as well.<br />

English also has verbal prefixes (e.g. out- prefixes to intransitive verbs and makes<br />

them transitive: I outran the bear) but makes regular use of separate particles to form<br />

phrasal verbs (look up/down/away, etc.).<br />

What all three languages share is the small number of semantic relations expressed<br />

by morphemes that create new words. The analyses of Czech, English and<br />

Zulu presented here allow us to predict that these D-relations are likely to be universal.<br />

All three languages use morphological processes to regularly and productively


88 Sonja Bosch, Christiane Fellbaum, and Karel Pala<br />

derive such semantic categories as Agent, Instrument, Location, Gender, Diminutiveness,<br />

Augmentation, Result as well as others.<br />

4 D-relations in WordNet among literals (screenshots of Czech and<br />

Princeton WordNets)<br />

The screenshot below indicates how D-relations are visually represented in Czech<br />

[17] and English WordNet using the browser and editor DebVisdic. The example<br />

shows the verb tancovat:1/tančit:1 –dance:1 in Czech WordNet and PWN 2.0. (We<br />

cannot show the verb dance in PWN 3.0 where the respective D-relations are more<br />

complete since it has not been converted yet for browsing in DebVisdic.)<br />

5 Conclusions<br />

Fig. 1: D-relations in Czech and English WordNet<br />

We present an analysis of some basic and highly regular D-relations in English,<br />

Czech and Bantu. It is possible to enrich both Czech and Englich WordNets considerably<br />

with the derivational nests (subnets), and this kind of enrichment makes these<br />

resources more useful suitable for some applications involving searching. Finally, we<br />

tried to show how the Czech and English experience can be applied in building<br />

WordNets for Bantu languages.


Enhancing WordNets with Morphological Relations… 89<br />

Another motivation for our work comes from the hypothesis that the derivational<br />

relations and derivational subnets reflect basic cognitive structures expressed in natural<br />

language. Such structures should be explored also in terms of ontological work.<br />

We hope that the work reported here will stimulate similar work in other languages<br />

and allow insights into their morphological processes as well as facilitate the computational<br />

representation and treatment of crosslinguistic morphological processes and<br />

relations.<br />

References<br />

1. Berko Gleason, Jean.: (1958). The Child's Learning of English Morphology. Word 14:150-<br />

77.<br />

2. Bosch, S., Fellbaum, C., Pala, K., and Vossen, P. (2007). African Languages WordNet: Laying<br />

the Foundations. Presented at the 12th International Conference of the African Association<br />

for Lexicography (AFRILEX), Soshanguve, South Africa.<br />

3. Clark, E. and Clark, H. (1979). When nouns surface as verbs. Language 55, 767-811.<br />

4. Clark, P., Harrison, P., Thompson, J., Murray, W., Hobbs, J., and Fellbaum, C. (2007). On<br />

the Role of Lexical and World Knowledge in RTE3. ACL-PASCAL Workshop on Textual<br />

Entailment and Paraphrases, June 2007, Prague, CZ.<br />

5. Doke, Clement M. (1973). Textbook of Zulu Grammar. Johannesburg: Longman Southern<br />

Africa.<br />

6. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT<br />

Press.<br />

7. Fellbaum, C., Osherson, A., and Clark, P.E. (2007). Adding Semantics to WordNet's<br />

"Morphosemantic" Links. In: Proceedings of the Third Language and Technology<br />

Conference, Poznan, Poland, October 5-7, 2007.<br />

8. Fillmore, C. (1968). The Case for Case. In: Bach, E., and R. Harms (Eds.) Universals in<br />

linguistic theory. NY: Holt.<br />

9. Habash, N. and Dorr, B. (2003). A Categorial Variation Database for English. Proceedings<br />

of the North American Association for Computational Linguistics, Edmonton, Canada, pp.<br />

96-102, 2003.<br />

10. Karlík, P. et al. (1995). Příruční mluvnice češtiny (Every day Czech Grammar),<br />

Nakladatelství Lidové Noviny, Prague, pp. 229, 310.<br />

11. Kosch, I.M. (2006). Topics in Morphology in the African Language Context. Pretoria:<br />

Unisa Press.<br />

12. Levin, B. (1993). English Verb Classes and Alternations. Chicago, IL University of<br />

Chicago Press.<br />

13. Marchand, H. (1969). The categories and types of present-day English word formation.<br />

Munich: Beck.<br />

14. Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the<br />

ACM. 38.11:39-41.<br />

15. Miller, G. A. and Fellbaum, C. (2003). Morphosemantic links in WordNet. Traitement<br />

automatique de langue, 44.2:69-80.<br />

16. Moropa, K., Bosch, S., and Fellbaum, C. (2007). Introducing the African Languages<br />

WordNet. Presented at the 14 th International Conference of the African Language Association<br />

of Southern Africa, Nelson Mandela Metropolitan University, Port Elizabeth, South Africa.<br />

17. Pala, K. and Hlaváčková, D. (2007). Derivational Relations in Czech WordNet. In:<br />

Proceedings of the Workshop on Balto-Slavonic NLP, ACL, Prague, 75-81.


90 Sonja Bosch, Christiane Fellbaum, and Karel Pala<br />

18. Palmer, M., Rosenzweig, J., Dang, H. T. et al. (1998). Investigating regular sense extensions<br />

based on intersective Levin classes. In Coling/ACL-98, 36th Association of Computational<br />

Linguistics Conference. Montreal 1, p. 293-300.<br />

19. Peters, W. (n.d.) The English WordNet, EWN Deliverable D032D033. University of<br />

Sheffield, England.<br />

20. Pustejovsky, J. (1995). The Generative Lexicon. Cambridge, MA: MIT Press.<br />

21. Sedláček, R.and Smrž, P. (2001). A New Czech Morphological Analyser Ajka. Proceedings<br />

of the 4 th International Conference on Text, Speech and Dialogue, Springer Verlag, Berlin,<br />

s.100-107.<br />

22. WordNet a lexical database for the English language. (2006). Available at: http://word<br />

net.princeton.edu/ [Accessed on 10 September 2007].<br />

23. Vossen, P. (1998, Ed.): EuroWordNet. Dordrecht: Holland: Kluwer.


On the Categorization of Cause and Effect in WordNet<br />

Cristina Butnariu and Tony Veale<br />

School of Computer Science and Informatics,<br />

University College Dublin, Dublin 4, Ireland<br />

{Ioana.Butnariu, Tony.Veale}@ucd.ie<br />

Abstract. The task of detecting causal connections in text would benefit greatly<br />

from a comprehensive representation of Cause and Effect in WordNet, since<br />

previous studies show that semantic abstractions play an important role in the<br />

linguistic detection of semantic relations, in particular the cause-effect relation.<br />

Based on these studies on causality, and on our own general intuitions about<br />

causality, we propose a cover-set of different WordNet categories to represent<br />

the ontological classes of Cause and Effect. We also propose a corpus-based<br />

approach to the population of these categories, whereby candidate words and<br />

senses are identified in a large corpus (such as the Google N-gram corpus)<br />

using specific syntagmatic patterns. We describe experiments using the Cause-<br />

Effect dataset from the 2007 SemEval workshop to evaluate the most effective<br />

combinations of WordNet categories and corpus data. Ultimately, we propose<br />

extending the WordNet category of Causal-Agent with the word-senses<br />

identified by this experimental exploration.<br />

Keywords: semantic relations, WN categorization, cause, effect, causality,<br />

syntagmatic patterns.<br />

1 Introduction<br />

Causality plays a fundamental role in textual inference, not just because it is intrinsic<br />

to notions of cause and effect, but also because it is central to the meaning of artifacts,<br />

agents, products (whether physical or abstract) and even natural phenomena. Artifacts<br />

possess a purpose, or telicity, that is causally defined, while agents are often defined<br />

by the products that they cause to exist, and natural phenomena like storms and other<br />

acts of god are typically conceptualized as intentional processes. Since each of these<br />

notions – agents, artifacts, products and natural phenomena – are all explicitly<br />

represented and richly specialized in a lexical ontology like WordNet [4], one can ask<br />

whether the concepts of Cause and Effect can and should be as richly represented in<br />

WordNet. Of course, since these concepts correspond to the nouns “cause” and<br />

“effect”, they clearly are represented in WordNet. Indeed, WordNet represents<br />

different nuances of these concepts, distinguishing between cause-as-agent (or<br />

{causal-agent}) and cause-as-reason (or {cause, reason, grounds}) and effect-asoutcome<br />

and effect-as-symptom.<br />

Nonetheless, these attempts at ontologizing causality are simultaneously too<br />

coarse-grained – insofar as they admit of too many specializations that are not


92 Cristina Butnariu and Tony Veale<br />

meaningfully represented as causes or effects – or too under-developed – insofar as<br />

they are little more than ontological place-holders that have few meaningful<br />

specializations. For instance, because WordNet defines the concept Causal-agent as a<br />

hypernym of Person, concepts like Victim, Martyr and Casualty will be seen<br />

indirectly as agents of their own state, even when this view is counter to their true<br />

meaning (these concepts are clearly better defined as causal-patients, though WordNet<br />

lacks such a concept). Likewise, WordNet categorizes antacids and other medicinally<br />

helpful substances (such as antacids) as causal agents but denies this classification to<br />

unhelpful substances such as poisons and allergens, as well as to harmful weather<br />

phenomena (such as storms and earthquakes) that are readily conceptualized as major<br />

causes by humans. Similarly, WordNet 2.1 only provides four possible specializations<br />

of the symptom meaning of Effect when any number of other WordNet concepts can,<br />

in the right circumstance, by seen as symptoms. Indeed, only 30% of the concepts<br />

whose WordNet 2.1 gloss contains the phrase “that causes” are categorized as causal<br />

agents in WordNet, even though all the concepts are valid examples of causal agency.<br />

WordNet would clearly benefit then from considerable house-cleaning under its<br />

categories of Cause (and Causal-Agent) and Effect. In this paper, we consider the<br />

effectiveness of WordNet in recognizing and capturing cause and effect relationships,<br />

by focusing on the cause-effect relation in the recent SemEval semantic-relations task<br />

(see [7]). While virtually all entrants in this task adopted a supervised machinelearning<br />

approach to the problem of detecting relations such as cause-effect between<br />

noun-pairs, we consider here how well WordNet, without training, can perform on<br />

this task when its basic causal repertoire is augmented with causally-indicative<br />

syntagmatic cues from a large corpus. In section 2 we briefly describe past-work on<br />

this topic, before presenting a purely WordNet-based approach to cause and effect in<br />

section 3. Causality is a highly contextual notion: a dinner plate is an effect (product)<br />

in the context of its construction, and a cause of pain when used as a projectile in the<br />

context of a domestic argument (see [12]). WordNet cannot hope to anticipate or<br />

reflect all of these contexts, but the language used in context-specific corpora may<br />

well reflect these causal nuances. In section 4 then, we present a corpus-based<br />

approach to identifying possible causes and effects in terms of lexico-syntactic<br />

patterns. Section 5 then presents an empirical evaluation of this corpus/WordNet<br />

combination. The paper concludes with some closing remarks in section 6.<br />

2 Past Work<br />

There have been many attempts in the computational linguistic communities to define<br />

and understand the Causality relation. Nastase in [11] defines causality as a general<br />

class of relations that describe how two occurrences influence each other. Further she<br />

proposes the following sub-relations of causality: cause, effect, purpose, entailment,<br />

enablement, detraction and prevention. She states that semantic relations can be<br />

expressed in different syntactic forms, at different syntactic levels. Hearst [8] states<br />

that “certain lexico-syntactic patterns unambiguously indicate certain semantic<br />

relations”. The key issue then is to discover the most efficient patterns that indicate a<br />

certain semantic relation. These patterns can be either manually specified by linguists


On the Categorization of Cause and Effect in WordNet 93<br />

or discovered automatically from corpora. For instance, the subject-verb-object<br />

lexico-syntactic pattern (where subject and verb are noun-phrases) was used in [3] to<br />

detect causal relations in text, and from these patterns, automatically construct<br />

Bayesian for causal inference.<br />

Girju proposes in [5] a classification of lexical patterns for mining instances of the<br />

causality relation from corpora, and describes a semi-automatic method to discover<br />

new patterns. She uses a general pattern in combination with<br />

WordNet to impose semantic restrictions on NP1 (the Cause category) and NP2 (the<br />

Effect category). She defines the classes of Cause and Effect in WordNet terms as a<br />

patchwork of different synsets/categories. For Effect, she proposes a cover-set<br />

comprising the following synsets: {human_action, human_activity, act},<br />

{phenomenon}, {state}, {psychological_feature} and {event}. However, she observes<br />

that the Cause class is harder to define in such terms of WordNet categories, since the<br />

notion of causality is frequently entwined with, and difficult to separate from, that of<br />

metonymy (e.g., does the poison cause death or the poisoner, or both? The gun or the<br />

gun-man?). She thus relies entirely on the intuitions already encoded in WordNet<br />

under the category of Causal-Agent. Girju then ranks the output patterns into five<br />

categories, according to their degree of ambiguity. She reports a precision of 68%<br />

when applying these patterns to a terrorism corpus.<br />

The SemEval-2007 task 4 (see [7]) concerned itself with the classification of<br />

semantic relations between pairs of words in a given context. Seven semantic<br />

relations were proposed and a training dataset for each semantic relation (comprising<br />

positive and negative examples, the latter in the form of near misses) was collected<br />

from the web and classified by two human judges. The relation that interests us here<br />

is the Cause-Effect relation, which the task authors define as follows: "Cause-<br />

Effect(X,Y) is true for a sentence S if X and Y appear close in the syntactic structure<br />

of S and the situation described in S entails that X is the cause of Y." There are some<br />

restrictions imposed on X and Y: "X and Y can be a nominal denoting an event, state,<br />

activity or an entity, as a metonymic expression of an occurrence.” The data-set for<br />

this relation comprises 220 noun pairs (with WordNet sense-tags and associated<br />

context fragments), of which 114 pairs are positive exemplars and 106 are negative<br />

"near-miss" exemplars.<br />

3 Defining Cause and Effect in WordNet terms<br />

Following Girju, we should intuitively expect a variety of high-level WordNet<br />

abstractions to encompass a range of concepts that play an enabling role in achieving<br />

certain ends, and thus to contribute to the cover-set that defines the class of Causes.<br />

Recall that Girju limits the definition of Cause to the WordNet category<br />

{causal_agent}, a snapshot of which is presented in Figure 1.


94 Cristina Butnariu and Tony Veale<br />

causal_agent<br />

…<br />

agent<br />

…<br />

lethal_agent<br />

biological_agent<br />

cause_of_death<br />

relaxer<br />

Fig. 1. The figure shows a fragment of the taxonomy for the lexical concept<br />

{causal_agent} in WordNet.<br />

In contrast, we broaden the cover-set of Causes to include the following WordNet<br />

categories and their descendants: {causal_agent}, {psychological_feature},<br />

{attribute}, {substance} (insofar as many are biological causal-agents),<br />

{phenomenon}, {communication} (insofar as they can drive agents to action),<br />

{natural_action} and {organic_process}. In contrast, the class of Effects should<br />

include: {psychological_feature}, {attribute}, {physical_process}, {phenomenon},<br />

{natural_action}, {possession} and {organic_process}. The two cover-sets are similar<br />

because causes and effects typically interact as part of complex causal chains, so the<br />

causes of one effect are often themselves the effects of prior causes.<br />

It is worth considering how well these WordNet-based cover-sets correspond to the<br />

exemplars of the SemEval dataset. Figure 2 reveals the coverage obtained for both the<br />

positive and negative exemplars by each WordNet category in the class of Causes.<br />

Note how the category Causal-Agent offers very little coverage for the positive<br />

exemplars (i.e., most of the actual causes in that data-set are not categorized as causalagents<br />

in WordNet), and actually offers higher coverage for the negative exemplars<br />

(making it more likely to contribute to a classification error in the case of a nearmiss).


On the Categorization of Cause and Effect in WordNet 95<br />

Fig. 2. The coverage (%) offered by different WordNet categories for SemEval positive and<br />

negative exemplars of the Cause class.<br />

Figure 3 presents a comparable analysis for the WordNet categories that comprise<br />

the cover-set for the class of Effects. Note how the category {psychological_feature}<br />

looms large as both a Cause and an Effect in the SemEval data-set.<br />

Fig. 3. The coverage (%) offered by different WordNet categories for SemEval positive and<br />

negative exemplars of the Effect class.


96 Cristina Butnariu and Tony Veale<br />

4 Defining Cause and Effect in Syntagmatic terms<br />

Girju in [5] notes that certain lexico-syntactic patterns are indicative of causal<br />

relations in text, but that some patterns are more ambiguous than others. For instance,<br />

the patterns "NP2-causing NP1" and "NP1-caused NP2" are explicit and largely<br />

unambiguous cues to the interpretation of NP1 as a cause and NP2 as an effect. In<br />

contrast, Girju notes that "NP2-inducing NP1" and "NP2-generated NP1" are equally<br />

explicit but potentially more ambiguous patterns for identifying cause and effect in<br />

text. Nonetheless, the pattern "NP-induced NP" does occur quite frequently in large<br />

corpora, and does designate causes with high accuracy and low ambiguity. However,<br />

this triple of "NP-induced/inducing NP" produces a spare space of associations<br />

between different causes and effects, so it is more productive to consider each nounphrase<br />

in isolation.<br />

Thus, we look for the patterns "Noun-inducing" and "Noun-causing" in a large<br />

corpus to identify those nouns that can denote effects, as in the phrase "headacheinducing".<br />

Our corpus is the set of Google N-grams (see [1]), from which the above<br />

pairings can easily be mined. Similarly, we mine the patterns "Noun-induced" and<br />

"Noun-caused" from these n-grams to identify a large set of nouns that can denote<br />

causes, as in "caffeine-induced". In addition, we look to the patterns "-induced Noun"<br />

and "-caused Noun" to identify a further collection of possible effect nouns, and the<br />

patterns "-inducing Noun" and -causing Noun" to identify further cause nouns. In this<br />

way, we obtain 3,500+ nouns as denoting potential causes, and 4,200+ nouns as<br />

denoting potential effects. Table 1 presents the top-ranked (by frequency) causes and<br />

effects in this data, as well as the top-ranked causality pairs (i.e., cause associated<br />

with specific effect).<br />

Table 1. Top-ranked (by frequency) cause-effect pairs, as well as<br />

isolated causes and isolated effects.<br />

CAUSE-EFFECT pairs CAUSE nouns EFFECT nouns<br />

(organism, disease) Drug apoptosis<br />

(laser, fluorescence) stress disease<br />

(noise, hearing) radiation cancer<br />

(chemical, cancer) exercise changes<br />

(agent, cancer) self cell<br />

(exercise, asthma) laser increase<br />

(collagen, arthritis) human activation<br />

(bacteria, disease) acid asthma<br />

(pregnancy, hypertension) light inhibition<br />

(human, climate) virus odor<br />

Because the Google N-grams corpus is not sense-tagged, we can only guess at the<br />

senses of the nouns in Table 1. However, if we assume that each noun is used in one<br />

of its two most frequent senses, then we can assign these nouns to various WordNet<br />

categories, as we did for the SemEval nouns in Figures 2 and 3. Following this<br />

heuristic assignment of senses, Figure 4 presents the distribution of cause nouns to<br />

different WordNet cause categories.


On the Categorization of Cause and Effect in WordNet 97<br />

Fig. 4. The distribution of corpus-mined cause nouns to WordNet categories.<br />

A comparable distribution for effect nouns is displayed in Figure 5.<br />

Fig. 5. The distribution of corpus-mined effect nouns to WordNet categories.<br />

Because some noun senses belong to multiple categories, and because we use the<br />

two most frequent senses of each noun, the sum total of distributions in Figures 2 to 5<br />

may exceed 100%. Note also that certain patterns are noisier than others. While<br />

"Noun-inducing" is a tight and rather unambiguous micro-context in which to<br />

recognize Noun as an effect, "-induced Noun" is more prone to leakage. For instance,<br />

"drug-induced liver failure" yields "drug" as an unambiguous cause, but mistakenly<br />

suggests "liver" as an effect. Given that "Noun-induced" is a more frequent pattern<br />

than "Noun-inducing", the set of nouns designed as effects is noisier than the set of<br />

nouns designed as causes. For this reason, the Other category in Figure 5 is more<br />

populous than the Other category in Figure 4. The most frequently misclassified<br />

nouns in the Effect class are: protein, liver, gene, lung, acute, platelet, insulin,<br />

diabetic, skin, calcium, rat, cytotoxicity, genes, immune, and bone.


98 Cristina Butnariu and Tony Veale<br />

5 Empirical results<br />

We can test the approaches of section 3 and 4 in a variety of guises and combinations:<br />

The WordNet-only approach (as described in section 3): a word pair can<br />

be classified as a Cause-Effect pairing if and only if any of the two most frequent<br />

senses of X fall under a synset in the Cause cover-set and any of the two most<br />

frequent senses of Y fall under a synset in the Effect cover-set.<br />

The Corpus-only approach (as described in section 4): a word pair can be<br />

classified as a Cause-Effect pairing if and only if X is found in the set of nouns that<br />

have been identified as cause nouns (e.g., because the pattern "X-induced" was found<br />

in the corpus) and Y is found in the set of effect nouns (e.g., because the pattern "Yinducing"<br />

or "-induced Y" was found in the corpus). In our experiments we test two<br />

different sets of corpus-mining patterns: a minimal set based on just two causation<br />

verbs, induce and cause, and an extended set comprising variations of the verbs<br />

induce, cause, power, fuel, activate, enable, control and operate.<br />

The Hybrid approach (WordNet used in combination with corpus-derived data):<br />

a word pair can be classified as a Cause-Effect pairing if any of the two most<br />

frequent senses of X fall under a synset in the Cause cover-set and a synonym of one<br />

these two senses (i.e., any word from the same two synsets) is found in the set of<br />

corpus-derived cause nouns, and if any of the two most frequent senses of Y fall<br />

under a synset in the Effect cover-set and a synonym of one these two senses of Y (or<br />

Y itself) is found in the set of effect nouns. The hybrid approach is thus a logical<br />

conjunction of the WordNet and corpus approaches, but one that includes synonyms<br />

of the words X and Y, so the corpus-data of the latter is effectively smoothed and<br />

made less sparse.<br />

Table 2 presents empirical results for each of these approaches on the SemEval<br />

cause-effect data-set and the All-true baseline which always guesses “true” (and<br />

thereby maximizes recall). Interestingly, the WordNet-only approach has the best<br />

overall performance (F-score), which accords with the observations of the SemEval<br />

organizers: the statistics show that WordNet plays an important role in the task of<br />

relation classification.<br />

Table 2. Empirical results for cause-effect in SemEval data-set, where F = 2*P*R / (P+R).<br />

P R F Total no<br />

A. WordNet only approach 61.3 85 71.3 220<br />

B. Corpus-only approach<br />

54 60 62.3 220<br />

Using {induce, cause} patterns<br />

C. Hybrid A+B approach 63.5 70 66.8 220<br />

D. Corpus-only approach<br />

using {induce, cause, power, fuel, activate, 51.6 83 63.6 220<br />

enable, control, operate} patterns<br />

E. Hybrid A+D approach 60 85 70.3 220<br />

All-true baseline 51.8 100 68.2 220


On the Categorization of Cause and Effect in WordNet 99<br />

5.1 Analysis of Results<br />

As the corpus yields a somewhat sparse and noisy data set of candidate cause and<br />

effect nouns, the corpus approach (B) that uses just cause and induce as causal<br />

markers achieves only 60% recall, with a low precision of 54%. The WordNet<br />

contribution in the Hybrid A+B approach boosts recall by 10% while also increasing<br />

precision. Recall is improved since the sparse corpus data is extrapolated by the use of<br />

WordNet synonyms; precision is also improved somewhat, over that of the WordNetonly<br />

approach (A) and the simple corpus approach (B) because WordNet’s category<br />

restrictions help to filter out some noisy and misclassified effect nouns. Nonetheless,<br />

there is need for more corpus data to increase the recall of the hybrid approach even<br />

further. In the second corpus approach (D), recall is boosted by using patterns based<br />

on a broader list of causative verbs (see [9]) to identify cause and effect nouns:<br />

{induce, cause, power, fuel, activate, enable, control, operate}. Note that when<br />

WordNet Cause and Effect categories of (A) are used to filter noisy classifications in<br />

the hybrid approaches, this imposes a WordNet-based ceiling of 85% (i.e., the recall<br />

of A) on the recall of the hybrid approaches: the tradeoff results in a lower precision<br />

but a better F-measure overall.<br />

Each approach in Table 2 (WordNet-alone, corpus-alone, and the combination of<br />

both) is unsupervised and does not avail of the WN sense information provided for<br />

nouns in the SemEval data-set. Our best F-measure is 71.3% and is comparable with<br />

the 72% F-measure obtained by the best performing system in the corresponding<br />

SemEval category (i.e., category A, in which competing systems do not avail of<br />

WordNet sense tags). The relatively low precision is largely explained by the fact that<br />

SemEval's negative examples are near misses rather than random examples of noncausal<br />

relationships. Our recorded precision is a lower-bound then for what one might<br />

expect on random word-pairings drawn from a real text.<br />

6 Concluding Remarks<br />

In this paper we presented three unsupervised approaches to the classification of<br />

causal-relations among noun-pairs: a corpus-based approach, an ontological<br />

WordNet-based approach, and a combination of both. The results achieved by these<br />

approaches on the SemEval dataset are encouraging, especially given the fact that<br />

these approaches do not apply machine-learning techniques to a training data-set. The<br />

WordNet categories which form the substance of the ontological approach, and which<br />

also contribute substantially to the combined approach, are hand-picked based on<br />

human intuitions about causality. However, a machine-learning approach to<br />

identifying these categories automatically is a topic of current research. As reflected<br />

in the superior performance of the WordNet-only approach, WordNet does have the<br />

capability to accurately represent high-level abstractions like Cause and Effect, and to<br />

do so in a non-trivial way that spans large numbers of more specific specializations.<br />

Nonetheless, our results also bear out our initial observation that the WordNet<br />

category of Causal-agent is very weakly represented and in serious need of reorganization,<br />

at least if it is to properly serve its intended purpose. In the SemEval


100 Cristina Butnariu and Tony Veale<br />

data analyzed here, the {causal_agent} category covers only 2% of the Cause<br />

instances in the positive exemplar set, and just 8% of the negative "near-miss"<br />

exemplars. Extension to this WordNet category can clearly be performed using<br />

intuition-guided ontological-engineering as well as corpus-based discovery. Based on<br />

our results then, we might ask which WordNet concepts should be included under the<br />

newly organized umbrella term of Causal-Agent, and under a new category, Causal-<br />

Patient? We suggest the word senses that satisfy approach E will make excellent<br />

candidates to populate these categories.<br />

We next plan to extend the general approach described here to other classes of<br />

semantic relation, such as Content-Container, Part-Whole and Tool-Purpose, since<br />

these too combine a strong ontological dimension to their meaning with a strong<br />

usage-based (i.e., corpus-based) dimension. Overall, our results confirm that WordNet<br />

has a significantly useful role to play in the detection of semantic relations in text, but<br />

detection would be more efficient if WordNet could provide more insightful<br />

ontological classifications of the concepts underlying these relations. These<br />

ontological insights will come from using the existing structures of WordNet to<br />

hypothesize about, and filter, large quantities of relevant usage data in a corpus.<br />

References<br />

1. Brants, T., Franz, A.: Web 1t 5-gram version 1. Linguistic Data Consortium (2006)<br />

2. Butnariu, C., Veale, T.: A hybrid model for detecting semantic relations between noun pairs<br />

in text. In: Proceedings of SemEval 2007, the 4th International Workshop on Semantic<br />

Evaluations. ACL 2007 (2007)<br />

3. Cole, S., Royal, M., Valorta, M., Huhns, M., Bowles, J.: A Lightweight Tool for<br />

Automatically Extracting Causal Relationships from Text. In: Proceedings of IEEE (2006)<br />

4. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA<br />

(1998)<br />

5. Girju, R.: Text Mining for Semantic Relations. PhD. Dissertation, University of Texas at<br />

Dallas (2002)<br />

6. Girju, R., Moldovan, M.: Text mining for causal relations. In: Proceedings of the FLAIRS<br />

Conference, pp. 360–364 (2002)<br />

7. Girju, R., Nakov, P., Nastase, V., Szpakowicz, S., Turney, P.: SemEval 2007 Task 04:<br />

Classification of Semantic Relations between Nominals. In: Proceedings of SemEval 2007,<br />

the 4th International Workshop on Semantic Evaluations. ACL 2007 (2007)<br />

8. Hearst, M.: Automated Discovery of WordNet Relations. In: WordNet: An Electronic<br />

Lexical Database and Some of its Applications. MIT Press (1998)<br />

9. Khoo, C., Kornfilt, J., Oddy, R., Myaeng, S.H.: Automatic extraction of cause-effect<br />

information from newspaper text without knowledge-based inferencing. J. Literary &<br />

Linguistic Computing, 13(4), 177–186 (1998)<br />

10. Lewis, D.: Evaluating text categorization. In: Proceedings of the Speech and Natural<br />

Language Workshop, pp. 312–318. Asilomar (1991)<br />

11. Nastase, V.: Semantic Relations Across Syntactic Levels. PhD Dissertation, University of<br />

Ottawa (2003)<br />

12. Veale, T., Hao, Y.: A context-sensitive framework for lexical ontologies. The Knowledge<br />

Engineering Review Journal. Cambridge University Press (in press) (2006)


Evaluation of Synset Assignment<br />

to Bi-lingual Dictionary<br />

Thatsanee Charoenporn 1 , Virach Sornlertlamvanich 1 , Chumpol Mokarat 1 ,<br />

Hitoshi Isahara 2 , Hammam Riza 3 , and Purev Jaimai 4<br />

1<br />

Thai Computational Linguistics Lab., NICT Asia Research Center,<br />

Thailand Science Park, Pathumthani, Thailand<br />

{thatsanee, virach, chumpol}@tcllab.org<br />

2<br />

National Institute of Information and Communications Technology,<br />

3-5 Hikaridai, Seika-cho, soraku-gaun, Kyoto, Japan 619-0289<br />

isahara@nict.go.jp<br />

3<br />

IPTEKNET, Agency for the Assessment and Application of Technology,<br />

Jakarta Pusat 10340, Indonesia<br />

hammam@iptek.net.id<br />

4<br />

Center for Research on Language Processing, National University of Mongolia,<br />

Ulaanbaatar, Mongolia<br />

purev@num.edu.mn<br />

Abstract. This paper describes an automatic WordNet synset assignment to the<br />

existing bi-lingual dictionaries of languages having limited lexicon information.<br />

Generally, a term in a bi-lingual dictionary is provided with very limited<br />

information such as part-of-speech, a set of synonyms, and a set of English<br />

equivalents. This type of dictionary is comparatively reliable and can be found<br />

in an electronic form from various publishers. In this paper, we propose an<br />

algorithm for applying a set of criteria to assign a synset with an appropriate<br />

degree of confidence to the existing bi-lingual dictionary. We show the<br />

efficiency in nominating the synset candidate by using the most common<br />

lexical information. The algorithm is evaluated against the implementation of<br />

Thai-English, Indonesian-English, and Mongolian-English bi-lingual<br />

dictionaries. The experiment also shows the effectiveness of using the same<br />

type of dictionary from different sources.<br />

Keywords: synset assignment<br />

1 Introduction<br />

The Princeton WordNet (PWN) [1] is one of the most semantically rich English<br />

lexical databases that are widely used as a lexical knowledge resource in many<br />

research and development topics. The database is divided by part of speech into noun,<br />

verb, adjective and adverb, organized in sets of synonyms, called synset, each of<br />

which represents “meaning” of the word entry. PWN is successfully implemented in<br />

many applications, e.g., word sense disambiguation, information retrieval, text<br />

summarization, text categorization, and so on. Inspired by this success, many


102 Thatsanee Charoenporn et al.<br />

languages attempt to develop their own WordNets using PWN as a model, for<br />

example 1 , BalkaNet (Balkans languages), DanNet (Danish), EuroWordNet (European<br />

languages such as Spanish, Italian, German, French, English), Russnet (Russian),<br />

Hindi WordNet, Arabic WordNet, Chinese WordNet, Korean WordNet and so on.<br />

Though WordNet was already used as a starting resource for developing many<br />

language WordNets, the constructions of the WordNet for languages can be varied<br />

according to the availability of the language resources. Some were developed from<br />

scratch, and some were developed from the combination of various existing lexical<br />

resources. Spanish and Catalan Wordnets [2], for instance, are automatically<br />

constructed using hyponym relation, a monolingual dictionary, a bilingual dictionary<br />

and taxonomy [3]. Italian WordNet [4] is semi-automatically constructed from<br />

definitions in a monolingual dictionary, a bilingual dictionary, and WordNet glosses.<br />

Hungarian WordNet uses a bilingual dictionary, a monolingual explanatory<br />

dictionary, and Hungarian thesaurus in the construction [5], etc.<br />

This paper presents a new method to facilitate the WordNet construction by using<br />

the existing resources having only English equivalents and the lexical synonyms. Our<br />

proposed criteria and algorithm for application are evaluated by implementing them<br />

for Asian languages which occupy quite different language phenomena in terms of<br />

grammars and word unit.<br />

To evaluate our criteria and algorithm, we use the PWN version 2.1 containing<br />

207,010 senses classified into adjective, adverb, verb, and noun. The basic building<br />

block is a “synset” which is essentially a context-sensitive grouping of synonyms<br />

which are linked by various types of relation such as hyponym, hypernymy,<br />

meronymy, antonym, attributes, and modification. Our approach is conducted to<br />

assign a synset to a lexical entry by considering its English equivalent and lexical<br />

synonyms. The degree of reliability of the assignment is defined in terms of<br />

confidence score (CS) based on our assumption of the membership of the English<br />

equivalent in the synset. A dictionary from a different source is also a reliable source<br />

to increase the accuracy of the assignment because it can fulfill the thoroughness of<br />

the list of English equivalent and the lexical synonyms.<br />

The rest of this paper is organized as follows: Section 2 describes our criteria for<br />

synset assignment. Section 3 provides the results of the experiments and error analysis<br />

on Thai, Indonesian, and Mongolian. Section 4 evaluates the accuracy of the<br />

assignment result, and the effectiveness of the complimentary use of a dictionary from<br />

different sources. And Section 5 concludes our work.<br />

2 Synset Assignment<br />

A set of synonyms determines the meaning of a concept. Under the situation of<br />

limited resources on a language, an English equivalent word in a bi-lingual dictionary<br />

is a crucial key to find an appropriate synset for the entry word in question. The<br />

synset assignment criteria described in this section relies on the information of<br />

1<br />

List of WordNets in the world and their information is provided at<br />

http://www.globalwordnet.org/gwa/ wordnet_table.htm


Evaluation of Synset Assignment to Bi-lingual Dictionary 103<br />

English equivalent and synonym of a lexical entry, which is most commonly encoded<br />

in a bi-lingual dictionary.<br />

Synset Assignment Criteria<br />

Applying the nature of WordNet which introduces a set of synonyms to define the<br />

concept, we set up four criteria for assigning a synset to a lexical entry. The<br />

confidence score (CS) is introduced to annotate the likelihood of the assignment. The<br />

highest score, CS=4, is assigned to the synset that is evident to include more than one<br />

English equivalent of the lexical entry in question. On the contrary, the lowest score,<br />

CS=1, is assigned to any synset that occupies only one of the English equivalents of<br />

the lexical entry in question when multiple English equivalents exist.<br />

The details of assignment criteria are: L i denotes the lexical entry, E j denotes the<br />

English equivalent, S k denotes the synset, and ∈ denotes the member of a set.<br />

Case 1: Accept the synset that includes more than one English equivalent with a<br />

confidence score of 4.<br />

Fig. 1 simulates that a lexical entry L 0 has two English equivalents of E 0 and E 1 .<br />

Both E 0 and E 1 are included in a synset of S 1 . The criterion implies that both E 0 and<br />

E 1 are the synset for L 0 which can be defined by a greater set of synonyms in S 1 .<br />

Therefore the relatively high confidence score, CS=4, is assigned for this synset to the<br />

lexical entry.<br />

L 0 E 0<br />

∈<br />

∈<br />

∈<br />

S 0<br />

S 1<br />

L 1<br />

E 1<br />

∈<br />

S 2<br />

Fig. 1. Synset assignment with CS=4<br />

Example:<br />

L 0 :<br />

E 0 : aim<br />

E 1 : target<br />

S 0 : purpose, intent, intention, aim, design<br />

S 1 : aim, object, objective, target<br />

S 2 : aim<br />

In the above example, the synset, S 1 , is assigned to the lexical entry, L 0 , with CS=4.<br />

Case 2: Accept the synset that includes more than one English equivalent of the<br />

synonym of the lexical entry in question with a confidence score of 3.<br />

If Case 1 fails in finding a synset that includes more than one English equivalent,<br />

the English equivalent of a synonym of the lexical entry is picked up to investigate.


104 Thatsanee Charoenporn et al.<br />

Fig. 2 shows an English equivalent of a lexical entry L 0 and its synonym L 1 in a<br />

synset S 1 . In this case the synset S 1 is assigned to both L 0 and L 1 with CS=3. The<br />

score in this case is lower than the one assigned in Case 1 because the synonym of the<br />

English equivalent of the lexical entry is indirectly implied from the English<br />

equivalent of the synonym of the lexical entry. The newly retrieved English<br />

equivalent may not be distorted.<br />

∈ S 0<br />

L 0 E 0 ∈<br />

∈<br />

S 1<br />

L 1<br />

E 1<br />

∈<br />

S 2<br />

Fig. 2. Synset assignment with CS=3<br />

Example:<br />

L 0 : L 1 :<br />

E 0 : stare E 1 : gaze<br />

S 0 : gaze, stare S 1 : stare<br />

In the above example, the synset, S 0 , is assigned to the lexical entry, L 0 , with CS=3.<br />

Case 3: Accept the only synset that includes only one English equivalent with a<br />

confidence score of 2.<br />

∈<br />

L 0 E 0 S 0<br />

Fig. 3. Synset assignment with CS=2<br />

Fig. 3 shows the assignment of CS-2 when there is only one English equivalent and<br />

there is no synonym of the lexical entry. Though there is no English equivalent to<br />

increase the reliability of the assignment, at the same time there is no synonym of the<br />

lexical entry to distort the relation. In this case, the only English equivalent shows an<br />

uniqueness in the translation that can maintain a degree of confidence.<br />

Example:<br />

L 0 :<br />

E 0 : obstetrician<br />

S 0 : obstetrician, accoucheur<br />

In the above example, the synset, S 0 , is assigned to the lexical entry, L 0 , with CS=2.<br />

Case 4: Accept more than one synset that includes each of the English equivalents<br />

with a confidence score of 1.


Evaluation of Synset Assignment to Bi-lingual Dictionary 105<br />

Case 4 is the most relaxed rule to provide some relation information between the<br />

lexical entry and a synset. Fig. 4 shows the assignment of CS=1 to any relations that<br />

do not meet the previous criteria but the synsets include one of the English<br />

equivalents of the lexical entry.<br />

∈ S 0<br />

L 0<br />

E 0 ∈<br />

S 1<br />

E 1<br />

∈<br />

S 2<br />

Example:<br />

L 0 :<br />

E 0 : hole<br />

E 1 : canal<br />

S 0 : hole, hollow<br />

S 1 : hole, trap, cakehole, maw, yap, gop<br />

S 2 : canal, duct, epithelial duct, channel<br />

Fig. 4. Synset assignment with CS=1<br />

In the above example, each synset, S 0 , S 1, and S 2 is assigned to lexical entry L 0 , with<br />

CS=1.<br />

3 Experiment Results<br />

We applied the synset assignment criteria to a Thai-English dictionary (MMT<br />

dictionary) [6] with the synset from WordNet 2.1. To compare the ratio of assignment<br />

for Thai-English dictionary, we also investigated the synset assignment of Indonesian-<br />

English and Mongolian-English dictionaries.<br />

In our experiment, there are only 24,457 synsets from 207,010 synsets, which is<br />

12% of the total number of the synsets that can be assigned to Thai lexical entries.<br />

Table 1 shows the successful rate in assigning synsets to the Thai-English dictionary.<br />

About 24 % of Thai lexical entries are found with the English equivalents that meet<br />

one of our criteria.<br />

Going through the list of unmapped lexical entries, we can classify the errors into<br />

three groups:<br />

1. Compound<br />

The English equivalent is assigned in a compound, especially in cases where<br />

there is no appropriate translation to represent exactly the same sense. For<br />

example,


106 Thatsanee Charoenporn et al.<br />

L: E: retail shop<br />

L: E: pull sharply<br />

2. Phrase<br />

Some particular words culturally used in one language may not be simply<br />

translated into one single word sense in English. In this case, we found it<br />

explained in a phrase. For example,<br />

L:<br />

E: small pavilion for monks to sit on to chant<br />

L:<br />

E: bouquet worn over the ear<br />

3. Word form<br />

Inflected forms, i.e., plural, past participle, are used to express an appropriate<br />

sense of a lexical entry. This can be found in non-inflected languages such as<br />

Thai and most Asian languages. For example,<br />

L: E: grieved<br />

The above English expressions cause an error in finding an appropriate synset.<br />

Table 1. Synset assignment to Thai-English dictionary<br />

WordNet (synset) TE Dict (entry)<br />

total Assigned Total assigned<br />

Noun 145,103<br />

18,353<br />

11,867<br />

43,072<br />

(13%)<br />

(28%)<br />

Verb 24,884<br />

1,333<br />

2,298<br />

17,669<br />

(5%)<br />

(13%)<br />

Adjective 31,302<br />

4,034<br />

3,722<br />

18,448<br />

(13%)<br />

(20%)<br />

Adverb 5,721<br />

737<br />

1,519<br />

3,008<br />

(13%)<br />

(51%)<br />

Total 207,010<br />

24,457<br />

19,406<br />

82,197<br />

(12%)<br />

(24%)<br />

We applied the same algorithm to Indonesia-English and Mongolian-English [7]<br />

dictionaries to investigate how it works with other languages in terms of the selection<br />

of English equivalents. The difference in unit of concept is basically understood to<br />

affect the assignment of English equivalents in bi-lingual dictionaries. In Table 2, the<br />

size of the Indonesian-English dictionary is about half that of the Thai-English<br />

dictionary. The success rates of assignment to the lexical entry are the same, but the<br />

rate of synset assignment of the Indonesian-English dictionary is lower than that of<br />

the Thai-English dictionary. This is because the total number of lexical entries is<br />

about in the half that of the Thai-English dictionary.<br />

A Mongolian-English dictionary is also evaluated. Table 3 shows the result of<br />

synset assignment.<br />

These experiments show the effectiveness of using English equivalents and<br />

synonym information from limited resources in assigning WordNet synsets.


Evaluation of Synset Assignment to Bi-lingual Dictionary 107<br />

Table 2. Synset assignment to Indonesian-English dictionary<br />

WordNet (synset) IE Dict (entry)<br />

total assigned total assigned<br />

Noun 145,103<br />

4,955<br />

2,710<br />

20,839<br />

(3%)<br />

(13%)<br />

Verb 24,884<br />

7,841<br />

4,243<br />

15,214<br />

(32%)<br />

(28%)<br />

Adjective 31,302<br />

3,722<br />

2,463<br />

4,837<br />

(12%)<br />

(51%)<br />

Adverb 5,721<br />

381<br />

285<br />

414<br />

(7%)<br />

(69%)<br />

total 207,010<br />

16,899<br />

9,701<br />

41,304<br />

(8%)<br />

(24%)<br />

Table 3. Synset assignment to Mongolian-English dictionary<br />

WordNet (synset) ME Dict (entry)<br />

total assigned Total assigned<br />

Noun 145,103<br />

268<br />

125<br />

168<br />

(0.18%)<br />

(74.40%)<br />

Verb 24,884<br />

240<br />

139<br />

193<br />

(0.96%)<br />

(72.02%)<br />

Adjective 31,302<br />

211<br />

129<br />

232<br />

(0.67%)<br />

(55.60%)<br />

Adverb 5,721<br />

35<br />

17<br />

42<br />

(0.61%)<br />

(40.48%)<br />

Total 207,010<br />

754<br />

410<br />

635<br />

(0.36%)<br />

(64.57%)<br />

4 Evaluations<br />

In the evaluation of our approach for synset assignment, we randomly selected 1,044<br />

synsets from the result of synset assignment to the Thai-English dictionary (MMT<br />

dictionary) for manually checking. The random set covers all types of part-of-speech<br />

and degrees of confidence score (CS) to confirm the approach in all possible<br />

situations. According to the supposition of our algorithm that the set of English<br />

equivalents of a word entry and its synonyms are significant information to relate to a<br />

synset of WordNet, the result of accuracy will be correspondent to the degree of CS.<br />

It took about three years to develop the Balkan WordNet on PWN 2.0 [8], [9].<br />

Therefore, we randomly picked up some synsets that resulted from our synset<br />

assignment algorithm. The results were manually checked and the details of synsets to<br />

be used to evaluate our algorithm are shown in Table 4.


108 Thatsanee Charoenporn et al.<br />

Table 5 shows the accuracy of synset assignment by part of speech and CS. A small<br />

set of adverb synsets is 100% correctly assigned irrelevant to its CS. The total number<br />

of adverbs for the evaluation could be too small. The algorithm shows a better result<br />

of 48.7% in average for noun synset assignment and 43.2% in average for all part of<br />

speech.<br />

With the better information of English equivalents marked with CS=4, the<br />

assignment accuracy is as high as 80.0% and decreases accordingly due to the CS<br />

value. This confirms that the accuracy of synset assignment strongly relies on the<br />

number of English equivalents in the synset. The indirect information of English<br />

equivalents of the synonym of the word entry is also helpful, yielding 60.7% accuracy<br />

in synset assignment for the group of CS=3. Others are quite low, but the English<br />

equivalents are somehow useful to provide the candidates for expert revision.<br />

Table 4. Random set of synset assignment<br />

CS=4 CS=3 CS=2 CS=1 Total<br />

Noun 7 479 64 272 822<br />

Verb 44 75 29 148<br />

Adjective 1 25 32 58<br />

Adverb 7 4 4 1 16<br />

total 15 552 143 334 1044<br />

Noun<br />

Verb<br />

Adjective<br />

Adverb<br />

total<br />

Table 5. Accuracy of synset assignment<br />

CS=4 CS=3 CS=2 CS=1 total<br />

5 306 34 55 400<br />

(71.4%) (63.9%) (53.1%) (20.2%) (48.7%)<br />

23 6 4 33<br />

(52.3%) (8.0%) (13.8%) (22.3%)<br />

2<br />

2<br />

(8.0%)<br />

(3.4%)<br />

7 4 4 1 16<br />

(100%) (100%) (100%) (100%) (100%)<br />

12 335 44 60 451<br />

(80.0%) (60.7%) (30.8%) (18%) (43.2%)<br />

Table 6. Additional correct synset assignment by other dictionary (LEXiTRON)<br />

CS=4 CS=3 CS=2 CS=1 total<br />

Noun 2 22 29 53<br />

Verb 2 6 4 12<br />

Adjective<br />

Adverb<br />

total 2 2 28 33 65<br />

To examine the effectiveness of English equivalent and synonym information from<br />

a different source, we consulted another Thai-English dictionary (LEXiTRON) [10].<br />

Table 6 shows the improvement of the assignment by the increased number of correct


Evaluation of Synset Assignment to Bi-lingual Dictionary 109<br />

assignment in each type. We can correct more in nouns and verbs but not adjectives.<br />

Verbs and adjectives are ambiguously defined in Thai lexicon, and the number of the<br />

remaining adjectives is too few, therefore, the result should be improved regardless of<br />

the type.<br />

Table 7. Improved correct synset assignment by additional bi-lingual dictionary (LEXiTRON)<br />

total<br />

CS=4 CS=3 CS=2 CS=1 total<br />

14 337 72 93 516<br />

(93.3%) (61.1%) (50.3%) (27.8%) (49.4%)<br />

Table 7 shows the total improvement of the assignment accuracy when we<br />

integrated English equivalent and synonym information from a different source. The<br />

accuracy for synsets marked with CS=4 is improved from 80.0% to 93.3% and the<br />

average accuracy is also significantly improved from 43.2% to 49.4%. All types of<br />

synset are significantly improved if a bi-lingual dictionary from different sources is<br />

available.<br />

5 Conclusion<br />

Our synset assignment criteria were effectively applied to languages having only<br />

English equivalents and its lexical synonym. Confidence scores were proven<br />

efficiently assigned to determine the degree of reliability of the assignment which<br />

later was a key value in the revision process. Languages in Asia are significantly<br />

different from the English language in terms of grammar and lexical word units. The<br />

differences prevent us from finding the target synset by following just the English<br />

equivalent. Synonyms of the lexical entry and an additional dictionary from different<br />

sources can be complementarily used to improve the accuracy in the assignment.<br />

Applying the same criteria to other Asian languages also yielded a satisfactory result.<br />

Following the same process that we implemented for the Thai language, we are<br />

expecting an acceptable result from the Indonesian, Mongolian languages and so on.<br />

References<br />

1. Fellbaum, C. (ed.).: WordNet: An Electronic Lexical Database. MIT Press, Cambridge,<br />

Mass (1998)<br />

2. Spanish and Catalan WordNets, http://www.lsi.upc.edu/~nlp/<br />

3. Atserias, J., Clement, S., Farreres, X., Rigau, G., Rodríguez, H.: Combining Multiple<br />

Methods for the Automatic Construction of Multilingual WordNets. In: Proceedings of the<br />

International Conference on Recent Advances in Natural Language, Bulgaria. (1997)<br />

4. Magnini, B., Strapparava, C., Ciravegna, F., Pianta, E.: A Project for the Construction of an<br />

Italian Lexical Knowledge Base in the Framework of WordNet. IRST Technical Report #<br />

9406-15 (1994)<br />

5. Proszeky, G., Mihaltz, M.: Semi-Automatic Development of the Hungarian WordNet. In:<br />

Proceedings of the LREC 2002, Spain. (2002)


110 Thatsanee Charoenporn et al.<br />

6. CICC.: Thai Basic Dictionary. Technical Report, Japan. (1995)<br />

7. Hangin, G., Krueger, J. R., Buell, P.D., Rozycki, W.V., Service, R.G.: A modern<br />

Mongolian-English dictionary. Indiana University, Research Institute for Inner Asian<br />

Studies (1986)<br />

8. Tufiş, D. (ed.).: Special Issue on the BalkaNet Project, Romanian Journal of Information<br />

Science and Technology, vol. 7, no. 1-2. (2004)<br />

9. Barbu, E., Mititelu, V. B.: Automatic Building of Wordnets. In: Proceedings of RANLP,<br />

Bulgaria (2005)<br />

10.NECTEC. LEXiTRON: Thai-English Dictionary, http://lexitron.nectec.or.th/


Using and Extending WordNet<br />

to Support Question-Answering<br />

Peter Clark 1 , Christiane Fellbaum 2 , and Jerry Hobbs 3<br />

1 Boeing Phantom Works, Seattle (USA)<br />

2 Princeton University, Princeton (USA)<br />

3 USC/ISI, Marina del Rey (USA)<br />

peter.e.clark@boeing.com, fellbaum@clarity.princeton.edu, hobbs@isi.edu<br />

Abstract. Over the last few years there has been increased research in<br />

automated question-answering from text, including questions whose answer is<br />

implied, rather than explicitly stated, in the text. WordNet has played a central<br />

role in many such systems (e.g., 21 of the 26 teams in the recent PASCAL<br />

RTE3 challenge used WordNet), and thus WordNet is being increasingly<br />

stretched to play more semantic tasks in applications. As part of our current<br />

research, we are exploring some of the new demands which question-answering<br />

places on WordNet, and how it might be further extended to meet them. In this<br />

paper, we present some of these new requirements, and some of the extensions<br />

that we are currently making to WordNet in response.<br />

Keywords: WordNet, question answering, textual entailment, world knowledge<br />

1 Introduction<br />

Advanced question-answering is more than simply fact retrieval; typically, much of<br />

the knowledge that an author wishes to convey is never explicitly stated in text (by<br />

one estimate the ratio of explicit:implicit knowledge is 1:8, [1]). Rather, the reader<br />

fills in the missing pieces using his/her background knowledge, creating a "mental<br />

model" of the scenario the text is describing, allowing him/her to go beyond facts<br />

explicitly stated. For example, given:<br />

"A soldier was killed in the gun battle"<br />

a reader would infer that, plausibly, the solder was shot, even though this fact is never<br />

explicitly stated.<br />

A key requirement for this task is access to a large body of world knowledge.<br />

However, machines are currently poorly equipped in this regard, and developing such<br />

resources is challenging. Typically, manual acquisition of knowledge is too slow,<br />

while automatic acquisition is too messy. However, WordNet [2,3] presents one<br />

avenue for making inroads into this problem: It already has broad coverage, multiple<br />

lexico-semantic connections, and significant knowledge encoded (albeit informally)<br />

in its glosses; it can thus be viewed as on the path to becoming an extensively


112 Peter Clark, Christiane Fellbaum, and Jerry Hobbs<br />

leveragable resource for reasoning. Our goal is to explore this perspective, and to<br />

accelerate WordNet along this path. The result we are aiming for is a significantly<br />

enhanced WordNet better able to support applications needing extensive semantic<br />

knowledge.<br />

2 Semantic Requirements on WordNet<br />

To assess WordNet's strengths and limitations for supporting textual questionanswering,<br />

have been working with the task of "recognizing textual entailment"<br />

(RTE) [4,5], namely deciding whether a hypothesis sentence, H, follows from an<br />

initial text T. For example, from:<br />

(1.T) Satomi Mitarai bled to death.<br />

the following hypotheses plausibly follow:<br />

Similarly, from:<br />

(1.H1) Satomi Mitarai died.<br />

(1.H2) Mitari lost blood.<br />

(2.T) Hanssen, who sold FBI secrets to the Russians, could face the death<br />

penalty.<br />

it plausibly follows that:<br />

(2.H1) The FBI had secrets.<br />

(2.H2) Hanssen received money from the Russians.<br />

(2.H3) Hanssen might be executed.<br />

(2.H4) The Russians bought secrets from Hanssen.<br />

Our methodology has been to define a test suite of such sentences, analyze the<br />

types of knowledge required to determine if the entailment holds or not, and then<br />

determine the extent to which WordNet can provide this knowledge already and<br />

where the gaps are. For these gaps, we are exploring ways in which they can be<br />

partially filled in.<br />

The test suite we developed contains 244 T-H entailment pairs (122 of which are<br />

positive entailments) such as those shown above. The pairs are grammatically fairly<br />

simple, and were deliberately authored to focus on the need for lexico-semantic<br />

knowledge rather than advanced linguistic processing. Determining entailment is very<br />

challenging in many cases. Each positive entailment pair was analyzed to identify the<br />

knowledge required to answer them. For example, for the pair:<br />

(3.T) Iran purchased plans for a nuclear reactor from A.Q.Khan.<br />

(3.H) The Iranians bought plans for building a nuclear reactor.


Using and Extending WordNet to Support Question-Answering 113<br />

the computer needs to know:<br />

"Iranian" is a person from Iran (derivational link)<br />

"buy" and "purchase" are approximately equivalent (synonyms)<br />

"plans for X" can mean "plans for building X" (world knowledge)<br />

This process was repeated for all 122 positive entailments. From this, we found the<br />

knowledge requirements could be grouped into approximately 15 major categories,<br />

namely knowledge of:<br />

1.Synonyms<br />

2.Hypernyms<br />

3.Irregular word forms<br />

4.Proper nouns<br />

5.Adverb-adjective relations<br />

6.Noun-adjective relations<br />

7.Noun-verb relations and their semantics (e.g., a consumer is the AGENT of<br />

consume event)<br />

8.Purpose of artifacts<br />

9.Polysemy vs. homonymy (related vs. unrelated senses of a word form)<br />

10.Typical/plausible behavior (planes fly, bombs explode, etc.)<br />

11.Core world knowledge (e.g., time, space, events)<br />

12.Specific world knowledge (e.g., bleeding involves loss of blood)<br />

13.Knowledge about actions and events (preconditions, effects)<br />

14.Paraphrases (linguistically equivalent ways of saying the same thing)<br />

15.Other<br />

Of these, WordNet already has rich coverage of synonyms, hypernyms, adverbadjective<br />

relations, and noun-adjective relations. It also has knowledge of noun-verb<br />

relations, although it does not distinguish between the different semantic type of this<br />

relation (e.g., AGENT, INSTRUMENT, EVENT); and some knowledge about<br />

semantic similarity highly polysemous verbs. In addition, WordNet has some<br />

knowledge of irregular word forms and proper nouns, and additional information is<br />

easily obtainable from other existing resources. The remaining knowledge types are<br />

still lacking; our goal is to extend WordNet to help provide more of this kind of<br />

knowledge. Note that we do not view WordNet as the sole supplier of knowledge,<br />

rather we wish to increase its utility as a contributing knowledge resource of systems<br />

performing advanced question-answering.<br />

3 Recent WordNet Extensions<br />

Based on this analysis, we are making several extensions to WordNet, which we<br />

describe in the following sections.


114 Peter Clark, Christiane Fellbaum, and Jerry Hobbs<br />

3.1 Morphosemantic links<br />

WordNet contains mostly paradigmatic relations, i.e., relations among synsets with<br />

words belonging to the same part of speech (POS). Version 2 introduced cross-POS<br />

links, so-called "morphosemantic links" among synsets that were not only<br />

semantically but also morphologically related [6]. There are currently tens of<br />

thousands of manually encoded noun-verb (sense) connections, linking derivationally<br />

related nouns and verbs, e.g.,:<br />

abandon#v1 - abandonment#n3<br />

rule#v6 - ruler#n1<br />

catch#v4 - catcher#n1<br />

Importantly, the appropriate senses of the nouns and verbs are paired, e.g., "ruler"<br />

and "rule" refer to the measuring stick and the marking or drawing with a ruler,<br />

respectively, rather than to a governor and governing, which makes for a different<br />

pair. What WordNet does not currently inform about, however, is the nature of the<br />

relation. For example:<br />

abandonment#n3 is the EVENT of abandon#v1<br />

ruler#n1 is the INSTRUMENT of rule#v6<br />

catcher#n1 is the AGENT of catch#v4<br />

Knowledge of the nature of such relations is essential for many question-answering<br />

tasks. For example, given<br />

(4.T) "Dodge produces ProHeart devices",<br />

it is needed to realize that "producer" refers to the AGENT ("Dodge"), "production"<br />

refers to the EVENT ("produces"), and "product" the RESULT ("ProHeart devices"),<br />

a prerequisite for correctly answering questions asking about the<br />

producer/production/product.<br />

The scale of adding this information manually is somewhat daunting; there are<br />

approximately 21,500 noun-verb (sense) links needing to be typed in WordNet. (We<br />

have not yet considered morphosemantic links among synsets from other parts of<br />

speech, which could also contribute to WordNet's usefulness as a tool for automated<br />

question answering.)<br />

We have devised the following semi-automated approach:<br />

1. We extract the noun-verb pairs with a particular morphological relation, (e.g., "-er"<br />

nouns such as "builder"-"build")<br />

2. We determine the default relation for these pairs (e.g., The noun is the AGENT of<br />

the action expressed by verb)<br />

3. We manually go through the list of pairs, marking pairs not conforming to default<br />

relation.<br />

4. We inspect and group the marked pairs, assigning the correct relations to them.


Using and Extending WordNet to Support Question-Answering 115<br />

This methodology is substantially faster than simply labelling each pair one by<br />

one, as only exceptions to the default relation need to be manually classified. In<br />

addition, this method has revealed the surprisingly high degree to which generally<br />

accepted one-to-one mappings of morphemes with meanings is violated;.<br />

Furthermore, it is interesting to see that across the morphological classes, a limited<br />

inventory of semantic relations applies (for details see [7]).<br />

3.2 Purpose links<br />

A second type of knowledge often needed in question-answering is the function or<br />

purpose of artifacts (natural entities like stones and trees do not have an inherent<br />

function). For example, given:<br />

(5.T) "The soldier was killed in a gun fight"<br />

(5.H) "The soldier was shot"<br />

we need to know that a gun is for shooting in order to infer that 5.H plausibly follows<br />

from 5.T. Knowledge what an artifact is intended for and how it is typically used<br />

enables a computer to make a plausible guess about implicit events that are not<br />

overtly expressed in a text. So our goal is to add links among noun and verb synsets in<br />

WordNet such that the verbs denote the intended and typical function or purpose of<br />

the nouns.<br />

The number of such links is potentially huge, as almost any object can be used for<br />

almost any function. Thus, one can kill someone with a stiletto shoe, using is as a<br />

weapon. Similarly, a tree stump could be sat on when no chair is available. Worse,<br />

just about any solid object of a certain size can be used for hitting. We try to limit our<br />

links to those expressing the intended function, similar to the Role qualia of<br />

Pustejovsky [8]. Corpus data, e.g., [9], can be used to identify the most frequent nounverb<br />

cooccurrences and usually confirm one's intuition about which noun-verb synset<br />

pairs should be linked.<br />

Manually adding the links is a daunting task. However, a semi-automated approach<br />

is possible, using existing morphosemantic links in WordNet. As noted by Clark and<br />

Clark [10], English has a productive rule and fairly regular rule whereby many nouns<br />

can be used as verbs, and in many cases, the verb denotes the noun's intended<br />

function (or, put differently, the noun is the Instrument for carrying out the action<br />

expressed by the verb). Examples are "gun"(n)-"gun"(v): A gun is for gunning;<br />

"pencil"(n)-"pencil"(v): A pencil is for penciling, a hyponym of writing. In cases<br />

where there is no corresponding verb, e.g., for "car"(n), we can search up the<br />

hypernym tree until a more general noun is found which does have a corresponding<br />

verb, e.g., "car"(n) is a "transport"(n), linked to "transport"(v), thus a "car" is for<br />

"transporting".<br />

We are currently inspecting the list of so-called zero-derived (homographic) nounverb<br />

pairs in WordNet and classifying them as described in 3.1. Those pairs where the<br />

noun is an Instrument will be encoded with purpose links. Similarly, all noun-verb<br />

pairs from the different morphological classes (-er, -al, -ment, -ion, etc.) that were<br />

classified as expressing an Instrument relation can be labeled as "Purpose."


116 Peter Clark, Christiane Fellbaum, and Jerry Hobbs<br />

The automatic extraction of pairs related via a specific affix (Step 1 in 3.1 above)<br />

generates a list of candidate pairs that is validated and corrected by the same<br />

lexicographer who manually inspects the pairs for their semantic relation. Most pairs<br />

that are generated are valid, but a few false hits must be discarded. For example, the<br />

noun synset {coax, ethernet cable} was paired with the verb "coax", which would lead<br />

to the statement "An ethernet cable is for coaxing". In the majority of cases the<br />

computer's guess is sensible, and hence construction of the database is much faster<br />

than working from scratch.<br />

3.3 World Knowledge - WordNet Glosses<br />

WordNet contains a substantial amount of knowledge within its glosses. In particular,<br />

note that knowledge about a word (sense) is not just contained in that sense's gloss<br />

and example sentences, but also in its use in other glosses and example sentences. For<br />

example, for the word "lawn", WordNet includes mention that a lawn:<br />

• needs watering;<br />

• can have games played on it;<br />

• can be flattened, mowed;<br />

• can have chairs on it and other furniture;<br />

• can be cut/mowed;<br />

• can have things growing on it;<br />

• has grass;<br />

• can have leaves on it; and<br />

• can be seeded.<br />

Despite this promise, this knowledge is largely locked up in informal English text,<br />

and difficult to extract in a machine-usable form (although there has been some work<br />

on translating the glosses to logic, e.g., [11,12]. The glosses were not originally<br />

written with machine interpretation in mind, and as a result the output of machine<br />

interpretation is often syntactically valid but semantically meaningless logic. To<br />

address this challenge, we are proceeding along two fronts: first, we are developing an<br />

improved language processor specifically designed for interpreting the WordNet<br />

glosses; second, we are manually rephasing some of the glosses to create more<br />

regularity in their structure, so that the resulting machine interpretation is improved.<br />

To scope this work, we are focusing on "Core WordNet" Because WordNet<br />

contains tens of thousands of synsets referring to highly specific animals, plants,<br />

chemical compounds, etc. that are less relevant to NLP, the Princeton WordNet group<br />

has compiled a CoreWordNet, consisting of 5,000 synsets that express frequent and<br />

salient concepts. These were selected as follows. First, a list with the most frequent<br />

strings from the BNC was automatically compiled and all WordNet synsets for these<br />

strings were pulled out. Second, two raters determined which of the senses of these<br />

strings expressed "salient" concepts [13]. The resulting top 5000 concepts comprises<br />

the core that we are focusing on, and as a result of this method of data collection<br />

contains a mixture of general and (common) domains-specific terms. (CoreWordNet<br />

is downloadable from http://wordnet.cs.princeton.edu /downloads.html)


Using and Extending WordNet to Support Question-Answering 117<br />

3.4 World Knowledge - Core Theories<br />

In addition to the specific world knowledge that might be obtained from the glosses,<br />

question-answering sometimes requires more fundamental, "core" knowledge of the<br />

world, e.g., about space, time, events, cognition, people and activities. Because of its<br />

more general nature, such knowledge is less likely to come from the WordNet<br />

glosses, and instead we are encoding some of this knowledge by hand as a set of "core<br />

theories". Although these theories contain only a small number of concepts (synsets),<br />

these concepts are also often general, meaning that information about them can be<br />

applied to a large number of other WordNet concepts. For example, WordNet has 517<br />

"vehicle" nouns, and so any general knowledge about vehicles in general is<br />

potentially applicable to all these subtypes; similarly WordNet has 185 "cover" verbs,<br />

so general knowledge about the nature of covering can potentially apply to all these<br />

subtypes. In general, the broad coverage of WordNet can be funneled into a much<br />

smaller defined core, which can then be richly axiomatized, and the resulting axioms<br />

applied to much of the wider vocabulary in WordNet.<br />

To identify these theories, we sorted words in Core WordNet into groups based on<br />

(a somewhat intuitive notion of) coherence, resulting in 15 core theories (listed with a<br />

selection of the words in them):<br />

• Composite Entities: perfect, empty, relative, secondary, similar, odd, ...<br />

• Scales: step, degree, level, intensify, high, major, considerable, ...<br />

• Events: constraint, secure, generate, fix, power, development, ...<br />

• Space: grade, inside, lot, top, list, direction, turn, enlarge, long, ...<br />

• Time: year, day, summer, recent, old, early, present, then, often, ...<br />

• Cognition: imagination, horror, rely, remind, matter, estimate, idea, ...<br />

• Communication: journal, poetry, announcement, gesture, charter, ...<br />

• Persons and their Activities: leisure, childhood, glance, cousin, jump, ...<br />

• Microsocial: virtue, separate, friendly, married, company, name, ...<br />

• Material World: smoke, shell, stick, carbon, blue, burn, dry, tough, ...<br />

• Geo: storm, moon, pole, world, peak, site, village, sea, island, ...<br />

• Artifacts: bell, button, van, shelf, machine, film, floor, glass, chair, ...<br />

• Food: cheese, potato, milk, break, cake, meat, beer, bake, spoil, ...<br />

• Macrosocial: architecture, airport, headquarters, prosecution, ...<br />

• Economic: import, money, policy, poverty, profit, venture, owe, ...<br />

We are first focusing on Time and Event words. We have developed underlying<br />

ontologies of time and event concepts, explicating the key notions in these domains<br />

[14,15]. For example, the temporal ontology axiomatizes topological temporal<br />

concepts like before, duration concepts, and concepts involving the clock and<br />

calendar. The event ontology axiomatizes notions like subevent, and the internal<br />

structure of events and processes. We are then defining, or at least characterizing, the<br />

meanings of the various word senses in terms of these underlying theories. For<br />

example, to fix something is to bring about a state in which all the components of the<br />

thing are functional. This effort is of course a very labor intensive project, but since


118 Peter Clark, Christiane Fellbaum, and Jerry Hobbs<br />

we are concentrating on the synsets in the core WordNet, we believe we will achieve<br />

the maximum impact for the labor we put into it.<br />

Because of the richness of WordNet's hypernym links, in principle these axioms<br />

can be heavily reused for reasoning about WordNet word senses. A number of the<br />

textual entailment problems in our test suite appeal directly to this knowledge, for<br />

example to judge the validity of this entailment:<br />

(6.T) Baghdad has seen a spike in violence since the summer.<br />

(6.H) There was greater violence in Baghdad since the summer.<br />

requires reasoning about the core notion of change in a quantity ("spike", "rise"),<br />

rather than anything specific about Baghdad, violence, or summer. This kind of<br />

knowledge - namely the meaning of these core words and their relationships - is being<br />

encoded in these core theories.<br />

4. Status and Summary<br />

The work that we have described here is still a work in progress: To date, we have<br />

corrected/validated about half of the machine-generated database of morphosemantic<br />

links; made an initial start on the purpose links; have completed a first pass on logical<br />

forms for WordNet glosses and are focussing on improving both the phrasing and<br />

interpretation of Core WordNet; and have completed some of the core theories and<br />

are in the process of linking their core notions to WordNet word senses. Our goal is<br />

that these extensions will substantially improve WordNet's utility for language-based<br />

problems that require reasoning as well as basic lexical information, and we are<br />

optimistic that these will improve WordNet's ability to meet the increasingly strong<br />

requirements demanded by modern day language-based applications.<br />

Acknowledgements<br />

This work was supported by the AQUAINT Program of the Disruptive Technology<br />

Office under contract number N61339-06-C-0160.<br />

References<br />

1. Graesser, A. C.: Prose Comprehension Beyond the Word. Springer, NY (1981)<br />

2. Miller, G. A.: WordNet: a lexical database for English. J. Communications of the ACM.<br />

38(11), 39–41 (1995)<br />

3. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA<br />

(1998)<br />

4. Giampiccolo, D., Magnini, B., Dagan, I., Dolan, B.: The Third PASCAL Recognizing<br />

Textual Entailment Challenge. In: Proc. 2007 Workshop on Textual Entailment and<br />

Paraphrasing, pp 1–9. PA: ACL. (2007)


Using and Extending WordNet to Support Question-Answering 119<br />

5. Clark, P., Harrison, P., Thompson, J., Murray, W., Hobbs, J., Fellbaum, C.: On the Role<br />

of Lexical and World Knowledge in RTE3. In: ACL-PASCAL Workshop on Textual<br />

Entailment and Paraphrases, June 2007. Prague, CZ (2007)<br />

6. Miller, G. A., Fellbaum, C.: Morphosemantic links in WordNet. J. Traitement<br />

automatique de langue 44(2), 69–80 (2003)<br />

7. Fellbaum, C., Osherson, A., Clark, P.E.: Putting Semantics into WordNet's<br />

"Morphosemantic" Links. In: Proceedings of the Third Language and Technology<br />

Conference, Poznan, Poland, October 5–7. (2007)<br />

8. Pustejovsky, J.: The Generative Lexicon. MIT Press, Cambridge, MA (1995)<br />

9. Clark, P., Harrison, P.: The Reuters Tuple Database. (Available on request from<br />

peter.e.clark@boeing.com) (2003)<br />

10. Clark, E., Clark, H.: When nouns surface as verbs. J. Language 55, 767–811 (1979)<br />

11. Harabagiu, S.M., Miller, G.A., Moldovan, D.I.: WordNet 2 - A Morphologically and<br />

Semantically Enhanced Resource In: Proc. SIGLEX 1999, pp. 1–8. (1999)<br />

12. Fellbaum, C., Hobbs, J.: WordNet for Question Answering (AQUAINT II Project<br />

Proposal). Technical Report, Princeton University (2004)<br />

13. Boyd-Graber, J., Fellbaum, C., Osherson, D., Schapire, R.: Adding dense, weighted,<br />

connections to WordNet. In: Proceedings of the Third Global WordNet Meeting, Jeju<br />

Island, Korea, January 2006 (2006)<br />

14. Hobbs, J. R., Pan, F.: An Ontology of Time for the Semantic Web. J. ACM Transactions<br />

on Asian Language Information Processing 3(1), March 2004. (2004)<br />

15. Hobbs, J.R.: Encoding Commonsense Knowledge. Technical Report, ISI.<br />

http://www.isi.edu/~hobbs/csk.html (2007)


An Evaluation Procedure for Word Net Based Lexical<br />

Chaining: Methods and Issues<br />

Irene Cramer and Marc Finthammer<br />

Faculty of Cultural Studies, University of Dortmund, Germany<br />

irene.cramer|marc.finthammer@uni-dortmund.de<br />

Abstract. Lexical chaining is regarded to be a valuable resource for NLP applications,<br />

such as automatic text summarization or topic detection. Typically,<br />

lexical chainers use a word net to compute semantically motivated partial text<br />

representations. However, their output is normally evaluated with respect to an<br />

application since generic evaluation criteria have not yet been determined and<br />

systematically applied. This paper presents a new evaluation procedure meant to<br />

address this issue and provide insight into the chaining process. Furthermore, the<br />

paper exemplarily demonstrates its application for a lexical chainer using GermaNet<br />

as a resource.<br />

1 Project Context and Motivation<br />

Converting linear text documents into documents publishable in a hypertext environment<br />

is a complex task requiring methods for the segmentation, reorganization, and<br />

linking. The HyTex project, funded by the DFG, aims at the development of conversion<br />

strategies based on text-grammatical features 1 . One focus of our work is on topic-based<br />

linking strategies using lexical and thematic chains. In contrast to the lexical ones thematic<br />

chains are based on a selection of central words, so called topic anchors, which are<br />

e.g. words able to outline the content of a complete passage, and as in lexical chaining<br />

connected via semantically meaningful edges. An illustration is given in Fig. 1.<br />

We intend to use lexical chaining for the construction of thematic chains: on the<br />

one hand as a feature for the extraction of topic anchors and on the other hand as a<br />

tool for the calculation of thematic structure, as shown in Fig. 1. For this purpose, we<br />

implemented a lexical chainer for German corpora based on GermaNet. In order to perform<br />

an in-depth analysis and evaluation of this chainer as well as to gain insight into<br />

the whole chaining process we developed a detailed evaluation procedure. We argue<br />

that this procedure is applicable to any lexical chainer regardless of the algorithm or resources<br />

used and helps to fine-tune the parameter setting ideal for a specific application.<br />

We also present a detailed evaluation of our own lexical chainer and illustrate the issues<br />

and challenges we encountered using GermaNet as a resource.<br />

1 See our project web pages http://www.hytex.info/ for more information about the concept of<br />

thematic chains and the project context.


An Evaluation Procedure for Word Net Based Lexical Chaining... 121<br />

Topic Chainer<br />

top-level topic<br />

topic continuation<br />

hyponym<br />

3<br />

1 2<br />

synonym<br />

1, 2, 3, …, n topic anchors<br />

hyponym<br />

topic splitting<br />

topic<br />

composition<br />

… … n<br />

meronym<br />

meronym<br />

Fig. 1. Topic chaining example<br />

Paper plan: The remainder of this paper is structured as follows: Section 2 describes<br />

the basic aspects of lexical chaining and presents a detailed, new evaluation procedure.<br />

Section 3 presents the resources used for our lexical chainer and the evaluation.<br />

Section 4 discusses our preprocessing component necessary to handle the rather complex<br />

German morphology and well-known challenges, such as proper names, in lexical<br />

chaining. Section 5 discusses our chaining based disambiguation experiments. Section<br />

6 presents a short overview of eight semantic relatedness measures and compares their<br />

values with the results of a human judgment experiment that we conducted. Section 7<br />

outlines the evaluation of our chaining with respect to our application scenario and the<br />

project context. Section 8 summarizes and concludes the paper.<br />

2 Lexical Chaining<br />

Based on the concept of lexical cohesion [1] computational linguists e.g. [2] developed<br />

a method to compute partial text representations: lexical chains. To illustrate the idea<br />

an annotation is given as an example in Fig. 2. It shows that lexical chaining is achieved<br />

by the selection of vocabulary and significantly accounts for the cohesive structure of<br />

a text passage. The chains span over passages linking lexical items, where the linking<br />

is based on the semantic relations existing between them. Typical semantic relations<br />

considered in this context are synonymy, antonymy, hyponymy, hypernymy, meronymy<br />

and holonymy as well as complex combinations of these which are computed on the<br />

basis of lexical semantic resources such as WordNet [3]. In addition to WordNet, which<br />

has been used in the majority of cases e.g. [4], [5], [6], Roget’s Thesaurus [2] and<br />

GermaNet [7] have already been applied.


122 Irene Cramer and Marc Finthammer<br />

Jan sat down to rest at the foot of a huge beechtree.<br />

Now he was so tired that he soon fell asleep;<br />

and a leaf fell on him, and then another, and then<br />

another, and before long he was covered all over<br />

with leaves, yellow, golden and brown.<br />

Chain 1: sat down, rest, tired, fell asleep<br />

Chain 2: beech-tree, leaf, leaves<br />

Unsystematic relations not yet considered in<br />

resource for lexical chaining: foot / huge – beechtree;<br />

yellow / golden / brown – leaves<br />

Fig. 2. Chaining example adapted from [1]<br />

Several natural language applications as text summarization e.g. [8], [9], malapropism<br />

recognition [4], automatic hyperlink generation e.g. [5], question answering e.g.<br />

[10] and topic detection/topic tracking e.g. [11] benefit from lexical chains as a valuable<br />

text representation.<br />

In this paper we present the evaluation of our own implementation of a lexical<br />

chainer for German, GLexi, which is based on the algorithms described by [4] and<br />

[8] and was developed to support the extraction of thematic structures and topic development.<br />

As most systems, GLexi consists of the fundamental modules shown in<br />

Table 1, which reveals that preprocessing – thus, the selection of the so-called chaining<br />

candidates and determination of relevant information about these candidates, like text<br />

position and part-of-speech – play a major role in the whole process. A chaining candidate<br />

is the fundamental chain element; it is a token comprised of all bits of information<br />

belonging to it.<br />

We argue that a sophisticated preprocessing may enhance coverage, which is acknowledged<br />

to be a crucial aspect in the development of a lexical chaining system e.g.<br />

[5], [8], and [4]. Accordingly, we address several ideas to improve the coverage of our<br />

system. At least two issues independent of language influence this aspect:<br />

– limitations imposed on the whole process by the size and coverage of the lexical<br />

semantic resource used,<br />

– and the presence of proper names in the text, which cannot be resolved without<br />

extensive preprocessing.


An Evaluation Procedure for Word Net Based Lexical Chaining... 123<br />

Module<br />

Table 1. Overview of chainer modules<br />

Subtasks<br />

preprocessing of corpora chaining candidate selection:<br />

determine chaining window,<br />

sentence boundaries,<br />

tokens, POS-tagging,<br />

chunks etc.<br />

core chaining algorithm lexical semantic look-up<br />

calculation of chains resource (e.g. WordNet),<br />

or meta-chains scoring of relations,<br />

sense disambiguation<br />

output creation rating/scoring of chain strength<br />

build application specific<br />

representation<br />

However, it is even more critical for German coverage because<br />

– of its complex morphology (e.g. inflection and word formation)<br />

– and the smaller coverage of GermaNet in comparison to WordNet.<br />

Both aspects as well as coverage in general are discussed in detail in the following<br />

sections.<br />

In order to formally evaluate the performance – in terms of precision and recall – of<br />

GLexi for various parameter settings a (preferably standardized and freely available)<br />

test set would be required. To our knowledge there is no such resource – neither for English<br />

nor for German. Therefore, we have started to investigate the development of such<br />

a gold standard for German corpora. Initial results are discussed in [12]. Our experiments<br />

show that the manual annotation of lexical chains is a demanding task, which has<br />

also been emphasized in the work by [13], [14] and [15]. The rich interaction between<br />

various principles to achieve a cohesive text structure seems to distract annotators. We<br />

therefore argue that the evaluation of a lexical chainer might be best performed in four<br />

steps:<br />

– evaluation of coverage: amount of chaining candidates the chainer is able to<br />

process,<br />

– evaluation of disambiguation quality: number of chaining candidates correctly<br />

disambiguated with respect to lexical semantic resource,<br />

– evaluation of quality of semantic relatedness measures: comparison with human<br />

judgment,<br />

– evaluation of chains with respect to concrete application.<br />

This procedure ensures that the most relevant parameters in the evaluation of our system,<br />

GLexi, can be judged separately and also enables us to gain the necessary insight<br />

into the chaining process.


124 Irene Cramer and Marc Finthammer<br />

3 Resources<br />

We based the evaluation of our system and all experiments described in this paper on<br />

three main resources: GermaNet as the lexical semantic lexicon for our chainer, the<br />

HyTex project corpus and a set of word pairs compiled in a human judgment experiment<br />

for the evaluation steps discussed in Sect. 6.2.<br />

3.1 GermaNet<br />

GermaNet [16] is a machine readable lexical semantic lexicon for the German language<br />

developed in 1996 within the LSD Project at the Division of Computational Linguistics<br />

of the Linguistics Department at the University of Tübingen. Version 5.0 covers<br />

approximately 77,000 lexical units – nouns, verbs, adjectives and adverbs as well as<br />

some multi word units – grouped into approximately 53,500 so-called synonym sets.<br />

GermaNet contains approximately 4,000 lexical (between lexical units) and approximately<br />

64,000 conceptual (between synonym sets) connections. Although it has much<br />

in common with the English WordNet [3] there are some differences; see [17] for more<br />

information about this issue. The most important difference in our opinion is the fact<br />

that GermaNet is much smaller than WordNet, which has a negative impact on the coverage.<br />

However, we found that none of the other differences, such as the presence of<br />

artificial concepts, have much influence over the results of our chainer.<br />

3.2 Corpus<br />

For the evaluation steps mentioned in Sect. 2 we used a part of the HyTex corpus, which<br />

contains 130 documents (approximately 3 million words). It was compiled and in parts<br />

manually annotated in project phase I; see [18] for more information. The HyTex corpus<br />

consistis of 3 subcorpora: the so-called core corpus, supplementary corpus and statistics<br />

corpus. The corpora contain scientific papers, technical specifications, tutorials and<br />

textbook chapters, as well as FAQs about language technology and hypertext research.<br />

In the core corpus logical text structure is marked, for example the organization of<br />

documents into chapters, sections, passages, figures, footnotes, tables etc. is annotated<br />

using DocBook-based XML tags; see [19] for more information. In order to split the<br />

documents into chainable sections, we used the core corpus and segmented the documents<br />

according to its annotation. The homogeneity and relevance of a chain largely<br />

depends on its length and thus on the length of the underlying text. We found the average<br />

length of a section to be adequate for chaining of our domain-specific corpus.<br />

We also decided to only select nouns and noun phrases as chaining candidates because<br />

our experiments revealed that terminology plays the key role in scientific and technical<br />

documents terminology.


An Evaluation Procedure for Word Net Based Lexical Chaining... 125<br />

3.3 Set of Word Pairs<br />

In order to evaluate the quality of a relatedness measure, a set of pre-classified word<br />

pairs (in our case for German) is necessary. In previous work for English, most researchers<br />

used Rubenstein and Goodenough’s list [20] or Miller and Charles’s list [21].<br />

For German there are – to our knowledge – three sets of word pairs: a translation of<br />

Rubenstein and Goodenough’s list by [22], a manually generated set of 350 word pairs<br />

by [23], and a semi-automatically generated set by [24]. Unfortunately, we could not<br />

find any of these German sets published. We also argue that the translation of a list<br />

constructed originally for English subjects might bias the results and therefore decided<br />

to compile our own set of word pairs as can be seen in Table 2. The goal was to cover a<br />

wide range of relatedness types, i.e. systematic and unsystematic relations, and relatedness<br />

levels, i.e. various degrees of relation strength. We also included nouns of diverse<br />

semantic classes, e.g. abstract nouns, such as das Wissen (Engl. knowledge), and<br />

concrete nouns, such as das Bügeleisen (Engl. flat-iron). We thus constructed a<br />

list of approximately 320 word pairs, picked 100 of these to evenly meet the constraints<br />

mentioned above and randomized them. We also included words which occur more than<br />

once (up to 8 times) in a word pair; these are grouped into consecutive blocks. We asked<br />

35 subjects to rate the word pairs on a 5-level scale (0 = not related to 4 = strongly related).<br />

The subjects were instructed to base the rating on their intuition about any kind of<br />

conceivable relation between the two words. We used this list and the human judgment<br />

to evaluate the semantic relatedness measures described in Sect. 6.1.<br />

4 Evaluation Phase I – Preprocessing Methods<br />

We conducted several experiments to investigate the coverage of GermaNet and thus<br />

the coverage of GLexi. We found that GermaNet contains 56.42% of the 28,772 noun<br />

tokens mentioned in the corpus. We concluded from a sample analyzed that this coverage<br />

issue stems from the rich German morphology, domain-specific terminology and<br />

proper names, which are both not covered sufficiently by GermaNet. We therefore implemented<br />

the preprocessing architecture shown in Fig. 3. A document is first segmented<br />

into sections and then split into sentences and tokens. In addition, for each<br />

token a list of features is extracted, such as position in the document (with respect to<br />

sentence and section), part-of-speech, lemma, and morphology 2 . On this basis the preprocessing<br />

component generates one or several alternative chaining candidates, e.g. the<br />

first alternative would be the singular instead of a plural, like for cats ⇒ cat. The second<br />

alternative considers compounds when applicable. Since our corpus is very rich in<br />

compounds this plays a major role in the implementation of our system and is discussed<br />

in more detail in Sect. 4.1 Technical terminology and proper names are also considered<br />

separately as alternatives.<br />

2 For our study we used the Insight Discoverer TM Extractor Version 2.1. (cf. http://www.temisgroup.com/).<br />

We thank the TEMIS group for kindly permitting us to use this technology in the<br />

framework of our project.


126 Irene Cramer and Marc Finthammer<br />

Table 2. Word pairs and human judgment mean value<br />

Word 1 Word 2 Mean Value Word 1 Word 2 Mean Value<br />

Nahrungsmittel Essen 3.94 Sonne Strom 2.51<br />

Wasser Flüssigkeit 3.94 Wasser Nebel 2.49<br />

Eltern Kind 3.86 Wasser Trockenheit 2.43<br />

Blume Pflanze 3.86 Schwimmbad Ferien 2.40<br />

Angst Furcht 3.86 Kino Theater 2.40<br />

Kamin Schornstein 3.80 Nahrungsmittel Tier 2.34<br />

Blume Tulpe 3.80 Wissen Alter 2.31<br />

Sonne Sommer 3.71 Würfel Mathematik 2.23<br />

Blume Duft 3.69 Mensch Hund 1.91<br />

Wasser Fisch 3.69 Wasser Palme 1.89<br />

Mensch Lebewesen 3.66 Schwimmbad Ausdauer 1.77<br />

Schwimmbad Bademeister 3.63 Würfel Betrug 1.57<br />

Riese Gigant 3.63 Würfel Kugel 1.49<br />

Mitarbeiter Kollege 3.60 Nahrungsmittel Jahreszeit 1.46<br />

Behandlung Therapie 3.54 Schwimmbad Eis 1.43<br />

Lampe Leuchte 3.49 Wüste Quelle 1.34<br />

Entdecker Expedition 3.49 Mensch Weltraum 1.26<br />

Ozean Tiefe 3.46 Wetter Hoffnung 1.26<br />

Wahl Demokratie 3.43 Licht Bremse 1.17<br />

Badekappe Schwimmer 3.40 Nahrungsmittel Zahn 1.11<br />

Würfel Zufall 3.37 Schwimmbad Stadt 1.09<br />

Wissen Kenntnis 3.34 Wissen Vergnügen 1.03<br />

Schwimmbad Becken 3.31 Beschleunigung Lautstärke 1.03<br />

Würfel Spiel 3.31 Geographie System 0.80<br />

Nahrungsmittel Hunger 3.31 Computer Hotel 0.71<br />

Bewegung Tanz 3.26 Pflanze Klebstoff 0.54<br />

Kälte Wärme 3.20 Datum Auslastung 0.54<br />

Mensch Verstand 3.20 Sonne Arzt 0.31<br />

Nahrungsmittel Restaurant 3.20 Glaube Rennen 0.29<br />

Wissen Schule 3.17 Mensch Wolke 0.20<br />

Zuverlässigkeit Freundschaft 3.17 Sonne Dirigent 0.17<br />

Politiker Bürgermeister 3.17 Nation Garten 0.17<br />

Wissen Quiz 3.09 Mittagessen Becken 0.17<br />

Blume Wasser 3.09 Farbe Richter 0.14<br />

Herbst Winter 3.03 Volk Punkt 0.11<br />

Kontinent Landkarte 3.03 Richtung Lied 0.11<br />

Sonne Leben 3.00 Schleuder Schallplatte 0.09<br />

Wissen Intelligenz 3.00 Löffel Baum 0.09<br />

Märchen Geschichte 2.94 Nahrungsmittel Kabel 0.09<br />

Sonne Stern 2.91 Hitze Familie 0.09<br />

Unterhaltung Programm 2.91 Wasser Rundfunk 0.09<br />

Etage Wohnung 2.83 Rausch Monat 0.06<br />

Wasser Pirat 2.80 Tasse Motor 0.03<br />

Treppe Aufzug 2.77 Dach Wal 0.03<br />

Haushalt Ordnung 2.74 Schwimmbad Gabel 0.03<br />

Blume Honig 2.74 Gardine Bleistift 0.03<br />

Blume Liebe 2.71 Oase Bügeleisen 0.03<br />

Nahrungsmittel Händler 2.66 Wäscheleine Toastbrot 0.03<br />

Mensch Krankheit 2.57 Würfel Wasser 0.03<br />

Tür Fenster 2.54 Flosse Drucker 0.00


An Evaluation Procedure for Word Net Based Lexical Chaining... 127<br />

Table 3. Coverage of GermaNet<br />

The approximately 29,000 (noun) tokens in our corpus split into<br />

56% in GermaNet 44% not in GermaNet, of these:<br />

15% inflected 12% compounds 17% small, uncovered classes<br />

(see Table 3)<br />

The cats Tom and Lucy<br />

lie on the mat and drink<br />

a milkshake. Suddenly,…<br />

preprocessing<br />

chaining candidates + candidate<br />

features:<br />

cats cat NN<br />

Tom Tom NE<br />

Lucy Lucy NE<br />

mat mat NN<br />

milkshake milk|shake NN<br />

… … …<br />

original/alternative in<br />

GermaNet?<br />

GermaNet look-up<br />

select elements for<br />

chaining<br />

features<br />

chaining<br />

Chains<br />

elements<br />

chaining elements:<br />

cats cat<br />

Tom/Lucy NE<br />

mat<br />

mat<br />

milkshake shake<br />

Fig. 3. Preprocessing architecture


128 Irene Cramer and Marc Finthammer<br />

4.1 German Morphology<br />

Compared to English, the German noun morphology is relatively complex: especially<br />

the presence of four cases and compounds, which are written as one word and not<br />

divided by blanks, plays a major role in our chaining system.<br />

Notes on German inflection: In order to ensure that inflected nouns can be handled<br />

accurately we rely on lemmatization. Inflection in German means four cases and<br />

singular/plural forms.<br />

Coverage improvement on the basis of inflection processing: On the basis of our<br />

lemmatization step, we were able to replace approximately 15% of the nouns by their<br />

lemmata and could thus increase the coverage to 71%.<br />

Open Issues: However, we found that there are some cases in which the original<br />

(plural) form in the text should not be normalized to its singular form, e.g. the German<br />

word Daten (Engl. data or dates) can be lemmatized to Datum (Engl. date); the same<br />

holds for Medien (Engl. media) and Medium (Engl. psychic, data carrier). Thus, when<br />

lemmatized the words change their meaning. Moreover, the plural form is not included<br />

in GermaNet. Consequently, our system uses as a chaining element the first alternative<br />

of the original, e.g. Datum instead of Daten. Of course, in our domain specific corpus<br />

Daten (Engl. data) and Medien (Engl. media) are frequent words (Daten occurred<br />

78 times in the corpus, Medien 41 times), which serve in the chains as glue for a list<br />

of other chaining elements and therefore need to be carefully considered. In addition,<br />

lemmatization is not very reliable for compounds. Nevertheless, we think that the results<br />

mentioned above emphasize that this preprocessing step is a necessary aspect to<br />

improve the coverage of a baseline chaining system.<br />

Notes on German compounds: Compounds are frequent in our limited domain<br />

corpus. Two or more (free) morphemes are combined into one word, the compound,<br />

e.g. Druckerpatrone (components: Drucker and Patrone; Engl. ink catridge).<br />

Sometimes, the components are additionally joined by a so-called Fugenelement (Engl.<br />

gap element), e.g. Liebeslied (components: Liebe and Lied, gap element: s;<br />

Engl. love song). Typically, the complete compound inherits the grammatical features,<br />

such as genus, of its last – so-called head – component, thus the one at the rightmost position,<br />

e.g. das Lied (genus: neutral; Engl. song) and das Liebeslied (genus:<br />

neutral), while it is die Liebe (genus: feminine; Engl. love). In addition to these<br />

grammatical features of compounds in German there are at least two semantically motivated<br />

classes: the semantically transparent and the intransparent compounds. Semantically<br />

transparent describes a compound for which the meaning of the whole can be<br />

deduced from the meaning of its parts, e.g. a Liebeslied (Engl. love song) is a kind<br />

of Lied (this component is the head of the compound; Engl. song), where the component<br />

Liebe (Engl. love) can be seen as the modifier of the head. In contrast, the meaning<br />

of a semantically intransparent compound cannot be deduced from its parts, e.g.<br />

Rotkehlchen (Engl. robbin; components: rot, Engl. red, and Kehlchen, which<br />

can be split into Kehle, Engl. throat and -chen diminutive suffix). An ideal lexical<br />

semantic resource would cover all intransparent compounds, whereas the transparent<br />

ones would not necessarily be included since it is possible to derive their meaning intel-


An Evaluation Procedure for Word Net Based Lexical Chaining... 129<br />

lectually or automatically. In principle GermaNet accounts for this rule, however, there<br />

are as always some compounds which are not included.<br />

Coverage improvement on the basis of compound processing: On the basis of<br />

the morphological analysis we were able to include previously uncovered words, i.e.<br />

approximately 12% of the nouns could be replaced by their compound head word (e.g.<br />

Liebeslied would be replaced with Lied) and thus increase the coverage to 83%.<br />

Open Issues: However, this step has at least two major drawbacks. First, the morphological<br />

analysis generated by the Insight Discoverer TM Extractor Version 2.1 contains<br />

all possible readings, e.g. the German word Agrarproduktion (Engl. agricultural<br />

production) might be split among other things into Agrar (Engl. agricultural),<br />

Produkt (Engl. artifact) and Ion (Engl. ion [chem.]). The automatic selection of a<br />

correct reading is in some cases demanding and the effect on the whole chaining process<br />

might be severe – e.g. given the word Produktion and the morphological analysis<br />

mentioned the chainer could decide to replace the word Produktion, given it cannot<br />

be found in GermaNet, with the word Ion, which could completely mislead the disambiguation<br />

of word sense in the chaining and thus the whole chaining process itself. Second,<br />

compounds containing more than two components could be split into several headwords,<br />

e.g. the head-word of the compound Datenbankbenutzerschnittstelle<br />

(Engl. data base user interface) could be Benutzerschnittstelle (Engl. user interface)<br />

or Schnittstelle (Engl. interface) or even only Stelle (Engl. position<br />

or area 3 ). In our future work, we therefore plan to investigate which parameter settings<br />

might be ideal on the one hand to improve the coverage and on the other hand<br />

to account for semantic disambiguation performance. Nevertheless, we think that morphological<br />

analysis of compounds is a crucial aspect in the preprocessing of our lexical<br />

chainer.<br />

4.2 Smaller Classes of Uncovered Material<br />

As Table 3 shows, with our first preprocessing step we were able to include approximately<br />

27% of the words, which we could initially not find in GermaNet, i.e. approximately<br />

15% on the basis of lemmatization and approximately 12% on the basis of compound<br />

analysis. We examined a sample of the remaining 17%, the results are shown in<br />

Table 4. We found in the sample approximately 15% proper names, approximately 30%<br />

foreign words, especially technical terminology in English, approximately 25% abbreviations,<br />

and approximately 20% nominalized verbs, which are not sufficiently included<br />

in GermaNet and very prominent in German technical documents. The rest (not shown<br />

in Table 4) consists of incorrectly tokenized or POS-tagged material, such as broken<br />

web links.<br />

No matter which language is considered, proper names are a well-known challenge<br />

in lexical chaining, e.g. [5]. They are semantically central items in most corpora and<br />

therefore need to be handled with care. The same holds for technical terminology, in<br />

3 Note: This is the correct though in this context semantically inadequate translation.


130 Irene Cramer and Marc Finthammer<br />

Table 4. Detailed analysis of small classes not covered by GermaNet<br />

The small, uncovered classes (see Table 2) split into<br />

15% proper names 30% foreign words 25% abbreviations 20% nominalized verbs<br />

many cases multi-word units, which are obviously very frequent and relevant in technical<br />

and academic documents. We deal with both in the second phase of our preprocessing<br />

component. However, note that we only treat the classical named entities, i.e. names<br />

belonging to people, locations, and organizations. We do not yet cover other proper<br />

names.<br />

We included the recognition of proper names and multi-word units in our preprocessing.<br />

After the basic preprocessing, such as sentence boundary detection, tokenization<br />

and lemmatization, which is accomplished by the Insight Discoverer TM Extractor<br />

Version 2.1, we run the second preprocessing phase, which splits into the following two<br />

subtasks:<br />

– Proper name recognition and classification: We use a simple named entity recognizer<br />

(NER) for German 4 , which tags person names, locations, and organizations.<br />

– Simple chunking of multi word units and simple phrases: We use the part-of-speech<br />

tags computed in the first preprocessing step by the Insight Discoverer TM Extractor<br />

Version 2.1 to construct simple phrases.<br />

Of course, these are interim solutions, and we plan to investigate strategies to improve<br />

the second preprocessing phase in our future work. Because we found names of<br />

conferences and product names to be relatively frequent, we intend to extend our NER<br />

system accordingly. Most of the technical terminology in our corpus is not included<br />

in GermaNet and could thus not be considered in the chaining. However, in the HyTex<br />

project we developed a terminological lexicon for our corpus (called TermNet), see [25]<br />

and [26], which we plan to use in addition to GermaNet. Ultimately, we hope this will<br />

again improve the coverage of our chainer. While it is thus far unclear how to handle<br />

nominalized verbs and abbreviations, the statistics shown in Table 4 emphasize their<br />

relevance, and they certainly need to be considered with care in our future work.<br />

To conclude, without any preprocessing only 56% of the noun tokens in our corpus<br />

are chainable. Approximately 67% of the remaining nouns can be handled with morphological<br />

analysis and a very simple NER system. The remaining approximately 33%<br />

is comprised of abbreviations, foreign words, nominalized verbs and broken material<br />

as well as not yet covered proper names and technical terminology, which we intend<br />

to deal with in an expansion of our lexical semantic resource, i.e. in a combination of<br />

GermaNet and TermNet, statistical relatedness measures based on web counts and a<br />

refinement of our preprocessing components.<br />

4 It is our own machine learning based implementation of a simple NER system.


An Evaluation Procedure for Word Net Based Lexical Chaining... 131<br />

5 Evaluation Phase II – Chaining-based Word Sense<br />

Disambiguation<br />

In addition to the coverage issues described in Sect. 4 word sense disambiguation has<br />

a high impact on the performance of a lexical chainer. That is, if incorrectly disambiguated,<br />

a word with several word senses, such as bank or mouse, could mislead the<br />

complete chaining algorithm and cause the construction of inappropriate chains. As a<br />

matter of course, the disambiguation performance of a chainer is not able to outperform<br />

high-quality WSD systems, such as presented at the Senseval workshops, and it is not<br />

our purpose to compete against these systems but to locate potential sources of error in<br />

the chaining procedure. Consequently, the second step in our evaluation procedure is<br />

related to word sense disambiguation, in our case the selection of an appropriate synonym<br />

set in GermaNet. In principle, there are at least two different methods: the greedy<br />

selection of a word sense and the subsequent selection. Greedy word sense disambiguation<br />

means to choose the first matching synonym set which exhibits a suitable path or<br />

a semantic relatedness measure value. In contrast, subsequent disambiguation, see e.g.<br />

[9], means to first assemble all possible readings, i.e. all in principle suitable paths or<br />

semantic relatedness measure values, and then, given this information, select the best<br />

match. However, both methods have their pros and cons: the greedy selection is simple<br />

and straightforward, but it tends to pick the wrong word sense in cases in which the<br />

correct reading of a word cannot be determined until the rest of the potential chaining<br />

partners are examined. The subsequent word sense disambiguation supports exactly<br />

this issue, but it is rather complex, especially when several relatedness measures are<br />

to be considered. In addition to these two methods, there are several ranges between<br />

the greedy and the subsequent disambiguation: e.g. the appropriate synonym set of a<br />

word might be determined on the basis of a majority vote when all possible combinations<br />

containing this word are read. Alternatively, the information content (see Sect.<br />

6.1) might be useful to pick a word sense.<br />

Analysis of the chaining-based word sense disambiguation: In lexical chaining,<br />

the disambiguation is essentially based on the selection of a word sense with respect<br />

to a path or relatedness measure value between synonym sets. For example, a pair of<br />

words A, with three senses, and B, with two senses, has six possible readings: thus, the<br />

probability to pick the correct one is only 1/6. The more senses a word pair exhibits,<br />

the likelier it is to pick an incorrect reading for at least one of the two words. Table 5<br />

shows the distribution of word senses for the (noun) tokens 5 in our corpus. Obviously,<br />

almost every second token features more than one word sense in GermaNet. That means<br />

in the worst case every second token can in principle mislead the chainer in the case of<br />

an incorrect disambiguation.<br />

5 We consider tokens instead of types because in principle every single occurrence of a word<br />

might exhibit a different word sense. We have such examples in our corpus, e.g. in one sentence<br />

the word text is used with three different senses.


132 Irene Cramer and Marc Finthammer<br />

Table 5. Overview of the number of word senses occurring in our corpus<br />

1 sense 2 senses 3 senses 4 senses > 4 senses<br />

∼ 53% ∼ 22% ∼ 15% ∼ 7% ∼ 3%<br />

word A word B word word Wu-Palmer rank<br />

sense sense value<br />

Text Hypertext 1 1 0,9231 1<br />

Text Hypertext 2 1 0,8333 2<br />

manually annotated word sense (correct word sense)<br />

Text Hypertext 1 1<br />

best Wu-Palmer value – correct word sense (rank 1)<br />

Fig. 4. Example ranking of the various readings<br />

However, it is the basic idea of lexical chaining that lexicalized coherence in the<br />

text accounts for the mutually correct disambiguation of the words in a pair. In order to<br />

investigate the disambiguation quality, we randomly selected a corpus sample and computed<br />

the relatedness values. We then ranked the possible readings for each word pair<br />

according to their relatedness values. An example is shown in Fig. 4. We evaluated this<br />

against our manual annotation of word senses. The results are shown in Table 6. The<br />

three best relatedness measures in this context, Resnik, Wu-Palmer and Lin, correctly<br />

disambiguate approximately 50% of the word pairs in our sample. For all eight measures<br />

the correct reading is on the first four ranks in the majority of the cases. Although<br />

this disambiguation accuracy is only mediocre, it outperforms the baseline (approximately<br />

39% correct disambiguation on rank 1), i.e. the performance of a chainer using<br />

the information content of a word to disambiguate its word sense. As mentioned above<br />

an additional alternative method to select the correct word sense is the majority voting:<br />

for a list of word pairs with one given word and all possible chaining partners in<br />

the text (e.g. mouse - computer, mouse - hardware, mouse - keyboard, mouse - etc.),<br />

the word sense, which is supported by most of the top-ranked relatedness measure values,<br />

is supposed to be the correct one. Our experiments showed that a majority voting<br />

is able to enhance the accuracy and bring the rate in some cases up to 63% correct<br />

disambiguation. We plan to investigate in our future work how we can again improve<br />

the disambiguation quality of our chainer. We especially plan to explore the method of<br />

meta-chaining proposed in [9] and to adapt it for a multiple relatedness measure chain-


An Evaluation Procedure for Word Net Based Lexical Chaining... 133<br />

ing framework. In addition, the integration of a WSD system might positively influence<br />

the performance of our chainer.<br />

Table 6. Overview of semantic relatedness-based disambiguation performance<br />

correct Graph Path Tree Path Wu-Palmer Leacockdisamb.<br />

on<br />

Chodorow<br />

rank 1 34.93% 42.13% 50.67% 34.93%<br />

rank 1 – 4 79.20% 80.80% 86.40% 79.20%<br />

Hirst-StOnge Resnik Jiang-Conrath Lin<br />

rank 1 17.07% 57.60% 37.60% 50.13%<br />

rank 1 – 4 19.20% 88.80% 77.87% 87.20%<br />

6 Evaluation Phase III – Semantic Relatedness and Similarity<br />

The third step in our evaluation procedure is related to the semantic measures, which are<br />

calculated on the basis of a lexical semantic resource (and word frequency counts) and<br />

used in the construction of lexical chains. A semantic measure expresses how much two<br />

words have to do with each other. The notion of semantic measure is controversially<br />

discussed in the literature e.g. [27]. The two most relevant terms in this context are<br />

semantic similarity and semantic relatedness, defined according to [27] as follows:<br />

– Semantic similarity: Word pairs are considered to be semantically similar if any<br />

synonymy or hypernymy relations hold. (Examples: forest - wood ⇒ synonymy,<br />

flower - rose ⇒ hypernymy, rose - oak ⇒ common hypernym: plant)<br />

– Semantic relatedness: Word pairs are considered to be semantically related if any<br />

systematic relation, such as synonymy, antonymy, hypernymy, holonymy, or any<br />

unsystematic relation holds. Compared to the semantic similarity measures this is<br />

the more general concept, as it includes any intuitive association or linguistically<br />

formalized relation between words. (Examples: flower - gardener or monkey - banana<br />

⇒ intuitive association, tree - branch ⇒ holonymy, day - night ⇒ antonymy)<br />

According to the definition by [27], semantic similarity is a subtype of semantic relatedness;<br />

in the following section we discuss various relatedness measures. In order<br />

to explore these measures and their relevant characteristics, we used the results of our<br />

human judgment experiment described in Sect. 3.3.<br />

6.1 GermaNet-based Semantic Relatedness Measures<br />

We expect that good lexical chains include systematic and unsystematic relations, a<br />

position which has also been stressed by the experiments reported in [13] and [14].


134 Irene Cramer and Marc Finthammer<br />

In fact, most of the established measures merely consider synonymy and hypernymy.<br />

Therefore, they actually fall under the notion of semantic similarity.<br />

Figure 5 outlines how the calculation of the relatedness measures interacts with the<br />

chaining algorithm and the semantic resource. When the preprocessing is completed,<br />

the chaining algorithm selects chaining candidate pairs, in other words, word pairs, for<br />

which the relatedness needs to be determined (see Fig. 5 – Query 1: relatedness of<br />

word A and B?). Next, the relatedness measure component (RM component) performs<br />

a look-up in the semantic resource in order to extract all available features, such as<br />

shortest path length or information content of a word, which are necessary to calculate<br />

the relatedness value (see Fig. 5 – Query 2: semantic information about A and B?). On<br />

the basis of these features, the RM component computes a value which represents the<br />

strength of the semantic relation between the two words.<br />

The cats are on the<br />

mat and drink a<br />

milkshake. Suddenly,…<br />

result Q2<br />

result Q1<br />

preprocessing<br />

semantic resource<br />

relatedness measure<br />

chaining algorithm<br />

Q2: semantic information<br />

about A and B?<br />

Q1: relatedness of word<br />

A and B?<br />

Chains<br />

Fig. 5. Use of relatedness measures in chaining<br />

The various measures introduced in the literature use different features and therefore<br />

also cover different concepts or aspects of semantic relatedness. We have implemented<br />

eight of these measures, which are shortly sketched out below. All eight measures are<br />

based on a lexical semantic resource, in our case GermaNet, and some additionally<br />

utilize a word frequency list 6 .<br />

The first four measures use a hyponym-tree induced from GermaNet. That means,<br />

given GermaNet represented as a graph, we exclude all edges except the hyponyms.<br />

6 We used a word frequency list computed by Dr. Sabine Schulte im Walde on the basis of<br />

the Huge German Corpus (see http://www.schulteimwalde.de/resource.html). We thank Dr.<br />

Schulte im Walde for kindly permitting us to use this resource in the framework of our project.


An Evaluation Procedure for Word Net Based Lexical Chaining... 135<br />

Since this gives us a wood of nine trees, we then connect them to an artificial root and<br />

thus construct the required GermaNet hyponym-tree.<br />

– Leacock-Chodorow [28]: Given a hyponym-tree, the Leacock-Chodorow measure<br />

computes the length of the shortest path between two synonym sets and scales it by<br />

the depth of the complete tree.<br />

rel LC (s 1 , s 2 ) = − log 2 · sp(s 1, s 2 )<br />

2 · D Tree<br />

(1)<br />

s 1 and s 2 : the two synonym sets examined; sp(s 1 , s 2 ): length of shortest path between<br />

s 1 and s 2 in hyponym-tree; D Tree : depth of the hyponym-tree<br />

– Wu-Palmer [29]: Given a hyponym-tree, the Wu-Palmer measure utilizes the least<br />

common subsumer in order to compute the similarity between two synonym sets.<br />

The least common subsumer is the deepest vertex which is a direct or indirect hypernym<br />

of both synonym sets.<br />

rel WP (s 1 , s 2 ) = 2 · depth(lcs(s 1, s 2 ))<br />

depth(s 1 ) + depth(s 2 )<br />

(2)<br />

depth(s): length of the shortest path form root to vertex s; lcs(s): least common<br />

subsumer of s<br />

– Resnik [30]: Given a hyponym-tree and frequency list, the Resnik measure utilizes<br />

the information content in order to compute the similarity between two synonym<br />

sets. As typically defined in Information Theory, the information content is the<br />

negative logarithm of the probability. Here the probability is calculated on the basis<br />

of subsumed frequencies. A subsumed frequency of a synonym set is the sum of<br />

frequencies of the set of all words which are in this synonym set, or a direct or<br />

indirect hyponym synonym set.<br />

∑<br />

w∈W (s)<br />

p(s) :=<br />

freq(w)<br />

(3)<br />

TotalFreq<br />

IC(s) := − log p(s) (4)<br />

rel Res (s 1 , s 2 ) = IC(lcs(s 1 , s 2 )) (5)<br />

freq(w): frequency of a word within a corpus; W (s): set of the synonym set s and<br />

all its direct/indirect hyponym synonym sets; TotalFreq: sum of the frequencies of<br />

all words in GermaNet; IC(s): information content of the synonym set s<br />

– Jiang-Conrath [31]: Given a hyponym-tree and frequency list, the Jiang-Conrath<br />

measure computes the distance (as opposed to similarity) of two synonym sets. The<br />

information content of each synonym set is included separately in this distances<br />

value, while the information content of the least common subsumer of the two<br />

synonym sets is subtracted.<br />

dist JC (s 1 , s 2 ) = IC(s 1 ) + IC(s 2 ) − 2 · IC(lcs(s 1 , s 2 )) (6)


136 Irene Cramer and Marc Finthammer<br />

– Lin [32]: Given a hyponym-tree and a frequency list, the Lin measure computes the<br />

semantic relatedness of two synonym sets. As the formula clearly shows, the same<br />

expressions are used as in Jiang-Conrath. However, the structure is different, as the<br />

expressions are divided not subtracted.<br />

rel Lin (s 1 , s 2 ) = 2 · IC(lcs(s 1, s 2 ))<br />

IC(s 1 ) + IC(s 2 )<br />

– Hirst-StOnge [4]: In contrast to the four above-mentioned methods, the Hirst-<br />

StOnge measure computes the semantic relatedness on the basis of the whole GermaNet<br />

graph structure. It classifies the relations considered into 4 classes: extra<br />

strongly related, strongly related, medium strongly related, and not related. Two<br />

words are considered to be<br />

• extra strongly related if they are identical;<br />

• strongly related if they are synonym, antonym or if one of the two words is part<br />

of the other one and additionally a direct relation holds between them;<br />

• medium strongly related if there is a path in GermaNet between the two which<br />

is shorter than six edges and matches the patterns defined by [2].<br />

In any other case the two words are considered to be unrelated. The relatedness<br />

values in the case of extra strong and strong relations are fixed values, whereas the<br />

medium strong relation is calculated based on the path length and the number of<br />

changes in direction.<br />

– Tree-Path (Baseline 1): Given a hyponym-tree, the simple Tree-Path measure computes<br />

the length of a shortest path between two synonym sets. Due to its simplicity,<br />

the Tree-Path measure serves as a baseline for more sophisticated similarity measures.<br />

dist Tree (s 1 , s 2 ) = sp(s 1 , s 2 ) (8)<br />

– Graph-Path (Baseline 2): Given the whole GermaNet graph structure, the simple<br />

Graph-Path measure calculates the length of a shortest path between two synonym<br />

sets in the whole graph, i.e. the path can make use of all relations available in<br />

GermaNet. Analogous to the Tree-Path measure, the Graph-Path measure gives us<br />

a very rough baseline for other relatedness measures.<br />

dist Graph (s 1 , s 2 ) = sp Graph (s 1 , s 2 ) (9)<br />

sp Graph (s 1 , s 2 ): Length of a shortest path between s 1 and s 2 in the GermaNet<br />

graph<br />

Differences and Challenges: Most of the measures described in this section are<br />

completely based on the hyponym-tree. Therefore, many potentially useful edges of the<br />

word net graph structure are not considered, which affects the holonymy (in GermaNet<br />

approximately 3,800 edges), meronymy (in GermaNet approximately 900 edges) and<br />

antonymy 7 (in GermaNet approximately 1,300 edges) relations. Some of the measures<br />

7 Because antonyms are mostly organized as co-hyponyms, they are – in fact – not completely<br />

discarded in the hyponym-tree-based approaches.<br />

(7)


An Evaluation Procedure for Word Net Based Lexical Chaining... 137<br />

additionally use the least common subsumer. Word pairs featuring potentially different<br />

levels of relation are thus subsumed 8 . One could also question if this is the only relevant<br />

information to be found in the hyponym-tree for a word pair. Interesting features such<br />

as network density or node depth are not included. Moreover, several measures rely<br />

on the concept of information content, for which a frequency list is required. Thus, the<br />

performance of experiments utilizing different lists as a basis is not directly comparable.<br />

Especially for lexical chaining, unsystematic relations are considered to be relevant,<br />

see e.g. [21] and [14]. However, these are not in GermaNet and consequently cannot<br />

be considered in any of the measures mentioned above. We therefore expect them to<br />

produce many false negatives, i.e. low relation values for word pairs which are judged<br />

by humans to be (strongly) related.<br />

Interpretation of relatedness measure values: Most of the relatedness measures<br />

mentioned in Sect. 6.1 are continuous, with the exception of Hirst-StOnge, Tree-Path<br />

and Graph-Path which are all discrete. All of the measures range in a specific interval<br />

between 0 (not related) and a maximum value, mostly 1. In any case, for each measure<br />

the interval could be normalized into a value ranging between 0 and 1. For the three<br />

distance measures, Jiang-Conrath, Tree-Path and Graph-Path, a concrete distance value<br />

can be converted into its corresponding relatedness value by subtracting it from the theoretical<br />

maximum distance. Suppose we plotted the empirically determined relatedness<br />

values 9 against ideal relatedness measure values, we would get exemplary distribution<br />

functions as shown in Fig. 6a. For a specific empirically determined value, e.g. 0.5, we<br />

then obtained different values for the various measures considered, e.g. 0.27 for measure<br />

A and 0.94 for measure B. Thus, the values of a specific relatedness measure A<br />

range between 1 and approximately 0.94 for an empirically determined interval of relation<br />

strengths (e.g. the word pair is strongly related) whereas a relatedness measure<br />

B exhibits values between 1 and 0.27 for the same relations. In order to profitably use<br />

this information in our chaining system, we need to interpret the values and thus find<br />

intervals mapping between e.g. classes of relation strength and measure values 10 . In any<br />

case, the distribution functions should be noisy, as shown in Fig. 6b – at best indicating<br />

a trend function. However, as Figures 7a–c, 8a–c and 9a–b illustrate, the real values of<br />

our eight measures plotted against the empirically determined relatedness values do not<br />

display any kind of obvious trend function.<br />

8 Given a pair of words w A and w B and their least common subsumer LCS AB, all pairs of a<br />

descendant of w A and a descendant of w B have LCS AB as their least common subsumer.<br />

9 These are the values deduced form our human judgment experiment mentioned in Sect. 3.3.<br />

10 Note that we need to discriminate between the distribution functions (considering empirically<br />

determined values and measure values, as exemplarily shown in Fig. 6) and the relatedness<br />

functions (as mentioned in Sect. 6.1). Although the two are equal with regards to their output<br />

(concrete measure values), they differ with respect to their input dimension and type.


138 Irene Cramer and Marc Finthammer<br />

1.00<br />

0.90<br />

0.80<br />

0.70<br />

Measure Value<br />

0.60<br />

0.50<br />

0.40<br />

0.30<br />

0.20<br />

0.10<br />

Linear Measure<br />

Measure A<br />

Measure B<br />

0.00<br />

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00<br />

"Real" Relatedness<br />

1.00<br />

0.90<br />

0.80<br />

0.70<br />

Measure Value<br />

0.60<br />

0.50<br />

0.40<br />

0.30<br />

0.20<br />

0.10<br />

Linear Measure<br />

Measure A with Noise<br />

Measure B with Noise<br />

0.00<br />

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00<br />

"Real" Relatedness<br />

Fig. 6. Idealized (a) and noisy distribution (b) of semantic relatedness values


An Evaluation Procedure for Word Net Based Lexical Chaining... 139<br />

Human Judgement<br />

Leacock-Chodorow<br />

1.00<br />

0.80<br />

Relatedness<br />

0.60<br />

0.40<br />

0.20<br />

0.00<br />

Word-Pairs Ordered by Relatedness Value<br />

Human Judgement<br />

Wu-Palmer<br />

1.00<br />

0.80<br />

Relatedness<br />

0.60<br />

0.40<br />

0.20<br />

0.00<br />

Word-Pairs Ordered by Relatedness Value<br />

Human Judgement<br />

Resnik<br />

1.00<br />

0.80<br />

Relatedness<br />

0.60<br />

0.40<br />

0.20<br />

0.00<br />

Word-Pairs Ordered by Relatedness Value<br />

Fig. 7. Leacock-Chodorow (a), Wu-Palmer (b) and Resnik (c) each plotted against human judgment


140 Irene Cramer and Marc Finthammer<br />

Human Judgement<br />

Jiang-Conrath<br />

1.00<br />

0.80<br />

Relatedness<br />

0.60<br />

0.40<br />

0.20<br />

0.00<br />

Word-Pairs Ordered by Relatedness Value<br />

Human Judgement<br />

Lin<br />

1.00<br />

0.80<br />

Relatedness<br />

0.60<br />

0.40<br />

0.20<br />

0.00<br />

Word-Pairs Ordered by Relatedness Value<br />

Human Judgement<br />

Hirst-StOnge<br />

1.00<br />

0.80<br />

Relatedness<br />

0.60<br />

0.40<br />

0.20<br />

0.00<br />

Word-Pairs Ordered by Relatedness Value<br />

Fig. 8. Jiang-Conrath (a), Lin (b) and Hirst-StOnge (c) each plotted against human judgment


An Evaluation Procedure for Word Net Based Lexical Chaining... 141<br />

Human Judgement<br />

Tree-Path<br />

1.00<br />

0.80<br />

Relatedness<br />

0.60<br />

0.40<br />

0.20<br />

0.00<br />

Word-Pairs Ordered by Relatedness Value<br />

Human Judgement<br />

Graph-Path<br />

1.00<br />

0.80<br />

Relatedness<br />

0.60<br />

0.40<br />

0.20<br />

0.00<br />

Word-Pairs Ordered by Relatedness Value<br />

Fig. 9. Tree-Path (a) and Graph-Path (b) each plotted against human judgment


142 Irene Cramer and Marc Finthammer<br />

6.2 Comparison of Human Judgment and GermaNet-based Measures<br />

Figures 7a–c, 8a–c and 9a–b show values of the various measures for all word pairs of<br />

our human judgment experiment described in Sect. 3.3. Although the inter-annotator<br />

agreement in the human judgment experiment is relatively high (correlation: 0.76 +/-<br />

0.04) 11 , the correlation between the various measures and the human judgment is relatively<br />

low (see Table 7). In addition, the trend functions potentially underlying the (very<br />

noisy) graphs in Figures 7a–c, 8a–c and 9a–b are not obvious at all.<br />

Table 7. Correlation coefficients: human judgment vs. relatedness measures<br />

Graph Path Tree Path Wu-Palmer Leacock-Chodorow<br />

correl. coeff. 0.41 0.42 0.36 0.48<br />

Hirst-StOnge Resnik Jiang-Conrath Lin<br />

correl. coeff. 0.47 0.44 0.45 0.48<br />

In order to use one of these measures or a combination of them in GLexi, we<br />

need to determine the best measure(s) and, because a lexical chainer mostly works with<br />

classes of relatedness, a function, which maps these values into discrete intervals of relatedness.<br />

We question whether a relatedness measure used in a lexical chainer has to be<br />

continuous; a continuous value can misleadingly appear to indicate an unrealistic grade<br />

of accuracy. Instead, a measure mapping from a list of features, such as relation type,<br />

network density or node depth etc., into three classes, such as not related, related and<br />

strongly related might be more adequate. The class distribution in our human judgment<br />

experiment shown in Fig. 10 confirms this idea. Because of the relatively low correlation<br />

between the measure values and the human judgment, the extreme noise in the<br />

distribution functions shown in Figures 7a–c, 8a–c and 9a–b, and the fact that interesting<br />

features of GermaNet are not yet considered in the calculation of the relatedness<br />

values, we assume that none of the measures presented in this paper is in fact appropriate<br />

for lexical chaining in German. In our future work we plan to integrate these findings<br />

into a Machine Learning based mapping between GermaNet-based features (and word<br />

counts, co-occurrence) and discrete classes of relatedness.<br />

7 Evaluation Phase IV – Application-oriented Evaluation<br />

The constraints imposed on our lexical chainer by the application scenario, i.e. the extraction<br />

of topic anchors and the topic chaining itself, are as follows: Firstly, we intend<br />

to utilize the structure and information about a specific text encoded in the lexical<br />

11 The inter-annotator agreement in our study is slightly lower than those reported in the literature<br />

for English because we considered systematically and unsystematically related word pairs as<br />

well as abstract and tricky nouns.


An Evaluation Procedure for Word Net Based Lexical Chaining... 143<br />

1200<br />

1000<br />

28%<br />

28%<br />

800<br />

# of Judgements<br />

600<br />

15%<br />

19%<br />

400<br />

10%<br />

200<br />

0<br />

Level 0 (=no<br />

relation)<br />

Level 1 Level 2 Level 3 Level 4 (=strong<br />

relation)<br />

Fig. 10. Distribution of human judgment<br />

chains as input features for the extraction of topic anchors. Especially, the length of a<br />

chain, the density and strength of its internal linking structure should be of great importance.<br />

Admittedly, additional chaining of independent features could be necessary<br />

to ultimately determine the topic anchors of a text passage. Secondly, we plan to use<br />

the same algorithms and resources for the construction of both lexical and topic chains.<br />

Merely the chaining candidates, i.e. all noun tokens for lexical chaining and exclusively<br />

topic anchors for topic chaining, account for the difference between the two types of<br />

chaining. However, we assume that for both chaining types a net structure could be superior<br />

to linearly organized chains. This kind of structure for a passage of a newspaper<br />

article, which we computed on the basis of our lexical chainer, is shown in Fig. 11.<br />

The article covers child poverty in German society; accordingly, the essential concepts<br />

are Kind (Engl. child), Geld (Engl. money), Deutschland (Engl. Germany), and<br />

Staat (Engl. state). On the basis of, among other things, edge density and frequency,<br />

we calculated the most relevant words (especially, Kind, Geld, Deutschland, and<br />

Staat), which we then accordingly highlighted in the graph shown in Fig. 11. Finally,<br />

the parameter settings, which we found to be reasonable on the basis of the evaluation<br />

phases I–III, need to be integrated with the constraints imposed on our lexical chainer<br />

by our application in our future work.


144 Irene Cramer and Marc Finthammer<br />

Fig. 11. Input for topic chaining: net structure-based lexical chaining example<br />

8 Conclusions and Future Work<br />

We explored the various components and aspects of lexical chaining for German corpora<br />

of technical and academic documents. We presented a detailed evaluation procedure<br />

and discussed the performance of our chaining system with respect to these aspects.<br />

We could show that preprocessing plays a major role due to of the complex morphology<br />

in German and furthermore that technical terminology and proper names are<br />

of great importance. Additionally, we discussed the performance of a simple chainingbased<br />

word sense disambiguation and outlined a method to enhance this aspect. We also<br />

presented a human judgment experiment which was conducted in order to evaluate the<br />

various semantic relatedness measures for GermaNet. We were able to show that it is<br />

thus far very difficult to determine the function mapping between the measure values<br />

and relatedness classes.<br />

We now plan to continue this work on four levels: Firstly, we hope to further improve<br />

the preprocessing; i.e. we plan to enhance the compound analysis and the basic NER<br />

system. In addition, we intend to integrate components for the handling of abbreviations<br />

and technical terminology. Secondly, we aim to develop a sophisticated chaining-based<br />

disambiguation methodology which incorporates the idea of meta-chains and other potentially<br />

useful features. Thirdly, we plan to investigate alternative relatedness measures,<br />

especially Machine Learning based approaches, which map between sets of features<br />

and discrete classes of relatedness. Finally, we intend to further explore our lexical


An Evaluation Procedure for Word Net Based Lexical Chaining... 145<br />

chainer with respect to topic chaining and thus to evaluate our chainer in an application<br />

oriented manner.<br />

References<br />

1. Halliday, M.A.K., Hasan, R.: Cohesion in English. Longman, London (1976)<br />

2. Morris, J., Hirst, G.: Lexical cohesion computed by thesaural relations as an indicator of the<br />

structure of text. Computational linguistics 17(1) (1991)<br />

3. Fellbaum, C., ed.: WordNet. An Electronic Lexical Database. The MIT Press (1998)<br />

4. Hirst, G., St-Onge, D.: Lexical chains as representation of context for the detection and<br />

correction malapropisms. In Fellbaum, C., ed.: WordNet: An electronic lexical database.<br />

(1998)<br />

5. Green, S.J.: Building hypertext links by computing semantic similarity. IEEE Transactions<br />

on Knowledge and Data Engineering 11(5) (1999)<br />

6. Teich, E., Fankhauser, P.: Wordnet for lexical cohesion analysis. In: Proc. of the 2nd Global<br />

WordNet Conference (<strong>GWC</strong>2004). (2004)<br />

7. Mehler, A.: Lexical chaining as a source of text chaining. In: Proc. of the 1st Computational<br />

Systemic Functional Grammar Conference, Sydney. (2005)<br />

8. Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Proc. of the<br />

Intelligent Scalable Text Summarization Workshop (ISTS’97). (1997)<br />

9. Silber, G.H., McCoy, K.F.: Efficiently computed lexical chains as an intermediate representation<br />

for automatic text summarization. Computational Linguistics 28(4) (2002)<br />

10. Novischi, A., Moldovan, D.: Question answering with lexical chains propagating verb arguments.<br />

In: Proc. of the 21st International Conference on Computational Linguistics and 44th<br />

Annual Meeting of the Association for Computational Linguistics. (2006)<br />

11. Carthy, J.: Lexical chains versus keywords for topic tracking. In: Computational Linguistics<br />

and Intelligent Text Processing. Lecture Notes in Computer Science. Springer (2004)<br />

12. Stührenberg, M., Goecke, D., Diewald, N., Mehler, A., Cramer, I.: Web-based annotation<br />

of anaphoric relations and lexical chains. In: Proc. of the Linguistic Annotation Workshop,<br />

ACL 2007. (2007)<br />

13. Morris, J., Hirst, G.: Non-classical lexical semantic relations. In: Proc. of HLT-NAACL<br />

Workshop on Computational Lexical Semantics. (2004)<br />

14. Morris, J., Hirst, G.: The subjectivity of lexical cohesion in text. In Chanahan, J.C., Qu, C.,<br />

Wiebe, J., eds.: Computing attitude and affect in text. Springer (2005)<br />

15. Beigman Klebanov, B.: Using readers to identify lexical cohesive structures in texts. In:<br />

Proc. of ACL Student Research Workshop (ACL2005). (2005)<br />

16. Lemnitzer, L., Kunze, C.: Germanet - representation, visualization, application. In: Proc. of<br />

the Language Resources and Evaluation Conference (LREC2002). (2002)<br />

17. Lemnitzer, L., Kunze, C.: Adapting germanet for the web. In: Proc. of the 1st Global Wordnet<br />

Conference (<strong>GWC</strong>2002). (2002)<br />

18. Beißwenger, M., Wellinghoff, S.: Inhalt und Zusammensetzung des Fachtextkorpus. Technical<br />

report, University of Dortmund, Germany (2006)<br />

19. Lenz, E.A., Lüngen, H.: Annotationsschicht: Logische Dokumentstruktur. Technical report,<br />

University of Dortmund, Germany (2004)<br />

20. Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Communications of<br />

the ACM 8(10) (1965)


146 Irene Cramer and Marc Finthammer<br />

21. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similiarity. Language and<br />

Cognitive Processes 6(1) (1991)<br />

22. Gurevych, I.: Using the structure of a conceptual network in computing semantic relatedness.<br />

In: Proc. of the IJCNLP 2005. (2005)<br />

23. Gurevych, I., Niederlich, H.: Computing semantic relatedness in german with revised information<br />

content metrics. In: Proc. of OntoLex 2005 - Ontologies and Lexical Resources,<br />

IJCNLP 05 Workshop. (2005)<br />

24. Zesch, T., Gurevych, I.: Automatically creating datasets for measures of semantic relatedness.<br />

In: Proc. of the Workshop on Linguistic Distances (ACL 2006). (2006)<br />

25. Beißwenger, M., Storrer, A., Runte, M.: Modellierung eines Terminologienetzes für das automatische<br />

Linking auf der Grundlage von WordNet. In: LDV-Forum, 19 (1/2) (Special issue<br />

on GermaNet applications, edited by Claudia Kunze, Lothar Lemnitzer, Andreas Wagner).<br />

(2003)<br />

26. Kunze, C., Lemnitzer, L., Lüngen, H., Storrer, A.: Towards an integrated owl model for<br />

domain-specific and general language wordnets. (in this volume)<br />

27. Budanitsky, A., Hirst, G.: Semantic distance in wordnet: An experimental, applicationoriented<br />

evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources<br />

at NAACL-2000. (2001)<br />

28. Leacock, C., Chodorow, M.: Combining local context and wordnet similarity for word sense<br />

identification. In Fellbaum, C., ed.: WordNet: An electronic lexical database. (1998)<br />

29. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proc. of the 32nd Annual<br />

Meeting of the Association for Computational Linguistics. (1994)<br />

30. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In:<br />

Proc. of the IJCAI 1995. (1995)<br />

31. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy.<br />

Proc. of the International Conference on Research in Computational Linguisics (1997)<br />

32. Lin, D.: An information-theoretic definition of similarity. In: Proc. of the 15th International<br />

Conference on Machine Learning. (1998)


On the Utility of<br />

Automatically Generated WordNets<br />

Gerard de Melo and Gerhard Weikum<br />

Max Planck Institute for Informatics<br />

Campus E1 4<br />

66123 Saarbrücken, Germany<br />

{demelo,weikum}@mpi-inf.mpg.de<br />

Abstract. Lexical resources modelled after the original Princeton Word-<br />

Net are being compiled for a considerable number of languages, however<br />

most have yet to reach a comparable level of coverage. In this paper,<br />

we show that automatically built WordNets, created from an existing<br />

WordNet in conjunction with translation dictionaries, are a suitable alternative<br />

for many applications, despite the errors introduced by the automatic<br />

building procedure. Apart from analysing the resources directly,<br />

we conducted tests on semantic relatedness assessment and cross-lingual<br />

text classification with very promising results.<br />

1 Introduction<br />

One of the main requirements for domain-independent lexical knowledge bases,<br />

apart from an appropriate data model, is a satisfactory level of coverage. Word-<br />

Net is the most well-known and most widely used lexical database for English<br />

natural language processing, and is the fruit of over 20 years of manual work<br />

carried out at Princeton University [1]. The original WordNet has inspired the<br />

creation of a considerable number of similarly-structured resources for other<br />

languages (“WordNets”), however, compared to the original, many of these still<br />

exhibit a rather low level of coverage due to the laborious compilation process. In<br />

this paper, we argue that, depending on the particular task being pursued, one<br />

can instead often rely on machine-generated WordNets, created with translation<br />

dictionaries from an existing WordNet such as the original WordNet.<br />

The remainder of this paper is laid out as follows. In Section 2 we provide an<br />

overview of strategies for building WordNets automatically, focusing in particular<br />

on a recent machine learning approach. Section 3 then evaluates the quality<br />

of a German WordNet built using this technique, examining the accuracy, coverage,<br />

as well as the general appropriateness of automatic approaches. This is<br />

followed by further investigations motivated by more pragmatic considerations.<br />

After considering human consultation in Section 4, we proceed to look more<br />

closely at possible computational applications, discussing our results in monolingual<br />

tasks such as semantic relatedness estimation in Section 5, and multilingual


148 Gerard de Melo and Gerhard Weikum<br />

ones such as cross-lingual text classification in Section 6. We conclude with final<br />

remarks and an exploration of future research directions in Section 7.<br />

2 Building WordNets<br />

In this section, we summarize some of the possible techniques for automatically<br />

creating WordNets fully aligned to an existing WordNet. We do not consider<br />

the so-called merge model, which normally requires some pre-existing WordNetlike<br />

thesaurus for the new language, and instead focus on the expand model,<br />

which mainly relies on translations [2]. The general approach is as follows: (1)<br />

Take an existing WordNet for some language L 0 , usually Princeton WordNet<br />

for English (2) For each sense s listed by the WordNet, translate all the terms<br />

associated with s from L 0 to a new language L N using a translation dictionary<br />

(3) Additionally retain all appropriate semantic relations between senses in order<br />

to arrive at a new WordNet for L N .<br />

The main challenge lies in determining which translations are appropriate for<br />

which senses. A dictionary translating an L 0 -term e to an L N -term t does not<br />

imply that t applies to all senses of e. For example, considering the translation of<br />

English “bank” to German “Bank”, we can observe that the English term can also<br />

be used for riverbanks, while the German “Bank” cannot (and likewise, German<br />

“Bank” can also refer to a park bench, which does not hold for the English term).<br />

In order to address these problems, several different heuristics have been proposed.<br />

Okumura and Hovy [3] linked a Japanese lexicon to an ontology based on<br />

WordNet synsets. They considered four different strategies: (1) simple heuristics<br />

based on how polysemous the terms are with respect to the number of translations<br />

and with respect to the number of WordNet synsets (2) checking whether<br />

one ontology concept is linked to all of the English translations of the Japanese<br />

term (3) compatibility of verb argument structure (4) degree of overlap between<br />

terms in English example sentences and translated Japanese example sentences.<br />

Another important line of research starting with Rigau and Agirre [4], and<br />

extended by Atserias et al. [5] resulted in automatic techniques for creating preliminary<br />

versions of the Spanish WordNet and later also the Catalan WordNet<br />

[6]. Several heuristic decision criteria were used in order to identify suitable translations,<br />

e.g. monosemy/polysemy heuristics, checking for senses with multiple<br />

terms having the same L N -translation, as well as heuristics based on conceptual<br />

distance measures. Later, these were combined with additional Hungarianspecific<br />

heuristics to create a Hungarian nominal WordNet [7].<br />

Pianta et al. [8] used similar ideas to produce a ranking of candidate synsets.<br />

In their work, the ranking was not used to automatically generate a WordNet<br />

but merely as an aid to human lexicographers that allowed them to work at<br />

faster pace. This approach was later also adopted for the Hebrew WordNet [9].<br />

A more advanced approach that requires only minimal human work lies in<br />

using machine learning algorithms to identify more subtle decision rules that can


On the Utility of Automatically Generated WordNets 149<br />

rely on a number of different heuristic scores with different thresholds. We will<br />

briefly summarize our approach [10]. A classifier f is trained on labelled examples<br />

(x i ,y i ) for pairs (t i ,s i ), where t i is an L N -term and s i is a candidate sense for t i .<br />

Each labelled instance consists of a real-valued feature vector x i , and an indicator<br />

y i ∈ Y = {0,1}, where 1 denotes a positive example, which implies that linking<br />

t i with sense s i is appropriate, and 0 characterizes negative examples. Based<br />

on these training examples, f classifies new unseen test instances by computing<br />

a confidence value y ∈ [0,1] that indicates to what degree an association is<br />

predicted to be correct. One may then obtain a confidence value y t,s for each<br />

possible pair (t,s) where t is a L N -term translated to an L 0 -term that is in turn<br />

linked to a sense s. These values can be used to create the new WordNet by<br />

either maintaining all y t,s as weights in order to create a weighted WordNet, or<br />

alternatively one can use confidence thresholds to obtain a regular unweighted<br />

WordNet. For the latter case, we use two thresholds α 1 , α 2 , and accept a pair<br />

(t,s) if y t,s ≥ α 1 , or alternatively if α 1 > y t,s ≥ α 2 and y t,s > y t,s ′ for all s ′ ≠ s.<br />

The feature vectors x i are created by computing a variety of scores based<br />

on statistical properties of the (t,s) pair as feature values. We mainly rely on a<br />

multitude of semantic overlap scores reflecting the idea that senses with a high<br />

semantic proximity to other candidate senses are more likely to be appropriate, as<br />

well as polysemy scores that reflect the idea that a sense becomes more important<br />

when there are few relevant alternative senses. The former are computed as<br />

∑<br />

max<br />

s ′ ∈σ(e) γ(t,s′ ) rel(s,s ′ ) (1)<br />

while for the latter we use<br />

∑<br />

e∈φ(t)<br />

e∈φ(t)<br />

1 + ∑<br />

s ′ ∈σ(e)<br />

1 σ(e) (s)<br />

γ(t,s ′ )(1 − rel(s,s ′ )) . (2)<br />

In these formulae, φ(t) yields the set of translations of t, σ(e) yields the set of<br />

senses of e, γ(t,s) is a weighting function, and rel(s,s ′ ) is a semantic relatedness<br />

function between senses. The characteristic function 1 σ(e) (s) yields 1 if s ∈ σ(e)<br />

and 0 otherwise. We use a number of different weighting functions γ(t,s) that<br />

take into account lexical category compatibility, corpus frequency information,<br />

etc., as well as multiple relatedness functions rel(s,s ′ ) based on gloss similarity<br />

and graph distance (cf. Section 5.2).<br />

This approach has several advantages over previous proposals: (1) Apart from<br />

the translation dictionary, it does not rely on additional resources such as monolingual<br />

dictionaries with field descriptors, verb argument structure information,<br />

and the like for the target language L N , and thus can be used in many settings,<br />

(2) the learning algorithm can exploit real-valued heuristic scores rather than<br />

just predetermined binary decision criteria, leading to a greater coverage, (3) the<br />

algorithm can take into account complex dependencies between multiple scores<br />

rather than just single heuristics or combinations of two heuristics.


150 Gerard de Melo and Gerhard Weikum<br />

3 Analysis of a Machine-Generated WordNet<br />

In the remainder of this paper, we will focus on a German-language WordNet<br />

produced using the machine learning technique described above, as it is the most<br />

advanced approach. The WordNet was generated from Princeton WordNet 3.0<br />

and the Ding translation dictionary [11] using a linear kernel support vector machine<br />

[12] with posterior probability estimation as implemented in LIBSVM [13].<br />

The training set consisted of 1834 candidate mappings for 350 randomly selected<br />

German terms that were manually classified as correct (22%) or incorrect. The<br />

values α 1 = 0.5 and α 2 = 0.45 were chosen as classification thresholds.<br />

3.1 Accuracy and Coverage<br />

In order to evaluate the quality of the WordNet generated in this manner, we<br />

considered a test set of term-sense mappings for 350 further randomly selected<br />

terms. We then determined whether the resulting 1624 mappings, which had not<br />

been involved in the WordNet building process, corresponded with the entries<br />

of our new WordNet. Table 1 summarizes the results, showing the precision and<br />

recall with respect to this test set.<br />

Table 1. Evaluation of precision and recall on an independent test set<br />

precision<br />

recall<br />

nouns 79.87 69.40<br />

verbs 91.43 57.14<br />

adjectives 78.46 62.96<br />

adverbs 81.81 60.00<br />

overall 81.11 65.37<br />

The results demonstrate that indeed a surprisingly high level of precision<br />

and recall can be obtained with fully automated techniques, considering the<br />

difficulty of the task. While the precision might not fulfil the high lexicographical<br />

standards adopted by traditional dictionary publishers, we shall later see that it<br />

suffices for many practical applications. Furthermore, one of course may obtain<br />

a higher level of precision at the expense of a lower recall by adjusting the<br />

acceptance thresholds. For very high recall levels, an increased precision might<br />

not be realistic even using purely manual work, considering that Miháltz and<br />

Prószéky [7] report an inter-annotator agreement of 84.73% for such mappings.<br />

Table 2 shows that applying the classification thresholds to all terms in the<br />

dictionary leads to a WordNet with a considerable coverage. While smaller than<br />

GermaNet 5.0 [14], one of the largest WordNets, it covers more senses than any<br />

of the original eight WordNets delivered by the EuroWordNet project [2]. Table 3


On the Utility of Automatically Generated WordNets 151<br />

Table 2. Quantitative Assessment of Coverage of the German WordNet<br />

sense<br />

mappings<br />

terms<br />

lexicalized<br />

synsets<br />

nouns 53146 35089 28007<br />

verbs 13875 5908 6304<br />

adjectives 21799 13772 9949<br />

adverbs 4243 2992 2593<br />

total 93063 55522 46853<br />

gives an overview of the polysemy of the terms as covered by our WordNet, with<br />

arithmetic means computed from the polysemy either of all terms, or exclusively<br />

from terms polysemous with respect to the WordNet.<br />

Table 3. Polysemy of Terms and Mean Number of Lexicalizations (excluding unlexicalized<br />

senses)<br />

mean term<br />

polysemy<br />

mean term<br />

polysemy<br />

excluding<br />

monosemous<br />

mean no. of<br />

sense<br />

lexicalizations<br />

nouns 1.51 2.95 1.90<br />

verbs 2.35 4.36 2.20<br />

adjectives 1.58 2.79 2.19<br />

adverbs 1.42 2.52 1.64<br />

total 1.68 3.07 1.99<br />

A more qualitative assessment of the accuracy and coverage revealed the<br />

following issues:<br />

– Non-Uniformity of Coverage: While even many specialized terms are included<br />

(e.g. “Kokarde”, “Vasokonstriktion”), certain very common terms were found<br />

to be missing (e.g. “Kofferraum”, “Schloss”). This seems to arise from the<br />

fact that common terms tend to be more polysemous, though frequently such<br />

terms also have multiple translations, which tends to facilitate the mapping<br />

process. One solution would be manually adding mappings for terms with<br />

high corpus frequency values, which due to Zipf’s law would quickly improve<br />

the relative coverage of the terms in ordinary texts.<br />

– Lexical Gaps and Incongruencies: Another issue is the lack of terms for which<br />

there are no lexicalized translations in the English language, or which are not<br />

covered by the source WordNet, e.g. the German word “Feierabend” means<br />

the finishing time of the daily working hours. The solution could consist


152 Gerard de Melo and Gerhard Weikum<br />

in smartly adding new senses to the sense hierarchy based on paraphrasing<br />

translations (e.g. as a hyponym of “time” for our current example).<br />

– Multi-word expressions in L N : Certain multi-word translations in L N might<br />

be considered inappropriate for inclusion in a lexical resource, e.g. the Ding<br />

dictionary lists “Jahr zwischen Schule und Universität” as a translation of<br />

“gap year”. By generally excluding all multi-word expressions one would<br />

also likely drop a lot of lexicalized expressions, e.g. German “runde Klammer”<br />

(parenthesis). A much better solution is to automatically mark all multiword<br />

expressions as possibly unlexicalized whenever no matching entry is<br />

found in monolingual dictionaries.<br />

3.2 Relational Coverage<br />

By producing mappings to senses of an existing source WordNet, we have the<br />

great advantage of immediately being able to import relations between those<br />

synsets. An excerpt of some of the relations we imported is given in Table 4.<br />

Table 4. An excerpt of some of the imported relations. We distinguish full links between<br />

two senses both with L N-lexicalizations, and outgoing links from senses with an L N<br />

lexicalization.<br />

relation full links outgoing<br />

hyponymy 26324 60062<br />

hypernymy 26324 33725<br />

similarity 10186 14785<br />

has category 2131 2241<br />

category of 2131 6135<br />

has instance 641 5936<br />

instance of 641 1131<br />

part meronymy 2471 6029<br />

part holonymy 2471 3408<br />

member meronymy 400 734<br />

member holonymy 400 1517<br />

subst. meronymy 190 325<br />

subst. holonymy 190 414<br />

Lexical relations between particular terms cannot, in general, be transferred<br />

automatically, e.g. a region domain for a term in one language, signifying in what<br />

geographical region the term is used, will not apply to a second language. However,<br />

certain lexical relations such as the derivation relation still provide valuable<br />

information when interpreted as a general indicator of semantic relatedness, as<br />

can be seen in Table 5, which shows the results of a human evaluation for several<br />

different relation types. Incorrect relations are almost entirely due to incorrect<br />

term-sense mappings.


On the Utility of Automatically Generated WordNets 153<br />

Table 5. Quality assessment for imported relations: For each relation type, 100 randomly<br />

selected links between two senses with L N-lexicalizations were evaluated.<br />

relation<br />

accuracy<br />

hyponymy, hypernymy 84%<br />

similarity 90%<br />

category 91%<br />

instance 93%<br />

part meronymy, holonymy 83%<br />

member meronymy, holonymy 89%<br />

subst. meronymy, holonymy 83%<br />

antonymy (as sense opposition) 95%<br />

derivation (as semantic similarity) 96%<br />

3.3 Structural Adequacy<br />

As mentioned earlier, our machine learning approach is very parsimonious with<br />

respect to L N -specific prerequisites, and hence scales well to new languages.<br />

Some lexicographers contend that using one WordNet as the structural basis<br />

for another WordNet does not do justice to the structure of the new language’s<br />

lexicon.<br />

The most significant issue is certainly that the source WordNet may lack<br />

senses for certain terms in the new language, as in the case of the German<br />

“Feierabend”. This point has already been addressed in Section 3.1.<br />

Apart from this, it seems that general structural differences between languages<br />

rarely cause problems. When new WordNets are built independently from<br />

existing WordNets, many of the structural differences will not be due to actual<br />

conceptual differences between languages, but rather result from subjective decisions<br />

made by the individual human modellers [8].<br />

Some of the rare examples of cultural differences affecting relations between<br />

two senses include perhaps the question of whether the local term for “guinea<br />

pig” should count as a hyponym of the respective term for “pet”. For such cases,<br />

our suggestion is to manually add relation attributes that describe the idea of<br />

a connection being language-specific, culturally biased, or based on a specific<br />

taxonomy rather than holding unconditionally.<br />

A more general issue is the adequacy of the four lexical categories (parts of<br />

speech) considered by Princeton WordNet. Fortunately, most of the differences<br />

between languages in this respect either concern functional words, or occur at<br />

very fine levels of distinctions, e.g. genus distinctions for German nouns, and thus<br />

are conventionally considered irrelevant to WordNets, though such information<br />

could be derived from monolingual dictionaries and added to the WordNet.


154 Gerard de Melo and Gerhard Weikum<br />

4 Human Consultation<br />

One major disadvantage of automatically built WordNets is the lack of nativelanguage<br />

glosses and example sentences, although this problem is not unique to<br />

automatically-built WordNets. Because of the great effort involved in compiling<br />

such information, manually built WordNets such as GermaNet also lack glosses<br />

and example sentences for the overwhelming majority of the senses listed. In<br />

this respect, automatically produced aligned WordNets have the advantage of at<br />

least making English-language glosses accessible.<br />

Another significant issue is the quality of the mappings. As people are more<br />

familiar with high-quality print dictionaries, they do not expect to encounter<br />

incorrect entries when consulting a WordNet-like resource.<br />

In contrast, we found that machine-generated WordNets can instead be used<br />

to provide machine-generated thesauri, where users expect to find more generally<br />

related terms rather than precise synonyms and gloss descriptions. In order to<br />

generate such a thesaurus, we relied on a simple technique that looks up all<br />

senses of a term as well as certain related senses, and then forms the union of<br />

all lexicalizations of these senses (Algorithm 4.1 with n h = 2, n o = 2, n g = 1).<br />

Table 6 provides a sample entry from the German thesaurus resulting from our<br />

WordNet, and demonstrates that such resources can indeed be used for example<br />

as built-in thesauri in word processing applications.<br />

Algorithm 4.1 Thesaurus Generation<br />

Input: a WordNet instance W (with function σ for retrieving senses and σ −1 for retrieving<br />

the set of all terms for a sense), number of hypernym levels n h , number of hyponym levels<br />

n o, number of levels for other general relations n g, set of acceptable general relations R<br />

Objective: generate a thesaurus that lists related terms for any given term<br />

1: procedure GenerateThesaurus(W, R)<br />

2: for each term t from W do ⊲ for every term t listed in the WordNet<br />

3: T ← ∅ ⊲ the list of related terms for t<br />

4: for each sense s ∈ σ(t) do ⊲ for each sense of t<br />

5: for each sense s ′ ∈ Related(W, s, n h , n o, n g, R) do<br />

6: T ← T ∪ σ −1 (s ′ ) ⊲ add lexicalizations of s ′ to T<br />

7: output T as list of related terms for t<br />

8: function Related(W, s, n h , n o, n g, R)<br />

9: S ← {s}<br />

10: for each sense s ′ related to s with respect to W do ⊲ recursively visit related senses<br />

11: if (s ′ hypernym of s) ∧ (n h > 0) then<br />

12: S ← S ∪ Related(W, s ′ , n h − 1, 0,0, ∅)<br />

13: else if (s ′ hyponym of s) ∧ (n o > 0) then<br />

14: S ← S ∪ Related(W, s ′ , 0, n o − 1, 0, ∅)<br />

15: else if ∃r ∈ R : (s ′ stands in relation r to s) ∧ (n g > 0) then<br />

16: S ← S ∪ Related(W, s ′ , 0,0, n g − 1, R)<br />

17: return S


On the Utility of Automatically Generated WordNets 155<br />

Table 6. Sample entries from generated thesaurus (which contains entries for 55522<br />

terms, each entry listing 17 additional related terms on average)<br />

headword: Leseratte<br />

Buchgelehrte, Buchgelehrter, Bücherwurm, Geisteswissenschaftler, Gelehrte,<br />

Gelehrter, Stubengelehrte, Stubengelehrter, Student, Studentin, Wissenschaftler<br />

headword: leserlich<br />

Lesbarkeit, Verständlichkeit<br />

deutlich, entzifferbar, klar, lesbar, lesenswert, unlesbar, unleserlich, übersichtlich<br />

5 Monolingual Applications<br />

5.1 General Remarks<br />

Although at first it might seem that having WordNets aligned to the original<br />

WordNet is mainly beneficial for cross-lingual tasks, it turns out that the alignment<br />

also proves to be a major asset for monolingual applications, as one can<br />

leverage much of the information associated with the Princeton WordNet, e.g.<br />

the included English-language glosses, as well as a wide range of third-party resources,<br />

incl. topic domain information [15], links to ontologies such as SUMO<br />

[16] and YAGO [17], etc.<br />

For instance, for the task of word sense disambiguation, a preliminary study<br />

using an algorithm that maximizes the overlap of the English-language glosses<br />

[18] showed promising results, although we were unable to evaluate it more adequately<br />

due to the lack of an appropriate sense-tagged test corpus. One problem<br />

we encountered, however, was that the generated WordNet sometimes did not<br />

cover all of the terms and senses to be disambiguated, which means that it is<br />

not an ideal sense inventory for word sense disambiguation tasks.<br />

Apart from that, generated WordNets can be used for most other tasks that<br />

the English WordNet is usually employed for, including text and multimedia<br />

retrieval, text classification, text summarization, as well as semantic relatedness<br />

estimation, which we will now consider in more detail.<br />

5.2 Semantic Relatedness<br />

Several studies have attempted to devise means of automatically approximating<br />

semantic relatedness judgments made by humans, predicting e.g. that most<br />

humans consider the two terms “fish” and “water” semantically related. Such<br />

relatedness information is useful for a number of different tasks in information<br />

retrieval and text mining, and various techniques have been proposed, many relying<br />

on lexical resources such as WordNet. For the German language, Gurevych<br />

[19] reported that Lesk-style similarity measures based on the similarity of gloss<br />

descriptions [20] do not work well in their original form because GermaNet features<br />

only very few glosses, and those that do exist tend to be rather short. With


156 Gerard de Melo and Gerhard Weikum<br />

machine-generated aligned WordNets, however, one can apply virtually any existing<br />

measure of relatedness that is based on the English WordNet, because<br />

English-language glosses and co-occurrence data are available.<br />

We proceeded using the following assessment technique. Given two terms t 1 ,<br />

t 2 , we estimate their semantic relatedness using the maximum relatedness score<br />

between any of their two senses:<br />

rel(t 1 ,t 2 ) = max<br />

s 1∈σ(t 1)<br />

max rel(s 1,s 2 ) (3)<br />

s 2∈σ(t 2)<br />

For the relatedness scores, we consider three different approaches.<br />

1. Graph distance: We consider the graph constituted by WordNet’s senses<br />

and sense relations, and compute proximity scores for nodes in the graph<br />

by taking the maximum of the products of relation-specific edge weights for<br />

any two paths between two nodes.<br />

2. Gloss Similarity: For each sense in WordNet, extended gloss descriptions are<br />

created by concatenating the glosses and lexicalizations associated with the<br />

sense as well as those associated with certain related senses (senses connected<br />

via hyponymy, derivation/derived, member/part holonymy, and instance relations,<br />

as well as two levels of hypernyms). Each gloss description is then<br />

represented as a bag-of-words vector, where each dimension represents the<br />

TF-IDF value of a stemmed term from the glosses. For two senses s 1 , s 2 ,<br />

one then computes the inner product of the two corresponding gloss vectors<br />

c 1 , c 2 to determine the cosine of the angle θ c1,c 2<br />

between them, which<br />

characterizes the amount of term overlap for the two context strings:<br />

cos θ c1,c 2<br />

= 〈c 1,c 2 〉<br />

||c 1 || · ||c 2 ||<br />

(4)<br />

3. Maximum: Since the two measures described above are based on very different<br />

information, we combined them into a meta-method that always chooses<br />

the maximum of these two relatedness scores.<br />

For evaluating the approach, we employed three German datasets [19, 21]<br />

that capture the mean of relatedness assessments made by human judges. In<br />

each case, the assessments computed by our methods were compared with these<br />

means, and Pearson’s sample correlation coefficient was computed. The results<br />

are displayed in Table 7, where we also list the current state-of-the-art scores<br />

obtained for GermaNet and Wikipedia as reported by Gurevych et al. [22].<br />

The results show that our semantic relatedness measures lead to near-optimal<br />

correlations with respect to the human inter-annotator agreement correlations.<br />

The main drawback of our approach is a reduced coverage compared to Wikipedia<br />

and GermaNet, because scores can only be computed when both parts of a term<br />

pair are covered by the generated WordNet.


On the Utility of Automatically Generated WordNets 157<br />

Table 7. Evaluation of semantic relatedness measures, using Pearson’s sample correlation<br />

coefficient in %. We compare our three semantic relatedness measures based on<br />

the automatically generated WordNet with the agreement between human annotators<br />

and scores for two alternative measures as reported by Gurevych et al. [22], one based<br />

on Wikipedia, the other on GermaNet.<br />

Dataset GUR65 GUR350 ZG222<br />

Pearson r Coverage Pearson r Coverage Pearson r Coverage<br />

Inter-Annot. Agreement 0.81 (65) 0.69 (350) 0.49 (222)<br />

Wikipedia (ESA) 0.56 65 0.52 333 0.32 205<br />

GermaNet (Lin) 0.73 60 0.50 208 0.08 88<br />

Gen. WordNet (graph) 0.72 54 0.64 185 0.41 89<br />

Gen. WordNet (gloss) 0.77 54 0.59 185 0.47 89<br />

Gen. WordNet (max.) 0.75 54 0.67 185 0.44 89<br />

One advantage of our approach is that it may also be applied without any<br />

further changes to the task of cross-lingually assessing the relatedness of English<br />

terms with German terms. In the following section, we will take a closer look at<br />

the general suitability of our WordNet for multilingual applications.<br />

6 Multilingual Applications<br />

6.1 General Remarks<br />

We can distinguish the following two categories of applications with multilingual<br />

support.<br />

– multilingual applications that need to support certain operations on more<br />

than just a single language, e.g. word processors with thesauri for multiple<br />

languages<br />

– multilingual applications that perform cross-lingual operations<br />

By creating isolated WordNets for many different languages one addresses<br />

only the first case. For the second case, one can use multiple WordNets for<br />

different languages where the senses are strongly interlinked. The ideal case is<br />

when there is no sense duplication, i.e. if two words in different languages share<br />

the same meaning, they should be linked to the same sense. The techniques<br />

described in Section 2 achieve this by producing WordNets that are strictly<br />

aligned to the source WordNet whenever appropriate.<br />

Aligned WordNets thus can be used for various cross-lingual tasks, including<br />

cross-lingual information retrieval [23], and cross-lingual text classification,<br />

which will now be studied.


158 Gerard de Melo and Gerhard Weikum<br />

6.2 Cross-Lingual Text Classification<br />

Text classification is the task of assigning text documents to the classes or categories<br />

considered most appropriate, thereby e.g. topically distinguishing texts<br />

about thermodynamics from others dealing with quantum mechanics. This is<br />

commonly achieved by representing each document using a vector in a highdimensional<br />

feature space where each feature accounts for the occurrence of a<br />

particular term from the document set (a bag-of-words model), and then applying<br />

machine learning techniques such as support vector machines. For more<br />

information, please refer to Sebastiani’s survey [24].<br />

Cross-lingual text classification is a much more challenging task. Since documents<br />

from two different languages obviously have completely different term<br />

distributions, the conventional bag-of-words representations perform poorly. Instead,<br />

it is necessary to induce representations that tend to give two documents<br />

from different languages similar representations when their content is similar.<br />

One means of achieving this is the use of language-independent conceptual<br />

feature spaces where the feature dimensions represent meanings of terms rather<br />

than just the original terms. We process a document by removing stop words,<br />

performing part-of-speech tagging and lemmatization using the TreeTagger [25],<br />

and then map each term to the respective sense entries listed by the WordNet<br />

instance. In order to avoid decreasing recall levels, we do not disambiguate in any<br />

way other than acknowledging the lexical category of a term, but rather assign<br />

each sense s a local scoreÈw t,s<br />

whenever a term t is mapped to multiple<br />

w t,s ′<br />

s ′ ∈σ(t)<br />

senses s ∈ σ(t). Here, w t,s is the weight of the link from t to s as provided by<br />

the WordNet if the lexical category between document term and sense match,<br />

or 0 otherwise. We test two different setups: one relying on regular unweighted<br />

WordNets (w t,s ∈ {0,1}), and another based on a weighted German WordNet<br />

(w t,s ∈ [0,1]), as described in Section 2. Since the original document terms may<br />

include useful language-neutral terms such as names of people or organizations,<br />

they are also taken into account as tokens with a weight of 1. By summing up<br />

the weights for each local occurrence of a token t (a term or a sense) within a<br />

document d, one arrives at document-level token occurrence scores n(t,d), from<br />

which one can then compute TF-IDF-like feature vectors using the following<br />

formula:<br />

(<br />

)<br />

|D|<br />

log(n(t,d) + 1) log<br />

|{d ∈ D | n(t,d) ≥ 1}|<br />

where D is the set of training documents.<br />

This approach was tested using a cross-lingual dataset derived from the<br />

Reuters RCV1 and RCV2 collections of newswire articles [26, 27]. We randomly<br />

selected 15 topics shared by the two corpora in order to arrive at ( )<br />

15<br />

2 = 105 binary<br />

classification tasks, each based on 200 training documents in one language,<br />

and 600 test documents in a second language, likewise randomly selected, however<br />

ensuring equal numbers of positive and negative examples in order to avoid<br />

(5)


On the Utility of Automatically Generated WordNets 159<br />

biased error rates. We considered a) German training documents and English<br />

test documents and b) English training documents and German test documents.<br />

For training, we relied on the SVMlight implementation [28] of support vector<br />

machine learning [12], which is known to work very well for text classification.<br />

Table 8. Evaluation of cross-lingual text classification in terms of micro-averaged<br />

accuracy, precision, recall, and F 1-score for a German-English as well as an English-<br />

German setup. We compare the standard bag-of-words TF-IDF representation with<br />

two WordNet-based representations, one using an unweighted, the other based on a<br />

weighted German WordNet.<br />

acc. prec. rec. F 1<br />

German-English<br />

TF-IDF 80.56 77.49 86.14 81.59<br />

WordNet (unweighted) 87.09 85.27 89.68 87.42<br />

WordNet (weighted) 87.98 85.48 91.51 88.39<br />

English-German<br />

TF-IDF 78.82 79.19 78.20 78.69<br />

WordNet (unweighted) 85.39 87.38 82.74 84.99<br />

WordNet (weighted) 87.47 87.73 87.07 87.40<br />

The results in Table 8 clearly show that automatically built WordNets aid in<br />

cross-lingual text classification. Since many of the Reuters topic categories are<br />

business-related, using only the original document terms, which include names of<br />

companies and people, already works surprisingly well, though presumably not<br />

well enough for use in production settings. By considering WordNet senses, both<br />

precision and recall are boosted significantly. This implies that English terms in<br />

the training set are being mapped to the same senses as the corresponding German<br />

terms in the test documents. Using the weighted WordNet version further<br />

improves the recall, as more relevant terms and senses are covered.<br />

7 Conclusions<br />

We have shown that machine-generated WordNets are useful for a number of<br />

different purposes. First of all, of course, they can serve as a valuable starting<br />

point for establishing more reliable WordNets, which would involve manually<br />

extending the coverage and addressing issues arising from differences between<br />

the lexicons of different languages.<br />

At the same time, machine-generated WordNets can be used directly without<br />

further manual work to generate thesauri for human use, or for a number of<br />

different natural language processing applications, as we have shown in particular<br />

for semantic relatedness estimation and cross-lingual text classification.


160 Gerard de Melo and Gerhard Weikum<br />

In the future, we would like to investigate techniques for extending the coverage<br />

of such statistically generated WordNets to senses not covered by the original<br />

Princeton WordNet. We hope that our research will aid in contributing to making<br />

lexical resources available for languages which to date have not been dealt<br />

with by the WordNet community.<br />

References<br />

1. Fellbaum, C., ed.: WordNet: An Electronic Lexical Database (Language, Speech,<br />

and Communication). The MIT Press (1998)<br />

2. Vossen, P.: Right or wrong: Combining lexical resources in the EuroWordNet<br />

project. In: Proc. Euralex-96. (1996) 715–728<br />

3. Okumura, A., Hovy, E.: Building Japanese-English dictionary based on ontology for<br />

machine translation. In: Proc. Workshop on Human Language Technology, HLT,<br />

Morristown, NJ, USA, Association for Computational Linguistics (1994) 141–146<br />

4. Rigau, G., Agirre, E.: Disambiguating bilingual nominal entries against WordNet.<br />

In: Proc. Workshop on the Computational Lexicon at the 7th European Summer<br />

School in Logic, Language and Information, ESSLLI. (1995)<br />

5. Atserias, J., Climent, S., Farreres, X., Rigau, G., Rodríguez, H.: Combining multiple<br />

methods for the automatic construction of multilingual WordNets. In: Proc.<br />

International Conference on Recent Advances in NLP. (1997) 143–149<br />

6. Benitez, L., Cervell, S., Escudero, G., Lopez, M., Rigau, G., Taulé, M.: Methods<br />

and tools for building the Catalan WordNet. In: Proc. ELRA Workshop on<br />

Language Resources for European Minority Languages at LREC 1998. (1998)<br />

7. Miháltz, M., Prószéky, G.: Results and evaluation of Hungarian Nominal Word-<br />

Net v1.0. In: Proc. Second Global WordNet Conference, Brno, Czech Republic,<br />

Masaryk University (2004)<br />

8. Pianta, E., Bentivogli, L., Girardi, C.: MultiWordNet: Developing an aligned<br />

multilingual database. In: Proc. First International Global WordNet Conference,<br />

Mysore, India. (2002) 293–302<br />

9. Ordan, N., Wintner, S.: Hebrew WordNet: a test case of aligning lexical databases<br />

across languages. International Journal of Translation 19(1) (2007)<br />

10. de Melo, G., Weikum, G.: A machine learning approach to building aligned wordnets.<br />

In: Proc. International Conference on Global Interoperability for Language<br />

Resources, ICGL. (<strong>2008</strong>)<br />

11. Richter, F.: Ding Version 1.5, http://www-user.tu-chemnitz.de/~fri/ding/.<br />

(2007)<br />

12. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (1998)<br />

13. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. (2001)<br />

14. Hamp, B., Feldweg, H.: GermaNet — a lexical-semantic net for German. In:<br />

Proc. ACL Workshop Automatic Information Extraction and Building of Lexical<br />

Semantic Resources for NLP Applications, Madrid (1997)<br />

15. Bentivogli, L., Forner, P., Magnini, B., Pianta, E.: Revising the Wordnet Domains<br />

hierarchy: semantics, coverage and balancing. In: Proc. COLING 2004 Workshop<br />

on Multilingual Linguistic Resources, Geneva, Switzerland (2004) 94–101


On the Utility of Automatically Generated WordNets 161<br />

16. Niles, I., Pease, A.: Linking lexicons and ontologies: Mapping WordNet to the<br />

Suggested Upper Merged Ontology. In: Proc. 2003 International Conference on<br />

Information and Knowledge Engineering (IKE ’03), Las Vegas, NV, USA. (2003)<br />

17. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A Core of Semantic Knowledge.<br />

In: 16th International World Wide Web conference, WWW, New York, NY, USA,<br />

ACM Press (2007)<br />

18. Patwardhan, S., Banerjee, S., Pedersen, T.: Using measures of semantic relatedness<br />

for word sense disambiguation. In: Proc. 4th Intl. Conference on Computational<br />

Linguistics and Intelligent Text Processing (CICLing), Mexico City, Mexico. (2003)<br />

19. Gurevych, I.: Using the structure of a conceptual network in computing semantic<br />

relatedness. In: Proc. Second International Joint Conference on Natural Language<br />

Processing, IJCNLP, Jeju Island, Republic of Korea (2005)<br />

20. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries:<br />

how to tell a pine cone from an ice cream cone. In: Proc. 5th annual international<br />

conference on Systems documentation, SIGDOC ’86, New York, NY, USA, ACM<br />

Press (1986) 24–26<br />

21. Zesch, T., Gurevych, I.: Automatically creating datasets for measures of semantic<br />

relatedness. In: COLING/ACL 2006 Workshop on Linguistic Distances, Sydney,<br />

Australia (2006) 16–24<br />

22. Gurevych, I., Müller, C., Zesch, T.: What to be? - Electronic career guidance<br />

based on semantic relatedness. In: Proc. 45th Annual Meeting of the Association<br />

for Computational Linguistics, Prague, Czech Republic, Association for Computational<br />

Linguistics (2007) 1032–1039<br />

23. Chen, H.H., Lin, C.C., Lin, W.C.: Construction of a Chinese-English WordNet and<br />

its application to CLIR. In: Proc. Fifth International Workshop on Information<br />

Retrieval with Asian languages, IRAL ’00, New York, NY, USA, ACM Press (2000)<br />

189–196<br />

24. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing<br />

Surveys 34(1) (2002) 1–47<br />

25. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Intl.<br />

Conference on New Methods in Language Processing, Manchester, UK (1994)<br />

26. Reuters: Reuters Corpus, vol. 1: English Language, 1996-08-20 to 1997-08-19 (2000)<br />

27. Reuters: Reuters Corpus, vol. 2: Multilingual, 1996-08-20 to 1997-08-19 (2000)<br />

28. Joachims, T.: Making large-scale support vector machine learning practical. In<br />

Schölkopf, B., Burges, C., Smola, A., eds.: Advances in Kernel Methods: Support<br />

Vector Machines. MIT Press, Cambridge, MA, USA (1999)


Words, Concepts and Relations<br />

in the Construction of Polish WordNet ⋆<br />

Magdalena Derwojedowa 1 , Maciej Piasecki 2 , Stanisław Szpakowicz 3,4 ,<br />

Magdalena Zawisławska 1 , and Bartosz Broda 2<br />

1 Institute of the Polish Language, Warsaw University,<br />

{derwojed,zawisla}@uw.edu.pl<br />

2 Institute of Applied Informatics, Wrocław University of Technology,<br />

{maciej.piasecki,bartosz.broda}@pwr.wroc.pl<br />

3 School of Information Technology and Engineering, University of Ottawa,<br />

szpak@site.uottawa.ca<br />

4 Institute of Computer Science, Polish Academy of Sciences<br />

Abstract. A Polish WordNet has been under construction for two years.<br />

We discuss the organisation of the project, the fundamental assumptions,<br />

the tools and the resources. We show how our work differs from that<br />

done on EuroWordNet and BalkaNet. In a year we expect the network<br />

to reach 20000 lexical units. Some 12000 entries will have been completed<br />

by hand. Work on others will be automated as far as possible; to that<br />

end, we have developed statistics-based semantic similarity functions and<br />

methods based on a form of chunking. The preliminary results show that<br />

at least semi-automated acquisition of relations is feasible, so that the<br />

lexicographers’ work may be reduced to revision and approval.<br />

1 Organisation of the project<br />

Ever since the initial burst of popularity of the original WordNet [1, 2], there<br />

has been little doubt how useful WordNets are in Natural Language Processing.<br />

For those who work with a language that lacks a WordNet, the question is not<br />

whether, but how and how fast to construct such a lexical resource. The construction<br />

is costly, with the bulk of the cost due to the high linguistic workload.<br />

This appears to have been the case, in particular, in two multinational WordNetbuilding<br />

projects, EuroWordNet [3] and BalkaNet [4]. The recent developments<br />

in automatic acquisition of lexical-semantic relations suggest that the cost might<br />

be reduced. Our project to construct a Polish WordNet (plWordNet) explores<br />

this path as a supplement to a well-organized and well-supported effort of a team<br />

of linguists/lexicographers.<br />

⋆ Work financed by the Polish Ministry of Education and Science, Project No. 3 T11C<br />

018 29.


Words and Concepts in the Construction of Polish WordNet 163<br />

The three-year project started in November 2005. The Polish Ministry of<br />

Education and Science funds it with a very modest ca. 65000 euro (net). The<br />

stated main objective is the development of algorithms of automatic acquisition<br />

of lexical-semantic relations for Polish, but we envisage the manual, softwareassisted<br />

creation of some 15000 to 20000 lexical units 5 (LUs) as an important<br />

side-effect. The evolving network also plays an essential role in the automated acquisition<br />

of relations. We describe the current state of the project in Section 3.3.<br />

We will automate part of the development effort. A core of about 7000 LUs<br />

has been constructed completely manually; in a form of bootstrapping, the remainder<br />

of the initial plWordNet will be built semi-automatically. Algorithms<br />

that generate synonym suggestions from a large corpus [5] will make suggestions<br />

for the linguists to act upon. The ultimate responsibility for every entry rests<br />

with its authors, in keeping with our general principle of high trustworthiness<br />

of the resource. We must, however, try to reduce the linguists’ workload and<br />

thus the time it takes to construct a network of a size comparable to several<br />

much more established European WordNets. We have allotted the funds approximately<br />

in the proportion 1 : 2 to manual work and to the software design and<br />

development work.<br />

The remainder of the paper presents a more detailed overview of decisions<br />

made and work done till now, reviews the lessons learned, and sketches the plan<br />

for the last year of this project.<br />

2 Fundamental assumptions<br />

The backbone of any WordNet is its system of semantic relations. Two principles<br />

guided our design of the set of relations for Polish WordNet (plWordNet): we<br />

should — for obvious portability reasons — stay as close as possible to the<br />

Princeton WordNet (WN) set and the EuroWordNet (EWN) set, but we should<br />

also respect the specific properties of the Polish language, especially its very rich<br />

morphology. Tables 1 and 2 summarise our decisions 6 .<br />

In our description we have kept the division of lexemes into grammatical<br />

classes (parts of speech, as in WN): nouns, verbs and adjectives. Relations other<br />

than relatedness and pertainymy connect lexemes in the same class. Some relations<br />

are symmetrical (e.g., if A is an antonym of B, then B is an antonym of A;<br />

the hyponymy-hypernymy pair is symmetrical, too), while others are not (e.g.,<br />

holonymy: a spoke is part of a wheel, but not every wheel has spokes). We refer<br />

to this property of semantic relations as reversibility.<br />

5 We consider it a more precise measure of WordNet size than the number of synsets.<br />

Variously interconnected LUs – lexemes, generally speaking – are the basic building<br />

blocks of plWordNet.<br />

6 EWN has introduced a number of other relations which are not relevant to the<br />

discussion in this paper


164 Magdalena Derwojedowa et al.<br />

WordNet EuroWordNet Polish WordNet<br />

synonymy synonymy synonymy<br />

antonymy antonymy antonymy<br />

conversion<br />

hypo-/hypernymy hypo-/hypernymy hypo-/hypernymy<br />

mero-/holonymy mero-/holonymy mero-/holonymy<br />

entailment<br />

entailment<br />

troponymy<br />

troponymy<br />

cause<br />

caused/is caused by<br />

derived form derived –<br />

pertainym pertainymy relatedness<br />

pertainymy<br />

similar to<br />

participle<br />

see also<br />

attribute<br />

role<br />

has subevent<br />

in manner of<br />

be in state<br />

fuzzynymy fuzzynymy<br />

Table 1. Semantic relations in WordNet, EuroWordNet and Polish WordNet<br />

relation grammatical class reversibility<br />

noun verb adjective<br />

synonymy + + + +<br />

hypo-/hypernymy + + + +<br />

antonymy + + + +<br />

conversion + + + +<br />

mero-/holonymy + – – –<br />

entailment – + – –<br />

troponymy – + – –<br />

relatedness + + + –<br />

derived form + + + –<br />

fuzzynymy + + + –<br />

Table 2. Properties of the semantic relations in Polish WordNet


Words and Concepts in the Construction of Polish WordNet 165<br />

In plWordNet, relations hold between LUs — pairs of lexemes. For example,<br />

the adjective mądry ‘wise’ is antonymous with głupi ‘stupid’, but its synonym<br />

inteligentny ‘intelligent’ has a different antonym, nieinteligentny ‘unintelligent’;<br />

mąż ‘husband’ is a converse of żona ‘wife’, while its synonym małżonek ‘spouse’<br />

has the converse małżonka ‘spouse’. A derived form has obviously one root.<br />

From EWN, we adopted the fuzzynymy relation. It is meant for pairs of<br />

lexemes which are clearly connected semantically, but which the lexicographer<br />

cannot fit into the existing system of more sharply delineated relations. The<br />

practice bore out our decision. We found, even in the basic vocabulary of the<br />

core list of lexical units, numerous instances of fuzzynymy (przylądek - morze,<br />

‘cape’ - ‘sea’, pacjent - przychodnia ‘patient’ - ‘walk-in clinic’). Future research<br />

includes a review of the fuzzynymy class to see if some subtypes of relations<br />

recur; this might be very interesting material for further linguistic investigation.<br />

There is one relation unique to plWordNet: conversion (narzeczony - narzeczona<br />

‘fiancé’ - ‘fiancée’, rodzic - dziecko ‘parent’ - ‘child’), kupić - sprzedać ‘to<br />

buy’ - ‘to sell’). Following Apresjan [6, pp. 242-265], we consider such cases to<br />

be different than antonymy.<br />

Contrary to our initial expectation, hypo/hypernymy applies not only to<br />

nouns and verbs (samochód - pojazd ‘car’ - ‘vehicle’, biec - poruszać się ‘to run’<br />

- ‘to move’), but also to adjectives (turkusowy - niebieski ‘turquoise’ - ‘blue’).<br />

In fact, adjectival hypo/hypernymy has turned out to be relatively widespread,<br />

once we allowed the lexicographers to note it.<br />

Neither WN nor EWN support relations that enable an effective rendition of<br />

the semantic variation carried by rich morphology and productive derivation. In<br />

Polish, we have verb aspect (szyć - uszyć ‘to sew - to have sewn’), reflexivity (golić<br />

- golić się ‘to shave someone - to shave oneself’), subtle derivation via prefixes<br />

(gnić - przegnić, nadgnić, wygnić etc. ‘to rot - to rot through - to become partially<br />

rotten - to rot out’), diminutives (kot ‘cat’ - kotek, koteczek, kocio, kotuś, kotunio;<br />

mały ‘small’ - malutki, maluteńki, malusieńki, maluśki), augmentatives (dziewczyna<br />

‘girl’ - dziewucha, dziewczynisko, dziewuszysko), expressive names (kobieta<br />

‘woman’ - kobiecina ‘a simple or poor woman’), gender pairs (malarz - malarka<br />

‘painter masc - painter fem ’), names of offspring (kot ‘cat’ - kocię ‘kitten’), names<br />

of action (strzelać ‘to shoot’ - strzelanie ‘shooting’, strzelanina ‘fusillade’), names<br />

of abstracts (nienawidzieć ‘to hate’ - nienawiść ‘hatred’, mądry ‘wise’ - mądrość<br />

‘wisdom’), names of places (jeść ‘to eat’ - jadalnia ‘dining room’), names of<br />

carriers of attribute (rudy ‘red-haired’ - rudzielec ‘someone red-haired’), names<br />

of agents of action (palić ‘smoke’ - palacz ‘smoker’), relational adjectives (uniwersytet<br />

‘university’ - uniwersytecki ‘university (in noun-noun compounds)’).<br />

Analogous phenomena were considered in Czech WordNet [7].<br />

To account for this variety somehow, we decided to extend two relations,<br />

relatedness and pertainymy. In the former, we placed the most regular types of<br />

word formation: names of actions, abstract names, pure aspectual pairs (without<br />

any other semantic “surplus”, e.g., pisać ‘to write’ - napisać ‘to have written’,


166 Magdalena Derwojedowa et al.<br />

kupić ‘to have bought’ - kupować ‘to buy habitually or to be buying’), causative<br />

verbs (martwić się ‘to worry’ - martwić (kogoś) ‘to worry someone’), relational<br />

adjectives and adjectival participles (which we do not consider as verb forms<br />

but as separate lexemes). The pertainymy relation accounts for the less regular<br />

word forms: names of places, carriers of attributes, agents of actions, offspring,<br />

augmentative, expressive and diminutive forms, gender pairs and names of nationalities.<br />

The prefixed verbs and “impure” aspectual pairs are captured by<br />

troponymy. Although we tried to fit as much as possible into the WN and EWN<br />

relation structure, we agree with the Czech WordNet team: it is necessary to go<br />

beyond that set of relation if we are to take into consideration the specificity of<br />

Slavic languages (Pala and Smrž 2004; (86).<br />

It is perhaps unexpected that the most problematic lexical-semantic relation<br />

turned out to be the fundamental one: synonymy. It helped little that this semantic<br />

notion is so well explored. There are two approaches to synonymy. One<br />

approach defines synonyms as lexemes with the same lexical meaning but with<br />

different shades of meaning; the other requires synonyms to be substitutable in<br />

some contexts [6, pp. 205-207]. In our opinion, neither approach works well in<br />

a semantically motivated network. We sharpened the criterion by positing that<br />

synonyms have the same hypernym and the same meronym (if they have any).<br />

For example, the lexemes twarz, morda, gęba, ryj, pysk, facjata, buzia, pyszczek<br />

(all of them mean more or less ‘the face’) can be considered synonymous in<br />

a wide sense. There are valid substitutions in some contexts (e.g., dał mu w<br />

twarz/mordę/gębę ‘he hit him in the face’; pogłaskała go po twarzy/buzi/pyszczku<br />

‘she stroked his face’). They do not, however, have the same hypernym and<br />

meronym: morda is an expressive name of a face, but not a body part . We regard<br />

such expressive names as hyponyms of the unmarked lexemes such as ‘face’; there<br />

is the same stance in [8]. One of the effects of this decision is that our synsets are<br />

very narrow, sometimes even with one element, but the hypo/hypernymy tree is<br />

much deeper.<br />

The problem with synonymy definition also arose in Bulgarian WordNet (Koeva,<br />

Mihev, Tinchov 2004; 62): “In Princeton WordNet the substitution criteria<br />

for SYNONYMY is mainly adopted [...] The consequences from such an approach<br />

are at least two — not only the exact SYNONYMY is included in the data base<br />

(a context is not every context). Second, it is easy to find contexts in which words<br />

are interchangeable, but still denoting different concepts (for example hypernyms<br />

and hyponyms), and there are many words which have similar meanings and by<br />

definition they are synonyms but are hardly interchangeable in any context due<br />

to different reasons — syntactic, stylistic, etc. (for example an obsolete and a<br />

common word)”.<br />

In our opinion, the vagueness of the synonymy definition and the lack of<br />

formal tools of establishing the synonymy of lexemes put in doubt the legitimacy<br />

of synonymy as the basic type of relation in lexical-semantics networks. It would<br />

appear that all relations link LUs. Suppose that B and D are (near-)synonyms


Words and Concepts in the Construction of Polish WordNet 167<br />

{mózg}<br />

<br />

{włosy} <br />

{buzia,<br />

buźka}<br />

{głowa}<br />

{nos}<br />

{twarz} <br />

<br />

<br />

<br />

{policzek}<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

{usta}<br />

<br />

<br />

<br />

<br />

<br />

<br />

{gęba,<br />

<br />

facjata} {morda} {lico,<br />

{ryj} {pysk} {pyszczek}<br />

oblicze}<br />

Fig. 1. The lexical unit twarz ‘face’ and its neighbours; straight arrows represent<br />

hypo/hypernymy, wavy arrows – meronymy/holonymy; mózg - ‘brain’, włosy - ‘hair’,<br />

nos - ‘nose’, policzek - ‘cheek’, usta - ‘mouth’.<br />

and B is a hypernym of synonymous A and C; in certain contexts D may be<br />

substituted for B, and is also a hypernym of A and C.<br />

The plWordNet project is building the semantic network from scratch; we<br />

decided not merely to translate the WN trees (in WordNet 3.0), because that<br />

would reflect the structure of English rather than Polish. We did try to translate<br />

the higher levels of WN, only to discover a few serious problems. 1) Many<br />

lexemes in WN can hardly be considered to denote frequent, basic or most general<br />

concepts in Polish; examples include skin flick ‘film pornograficzny’, party<br />

favour ‘pamiątka z przyjęcia’, butt end ‘grubszy koniec’, apple jelly ‘galaretka<br />

jabłkowa’. 2) WN glosses are not precise enough to let us find the Polish equivalent,<br />

or there may be no lexical Polish equivalent at all (other than calques<br />

of English words); examples of untranslatable entries include changer, modifier,<br />

communicator, acquirer, banshee, custard pie, marshmallow. 3) Translating WN<br />

would create nodes in the hypo/hypernymy structure that represent unnecessary<br />

or artificial concepts; examples include emotional person ‘osoba uczuciowa’, immune<br />

person ‘osoba uodporniona’, large person ‘duży człowiek’, rester ‘odpoczywający’,<br />

smiler ‘uśmiechający się’, states’ rights ‘prawa stanowe’.<br />

Our fundamental design decision was corroborated by the experience of the<br />

Czech WordNet team [7, pp. 84-85]. The BalkaNet project systematically recorded<br />

concepts from other languages (mainly from English, based on WN), not<br />

lexicalized in the language at hand. [. . . ] The Czech team noticed problems with<br />

the translation of equivalents and the corresponding gaps with regard to English.<br />

They observed two types of cases where it was not possible to find synonyms (or


168 Magdalena Derwojedowa et al.<br />

even near-synonyms). The Czech synsets had no lexical equivalents in English<br />

because of the difference in lexicalizations and conceptualization, or because of<br />

the typological differences between those two languages;; there are, for example,<br />

no phenomena in English as the Czech as verb aspect, reflexive verbs or rich<br />

word formation. It is well known that concepts are not universal, nor are they<br />

expressed in the same way across languages (this is true even of so basic a<br />

notion as colour), although sometimes an ethnocentrism still can be observed —<br />

see Wierzbicka’s criticism on that approach [9, pp. 193]. We decided to describe<br />

the lexicalization and conceptualization in Polish as accurately as possible. We<br />

think that it is much more interesting to compare two semantic networks that<br />

reflect the real nature of two natural languages than to create a hybrid, which<br />

in fact would be just an English semantic network translated into Polish.<br />

Near the end of year 2 of the plWordNet project, the noun network (the<br />

intended vocabulary) is ready. Work must be completed on verbs and adjectives.<br />

See Section 3.3 for more details.<br />

3 Tools and resources<br />

3.1 The linguist’s tool<br />

We now discuss software support for the Polish WordNet enterprise: a dedicated<br />

editor and algorithms that support lexicographers’ decisions. Two years ago,<br />

all available tools – such as [10–12] – required editing the source format, not<br />

exactly linguist-friendly. A much more apt editor DEBVisDic [13] was not yet<br />

available 7 . We therefore chose to design our own WordNet editor, plWNApp,<br />

with tight coupling of the envisaged development procedure and the linguistic<br />

tasks. [14] present the implementation in some details; here, we focus on its use<br />

as a tool.<br />

Linguists edit synsets and relations using plWNApp, which also supports<br />

verification and control by coordinators of the project’s linguistic side. Written<br />

in Java, so practically fully portable, plWNApp has a client-server architecture<br />

with a central database. Clients transparently connect to the database via the<br />

Internet, though a version that allows work on a local copy of the database is<br />

also maintained. Efficiency, even on low-end computers, was a priority. Network<br />

communication is efficient due to caching data exchanged with the database.<br />

While it might put screen data out of synch for up to two minutes, this has not<br />

happened in 1.5 year of use by a large, distributed group of linguists.<br />

Linguists work via a Graphical User Interface and never edit source files.<br />

Every user downloads an appropriate current version of the WordNet from the<br />

sever. Data are exported and archived in XML, in a special format that we plan<br />

to replace with a standard format once we have identified a fitting one. The<br />

7 Early on, our project was also constrained by a commercial connection.


Words and Concepts in the Construction of Polish WordNet 169<br />

coordinators can edit source files; they did that during the initial assignment<br />

of lexical units (LUs) to domains. The coordinators’ stronger tool also supports<br />

definition of new lexical-semantic relations, invasive changes in the database and<br />

elements of group management. Both versions check on the fly such basic things<br />

as the existence of synsets/LUs or the appropriateness of relation instances to be<br />

added. More sophisticated diagnostic procedures have been designed, and some<br />

already installed.<br />

Core plWordNet will have a complete description of selected LUs, so<br />

plWNApp distinguishes system LUs and user LUs. Only coordinators can add<br />

the former; other linguists introduce user LUs to complete synsets under construction.<br />

Our linguistic assumptions suggested support for three main tasks:<br />

1. construct an initial, broad synset for a given system LU;<br />

2. correct and divide initial synsets into more cohesive, almost always smaller<br />

synsets;<br />

3. link synsets by lexical-semantic relations.<br />

To support these tasks, plWNAPP’s user interface features two perspectives:<br />

the LU perspective (2) and the synset perspective (3). The former is organised<br />

around selecting a LU and defining synsets and LU relations for it. A linguist<br />

would traverse the list of system LUs in the domain assigned to her and, for each<br />

LU, define all synsets to which it belongs. System LUs thus serve as starting<br />

points in synset construction.<br />

The intended result of task 1 was to group LUs in broad sets of nearsynonyms,<br />

but pairs of synset often overlapped because of lack of precision in<br />

the grouping criteria. In order to support coordinators in task 2, we added the<br />

comparison perspective, showing two lists of synsets that share at least k LUs.<br />

Coordinators can edit or merge synsets, or move LUs around. We soon discovered,<br />

however, that correction – task 2 – is only possible when done together<br />

with task 3, supported by the synset perspective. According to the definition<br />

of synsets and synset relations, a LU can participate in a synset only because<br />

of what we know about this synset’s relations. In the comparison perspective,<br />

synsets are isolated from the structure of synset relations, and coordinators find<br />

it very hard to determine the correctness of the overlap between two synsets. In<br />

the next version of plWNApp, we will enhance this perspective to a comparison<br />

of structures of synset relations around two synsets.<br />

In the synset perspective, each user interaction was to begin by the selection<br />

of a source synset which either must be corrected or is chosen as the starting<br />

node of a relation instance. Next, the user was to divide the source synset<br />

into two or to select a target synset, and then to pick a relation between the<br />

two (hypo/hypernymy when dividing the source synset). The added relation instances<br />

appear in a table at the bottom of the synset perspective. Predictably,<br />

practice diverged significantly from the initial ideas. The relation table was used


170 Magdalena Derwojedowa et al.<br />

Fig. 2. The LU perspective<br />

most often, gradually becoming the central point of the synset perspective. Extracting<br />

a hypo/hypernym synset directly from the source synset was a very rare<br />

operation. Linguists preferred to create a new hypo/hypernym synset and move<br />

some LUs from the source synset, one by one. It may be easier to decide on one<br />

LUs than on a group. In any event, the synset perspective is the basic tool in<br />

transforming the initial synsets into the deepened hierarchy of narrow synsets, in<br />

keeping with our fundamental assumptions. Also, the table shows only relations<br />

of the selected source synsets, so linguists suggested extending the table to a<br />

graph view. We plan to introduce the possibility of editing synsets and synset<br />

relations in combination with the enhanced comparison perspective.<br />

Early on, we found that consistency among linguists was a concern. In order<br />

to increase consistency, we introduced substitution tests. For each relation in<br />

plWordNet – for synsets and for LUs – there is a morphologically generic test<br />

with slots for LUs from the linked synsets or for the linked LUs. (Coordinators<br />

can edit definitions.) Slots are filled with the appropriate morphological forms.


Words and Concepts in the Construction of Polish WordNet 171<br />

Fig. 3. The Synset perspective<br />

Whenever a relation instance is to be added, plWNApp generates a test instance<br />

and shows it to the linguist.<br />

The tool associates domains not only with LUs but also with synsets. A LU<br />

is assigned to some domains when added it is to the database. The domain of<br />

a synset is that of its first LU, usually the system LU that started this synset.<br />

Domains offer a simple, but useful way dividing work among linguists. It is the<br />

coordinators’ task to merge domain subsets. This is not trouble-free: occasionally,<br />

two linguists working on two close domains created a similar, overlapping<br />

structure of synsets and synset relation. An enhanced comparison perspective<br />

should help adjust such overlaps.


172 Magdalena Derwojedowa et al.<br />

3.2 Toward automation<br />

Work on extending plWNApp to support semi-automatic WordNet construction<br />

is under way. We will build software tools that:<br />

– offer better corpus-browsing capability,<br />

– criticize existing WordNet content,<br />

– suggest possible instances of relations.<br />

The browsing tools are based on the statistical analysis of a large corpus in<br />

search for distributional associations of LUs. One can identify potential collocations<br />

and extract a semantic similarity function (SSF), which for a pair of<br />

LUs returns a real-valued measure of their similarity. As our examples showed,<br />

the real LUs are the minority among the extracted collocations, and it would be<br />

very hard to add new multiword LUs automatically on the basis of a collocation<br />

list. A linguist, however, can easily spot possible new multiword LUs if shown a<br />

candidate list.<br />

SSFs are based on Harris’s Distributional Hypothesis [15], aptly summarized<br />

in [16]: ‘The distributional hypothesis is usually motivated by referring to the<br />

distributional methodology developed by Zellig Harris (1909-1992). (...) Harris’<br />

idea was that the members of the basic classes of these entities behave distributionally<br />

similarly, and therefore can be grouped according to their distributional<br />

behavior. As an example, if we discover that two linguistic entities, w1 and w2 ,<br />

tend to have similar distributional properties, for example that they occur with<br />

the same other entity w3 , then we may posit the explanandum that w1 and w2<br />

belong to the same linguistic class. Harris believed that it is possible to typologize<br />

the whole of language with respect to distributional behavior, and that such<br />

distributional accounts of linguistic phenomena are “complete without intrusion<br />

of other features such as history or meaning.”’<br />

Many methods of SSF construction have been proposed. The serious problem<br />

is their comparison. A SSF produces real values. Manual inspection of even<br />

several real numbers is very hard on people. While all known SSF algorithms<br />

produce interesting result, how do we choose a SSF that distinguishes really<br />

similar LUs (synonyms or close hypo/hypernym) from other groupings? Core<br />

plWordNet, constructed manually, can serve as the basis for evaluation. Following<br />

[17], we evaluate a SSF by applying it in solving a version of WordNet-Based<br />

Synonymy Test (WBST; see also [18]): given a word and four candidates, separate<br />

the actual synonym from distractors. The test is automatically generated<br />

from plWordNet; for evaluation, different SSFs were extracted from the IPI PAN<br />

corpus 8 [5] for the same set of LUs.<br />

8 The IPI PAN Corpus contains about 254 millions of tokens and is rather unbalanced:<br />

most of the text in the corpus comes from newspapers, transcripts of parliamentary<br />

sessions and legal text, however also includes artistic prose and scientific texts.


Words and Concepts in the Construction of Polish WordNet 173<br />

We tested several versions of SSF, achieving the best result of 90.92% in<br />

WBST generated from plWordNet for a SSF based on the Rank Weight Function<br />

(RWF) [19]: on the basis of SSF RW F we can distinguish a synonym from three<br />

randomly selected words in some 90% cases. However, in a more difficult version<br />

of the WBST, called Extended WBST [18], in which decoys are chosen from LUs<br />

similar to the answer, the application of the same SSF RW F gave the accuracy<br />

of 53.52% . Though the ability of SSF RW F to distinguish among semantically<br />

related LUs is limited, it was added to plWNApp as a browsing tool. SSF RW F is<br />

used to produce lists of LUs k most similar to the given one. Such a list can help<br />

linguists look among the top positions in the list for possibly omitted synonyms<br />

and hypo/hypernyms.<br />

SSF RW F is loosely correlated with similarity functions based on plWordNet<br />

but it is hard to find any threshold above which the similarity value guarantees<br />

the existence of the synonymy or hypo/hypernymy relation. In an experiment, we<br />

chose a value 0.2 as a threshold (on the basis of manual inspection). Next, one of<br />

the authors manually assessed a statistically significant sample of LU pairs with<br />

the similarity above the threshold, according to the synset relations: synonymy,<br />

hypo/hypernymy, meronymy and holonymy. Half of the pairs did not express<br />

any of these relations. The other half appeared to be worth browsing. In 7% of<br />

cases we found two synonyms already present in plWordNet, but only 1% of<br />

new synonym pairs. 20% of pairs were close hypo/hypernyms (not necessarily<br />

direct) already present in plWordNet, and 16% of new close hypo/hypernyms<br />

and co-hyponyms were discovered. 1% of known meronyms and holonyms were<br />

found and 5% of new ones were discovered.<br />

SSFs are intended to extract more rather than fewer semantic relations between<br />

LUs. We will reintroduce restrictions by way of clustering of the results<br />

of SSF – constructing proto-synsets. We also want to apply statistical lexicosyntactic<br />

patterns – for example, in the style of [20] – to a large corpus, in order<br />

to extract candidate instances of plWordNet relations. The extracted instances<br />

will be used to combine the cluster resulting from grouping LUs into a network of<br />

synset relations. The results of automatic extraction will be always anchored to<br />

plWordNet, because we want to extend it gradually, at each step adding a small<br />

set of new LUs automatically suggested for inclusion. After each iteration of<br />

automatic acquisition, linguists will be asked to verify and correct the proposed<br />

proto-synsets and instances of relations. The proposals will be clearly marked in<br />

plWNApp.<br />

3.3 The current state of the system<br />

At the time of this writing, plWordNet contains 12483 LUs grouped in 8095<br />

synsets, 6059 synset relations and 5379 LU relations. Table 3 shows more detailed<br />

facts. While we feel that the number of LUs is more important than the number


174 Magdalena Derwojedowa et al.<br />

of synsets (Section 2), Table 3 separates relations between synsets and LUs —<br />

the former hold for every LU in a synset.<br />

LUs LU relations synset relations<br />

nouns 8307 antonymy 1952 hypor/hypernymy 4293<br />

verbs 3317 converse 47 holonymy 919<br />

adjectives 3053 relatedness 1534 meronymy 847<br />

pertainymy 1175<br />

fuzzynymy 671<br />

all 14677 all 5379 all 6059<br />

Table 3. plWordNet in numbers, September 2007<br />

The average rate of polysemy is 1.46 (calculated as the average number of<br />

synsets including the given homonymous LU, as in [1]), and the average size of<br />

a synset is 2.04 LUs. The detailed data appear in Tables 4 and 5.<br />

Synsets to which a homonymous LU belongs<br />

1 2 3 4 5 6 7 8 9 ≥ 10 Avg WN<br />

All LUs [%] 73.45 16.10 5.98 2.32 1.02 0.58 0.30 0.09 0.05 0.12 1.47 –<br />

Nouns LUs [%] 74.11 16.35 6.00 2.19 0.77 0.38 0.16 0.03 – – 1.41 1.24<br />

Verbs LUs [%] 79.40 14.73 4.04 1.34 0.28 0.18 0.04 – – – 1.29 2.17<br />

Adj. LUs [%] 64.61 17.10 8.23 3.84 2.53 1.56 0.97 0.34 0.25 0.55 1.79 1.40<br />

Table 4. The level of LU polysemy in plWordNet, September 2007 (WN means the<br />

Princeton WordNet 3.0)<br />

4 Observations and future work<br />

Our work to date has taught us a few valuable lessons. Of much use, though less<br />

interest, is what we found about facilitating the linguists’ task. An important<br />

observation concerns the staring point of any properly conceived WordNet: it<br />

must be corpus-based. The core vocabulary should consist of words that are<br />

frequent in real-life text. We have learnt that, for that particular purpose, certain<br />

balance in the corpus is extremely important. In our case, a little too many formal<br />

text resulted in a shortage of everyday vocabulary, such as names of the edible<br />

plants and food in general, animals and so on, in exchange for a higher than<br />

average number of economic and legal terms.


Words and Concepts in the Construction of Polish WordNet 175<br />

LUs in a synset<br />

1 2 3 4 5 6 7 8 9 ≥ 10<br />

All synsets [%] 46.50 25.03 15.87 7.66 2.77 1.05 0.53 0.20 0.17 0.2<br />

Noun synsets [%] 65.93 19.45 7.92 3.83 1.38 0.63 0.36 0.13 0.17 0.19<br />

Verb synsets [%] 1.74 47.07 28.76 12.28 6.10 2.30 0.87 0.32 0.24 0.32<br />

Adj. synsets [%] 15.69 26.17 33.04 17.29 4.87 1.47 0.87 0.33 0.13 0.13<br />

Table 5. The number of LUs per synset in plWordNet<br />

Experiments with translating the Princeton WordNet indiscriminately clearly<br />

show that only the top levels of the hierarchy may carry over to other languages<br />

intact; it probably should work, because this hierarchy may well be universal.<br />

We must work out the lower level afresh, if we want a WordNet that represents<br />

the lexical system, or at least much of the lexical system, of the langugage at<br />

hand — see Section 2.<br />

Last but not least, we feel that for a WordNet to cover as much vocabulary<br />

of a given language as possible, it would need its own set of relations — many of<br />

them derivational in nature. This, however, would make it hard to use WordNets<br />

for multilingual NLP tasks, a most likely “killer app” of the near future. In<br />

the end, then, one ought to keep balance between too few but rather universal<br />

relations (such as antonymy or hypernymy) and too many, too detailed languagespecific<br />

derivational relations. We believe that any criterion for choosing a useful<br />

set relations should consider the feasibility of future NLP tasks and linguistic<br />

credibility.<br />

On the computing side of the plWordNet project, we see further fine-tuning<br />

of semantic similarity functions as a major task for the near future. Although<br />

the results thus far are very promising, too much noise can be observed in the<br />

data (about 50% — see Section 3.2). One cannot keep naive thresholds as a<br />

means of constraining the output of SSFs. We must first of all take a look at<br />

multi-word expressions. We have already developed language-specific methods<br />

of extracting Polish multi-word expressions from a corpus, [21] but more work is<br />

neccessary. We need to build more natural groupings of words based on SSFs. One<br />

approach that we will try is to use fuzzy clustering algorithms. The preliminary<br />

results are again promising. On the other hand, pattern-based methods are very<br />

accurate and have been widely used to extract relations for WordNet; an early<br />

example is [22]). We will try to combine pattern-based methods with clustering.<br />

One way to accomplish this is to do machine learning of patterns on the basis<br />

of statistical and cluster information provided by a SSF; it should at least be<br />

useful in disambiguating lexico-semantic relations from the output of a SSF, but<br />

it also might help build the WordNet up in a weakly supervised manner.


176 Magdalena Derwojedowa et al.<br />

References<br />

1. Miller, G.A., Fellbaum, C., Tengi, R., Wolff, S., Wakefield, P., Langone, H., Haskell,<br />

B.: WordNet — a lexical database for the English language. Homepage of the<br />

project (2007)<br />

2. Fellbaum, C., ed.: WordNet — An Electronic Lexical Database. The MIT Press<br />

(1998)<br />

3. Vossen, P.: EuroWordNet general document version 3. Technical report, University<br />

of Amsterdam (2002)<br />

4. Tufiş, D., Cristea, D., Stamou, S.: BalkaNet: Aims, methods, results and perspectives.<br />

a general overview. Romanian Journal of Information Science and Technology<br />

7(1–2) (2004) 9–43 Special Issue.<br />

5. Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version. Institute of Computer<br />

Science PAS (2004)<br />

6. Apresjan, J.D.: Semantyka leksykalna. Synonimiczne środki języka (Lexical semantics.<br />

The means of synonymy in language). Ossolineum Wrocław (2000)<br />

7. Pala, K., Smrž, P.: Building Czech Wordnet. Romanian Journal of Information<br />

Science and Technology 7(1–2) (2004) 79–88<br />

8. Dubisz, S., ed.: Uniwersalny słownik języka polskiego [Universal Dictionary of<br />

Polish Language], electronic version 0.1. PWN (2004)<br />

9. Wierzbicka, A.: Język–umysł–kultura. PWN (2000)<br />

10. Tengi, R.I.: 4. [2] 105–127<br />

11. Louw, M.: Polaris user’s guide. the EuroWordNet database editor. EuroWordNet<br />

(le-4003), deliverable d023d024. technical report, Lernout & Hauspie, Antwerp,<br />

Belgium (1998)<br />

12. Horák, A., Smrž, P.: New features of wordnet editor VisDic. Romanian Journal of<br />

Information Science and Technology 7(1–2) (2004) 201–213<br />

13. Horák, A., Pala, K., Rambousek, A., Povolný, M.: DEBVisDic — first version of<br />

new client-server wordnet browsing and editing tool. In: Proceedings of the Third<br />

International WordNet Conference — <strong>GWC</strong> 2006, Masaryk University (2006) 325–<br />

328<br />

14. Piasecki, M., Koczan, P.: Environment supporting construction of the Polish Wordnet.<br />

In Vetulani, Z., ed.: Proceedings of the 3rd Language and Technology Conference,<br />

2007, Poznań. (2007) 519–523<br />

15. Harris, Z.S.: Mathematical Structures of Language. Interscience Publishers, New<br />

York (1968)<br />

16. Sahlgren, M.: The Word-Space Model. PhD thesis, Stockholm University (2006)<br />

17. Freitag, D., Blume, M., Byrnes, J., Chow, E., Kapadia, S., Rohwer, R., Wang, Z.:<br />

New experiments in distributional representations of synonymy. In: Proceedings<br />

of the Ninth Conference on Computational Natural Language Learning (CoNLL-<br />

2005), Ann Arbor, Michigan, Association for Computational Linguistics (2005)<br />

25–32<br />

18. Piasecki, M., Szpakowicz, S., Broda, B.: Extended similarity test for the evaluation<br />

of semantic similarity functions. In Vetulani, Z., ed.: Proceedings of the 3rd<br />

Language and Technology Conference, 2007, Poznań. (2007) 104–108<br />

19. Piasecki, M., Szpakowicz, S., Broda, B.: Automatic selection of heterogeneous<br />

syntactic features in semantic similarity of Polish nouns. In: Proceedings of the<br />

Text, Speech and Dialogue 2007 Conference. LNAI 4629, Springer (2006) 99–106


Words and Concepts in the Construction of Polish WordNet 177<br />

20. Pantel, P., Pennacchiotti, M.: Espresso: Leveraging generic patterns for automatically<br />

harvesting semantic relations. In: Proceedings of the 21st International Conference<br />

on Computational Linguistics and 44th Annual Meeting of the Association<br />

for Computational Linguistics, ACL (2006) 113–120<br />

21. Broda, B., Derwojedowa, M., Piasecki, M.: Recognition of structured collocations<br />

in an inflective language. In: Proceedings of the International Multiconference on<br />

Computer Science and Information Technology — 2nd International Symposium<br />

Advances in Artificial Intelligence and Applications (AAIA’07). (2007) 247–256<br />

22. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In:<br />

Proceeedings of COLING-92, Nantes, France, The Association for Computer Linguistics<br />

(1992) 539–545<br />

23. Koeva, S., Mihov, S., Tinchev, T.: Bulgarian Wordnet ?– structure and validation.<br />

Romanian Journal of Information Science and Technology 7(1–2) (2004) 61–78<br />

24. Hamp, B., Feldweg, H.: GermaNet — a lexical-semantic net for German. In:<br />

Proceedings of ACL workshop Automatic Information Extraction and Building of<br />

Lexical Semantic Resources for NLP Applications, Madrid, ACL (1997) 9–15<br />

25. Derwojedowa, M., Piasecki, M., Szpakowicz, S., Zawisławska, M.: Polish wordnet<br />

on a shoestring. In: Biannual Conference of the Society for Computational<br />

Linguistics and Language Technology, Tübingen. (2007) 169–178<br />

26. Piasecki, M., team: Polish WordNet, the Web interface. (2007)


Exploring and Navigating: Tools for GermaNet<br />

Marc Finthammer and Irene Cramer<br />

Faculty of Cultural Studies, University of Dortmund, Germany<br />

marc.finthammer|irene.cramer@udo.edu<br />

Abstract. GermaNet is regarded to be a valuable resource for German NLP applications,<br />

corpus research, and teaching. This demo presents three GUI-based<br />

tools meant to facilitate the exploration of and navigation through it.<br />

1 Motivation<br />

GermaNet [1], the German equivalent of WordNet [2], represents a valuable lexicalsemantic<br />

resource for numerous German natural language processing (NLP) applications.<br />

However, in contrast to WordNet, only a few graphical user interface (GUI) based<br />

tools have been created up to now for the exploration of GermaNet. In principle, in order<br />

to get an idea of it, the user is left on his own with a bunch of XML-files and<br />

insufficient means for navigation or exploration 1 . Various sub-tasks of our research in<br />

the DFG (German Research Foundation) funded project HyTex 2 , such as lexical chaining,<br />

highly rely on the semantic knowledge represented in GermaNet. While intensively<br />

working with it, we accumulated a list of properties a GermaNet GUI should<br />

feature and accordingly implemented the GermaNet Explorer. In addition, during the<br />

course of our research on lexical chaining for German corpora [3], we also investigated<br />

semantic relatedness and similarity measures based on GermaNet as a resource. The<br />

results of this work led us to the implementation of eight GermaNet-based relatedness<br />

measures, which we provide as Java TM API, the so-called GermaNet-Measure-API. In<br />

order to facilitate the use, we also developed a GUI for this API, the so-called GermaNet<br />

Pathfinder. We think that these three tools simplify the use of and work with<br />

GermaNet: they can be integrated into various NLP applications and can also be used<br />

as a resource for the visual exploration of and navigation through GermaNet. All three<br />

tools are freely available for download.<br />

1 As a matter of course, the NLP community working with German data is much smaller than the<br />

one working with English data, consequently, the development of tools for German resources,<br />

such as GermaNet, takes more time.<br />

2 The HyTex project aims at the development of text-to-hypertext conversion strategies based<br />

on text-grammatical features. Please, refer to our project web pages http://www.hytex.info/ for<br />

more information about our work.


Exploring and Navigating: Tools for GermaNet 179<br />

2 GermaNet Explorer<br />

Many researchers working with GermaNet have the same experience: they lose their<br />

way in the rich, complex structure of its XML-representation. In order to solve this<br />

problem, we implemented the GermaNet Explorer, of which a screenshot is shown in<br />

Figure 1. Its most important features are: the word sense retrieval function (Figure 1,<br />

region 1) and the structured presentation of all semantic relations pointing to/from the<br />

synonym set (synset) containing the currently selected word sense (Figure 1, region 2).<br />

Fig. 1. Screenshot GermaNet Explorer<br />

In addition, the GermaNet Explorer offers a visual, graph-based navigation function.<br />

A synset (in Figure 2 [Rasen, Grünfläche] Engl. lawn) is displayed in the center<br />

of a navigation graph surrounded by its direct semantically related synsets, such as<br />

hypernyms (in Figure 2 [Nutzfläche, Grünland]) above the current synset, hyponyms<br />

(in Figure 2 [Kunstrasen, Kunststoffrasen] and [Grüngürtel])<br />

below, holonyms (in Figure 2 [Grünanlage, Gartenanlage, Eremitage]) to<br />

the left, and meronyms (in Figure 2 [Graspflanze, Gras]) to the right. In order to<br />

navigate the graph representation of GermaNet, one simply clicks on a related synsets,<br />

in other words one of the rectangles surrounding the current synset shown in Figure 2.<br />

Subsequently, the visualization is refreshed: the selected synset moves into the center<br />

of the displayed graph and the semantically related synsets are updated accordingly.


180 Marc Finthammer and Irene Cramer<br />

Fig. 2. Screenshot GermaNet Explorer – Visual Graph Representation<br />

Fig. 3. Screenshot GermaNet Explorer – Representation of the List of All GermaNet Synsets


Exploring and Navigating: Tools for GermaNet 181<br />

In addition, the GermaNet Explorer features a representation of all synsets, which<br />

is illustrated in Figure 3, region 1. It also provides retrieval, filter, and sort functions<br />

(Figure 3, region 2). Further, the GermaNet Explorer exhibits the same function as<br />

shown in Figure 3 and a similar GUI for the list of all word senses. We found that these<br />

functions, both for the word senses and the synsets, provide a very detailed insight into<br />

the modeling and structure of GermaNet and thus helped us to understand its strengths<br />

and weak points.<br />

We were already able to successfully utilize the GermaNet Explorer in various areas<br />

of our research and teaching. E.g. in experiments on the manual annotation of lexical<br />

chains in German corpora, our subjects used the GermaNet Explorer to find paths representing<br />

semantic relatedness between two words. This work is partially described in<br />

[4]. We also found it helpful for the visualization of lexical semantic concepts and thus<br />

for the training of our students in courses on e.g. semantics. We hence argue that the<br />

GermaNet Explorer represents a tool which is applicable in many scenarios.<br />

3 GermaNet Pathfinder and Measure-API<br />

Semantic relatedness measures express how much two words have to do with each other.<br />

This represents an essential information in various NLP applications and is extensively<br />

discussed in the literature e.g. [5]. Many measures have already been investigated and<br />

implemented for the English WordNet, however, there are only few publications addressing<br />

measures based on GermaNet e.g. [6] as well as [3]. The calculation of semantic<br />

relatedness is a subtask of our research in HyTex; we therefore implemented<br />

eight GermaNet 3 and three Google TM 4 based measures. Because of the–compared to<br />

WordNet–different structure of GermaNet, it was necessary to re-implement and adapt<br />

algorithms discussed in the literature and in parts already available for WordNet. The<br />

GermaNet-Measure-API is implemented as a Java TM class library and consists of a hierarchically<br />

organized collection of measure classes, which provide methods to perform<br />

operations such as the calculation of specific relatedness values between words and<br />

synsets or the automated distance-to-relatedness conversion. In order to additionally<br />

facilitate the integration of these measures into user-defined applications and to allow<br />

the straightforward comparison and evaluation of the different measures, we also implemented<br />

a GUI, the GermaNet Pathfinder, shown in Figure 4. The most important<br />

features of these tools are: the calculation of the semantic relatedness between two<br />

words (or two synsets) with various adjustable parameter settings (Figure 4, region<br />

1), the easy-to-apply Java TM interface, which ensures the simple and fast integration of<br />

all measures into any application, and the visualization of the calculated relatedness<br />

3 For more information about the measures implemented as well as our research on lexical/thematic<br />

chaining and the performance of our GermaNet based lexical chainer, please refer to [3]<br />

in this volume.<br />

4 The three Google TM measures are based on co-occurence counts and realize different algorithms<br />

to convert these counts into values representing semantic relatedness.


182 Marc Finthammer and Irene Cramer<br />

Fig. 4. Screenshot GermaNet Pathfinder<br />

as a path in GermaNet, which is shown in Figure 5. For a given word pair all possible<br />

readings–in other words all synsets–are considered to calculate the relatedness (or<br />

paths) with respect to GermaNet (Figure 4, region 2).<br />

We already successfully used the GermaNet-Measure-API in our lexical chainer,<br />

called GLexi. We also found the GermaNet Pathfinder very helpful to explore GermaNet<br />

and retrace semantically motivated paths, which is illustrated in Figure 5 as the<br />

(shortest) path between Blume (Engl. flower) and Baum (Engl. tree). This path consists<br />

of three steps (hypernymy – hyponymy – hyponymy) and traverses two synsets; it thus<br />

represents the kind of (indirect) semantic relation relevant in e.g. lexical chaining.<br />

4 Open Issues and Future Work<br />

We have already used the GermaNet Explorer, GermaNet-Measure-API and Pathfinder<br />

in our research on thematic chaining, as a tool for the manual annotation of lexical<br />

chains and as a resource in our seminars. In our future work, we plan to further explore<br />

the possible fields of application, e.g. for training students and annotators. The<br />

research on relatedness measures both for GermaNet and WordNet among others [5]<br />

shows that the established algorithms are not yet able to satisfactorily represent the semantic<br />

relations between two words. Namely, human-judgement experiments stress that<br />

the correlation between the relatedness measures and the intuition of subjects is much<br />

to low. We therefore plan to investigate alternative relatedness measures, which we also


Exploring and Navigating: Tools for GermaNet 183<br />

intend to integrate into the GermaNet Pathfinder. However, the usefulness of the GermaNet<br />

Explorer and Pathfinder is constrained by the coverage and modeling quality of<br />

the underlying semantic lexicon. Therefore, we also hope to hereby provide tools to<br />

see behind GermaNet’s curtain and to thus facilitate the user-centered work with this<br />

interesting and valuable resource.<br />

Fig. 5. Screenshot GermaNet Pathfinder – Illustration of a Shortest Path Between Blume and<br />

Baum<br />

References<br />

1. Lemnitzer, L., Kunze, C.: Germanet - representation, visualization, application. In: Proc. of<br />

the Language Resources and Evaluation Conference (LREC2002). (2002)<br />

2. Fellbaum, C., ed.: WordNet. An Electronic Lexical Database. The MIT Press (1998)<br />

3. Cramer, I., Finthammer, M.: An evaluation procedure forword net based lexical chaining:<br />

Methods and issues. In: this volume. (to appear)<br />

4. Stührenberg, M., Goecke, D., Diewald, N., Mehler, A., Cramer, I.: Web-based annotation of<br />

anaphoric relations and lexical chains. In: Proc. of the Linguistic Annotation Workshop, ACL<br />

2007. (2007)


184 Marc Finthammer and Irene Cramer<br />

5. Budanitsky, A., Hirst, G.: Semantic distance in wordnet: An experimental, applicationoriented<br />

evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources<br />

at NAACL-2000. (2001)<br />

6. Gurevych, I., Niederlich, H.: Computing semantic relatedness in german with revised information<br />

content metrics. In: Proc. of OntoLex 2005 - Ontologies and Lexical Resources,<br />

IJCNLP 05 Workshop. (2005)


Using Multilingual Resources<br />

for Building SloWNet Faster<br />

Darja Fišer<br />

Department of Translation, Faculty of Arts, University of Ljubljana<br />

Aškerčeva 2, 1000, Ljubljana, Slovenia<br />

darja.fiser@guest.arnes.si<br />

Abstract. This project report presents the results of an approach in which<br />

synsets for Slovene WordNet were induced automatically from parallel corpora<br />

and already existing WordNets. First, multilingual lexicons were obtained from<br />

word-aligned corpora and compared to the WordNets in various languages in<br />

order to disambiguate lexicon entries. Then appropriate synset ids were attached<br />

to Slovene entries from the lexicon. In the end, Slovene lexicon entries sharing<br />

the same synset id were organized into a synset. The results were evaluated<br />

against a goldstandard and checked by hand.<br />

Keywords: multilingual lexica, parallel corpora, word senses, word-alignment.<br />

1 Introduction<br />

Automated approaches for WordNet construction, extension and enrichment all aim to<br />

facilitate faster, cheaper and easier development. But they vary according to the<br />

resources that are available for a particular language. These range from Princeton<br />

WordNet (PWN) [7], the backbone of a number of WordNets [16, 14], to machine<br />

readable bilingual and monolingual dictionaries which are used to disambiguate and<br />

structure the lexicon [11], and taxonomies and ontologies that usually provide a more<br />

detailed and formalized description a domain [6].<br />

For the construction of Slovene WordNet we have leveraged the resources at our<br />

disposal, which are mainly corpora. Based on the assumption that the translation<br />

relation is a plausible source of semantics we have used multilingual parallel corpora<br />

to extract semantically relevant information. The idea that senses of ambiguous words<br />

in SL are often translated into distinct words in TL and that all SL words that are<br />

translated into the same TL word share some element of meaning has already been<br />

explored by e.g. [13] and [10]. Our work is also closely related to what was been<br />

reported by [1], [3] and [17].<br />

The paper is organized as follows: the methodology used in the experiment is<br />

explained in the next section. Sections 3 and 4 present and evaluate the results and the<br />

last section gives conclusions and work to be done in the future.


186 Darja Fišer<br />

2 Methodology<br />

2.1 Parallel Corpora<br />

The experiment was conducted on two very different corpora, the MultextEast corpus<br />

[2] and the JRC-Acquis corpus [14].<br />

The former is relatively small (100,000 words per language) and it only contains a<br />

single text, the novel “1984” by George Orwell. Although the corpus contains a single<br />

a literary text but is written in a plain and contemporary style and contains general<br />

vocabulary. But because it had already been sentence-aligned and tagged, as many as<br />

five languages could be used (English, Czech, Romanian, Bulgarian and Slovene).<br />

The latter, by contrast, contains EU legislation and is very domain-specific. It is<br />

also the biggest parallel corpus of its size in 21 languages (about 10 million words per<br />

language). However, the JRC-Acquis is paragraph-aligned with HunAlign [18] but is<br />

not tagged, lemmatized, sentence- or wordaligned. This means that the pre-processing<br />

stage was a lot more demanding than with the 1984 corpus. We were therefore forced<br />

to initially limit the languages involved to English, Czech and Slovene with the aim of<br />

extending it to Bulgarian and Romanian as soon as tagging information becomes<br />

available for these languages.<br />

The English and Slovene part JRC-Acquis of the corpus were first tokenized,<br />

tagged lemmatised with totale [4] while the Czech part was kindly tagged for us with<br />

Ajka [14] by the team from the Faculty of Informatics from the Masaryk University in<br />

Brno. We included the first 2000 documents from the corpus in the dataset and<br />

filtered out all function words.<br />

Both corpora were sentence- and word-aligned with Uplug [15] for which the<br />

slowest but best performing ‘advanced setting’ was used. It first creates basic clues<br />

for word alignments, then runs GIZA++ [13] with standard settings and aligns words<br />

with the existing clues. Alignments with the highest confidence measure are learned<br />

and the last two steps are repeated three times. The output of the alignment process is<br />

a file containing word links with information on word link certainty between the<br />

aligned pair of words and their unique ids.<br />

2.2 Extracting Translations of One-Word Literals<br />

Word-alignments were used to create bilingual lexicons. In order to reduce the noise<br />

in the lexicon as much as possible, only 1:1 links between words of the same part of<br />

speech were taken into account. All alignments occurring only once were discarded.<br />

In this experiment, synonym identification and sense disambiguation were<br />

performed by observing semantic properties of words in several languages. This is<br />

why the information from bilingual word-alignments was combined into a<br />

multilingual lexicon. The lexicon is based on English lemmas and their word ids, and<br />

it contains all their translation variants found in other languages. The obtained<br />

multilingual lexicon was then compared to the already existing WordNets in the<br />

corresponding languages.


Using Multilingual Resources for Building SloWNet Faster 187<br />

For English, PWN was used while for Czech, Romanian and Bulgarian WordNets<br />

from the BalkaNet project [16] were used. There were two reasons for using BalkaNet<br />

WordNets: (1) the languages included in the project correspond to the multilingual<br />

corpus we had available; and (2) the WordNets were developed in parallel, they cover<br />

a common sense inventory and are also aligned to one another as well as to PWN,<br />

making the intersection easier.<br />

If a match was found between a lexicon entry and a literal of the same part of<br />

speech in the corresponding WordNet, the synset id was remembered for that<br />

language. If after examining all the existing WordNets there was an overlap of synset<br />

ids across all the languages for the same lexicon entry, it was assumed that the words<br />

in question all describe the concept marked with this id. Finally, the concept was<br />

extended to the Slovene part of the multilingual lexicon entry and the synset id<br />

common to all the languages was assigned to it. All the Slovene words sharing the<br />

same synset id were treated as synonyms and were grouped into synsets.<br />

2.3 Extracting Translations of Multi-Word Literals<br />

The automatic word-alignment used in this experiment only provides links between<br />

individual words, not phrases. However, simply ignoring all the expressions that<br />

extend beyond word boundaries would be a serious limitation of the proposed<br />

approach, especially because so much energy has been invested in the preparation of<br />

the resources. The second part of the experiment is therefore dedicated to harvesting<br />

multi-word expressions from parallel corpora.<br />

The starting point was a list of multi-word literals we extracted from PWN. It<br />

contains almost 67,000 unique expressions. A great majority of those (almost 61,000)<br />

are from nominal synsets. Another interesting observation is that most of the<br />

expressions (more than 60,000) appear in only one synset and are therefore are<br />

monosemous. Again, most nouns are monosemous (almost 57,000) and there are only<br />

about 150 nouns that have more than three senses. The highest number of senses for<br />

nouns is 6, much lower than for verbs which can have up to 19 senses. We therefore<br />

concluded that sense disambiguation of multi-word expressions will not be a serious<br />

problem, and limited the approach only to English and Slovene. Bearing in mind the<br />

differences between the two languages, we also assumed that we would not be very<br />

successful in finding accurate translations of e.g. phrasal verbs automatically, which<br />

is why we decided to first look for two- and three- word nominal expressions only.<br />

First, the Orwell corpus was searched for the nominal multi-word expressions from<br />

the list. If an expression was found, the id and part of speech for each constituent<br />

word was remembered. This information was then used to look for possible Slovene<br />

translations of each constituent word in the file with word alignments. In order to<br />

increase the accuracy of the target multi-word expressions, translation candidates had<br />

to meet several constraints:


188 Darja Fišer<br />

(1) a Det-Noun phrase could only be translated by a single Noun (example: ‘a<br />

people’ – ‘narod’);<br />

(2) a Det-Adj phrase could only be translated by a single Noun or by a single<br />

Adj_Pl (example: ‘the young’ – ‘mladina’ or ‘mladi’);<br />

(3) an (Adj-)Adj-Noun phrase could only be translated by an (Adj-)Adj-Noun<br />

phrase (example: ‘blind spot’ – ‘slepa pega’);<br />

(4) a (Adj-)Noun-Noun phrase could be translated either by an (Adj-)Adj-Noun or<br />

by a Noun-Noun_gen phrase (examples: ‘swing door’ – ‘nihajna vrata [Adj-<br />

N]’, ‘death rate’ – ‘stopnja umrljivosti [N-N_gen]’, exceptions: ‘cloth cap’<br />

which is translated into Slovene as ‘pokrivalo iz blaga [a cap made of cloth]’,<br />

‘chestnut tree’ – ‘kostanj’);<br />

(5) a Noun-Prep-Noun phrase could be translated by a Noun-Noun_gen or by an<br />

Adj-Noun phrase (examples: ‘loaf of bread’ – ‘hlebec kruha’, ‘state of war’ –<br />

‘vojno stanje’, exception: ‘Republic of Slovenia’ – ‘Republika Slovenija[N-<br />

N_nom]’);<br />

(6) a Noun-Noun-Noun phrase could only be translated by a Noun-Noun_gen-<br />

Noun_gen phrase (example: ‘infant mortality rate’ – ‘stopnja umrljivost<br />

otrok’, exception: ‘corn gluten feed’ – ‘krma iz koruznega glutena [feed made<br />

of corn gluten]’).<br />

Because word-alignment is far from perfect, alignment errors were avoided by<br />

checking whether translation candidates actually appear as a phrase in the<br />

corresponding sentence in the corpus. If a translation was not found for all the parts of<br />

the multi-word expression in the file with alignments, an attempt was made to recover<br />

the missing translations by first locating the known translated word in the corpus and<br />

then using the above-mentioned criteria to guess the missing word from the context.<br />

In the end, the canonical word forms for the successfully translated were extracted<br />

from the corpus and all phrases sharing the same synset id were joined into a single<br />

synset.<br />

3 Results<br />

3.1 Word-Based Approach<br />

The first version of the Slovene WordNet (SLOWN0) was created by translating<br />

Serbian synsets [12] into Slovene with a Serbian-Slovene dictionary [5]. The main<br />

disadvantage of that approach was the inadequate disambiguation of polysemous<br />

words, therefore requiring extensive manual editing of the results. In the current<br />

approach we tried to use multilingual information to improve the disambiguation<br />

stage and generate more accurate synsets.<br />

In the experiment with the Orwell corpus, four different settings were tested, each<br />

of them using one more language [8]. Table 1 shows the number of nominal one-word<br />

synsets generated from the Orwell corpus, depending on the number of languages<br />

involved. Recall drops significantly when a new language is added. On the other<br />

hand, the average number of literals per synset is not affected.


Using Multilingual Resources for Building SloWNet Faster 189<br />

The same approach was also tested on the JRC-Acquis corpus that is from an<br />

entirely different domain and is much larger [9]. It is interesting to observe the change<br />

in synset coverage and quality resulting from the different dataset.<br />

However, because the corpus is not annotated with the linguistic information<br />

needed in this experiment, we could only implement the approach on English, Czech<br />

and Slovene in this setting. Note that although the corpus used was much larger, the<br />

number of the generated synsets is only slightly higher. This could be explained by<br />

the high degree of repetition and domain-specificity of texts from the dataset.<br />

Table 1. Nominal synsets generated by leveraging existing multi-lingual resources (one-word<br />

literals only).<br />

SLOWN0 SLOWN1 SLOWN2 SLOWN3 SLOWN4 SLOWNJRC<br />

nouns 3,210 2,964 870 671 291 3,528<br />

max l/s 40 10 7 6 4 9<br />

avg l/s 4.8 1.4362 1.4 1.4 1.7 2.6<br />

3.2 Phrase-Based Approach<br />

Nominal multi-word literals were extracted from PWN and then translated into<br />

Slovene based on word-alignments. In order to avoid the alignment errors some<br />

restrictions on the translation patterns were introduced and phrase candidates were<br />

checked in the Slovene corpus as well. This simple approach to match phrases in<br />

word-aligned parallel corpora yielded more synsets than was initially expected. If it<br />

was extended to other patterns, even more multi-word literals could be obtained.<br />

Another approach would be to use statistical co-occurrence measures to check the<br />

validity of more elusive patterns.<br />

Table 2. Nominal synsets generated from parallel corpora<br />

(two-word and three-word literals only).<br />

ORWELL JRC<br />

mwe’s found 163 5,652<br />

mwe’s translated 121 (73%) 1,984 (34%)<br />

max l/s 4 2<br />

avg l/s 1,29 1,13


190 Darja Fišer<br />

4 Evaluation<br />

4.1 Synset Quality<br />

Automatic evaluation was performed against a manually created goldstandard. Its<br />

literals were compared to literals in the automatically induced WordNets with regard<br />

to which synsets they appear in. This information was used to calculate precision,<br />

recall and f-measure.<br />

Precision gives the proportion of retrieved and relevant synset ids for a literal to all<br />

synset ids for that literal. Recall is the proportion of relevant synset id retrieved for a<br />

literal out of all relevant synset ids available for that literal. Finally, precision and<br />

recall were combined in the traditional f-measure: (2 * P * R) / (P + R). This seems a<br />

fairer alternative to simply evaluating synsets because of the restricted input<br />

vocabulary.<br />

90.00%<br />

80.00%<br />

70.00%<br />

60.00%<br />

2 lang 3 lang 4 lang 5 lang<br />

precision total 62.22% 69.80% 74.04% 77.37%<br />

recall total 82.24% 77.27% 75.13% 75.88%<br />

f-1 total 70.84% 73.19% 74.53% 76.62%<br />

Fig. 1. A comparison of precision, recall and f-measure for nominal synsets according to the<br />

number of languages used in the disambiguation stage of automatic synset induction from the<br />

Orwell corpus.<br />

Figure 1 shows the drop in recall and increase in precision and f-measure each time<br />

a new language is added to the disambiguation stage, peaking at 77,37% for precision,<br />

75,88% for recall and 76,62% for f-measure. The results for the JRC-Acquis corpus<br />

are worse due to fewer languages involved and less accurate word-alignment<br />

(precision: 67.0%, recall: 72.0% and f-measure: 69.4%).


Using Multilingual Resources for Building SloWNet Faster 191<br />

4.2 Multi-Word Expressions<br />

Because there was virtually no overlap between the goldstandard and synsets<br />

containing automatically translated multi-word expressions, all the synsets obtained<br />

from the Orwell corpus was checked by hand. As can be seen in Table 3, about a third<br />

of the generated literals were completely wrong.<br />

The errors were analyzed and grouped into categories. Most errors (17 synsets)<br />

occurred because an English multi-word expression should be translated into Slovene<br />

with a single word (e.g. ‘top hat’ – ‘cilinder’). The next category (12 synsets)<br />

contains alignment errors in which one of the constituent words is mistranslated or a<br />

translation is missing (e.g. ‘mortality rate’ – ‘umrljivost otrok’, should be ‘stopnja<br />

smrtnosti’). In the next category there are 8 expressions that have been translated<br />

correctly but can not be included in the synset because the senses of the translation<br />

and the original synset are not the same (e.g. ‘white knight’ – ‘beli tekač’ as in chess,<br />

should be ‘beli vitez’ as in business takeovers). And finally, there are 12 borderline<br />

cases that contain a correct translation but also an error (e.g. ‘black hole’ – ‘črna<br />

odprtina[wrong]’ and ‘črna luknja[correct]’).<br />

Table 3. Manual evaluation of multi-word expressions obtained from the Orwell corpus.<br />

ORWELL<br />

completely wrong 39 (32%)<br />

contain some errors 12 (10%)<br />

fully correct 70 (58%)<br />

total no. of synsets 121<br />

A larger-scale evaluation of multi-word expressions harvested from the JRC-<br />

Acquis has not been carried out but is planned for the near future. A quick overview<br />

of the results suggests that the quality of the generated synsets is comparable to the<br />

ones obtained from the Orwell corpus.<br />

5 Conclusions<br />

In this paper we have presented an approach to automatically generate WordNet<br />

synsets from the two parallel corpora. The method works best on nouns which are<br />

disambiguated against several languages. The limitation of the word-alignment based<br />

approach was successfully overcome by using the alignment information to form<br />

multi-word expressions.<br />

However, the issue of adding multi-word units to WordNet is far from exhausted.<br />

More sophisticated statistic-based methods could be used to find more reliable<br />

translations of multi-word units. Another possibility to get even more added value<br />

from parallel corpora would be an attempt to identify (domain-specific) multi-word<br />

expressions that are not part of PWN and add them to Slovene WordNet.


192 Darja Fišer<br />

Acknowledgements<br />

I would like to thank Aleš Horák from the Faculty of Informatics, Brno Masaryk<br />

University, for POS-tagging and lemmatizing the Czech part of the JRC-Acquis<br />

corpus.<br />

References<br />

1. Diab, M.: The Feasibility of Bootstrapping an Arabic WordNet leveraging Parallel Corpora<br />

and an English WordNet. In: Proceedings of the Arabic Language Technologies and<br />

Resources. NEMLAR, Cairo (2004)<br />

2. Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H., Petkevic V., Tufis, D.: Multext-East: Parallel<br />

and Comparable Corpora for Six Central and Eastern European Languages. In: Proceedings<br />

of ACL/COLING98, pp. 315–19. Montreal, Canada (1998)<br />

3. Dyvik, H.: Translations as semantic mirrors: from parallel corpus to wordnet. Revised<br />

version of paper presented at the ICAME 2002 Conference in Gothenburg. (2002)<br />

4. Erjavec, T., Ignat, C., Pouliquen, B., Steinberger, R.: Massive multilingual corpus<br />

compilation: ACQUIS Communautaire and totale. In: Proceedings of the Second Language<br />

Technology Conference. Poznan, Poland (2005)<br />

5. Erjavec, T., Fišer, D.: Building Slovene WordNet. In: Proceedings of the 5th International<br />

Conference on Language Resources and Evaluation LREC'06. 24-26th May 2006, Genoa,<br />

Italy (2006)<br />

6. Farreres, X., Gibert, K., Rodriguez, H.: Towards Binding Spanish Senses to WordNet Senses<br />

through Taxonomy Alignment. In: Proceedings of the Second Global WordNet Conference,<br />

Brno, Czech Republic, January 20-23, 2004, pp. 259–264 (2004)<br />

7. Fellbaum, C. (ed.): WordNet. An Electronic Lexical Database. MIT Press, Cambridge,<br />

Massachusetts (1998)<br />

8. Fišer, D.: Leveraging parallel corpora and existing wordnets for automatic construction of<br />

the Slovene wordnet. In: Proceedings of the 3rd Language and Technology Conference<br />

L&TC'07, 5-7 October 2007. Poznan, Poland (2007a)<br />

9. Fišer, D.: A multilingual approach to building Slovene WordNet. In: Proceedings of the<br />

workshop on A Common Natural Language Processing Paradigm for Balkan Languages<br />

held within the Recent Advances in Natural Language Processing Conference RANLP'07.<br />

26 September 2007, Borovets, Bulgaria (2007b)<br />

10. Ide, N.; Erjavec, T.; Tufis, D.: Sense Discrimination with Parallel Corpora. In: Proceedings<br />

of ACL'02 Workshop on Word Sense Disambiguation: Recent Successes and Future<br />

Directions, pp. 54–60. Philadelphia (2002)<br />

11. Knight, K., Luk. S.: Building a Large-Scale Knowledge Base for Machine Translation. In:<br />

Proceedings of the American Association of Artificial Intelligence AAAI-94. Seattle, WA.<br />

(1994)<br />

12. Krstev, C., Pavlović-Lažetić, G., Vitas, D., Obradović, I.: Using textual resources in<br />

developing Serbian wordnet. J. Romanian Journal of Information Science and Technology<br />

7(1-2), 147–161 (2004)<br />

13. Och, F. J., Ney, H.: A Systematic Comparison of Various Statistical Alignment Models. J.<br />

Computational Linguistics 29(1), 19–51 (2003)<br />

14. Pianta, E., Bentivogli, L., Girardi, C.: MultiWordNet: developing an aligned multilingual<br />

database. In: Proceedings of the First International Conference on Global WordNet, Mysore,<br />

India, January 21-25, 2002 (2002)


Using Multilingual Resources for Building SloWNet Faster 193<br />

13. Resnik, Ph., Yarowsky, D.: A perspective on word sense disambiguation methods and their<br />

evaluation. In: ACL-SIGLEX Workshop Tagging Text with Lexical Semantics: Why, What,<br />

and How? April 4-5, 1997, Washington, D.C., pp. 79–86 (1997)<br />

14. Sedlacek, R.; Smrz, P.: A New Czech Morphological Analyser ajka. In: Proceedings of the<br />

4th International Conference, Text, Speech and Dialogue. Zelezna Ruda, Czech Republic<br />

(2001)<br />

14. Steinberger R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., Varga, D.:. The<br />

JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of<br />

the 5th International Conference on Language Resources and Evaluation. Genoa, Italy, 24-<br />

26 May 2006 (2006)<br />

15. Tiedemann, J.: Recycling Translations - Extraction of Lexical Data from Parallel Corpora<br />

and their Application in Natural Language Processing. Doctoral Thesis, Studia Linguistica<br />

Upsaliensia 1 (2003)<br />

16. Tufis, D.; Cristea, D.; Stamou, S.: BalkaNet: Aims, Methods, Results and Perspectives. A<br />

General Overview. In: Dascalu, Dan (ed.): Romanian Journal of Information Science and<br />

Technology Special Issue. 7(1-2), 9–43 (2000)<br />

17. van der Plas, L., Tiedemann, J.: Finding Synonyms Using Automatic Word Alignment and<br />

Measures of Distributional Similarity. In: Proceedings of ACL/COLING 2006 (2006)<br />

18. Varga, D., Halacsy, P., Kornai, A., Nagy, V., Nemeth, L., Tron, V.: Parallel corpora for<br />

medium density languages. In: Proceedings of RANLP’2005, pp. 590–596. Borovets,<br />

Bulgaria (2005)


The Global WordNet Grid Software Design<br />

Aleš Horák, Karel Pala, and Adam Rambousek<br />

Faculty of Informatics<br />

Masaryk University<br />

Botanická 68a, 602 00 Brno<br />

Czech Republic<br />

{hales,pala,xrambous}@fi.muni.cz<br />

Abstract. In the presented paper we show how the Global WordNet<br />

Grid software is designed. The goal of the Grid is to provide a free network<br />

of WordNets linked together through interlingual indexes. We have<br />

set as our goal to work on the Grid preparation in the Masaryk University<br />

NLP Centre and design its software background. All participating<br />

WordNets will be encapsulated by a DEB (Dictionary Editor and<br />

Browser) server established for this purpose.<br />

The following text presents design details of the new DEBGrid application<br />

with possibilities of three types of public and authenticated user<br />

access to the Grid WordNet data.<br />

Key words: WordNet; DEB platform; DEBVisDic; Global WordNet<br />

Grid<br />

1 Introduction<br />

In June 2000, the Global WordNet Association (GWA [1]) was established by<br />

Piek Vossen and Christiane Fellbaum. The purpose of this association is to “provide<br />

a platform for discussing, sharing and connecting WordNets for all languages<br />

in the world.” One of the most important actions of GWA is the Global Word-<br />

Net Conference (<strong>GWC</strong>) that is being held every two years on different places<br />

all over the world. The second <strong>GWC</strong> was organized by the MU NLP Centre in<br />

Brno and the NLP Centre members are actively participating in GWA plans and<br />

activities. A new idea that was born during the third <strong>GWC</strong> in Korea is called the<br />

Global WordNet Grid with the purpose of providing a free network of smaller<br />

(at the beginning) WordNets linked together through ILI. The Grid preparation<br />

is currently just starting and the MU NLP Centre is going to secure its software<br />

background.<br />

The idea of connecting WordNets has been suggested during the Balkanet<br />

project (2001–2004 [2]) in which Patras team developed the core of the WordNet<br />

Management System designed to link all the WordNets developed in the course<br />

of the project (Deliverable 9.1.04, September 2004).


The Global WordNet Grid Software Design 195<br />

It was tested successfully on Greek and Czech WordNets. However, the Patras<br />

team did not proceed with it and the system remained only as a partial<br />

result of the research that was not pursued further. Before the end of Balkanet<br />

project, the Czech team decided to re-implement the local version of the Vis-<br />

Dic browser and editor using client/server architecture. This was the origin of<br />

the DEBVisDic tool that was fully implemented only after finishing Balkanet<br />

project. Fully operational version of DEBVisDic was presented at the 3 rd Global<br />

WordNet Conference 2006 in Korea [3]. In our view this client/server tool will<br />

become a software background for the Grid preparation mentioned above (see<br />

below in the Section 3.2).<br />

2 The Global WordNet Grid<br />

Since the first publicly available WordNet, the Princeton WordNet [4], more than<br />

fifty national WordNets have been developed all over the world. However, the<br />

availability of the WordNets is limited – that is why the idea of a completely<br />

free Global WordNet Grid has appeared.<br />

It is a known fact that, for instance, the results of the EuroWordNet are not<br />

freely accessible though the participants of the project have developed (and are<br />

developing) more complete and larger WordNets for the individual languages.<br />

Practically the same can be said also about the results of the Balkanet project.<br />

If one wants to exploit WordNets for different languages it is always necessary<br />

to get in touch with the developers and ask them for the permission to use the<br />

WordNet data.<br />

Another reason for building and having the completely free Global WordNet<br />

Grid is the fact that the particular WordNets can be linked to the selected<br />

ontologies (e.g. Sumo/Milo) and domains. This has already took place with the<br />

WordNets developed in the Balkanet project. The links to the ontologies should<br />

be provided for all WordNets included in the Global WordNet Grid.<br />

The Grid also provides a common core of 4.689 synsets serving as a shared<br />

set of concepts for all the Grid’s languages. These synsets are selected from the<br />

EuroWordNet Common Base Concepts used in many WordNet projects.<br />

3 DEBGrid – the DEB Application for the Global<br />

WordNet Grid<br />

The DEBGrid application will be built over the DEBVisDic application with the<br />

DEB server either set up at the NLP Centre of Masaryk University in Brno or it<br />

will be set up by the Global WordNet Association. The DEB platform provides<br />

important backgrounds for the WordNet Grid universal features.


196 Aleš Horák, Karel Pala, and Adam Rambousek<br />

3.1 The DEB Architecture<br />

The Dictionary Editor and Browser (DEB) platform [3, 5, 6] has been developed<br />

as a general framework for fast development of wide range of dictionary writing<br />

applications. The DEB platform provides several very important foundations<br />

that are common to most of the intended dictionary systems. These foundational<br />

features include:<br />

– a strict separation to the client and server parts in the application design.<br />

The server part provides all the necessary data manipulation functions like<br />

data storage and retrieval, data indexing and querying, but also various kinds<br />

of data presentations using templates. In DEB, the dictionary entries are<br />

stored using a common XML format, which allows to design and implement<br />

dictionaries and lexicons of all types (monolingual, translational, thesauri,<br />

ontologies, encyclopaedias etc.). The client part of the application concentrates<br />

on the user interaction with the server part, it does not produce any<br />

complicated data manipulation. The client and server parts communicate by<br />

means of the standard HTTP (or secured HTTPs) protocol.<br />

– a common administrative interface that allows to manage user accounts including<br />

user access rights to particular dictionaries and services, dictionary<br />

schema definitions, entry locking administration or entry templates definitions.<br />

– XML database backend for the actual dictionary data storage. Currently, we<br />

are working with the Oracle Berkeley DB XML [7, 8] database, which provides<br />

a flexible XML database with standard XPath and XQuery interfaces.<br />

The DB XML database is well suited for processing complicated XML structures,<br />

however, we (and according to private discussions other DB XML users<br />

as well) have encountered efficiency problems when processing certain kinds<br />

of queries that result in large lists of answers. Simple processing of the data<br />

(like export or import of the whole dictionary) is not a problem as the whole<br />

English WordNet export (over 100.000 entries) takes less than 1 minute, but<br />

searching for values of specific subtags can take several seconds in such large<br />

dictionary even when indexes are used. We are currently working on several<br />

solutions for this, which include link caching, specific DB XML indexing and<br />

also trying a completely different database backend. The key advantage for<br />

all the DEB applications is that a replacement of the DB XML backend with<br />

another database will be a completely transparent process which does not<br />

need any change in the applications themselves.<br />

Based on these common features several developed and widely used dictionary<br />

applications have been implemented, including the well-known WordNet editor<br />

DEBVisDic that has been used in several national WordNets development<br />

recently (Czech, Polish, Hungarian or South African languages). With this evidence,<br />

we believe that DEB is the right concept for the Global WordNet Grid<br />

data provision.


The Global WordNet Grid Software Design 197<br />

3.2 The DEBGrid Design and Implementation<br />

In the DEB platform environment, all the WordNets are usually stored on single<br />

DEBVisDic server. In the Grid, most of the WordNets will be also stored in<br />

this way, however, since the Grid could be finally composed of large number of<br />

WordNet dictionaries developed by different organizations, this solution may not<br />

be always the best option (for example because of licensing issues). Thanks to<br />

the client-server nature of the DEB platform, DEBGrid can offer two possible<br />

types of encapsulating WordNets in the server:<br />

– a WordNet can be physically stored on the central server. This is the traditional<br />

DEBVisDic setup and offers the best performance.<br />

– a WordNet can be stored on a DEBVisDic server located at the WordNet<br />

owner’s institution. All servers in the Grid can then communicate with each<br />

other (depending on the server setup). The Central Grid server for this Word-<br />

Net has only the knowledge of which server to contact, instead of having<br />

the full WordNet database stored locally, and all queries are dynamically<br />

resolved over the Internet. This option may be slower as it depends on the<br />

quality of connection to different servers and their performance. On the other<br />

hand, the WordNet owner has full control over the displayed data and access<br />

permissions.<br />

– a mixed solution – some WordNets are stored on central server and some<br />

are stored on their respective owners’ servers. This is just an extension of<br />

the previous option. Again, the performance of the whole Grid depends on<br />

the performance of single servers, but the speed can be improved if the most<br />

used WordNets are stored on the central server.<br />

The DEB framework provides several possibilities of working with the WordNet<br />

data. All the types of the Grid access undergo the same control of service and<br />

user management with the option to provide information for public (anonymous)<br />

access as well as authenticated access for registered users.<br />

Basically, each WordNet in the Grid can be presented to the Grid users in<br />

one of the following forms:<br />

a) by means of a simple purely HTML interface working in any web browser.<br />

This interface is able to display one WordNet dictionary or the same synset in<br />

several WordNets. Synsets are displayed using XSLT templates – the server<br />

can provide several view of the synset data ranging from a terse view up<br />

to a detailed view. The view can be even different for each dictionary. An<br />

example of such presentation of one synset in three WordNets is displayed<br />

in the Figure 1. This type of WordNet view is probably the best for public<br />

anonymous access to the Grid, since it does not need any installation of user<br />

software or packages.<br />

a) using the full DEBVisDic application. This application needs to be installed<br />

as an extension of the freely available Firefox web browsers, but it offers


198 Aleš Horák, Karel Pala, and Adam Rambousek<br />

Fig. 1. The web interface of DEBGrid with three interlinked WordNets.<br />

much complex functionality than the web access. Each WordNet is opened<br />

in its own window which offers several views of the WordNet data (a textual<br />

preview, hypero/hyponymic tree structures, user query lists or XML) and<br />

also the possibility to edit the data (for users with the write permissions).<br />

With this type of the Grid access, the user would have the most advanced<br />

environment for working with the Grid WordNets.<br />

a) by means of a defined interface of the DEBVisDic server. This way any<br />

external application may query the server and receive WordNet entries (in<br />

XML or other form) for subsequent processing. In this way, local external<br />

applications can easily process the Grid data in standard formats.<br />

In all cases, users (or external applications) could authenticate with a login and<br />

password over secure HTTP connection. Each user can be given a read-only or<br />

read-write access to particular WordNets.<br />

For some applications it is useful to use a visualization tool that allows to<br />

view synsets and their links as graphs. Such tool is under development at the MU<br />

NLP Centre, it is called Visual Browser [9]. Its important feature is the ability to<br />

process WordNet synsets from a DEB server storage and convert them into the<br />

RDF notation for visualization. Visual Browser is also suitable for representing<br />

ontologies that can and will be integrated within Global WordNet Grid.


The Global WordNet Grid Software Design 199<br />

4 Conclusions<br />

In this article, we have presented a report of the design and implementation of<br />

the Global WordNet Grid software background. The basic idea of the WordNet<br />

Grid introduced by P. Vossen, Ch. Fellbaum and A. Pease at <strong>GWC</strong> 2006 includes<br />

establishing of an interlinked network of national WordNets connected by means<br />

of the interlingual indexes. In the starting phase the Grid contains only a subset<br />

of the EuroWordNet Base Concepts with nearly 5.000 synsets.<br />

The management and intelligent processing of the included WordNets is<br />

driven by the DEB development platform tool called DEBGrid. This tool is<br />

built on top of the DEBVisDic WordNet editor and allows thus a versatile environment<br />

for working with large number of WordNets in one place and style.<br />

Acknowledgements<br />

This work has been partly supported by the Academy of Sciences of Czech<br />

Republic under the project T100300419, by the Ministry of Education of CR in<br />

the National Research Programme II project 2C06009 and by the Czech Science<br />

Foundation under the project 201/05/2781.<br />

References<br />

1. The Global WordNet Association. (2007) http://www.globalwordnet.org/.<br />

2. Balkanet project website, http://www.ceid.upatras.gr/Balkanet/. (2002)<br />

3. Horák, A., Pala, K., Rambousek, A., Povolný, M.: First version of new client-server<br />

WordNet browsing and editing tool. In: Proceedings of the Third International<br />

WordNet Conference - <strong>GWC</strong> 2006, Jeju, South Korea, Masaryk University, Brno<br />

(2006) 325–328<br />

4. Miller, G.: Five Papers on WordNet. International Journal of Lexicography 3(4)<br />

(1990) Special Issue.<br />

5. Horák, A., Pala, K., Rambousek, A., Rychlý, P.: New clients for dictionary writing<br />

on the DEB platform. In: DWS 2006: Proceedings of the Fourth International<br />

Workshop on Dictionary Writings Systems, Italy, Lexical Computing Ltd., U.K.<br />

(2006) 17–23<br />

6. Horák, A., Rambousek, A.: Dictionary Management System for the DEB Development<br />

Platform. In: Proceedings of the 4 th International Workshop on Natural<br />

Language Processing and Cognitive Science (NLPCS, aka NLUCS), Funchal, Portugal,<br />

INSTICC PRESS (2007) 129–138<br />

7. Chaudhri, A.B., Rashid, A., Zicari, R., eds.: XML Data Management: Native XML<br />

and XML-Enabled Database Systems. Addison Wesley Professional (2003)<br />

8. Oracle Berkeley DB XML web (2007)<br />

http://www.oracle.com/database/berkeley-db/xml.<br />

9. Nevěřilová, Z.: The Visual Browser Project. http://nlp.fi.muni.cz/projects/<br />

visualbrowser (2007)


The Development of a Complex-Structured<br />

Lexicon based on WordNet<br />

Aleš Horák 1 , Piek Vossen 2 , and Adam Rambousek 1<br />

1 Faculty of Informatics<br />

Masaryk University<br />

Botanická 68a, 602 00 Brno<br />

Czech Republic<br />

{hales,xrambous}@fi.muni.cz<br />

2 Faculteit der Letteren<br />

Vrije Universiteit van Amsterdam<br />

e Boelelaan 1105, 1081 HV Amsterdam<br />

The Netherlands<br />

Piek.Vossen@irion.nl<br />

Abstract. The Cornetto project develops a new complex-structured<br />

lexicon for the Dutch language. The lexicon comprises information from<br />

two current electronic dictionaries – the Referentie Bestand Nederlands<br />

(RBN), which contains FrameNet-like structures, and the Dutch Word-<br />

Net (DWN) with the usual WordNet structures. The Cornetto lexicon<br />

(stored in the Cornetto database) will be linked to English WordNet<br />

synsets and have detailed descriptions of lexical items in terms of morphologic,<br />

syntactic, combinatoric and semantic information. The database<br />

is organized in four data collections – lexical units, synsets, ontology<br />

terms and the Cornetto identifiers. The Cornetto identifiers are specifically<br />

used for managing the relations between lexical units on the one<br />

hand and synsets on the other hand. The mapping is first created automatically,<br />

but then revised manually by lexicographers. Special interfaces<br />

have been developed to compare the different perspectives of organizing<br />

concepts (lexical units versus synsets versus ontology terms).<br />

In this article, we describe the background information about the Cornetto<br />

project and the implementation of necessary project tools that are<br />

based on the DEBVisDic tool for WordNet editing. The development of<br />

the Cornetto clients is a joint project of the Masaryk University in Brno<br />

and the University of Amsterdam.<br />

Key words: Cornetto project; WordNet; DEB platform; DEBVisDic<br />

1 Introduction<br />

Cornetto is a two-year Stevin project (STE05039) in which a lexical semantic<br />

database is built that combines WordNet with FrameNet-like information [1]


The Development of a Complex-Structured Lexicon based on WordNet 201<br />

for Dutch. The combination of the two lexical resources will result in a much<br />

richer relational database that may improve natural language processing (NLP)<br />

technologies, such as word sense-disambiguation, and language-generation systems.<br />

In addition to merging the WordNet and FrameNet-like information, the<br />

database is also mapped to a formal ontology to provide a more solid semantic<br />

backbone.<br />

The database will be filled with data from the Dutch WordNet [2] and the<br />

Referentie Bestand Nederlands [3]. The Dutch WordNet (DWN) is similar to<br />

the Princeton WordNet for English, and the Referentie Bestand (RBN) includes<br />

frame-like information as in FrameNet plus additional information on the combinatoric<br />

behaviour of words in a particular meaning.<br />

Both DWN and RBN are semantically based lexical resources. RBN uses a<br />

traditional structure of form-meaning pairs, so-called Lexical Units [4]. Lexical<br />

Units contain all the necessary linguistic knowledge that is needed to properly use<br />

the word in a language. The Synsets are concepts as defined by [5] in a relational<br />

model of meaning. Synsets are mainly conceptual units strictly related to the<br />

lexicalization pattern of a language. Concepts are defined by lexical semantic<br />

relations. For Cornetto, the semantic relations from EuroWordNet are taken as<br />

a starting point [2].<br />

Within the project, we try to clarify the relations between Lexical Units<br />

and Synsets, and between Synsets and an ontology. DEBVisDic is specifically<br />

adapted for this purpose.<br />

In the next section we give a short overview of the structure of the database.<br />

The following sections give some background information on DEBVisDic and<br />

explain the specific adaptations and clients that have been developed to support<br />

the work of mapping the three resources.<br />

2 The Cornetto Lexical Database<br />

The Cornetto database (CDB) consists of 3 main data collections:<br />

– Collection of Lexical Units, mainly derived from the RBN<br />

– Collection of Synsets, mainly derived from DWN<br />

– Collection of Terms and axioms, mainly derived from SUMO and MILO<br />

In addition to the 3 data collections, a separate table of so-called Cornetto<br />

Identifiers (CIDs) is provided. These identifiers contain the relations between<br />

the lexical units and the synsets in the CDB but also to the original word senses<br />

and synsets in the RBN and DWN.<br />

DWN was linked to WordNet 1.5. WordNet domains are mapped to Word-<br />

Net 1.6 and SUMO is mapped to WordNet 2.0 (and most recently to Word-<br />

Net 2.1). In order to apply the information from SUMO and WordNet domains


202 Aleš Horák, Piek Vossen, and Adam Rambousek<br />

Fig. 1. Cornetto Lexical Units, showing the preview and editing form<br />

to the synsets, we need to exploit the mapping tables between the different versions<br />

of WordNet. We used the tables that have been developed for the MEAN-<br />

ING project [6, 7]. For each equivalence relation to WordNet 1.5, we consulted a<br />

table to find the corresponding WordNet 1.6 and WordNet 2.0 synsets, and via<br />

these we copied the mapped domains and SUMO terms to the Dutch synsets.<br />

The structure for the Dutch synsets thus consists of:<br />

– a list of synonyms<br />

– a list of language internal relations<br />

– a list of equivalence relations to WordNet 1.5 and WordNet 2.0<br />

– a list of domains, taken from WordNet domains<br />

– a list of SUMO mappings, taken from the WordNet 2.0 SUMO mapping<br />

The structure of the lexical units is fully based in the information in the RBN.<br />

The specific structure differs for each part of speech. At the highest level it<br />

contains:<br />

– orthographic form<br />

– morphology<br />

– syntax<br />

– semantics<br />

– pragmatics<br />

– examples<br />

The above structure is defined for single word lexical units. A separate structure<br />

will be defined later in the project for multi-word units. It will take too much<br />

space to explain the full structure here. We refer to the Cornetto website [8] for<br />

more details.


The Development of a Complex-Structured Lexicon based on WordNet 203<br />

Fig. 2. Cornetto Synsets window, showing a preview and a hyperonymy tree<br />

3 The DEB Platform<br />

The Dictionary Editor and Browser (DEB) platform [9, 10] offers a development<br />

framework for any dictionary writing system application that needs to store the<br />

dictionary entries in the XML format structures. The most important property<br />

of the system is the client-server nature of all DEB applications. This provides<br />

the ability of distributed authoring teams to work fluently on one common data<br />

source. The actual development of applications within the DEB platform can be<br />

divided into the server part (the server side functionality) and the client part<br />

(graphical interfaces with only basic functionality). The server part is built from<br />

small parts, called servlets, which allow a modular composition of all services.<br />

The client applications communicate with servlets using the standard HTTP<br />

web protocol.<br />

For the server data storage the current database backend is provided by the<br />

Berkeley DB XML [11], which is an open source native XML database providing<br />

XPath and XQuery access into a set of document containers.<br />

The user interface, that forms the most important part of a client application,<br />

usually consists of a set of flexible forms that dynamically cooperate with the<br />

server parts. According to this requirement, DEB has adopted the concepts of<br />

the Mozilla Development Platform [12]. Firefox Web browser is one of the many<br />

applications created using this platform. The Mozilla Cross Platform Engine<br />

provides a clear separation between application logic and definition, presentation<br />

and language-specific texts.


204 Aleš Horák, Piek Vossen, and Adam Rambousek<br />

3.1 New DEB Features for the Cornetto Project<br />

During the Cornetto project the nature of the Cornetto database structure has<br />

imposed the need of several features that were not present in the (still developing)<br />

DEB platform. The main new functionalities include:<br />

– entry locking for concurrent editing. Editing of entries by distant users was<br />

already possible in DEB, however, the exclusivity in writing to the same dictionary<br />

item was not controlled by the server. The new functions offer the<br />

entry locking per user (called from the client application e.g. when entering<br />

the edit form). The list of all server locks is presented in the DEB administration<br />

interface allowing to handle the locks either manually or automatically<br />

on special events (logout, timeout, loading new entry, . . . ).<br />

– link display preview caching. According to the database design that (correctly)<br />

handles all references with entity IDs, each operation, like structure<br />

entry preview or edit form display, runs possibly huge numbers (tens or hundreds)<br />

of extra database queries displaying text representations instead of<br />

the entity ID numbers. The drawback of this compact database model is in<br />

slowing down the query response time to seconds for one entry. To overcome<br />

this increase of the number of link queries, we have introduced the concept of<br />

preview caching. With this mechanism the server computes all kinds of previews<br />

in the time of saving a modified entry in special entry variables (either<br />

XML subtags or XML metadata). In the time of constructing the preview<br />

or edit form, the linked textual representations are taken from the preview<br />

caches instead of running extra queries to obtain the computed values.<br />

– edit form functionalities – the lexicographic experts within the Cornetto<br />

project have suggested several new user interface functions that are profitable<br />

for other DEB-based projects like collapsing of parts of the edit form, entry<br />

merging and splitting functions or new kinds of automatic inter-dictionary<br />

queries, so called AutoLookUps.<br />

All this added functionalities are directly applicable in any DEB application like<br />

DEBVisDic or DEBDict.<br />

4 The New DEBVisDic Clients<br />

Since one of the basic parts of the Cornetto database is the Dutch WordNet, we<br />

have decided to use DEBVisDic as the core for Cornetto client software. We have<br />

developed four new modules, described in more details below. All the databases<br />

are linked together and also to external resources (Princeton English WordNet<br />

and SUMO ontology), thus every possible user action had to be very carefully<br />

analyzed and described.<br />

During the several months of active development and extensive communication<br />

between Brno and Amsterdam, a lot of new features emerged in both


The Development of a Complex-Structured Lexicon based on WordNet 205<br />

Fig. 3. Cornetto Identifiers window, showing the edit form with several alternate mappings<br />

server and client and many of these innovations were also introduced into the<br />

DEBVisDic software. This way, each user of this WordNet editor benefits from<br />

Cornetto project.<br />

The user interface is the same as for all the DEBVisDic modules: upper part<br />

of the window is occupied by the query input line and the query result list and<br />

the lower part contains several tabs with different views of the selected entry.<br />

Searching for entries supports several query types – a basic one is to search for a<br />

word or its part, the result list may be limited by adding an exact sense number.<br />

For more complex queries users may search for any value of any XML element<br />

or attribute, even with a value taken from other dictionaries (the latter is used<br />

mainly by the software itself for automatic lookup queries).<br />

The tabs in the lower part of the window are defined per dictionary type,<br />

but each dictionary contains at least a preview of an entry and a display of the<br />

entry XML structure. The entry preview is generated using XSLT templates, so<br />

it is very flexible and offers plenty of possibilities for entry representation.


206 Aleš Horák, Piek Vossen, and Adam Rambousek<br />

4.1 Cornetto Lexical Units<br />

The Cornetto foundation is formed by Lexical Units, so let us describe their<br />

client package first. Each entry contains complex information about morphology,<br />

syntax, semantics and pragmatics, and also lots of examples with complex<br />

substructure. Thus one of the important tasks was to design a preview to display<br />

everything needed by the lexicographers without the necessity to scroll a lot. The<br />

examples were moved to separate tab and only their short resumé stayed on the<br />

main preview tab.<br />

Lexical units also contain semantic information from RBN that cannot be<br />

published freely because of licensing issues. Thus DEBVisDic here needs to differentiate<br />

the preview content based on the actual user’s access rights.<br />

The same ergonomic problem had to be resolved in the edit form. The whole<br />

form is divided to smaller groups of related fields (e.g. morphology) and it is<br />

possible to hide or display each group separately. By default, only the most<br />

important parts are displayed and the rest is hidden.<br />

Another new feature developed for Cornetto is the option to split the edited<br />

entry. Basically, this function copies all content of edited entry to a new one. This<br />

way, users may easily create two lexical units that differ only in some selected<br />

details.<br />

Because of the links between all the data collections, every change in lexical<br />

units has to be propagated to Cornetto Synsets and Identifiers. For example,<br />

when deleting a lexical unit, the corresponding synonym has to be deleted from<br />

the synset dictionary.<br />

4.2 Cornetto Synsets<br />

Synsets are even more complex than lexical units, because they contain lots<br />

of links to different sources – links to lexical units, relations to other synsets,<br />

equivalence links to Princeton English WordNet, and links to the ontology.<br />

Again, designing the user-friendly preview containing all the information was<br />

very important. Even here, we had to split the preview to two tabs – the first<br />

with the synonyms, domains, ontology, definition and short representation of<br />

internal relations, and the second with full information on each relation (both<br />

internal and external to English Wordnet). Each link in the preview is clickable<br />

and displays the selected entry in the corresponding dictionary window (for<br />

example, clicking on a synonym opens a lexical unit preview in the lexical unit<br />

window).<br />

The synset window offers also a tree view representing a hypernym/hyponym<br />

tree. Since the hypero/hyponymic hierarchy in WordNet forms not a simple tree<br />

but a directed graph, another tab provides the reversed tree displaying links<br />

in the opposite direction (this concept was introduced in the VisDic WordNet<br />

editor). The tree view also contains information about each subtree’s significance<br />

– like the number of direct hyponyms or the number of all the descendant synsets.


The Development of a Complex-Structured Lexicon based on WordNet 207<br />

The synset edit form looks similar to the form in the lexical units window,<br />

with less important parts hidden by default. When adding or editing links, users<br />

may use the same queries as in dictionaries to find the right entry.<br />

4.3 Cornetto Identifiers<br />

The lexical units and synsets are linked together using the Cornetto Identifiers<br />

(CID). For each lexical unit, the automatic aligning software produced several<br />

mappings to different synsets (with different score values). At the very beginning,<br />

the most probable one was marked as the “selected” mapping.<br />

In the course of work, users have several ways for confirming the automatic<br />

choice, choosing from other offered mapping, or creating an entirely new link.<br />

For example, a user can remove the incorrect synonym from a synset and the<br />

corresponding mapping will be marked as unselected in CID. Another option is<br />

to select one of the alternate mappings in the Cornetto Identifiers edit form. Of<br />

course, this action leads to an automatic update of synonyms.<br />

The most convenient way to confirm or create links is to use Map current<br />

LU to current Synset function. This action can be run from any Cornetto client<br />

package, either by a keyboard shortcut or by clicking on the button. All the<br />

required changes are checked and carried out on the server, so the client software<br />

does not need to worry about the actual actions necessary to link the lexical unit<br />

and the synset.<br />

4.4 Cornetto Ontology<br />

The Cornetto Ontology is based on SUMO and so is the client package. The<br />

ontology is used in synsets, as can be seen in the Figure 2. The synset preview<br />

shows a list of ontology relations triplets – relation type, variable and variable<br />

or ontology term.<br />

Clicking on the ontology term opens the term preview. A user can also browse<br />

the tree representing the ontology structure.<br />

5 Conclusions<br />

We have just presented the design and implementation of new tools for supporting<br />

the work on the Dutch Cornetto project developing a new complex structure<br />

lexicon. The tools are prepared on top of the DEB platform, which currently<br />

covers in six full featured dictionary writing systems (DEBDict, DEBVisDic,<br />

PRALED, DEB CPA, DEB TEDI and Cornetto). The Cornetto tools are closely<br />

related to the DEBVisDic system which, within the Cornetto project, has shown<br />

the versatility of its design as well as has been supplemented with new features<br />

reusable not only for work with other national WordNets but also for any other<br />

DEB application.


208 Aleš Horák, Piek Vossen, and Adam Rambousek<br />

Acknowledgments<br />

The Cornetto project is funded by the Nederlandse Taalunie and STEVIN. This<br />

work has also partly been supported by the Ministry of Education of the Czech<br />

Republic within the Center of basic research LC536 and in the Czech National<br />

Research Programme II project 2C06009.<br />

References<br />

1. Fillmore, C., Baker, C., Sato, H.: Framenet as a ’net’. In: Proceedings of Language<br />

Resources and Evaluation Conference (LREC 04). Volume vol. 4, 1091-1094.,<br />

Lisbon, ELRA (2004)<br />

2. Vossen, P., ed.: EuroWordNet: a multilingual database with lexical semantic networks<br />

for European Languages. Kluwer (1998)<br />

3. Maks, I., Martin, W., de Meerseman, H.: RBN Manual. (1999)<br />

4. Cruse, D.: Lexical semantics. Cambridge, England: University Press (1986)<br />

5. Miller, G., Fellbaum, C.: Semantic networks of english. Cognition October (1991)<br />

6. WordNet mappings, the Meaning project (2007)<br />

http:http://www.lsi.upc.es/~nlp/tools/mapping.html.<br />

7. Daudé J., P.L., G., R.: Validation and tuning of WordNet mapping techniques.<br />

In: Proceedings of the International Conference on Recent Advances in Natural<br />

Language Processing (RANLP’03), Borovets, Bulgaria (2003)<br />

8. The Cornetto project web site (2007)<br />

http://www.let.vu.nl/onderzoek/projectsites/cornetto/start.htm.<br />

9. Horák, A., Pala, K., Rambousek, A., Rychlý, P.: New clients for dictionary writing<br />

on the DEB platform. In: DWS 2006: Proceedings of the Fourth International<br />

Workshop on Dictionary Writings Systems, Italy, Lexical Computing Ltd., U.K.<br />

(2006) 17–23<br />

10. Horák, A., Pala, K., Rambousek, A., Povolný, M.: First version of new client-server<br />

WordNet browsing and editing tool. In: Proceedings of the Third International<br />

WordNet Conference - <strong>GWC</strong> 2006, Jeju, South Korea, Masaryk University, Brno<br />

(2006) 325–328<br />

11. Chaudhri, A.B., Rashid, A., Zicari, R., eds.: XML Data Management: Native XML<br />

and XML-Enabled Database Systems. Addison Wesley Professional (2003)<br />

12. Feldt, K.: Programming Firefox: Building Rich Internet Applications with Xul.<br />

O’Reilly (2007)


WordNet-anchored Comparison of<br />

Chinese-Japanese Kanji Word<br />

Chu-Ren Huang ,1 , Chiyo Hotani ,2 , Tzu-Yi Kuo ,1 , I-Li Su 1 , and Shu-Kai Hsieh 3<br />

1<br />

Institute of Linguistics, Academia Sinica<br />

Nankang, Taipei, Taiwan 115<br />

2<br />

Seminar für Sprachwissenschaft, University of Tuebingen,<br />

Germany<br />

3<br />

Department of English, National Taiwan Normal University<br />

1<br />

{churen, ivykuo, isu}@sinica.edu.tw<br />

2<br />

inatohc@hotmail.com<br />

3<br />

shukai@gmail.com<br />

1 Introduction<br />

Chinese and Japanese are two typologically different languages sharing the same<br />

orthography since they both use Chinese characters in written text. What makes this<br />

sharing of orthography unique among languages in the world is that Chinese<br />

characters (kanji in Japanese and hanzi in Chinese) explicitly encode information of<br />

semantic classification [1,2]. This partially explains the process of Japanese adopting<br />

Chinese orthography even though the two languages are not related. The adaptation is<br />

supposed to be based on meaning and not on cognates sharing some linguistic forms.<br />

However, this meaning-based view of kanji/hanzi orthography faces a great challenge<br />

given the fact that Japanese and Chinese form-meaning pair do not have strict one-toone<br />

mapping. There are meanings instantiated with different forms, as well as same<br />

forms representing different meanings. The character 湯 is one of most famous faux<br />

amis. It stands for ‘hot soup’ in Chinese and ‘hot spring’ in Japanese. In<br />

sum, these are two languages where their forms are supposed to be organized<br />

according to meanings, but show inconsistencies.<br />

WordNets as lexical knowledgebases, on the other hand, assume a basic semantic<br />

taxonomy which can be universally represented regardless of the linguistic distance.<br />

In other words, they assume that the organization of words around synsets and lexical<br />

semantic relations are universal. This position is partially supported by the various<br />

languages with comprehensive WordNets.<br />

It is important to note that WordNet and the Chinese character orthography is not<br />

as different as they appear. WordNet assumes that there are some generalizations in<br />

how concepts are clustered and lexically organized in languages and propose an<br />

explicit lexical level representation framework which can be applied to all languages<br />

in the world. Chinese character orthography intuited that there are some conceptual<br />

bases for how meaning are lexical realized and organized, hence devised a sub-lexical<br />

level representation to represent semantic clusters. Based on this observation, the<br />

study of cross-lingual homo-forms between Japanese and Chinese in the context of<br />

WordNet offers a unique window for different approaches to lexical<br />

conceptualization. Since Japanese and Chinese use the same character set with the


210 Chu-Ren Huang, Chiyo Hotani, Tzu-Yi Kuo, I-Li Su, and Shu-Kai Hsieh<br />

same semantic primitives (i.e. radicals), we can compare their conceptual system with<br />

the same atoms when there are variations in meanings of the same word-forms. When<br />

this is overlaid over WordNet, we get to compare the ontology of the two<br />

representation systems.<br />

From a more practical point of view, unified lexical resources are necessary in<br />

advanced multilingual knowledge processing. Princeton WordNet [3] is a lexical<br />

resource commonly used. The Chinese WordNet (CWN) [4] is created already, but<br />

there is no Japanese WordNet available yet. Since the Japanese and the Chinese<br />

writing system (Hanzi) and its semantic meanings are near-related, analyzing such<br />

relation may speed up the creation of the Japanese WordNet that aligned with CWN<br />

by providing statistical information of Form-Meaning mapping of Japanese and<br />

Chinese word. In this paper, we examine and analyze the form of Hanzi and the<br />

semantic relations between the CWN and the Japanese Electronic Dictionary<br />

Research [5].<br />

2 Literature Review<br />

WordNet-like lexical knowledgebases for Chinese include HowNet, Chinese Concept<br />

Dictionary (CCD) [6], and Chinese WordNet [4, 7]. However, these are all<br />

constructed at the word level and do not explicitly refer to characters or character<br />

composition. Wong and Pala [8] was probably the first work linking the semantic<br />

radicals of Chinese characters to a linguistic ontology, EWN top ontology in their<br />

work. The first full-scale Lexical KnowledgeBase work based on Chinese characters<br />

and semantic radicals are two recent doctoral dissertations: Chou [9] and Hsieh [10].<br />

Chou’s Hantology maps Chinese character radicals to SUMO ontology and build an<br />

ontology-based representation of character changes and variations. Hsieh’s HanziNet<br />

is a WordNet like knowledgebase taking Chinese characters as basic units and utilizes<br />

the semantic information form the semantic radicals. Their work have been converged<br />

and integrated in our recent proposal to utilize Chinese characters to build<br />

multilingual knowledge infrastructure [10].<br />

For Japanese, the National Institute of Communication Technology (NICT) has<br />

recently started the first project to construct a Japanese WordNet. There is also a long<br />

tradition of working on Kanji, especially in terms of font rendition and character and<br />

word dictionaries (for instance, by the CJK Dictionary Institute, www.cjk.org).<br />

Unfortunately, we are not aware of any systematic work in Japan linking kanji with<br />

WordNet like lexical knowledgbases.<br />

3 Resources: CWN, EDR and List of Character Variants<br />

In order to do a character-based and sense-anchored comparison of Chinese and<br />

Japanese words, we employed three important resources: CWN, EDR, and a mapping<br />

table between Chinese and Japanese characters.


WordNet-anchored Comparison of Chinese-Japanese Kanji Word 211<br />

EDR<br />

The EDR Electronic Dictionary is a machine-tractable dictionary that contains the<br />

lexical knowledge of Japanese and English constructed by the Japanese Electronic<br />

Dictionary Research Institute [5]. 1 It contains list of 325,454 Japanese words (jwd)<br />

and their descriptions. In this study, the English translation, the English definition and<br />

the Part-of-Speech category (POS) of each jwd are used to determine their senses and<br />

semantic relations to their Chinese counterparts.<br />

CWN<br />

The Chinese WordNet currently contains list of 8,624 Chinese words (cwd) and<br />

their descriptions. In this experiment, the English translations, the English definition,<br />

the Part-of-Speech category (POS) and the corresponding synset of all senses of each<br />

cwd are used to determine the semantic relations. These high and mid-frequency<br />

words represent over 20,000 synsets. Since CWN is still in progress and contains only<br />

words whose senses are manually analyzed and confirmed by corpus data, we can<br />

supplement the synsets not covered by CWN with data from the translation-based<br />

Sinica BOW (Academia Sinica Bilingual Ontological WordNet<br />

http://bow.sinica.edu.tw ).<br />

List of Hanzi Variants<br />

Modern kanji and hanzi systems are both descendent of Chinese character systems<br />

of the Tang dynasty over 1.200 years ago. However, even though frequent contacts<br />

were maintained, the long periods of developments as separate systems still result in a<br />

small set of glyph variants. For instance, a character with the basic meaning of ‘elder<br />

sister’ are represented by two graph variants in Chinese hanzi and Japanese kanji, as<br />

shown in (1)<br />

(1) Example Character Variants in Chinese and Japanese<br />

姊 <br />

姉 <br />

It is important to note that as these glyph variants cannot be dealt with simply as<br />

font variants. First, they are highly conventionalized can have different meaning in<br />

the context of each language. Second, they are given different codes in Unicode<br />

coding space based on the conventionality arguments. Hence, in terms of automatic<br />

searching and comparison, these pairs will be recognized as different characters<br />

unless stipulated otherwise. In our study, we use a list of 125 pairs of Japanese and<br />

Chinese character variants compiled and provided to us by Christian Wittern of Kyoto<br />

University’s Institute for Studies in Humanities.<br />

1<br />

http://www2.nict.go.jp/r/r312/EDR/index.html


212 Chu-Ren Huang, Chiyo Hotani, Tzu-Yi Kuo, I-Li Su, and Shu-Kai Hsieh<br />

4 Methodology and Procedure<br />

4.1 EditorCharacter-based Word Mappong Between Chinese and Japanese<br />

Each Japanese word jwd and Chinese word cwd is analyzed as a string of characters<br />

c 1 …c n . Each jwd is compared with all cwd’s for their character-string similarity. Each<br />

matched pair must satisfied one of the three criteria and are classified as<br />

such. It is important to note that with the help of the list of variant characters, we are<br />

able to establish character identity regardless of their different surface glyphs.<br />

(I) Identical Character Sequence Pairs, where the numbers of characters in<br />

the jwd and cwd are identical and all the corresponding n th characters in the two<br />

words are also identical. We call these pairs homographic pairs. They can be<br />

exemplified by 頭 ‘head’, and 歌 手 ‘singer’. 2<br />

(II) Identical Character Component Pairs, where the numbers of characters<br />

in the jwd and cwd are identical, and both contains the same set of characters in<br />

different order. We call these pairs homomorphemic pairs. 3 They can be exemplified<br />

by Japanese 制 限 vs. Chinese 限 制 ‘to restrict’; and Japanese 律 法 vs. Chinese 法<br />

律 ‘law’.<br />

(III) Partly Identical Pairs, where at least one Kanji in the jwd matches with<br />

a Hanzi in the cwd. For example Japanese 相 合 can be paired with Chinese 相 對 於 ,<br />

合 力 , 相 形 之 下 , 看 相 , 縫 合 etc. The semantic relation between each pair, if it does<br />

exist, may be quite distant. However, including this class allows us to cover all<br />

possible mappings, as well as to study the kind of conceptual clustering represented<br />

by shared characters in either language.<br />

Jwc-cwd word pairs in such mapping groups are searched and compared with the<br />

following algorithm: (1) A jwd and a cwd are compared. If the words are identical,<br />

then they are considered as a homographic pair. (2) For all non-homographic pairs, if<br />

the two words have the same string length, then check the characters contained in<br />

each word. If both contain the exact same set of characters, then they are a<br />

homomorphemic pair. (3) If the pair has different string length or do not contain the<br />

exact character sets, check if thee is any character shared by the pair. If there is one of<br />

more shared characters, then the pair is a partly identical pair.<br />

After the mapping procedure, if the jwd is not mapped to any of the cwd, the jwd is<br />

classified to (IV) uniquely Japanese group. If a cwd is not mapped by any of the jwd,<br />

it is classified to (V) uniquely Chinese group.<br />

2<br />

Note that as these forms are used in both languages but cannot be expected to have the exact<br />

meaning. Hence the free translation intends to capture only the rough conceptual equivalence<br />

in both languages.<br />

3<br />

Logically, all homographic pairs are also homomorphemic pairs. However, for classificatory<br />

and comparative reasons, we use homomorphemic pairs to refer only to non-homographic<br />

ones.


WordNet-anchored Comparison of Chinese-Japanese Kanji Word 213<br />

4.2 Establishing Semantic Relation in Word Pairs<br />

After the character-based mapping, the senses of (I) homographic pairs and (II)<br />

homomorphemic pairs are compared in order to establish their cross-lingual semantic<br />

relations according to the following three classifications:<br />

(I-1, II-1) Synonym pairs with identical POS:.<br />

E.g.<br />

(1-1) 以 降 : afterwards (noun)<br />

兄 弟 (Japanese) and 弟 兄 (Chinese): brother (noun)<br />

(I-2, II-2) Synonym pairs with unmatched POS: words in a pair are synonym with<br />

different POS or POS of at least one of the words in the pair is missing.<br />

E.g.<br />

(1-2) 意 味 : sense (noun in EDR and verb in CWN)<br />

(2-2) 定 規 (Japanese) and 規 定 (Chinese): rule (noun in EDR and no POS is<br />

indicated in CWN)<br />

(I-3, II-3) unknown relation: the relation is not determinable by machine<br />

processing with the given information at this point.<br />

E.g. Japanese Chinese<br />

(1-3) 灰 : ash (noun) 灰 : dust (no POS indicated)<br />

(2-3) 愛 心 : affection (noun) 心 愛 : dear, darling (no POS indicated)<br />

In order to find the relation of J-C word pairs, the jwd and the cwd in a pair are<br />

compared according to the following information;<br />

(2)<br />

Jwd: English translation in EDR (jtranslation), POS<br />

Cwd: English translations in CWN (ctranslations), POS, cwd synset (English)<br />

The comparisons are done in the following manner; Check if the jtranslation<br />

matches with any of the ctranslations or a word in the cwd synset. If no match was<br />

found, the pair is unknown relation. If any match was found, check if the POS are<br />

identical. If the POS are identical, the pair is a synonym pair with identical POS.<br />

Otherwise the pair is a synonym pair with unmatched POS.<br />

After the process, synonym pairs with identical POS and synonym pairs with<br />

unmatched POS are examined manually to see if they are really synonyms.<br />

Unknown Relation Analysis<br />

The pairs with unknown relation are divided into the following four different groups.<br />

Only jtranslation is missing (Only comparison info. of jwd is missing)<br />

E.g.<br />

(1-3-A) No English translation for 足 in EDR<br />

(2-3-A) No English translation for 運 命 in EDR


214 Chu-Ren Huang, Chiyo Hotani, Tzu-Yi Kuo, I-Li Su, and Shu-Kai Hsieh<br />

Only ctranslations and cwd synset are missing (Only comparison info. of cwd is<br />

missing)<br />

E.g.<br />

(1-3-B) No English translation nor synset for 有 無 in CWN<br />

(2-3-B) No English translation nor synset for 明 星 in CWN<br />

No comparison info. is missing.<br />

E.g. (1-3-C)<br />

Japanese<br />

Chinese<br />

火 力 : firepower (noun) 火 力 : power, powerfulness, potency (no POS)<br />

(2-3-C)<br />

Japanese<br />

Chinese<br />

末 期 : end (noun) 期 末 : concluding,final,last,terminal (noun)<br />

Jtranslation, ctranslations and cwd synset are missing (Both comparison info. are<br />

missing)<br />

E.g.<br />

(1-3-D) No English translation nor synset for 機 動 in both EDR and CWN<br />

(2-3-D) No English translation nor synset for 山 中 in EDR and for 中 山 in CWN<br />

Then the group, (A), (B) and (C), are sorted into possible synonym pairs and nonsynonym<br />

pairs by using the following method.<br />

Check if the definition of jwd contains any of the ctranslations or cwd synset. If<br />

the definition contains any of them, then the pair is possible synonym pairs.<br />

Otherwise they are non-synonym pairs.<br />

Check if the definition of cwd contains the jtranslation. If the definition contains<br />

the jtranslation, then the pair is possible synonym pairs. Otherwise they are nonsynonym<br />

pairs.<br />

Do both the methods that for (A) and (B).<br />

5 The References Section Result<br />

Hanzi Mapping<br />

Table 1. J-C Hanzi Similarity Distribution.<br />

Number of Words<br />

(1) Identical Hanzi Sequence Pairs 2881 jwds 20580<br />

Without variant mapping 2815 jwds 20199<br />

Difference +66 jwds +381<br />

(2) Different Hanzi Order Pairs 207 jwds 481<br />

Without variant mapping 204 jwds 473<br />

Difference +3 jwds +8<br />

Number of J-C Word<br />

Pairs


WordNet-anchored Comparison of Chinese-Japanese Kanji Word 215<br />

(3) Partly Identical Pairs 267036 jwds 8492103<br />

Without variant mapping 264917 jwds 8405427<br />

Difference +2119 jwds +86676<br />

(4) Independent Japanese 55330 jwds -<br />

Without variant mapping 57518 jwds -<br />

Difference -2188 jwds -<br />

(5) Independent Chinese 736 cwds -<br />

Without variant mapping 851 cwds -<br />

Difference -115 cwds -<br />

Finding Synonyms (Word Relations)<br />

Table 2. Identical Hanzi Sequence Pairs (20580 pairs) Synonymous Relation Distribution<br />

(1-1) Synonym with the<br />

same POS pairs<br />

Number of 1-to-1<br />

Form-Meaning<br />

Pairs Found by<br />

Machine<br />

Processing<br />

(% in (1))<br />

Number of 1-to-1<br />

Form-Meaning<br />

Pairs Found by<br />

Manual Analysis<br />

(% in (1))<br />

92 (0.4%) 35 (0.2%) 26<br />

Without variant mapping 92 35 26<br />

Difference ±0 ±0 ±0<br />

(1-2) Synonym with<br />

439 (2.1%)<br />

unmatched POS pairs<br />

262 (1.3%) 153<br />

Without variant mapping 425 254 150<br />

Difference +14 +8 +3<br />

(1-3) unknown relation 20049 (97.4%) - -<br />

Without variant mapping 19682 - -<br />

Difference +367 - -<br />

* Number of<br />

Many-to-Many<br />

Form-Meaning<br />

Pairs Found by<br />

Manual Analysis


216 Chu-Ren Huang, Chiyo Hotani, Tzu-Yi Kuo, I-Li Su, and Shu-Kai Hsieh<br />

Table 3. Identical Hanzi But Different Order Pairs (481 pairs) Synonymous<br />

Relation Distribution<br />

** (2-1) Synonym with<br />

the same POS pairs<br />

Number of 1-to-1<br />

Form-Meaning<br />

Pairs Found by<br />

Machine<br />

Processing (% in<br />

(2))<br />

Number of 1-to-1<br />

Form-Meaning<br />

Pairs Found by<br />

Manual Analysis<br />

(% in (2))<br />

0 (0.0%) 0 (0.0%) 0<br />

Without variant mapping 0 0 0<br />

Difference ±0 ±0 ±0<br />

(2-2) Synonym with<br />

unmatched POS pairs<br />

14 (2.9%) 11 (2.3%) 10<br />

Without variant mapping 14 11 10<br />

Difference ±0 ±0 ±0<br />

(2-3) unknown relation 467 (97.1%) - -<br />

Without variant mapping 459 - -<br />

Difference +8 - -<br />

* Number of<br />

Many-to-Many<br />

Form-Meaning<br />

Pairs Found by<br />

Manual Analysis<br />

* Many-to-Many Form-Meaning Pair refers to a mapping between a group of jwds,<br />

which have the same senses, and a group of cwds that corresponds with the jwds.<br />

** No pair found in (2-1), because, all of jwds in (2-1) also has identical Hanzi<br />

sequence cwd in the given data.<br />

Unknown Relation Analysis<br />

Table 4. Identical Hanzi Sequence Pairs with Unknown Relation (20049 pairs) distribution<br />

Number of<br />

Number of Non-<br />

Number of Pairs Possible Synonym<br />

Synonym Pairs<br />

(% in 1-3) Pairs<br />

(% in 1-3)<br />

(% in 1-3)<br />

(A) Missing the Japanese<br />

8618 (43.0%)<br />

translation<br />

607 (3.0%) 8011 (40.0%)<br />

Without variant mapping 8428 590 7838<br />

Difference +190 +17 +173<br />

*** (B) Missing Chinese 2298 (11.5%) 0 (0.0%) 2298 (11.5%)<br />

translation and the synset


WordNet-anchored Comparison of Chinese-Japanese Kanji Word 217<br />

Without variant mapping 2275 0 2275<br />

Difference +23 ±0 +23<br />

(C) No missing information 5832 (29.1%) 322 (1.6%) 5510 (27.5%)<br />

Without variant mapping 5720 296 5424<br />

Difference +112 +26 +86<br />

(D) Missing both 3301 (16.5%) - -<br />

translations and the synset<br />

Without variant mapping 3259 - -<br />

Difference +42 - -<br />

Table 5. Identical Hanzi But Different Order Pairs with Unknown<br />

Relation (467 pairs) distribution<br />

Number of Pairs<br />

(% in 2-3)<br />

Number of Possible<br />

Synonym Pairs<br />

(% in 2-3)<br />

Number of Non-<br />

Synonym Pairs<br />

(% in 2-3)<br />

(A) Missing the<br />

207 (44.3%)<br />

Japanese translation<br />

7 (1.5%) 200 (42.8%)<br />

Without variant<br />

199<br />

mapping<br />

5 194<br />

Difference +8 +2 +6<br />

*** (B) Missing<br />

Chinese translation and 46 (9.9%) 0 (0.0%) 46 (9.9%)<br />

the synset<br />

Without variant<br />

46<br />

mapping<br />

0 46<br />

Difference ±0 ±0 ±0<br />

(C) No missing<br />

151 (32.3%)<br />

information<br />

10 (2.1%) 141 (30.2%)<br />

Without variant<br />

151<br />

mapping<br />

10 141<br />

Difference ±0 ±0 ±0<br />

(D) Missing both<br />

translations and the 63 (13.5%) - -<br />

synset<br />

Without variant<br />

63 - -<br />

mapping<br />

Difference ±0 - -<br />

*** In both group (B), all of the CWN has no definition either, therefore no<br />

possible synonym pair is found.


218 Chu-Ren Huang, Chiyo Hotani, Tzu-Yi Kuo, I-Li Su, and Shu-Kai Hsieh<br />

6 Conclusion<br />

In this paper, we present our study over Japanese and Chinese lexical semantic<br />

relation based on the Hanzi sequences and their semantic relations. We compared<br />

Electric Dictionary Research [5] with the Chinese WordNet [4] in order to examine<br />

the nature of cross-lingual lexical semantic relations.<br />

The following tables are summarized tables showing the Japanese-Chinese formmeaning<br />

relation distribution examined from this experiment.<br />

Table 6. Identical Hanzi Sequence Pairs (20580 pairs) Lexical Semantic Relation<br />

Pairs Found to be<br />

Synonym<br />

(% in (1))<br />

Pairs Found to be<br />

Non-Synonym<br />

(% in (1))<br />

Unknown<br />

Relation<br />

(% in (1))<br />

Machine Analysis 1460 (7.1%) 15819 (76.9%) 3301 (16.0%)<br />

Without variant mapping 1403 15537 3259<br />

Difference +57 +282 +42<br />

Including Manual<br />

1226 (6.0%)<br />

Analysis<br />

16053 (78.0%) 3301 (16.0%)<br />

Without variant mapping 1175 15765 3259<br />

Difference +51 +288 +42<br />

Table 7. Identical Hanzi But Different Order Pairs (481 pairs) Lexical Semantic Relation<br />

Pairs Found to be<br />

Synonym<br />

(% in (2))<br />

Pairs Found to be<br />

Non-Synonym<br />

(% in (2))<br />

Unknown<br />

Relation<br />

(% in (2))<br />

Machine Analysis 31 (6.4%) 387 (80.5%) 63 (13.1%)<br />

Without variant mapping 29 381 63<br />

Difference +2 +6 ±0<br />

Including Manual<br />

28 (5.8%)<br />

Analysis<br />

390 (81.1%) 63 (13.1%)<br />

Without variant mapping 26 384 63<br />

Difference +2 +6 ±0<br />

Hanzi variants were not taken into account in the previous experiment. This time,<br />

when Hanzi variants are taken into account, there found more J-C word pairs such<br />

that each of the word in the pair contains a Hanzi that they are actually variants of<br />

each other, thus the words are actually related in a sense.<br />

As the table shows, there are more than 75% of pairs are found to be nonsynonyms.<br />

However, it is not certain whether if the pairs are really non-synonyms and<br />

what their actual semantic relations are. In the further experiment, we will try to find<br />

the semantic relation (not only synonymous relation) of those pairs found to be nonsynonym<br />

pairs at this point and analyze the relation of Japanese and Chinese Hanzi<br />

narrower and to get more accurate result.


WordNet-anchored Comparison of Chinese-Japanese Kanji Word 219<br />

References<br />

1. Xyu, S.: 'The Explanation of Words and the Parsing of Characters' ShuoWenJieZi. This<br />

edition. ZhongHua, Beijing (121/2004)<br />

2. Chou, Y.M., Huang, C.R.: Hantology: An Ontology based on Conventionalized<br />

Conceptualization. In: Proceedings of the Fourth OntoLex Workshop. A workshop held in<br />

conjunction with the second IJCNLP. October 15. Jeju, Korea (2005)<br />

3. Fellbau, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)<br />

4. Chinese WordNet. http://www.ling.sinica.edu.tw/cwn<br />

5. EDR Electronic Dictionary Technical Guide. Japanese Electronic Dictionary Research<br />

Institute. Online version, http://www2.nict.go.jp/r/r312/EDR/ENG/E_TG/E_TG.html (1995)<br />

6. Yu, J., Liu, Y., Yu, S.: The Specification of Chinese Concept Dictionary. J. Journal of<br />

Chinese Language and Computing 13(2), 176–193. (2003)<br />

7. Huang, C.R., Tseng, E.I.J., Tsai, D.B.S.: Cross-lingual Portability of Semantic<br />

Relations:Bootstrapping Chinese WordNet with English WordNet Relations. Presented at<br />

the Third Chinese Lexical Semantics Workshops. May 1-3. Academia Sinica (2002)<br />

8. Wong, S. H. S., Pala, K.: Chinese Characters and Top Ontology in EuroWordNet. In :Sing,<br />

U. N. (ed.) Proceedings of the First Global WordNet Conference. Mysore, India. (2002)<br />

9. Chou, Y.M.: Hantology. [In Chinese]. Doctoral Dissertation. National Taiwan University.<br />

(2005)<br />

10. Chou, Y.M., Hsieh, S.K., Huang, C.R.: HanziGrid: Toward a knowledge infrastructure for<br />

Chinese characters-based cultures. To appear in: Ishida, T., Fussell, S.R., Vossen, P.T.J.M.<br />

(eds.) Intercultural Collaboration I. Lecture Notes in Computer Science, State-of-the-Art<br />

Survey. Springer-Verlag (2007)<br />

11. Hsieh, S.K., Huang, C.R.: When Conset Meets Synset: A Preliminary Survey of an<br />

Ontological Lexical Resource based on Chinese Characters. In: Proceedings of the 2006<br />

COLING/ACL Joint Conference. Sydney, Australia. July 17–21 (2006)<br />

12. HowNet. http://www.keenage.com<br />

13. Hsieh, S.K.: Hanzi, Concept and Computation: A Preliminary Survey of Chinese Characters<br />

as a Knowledge Resource in NLP. Doctoral Dissertation. University of Tubingen (2006)<br />

14. Huang, C.R., Lin, W.Y., Hong, J.F., Su, I.L.: The Nature of Cross-lingual Lexical Semantic<br />

Relations: A Preliminary Study Based on English-Chinese Translation Equivalents. In:<br />

Proceedings of the Third International WordNet Conference, pp. 180–189. Jeju, January 22–<br />

25 (2006)


Paranymy: Enriching Ontological<br />

Knowledge in WordNets<br />

Chu-Ren Huang, Pei-Yi Hsiao, I-Li Su, and Xiu-Ling Ke<br />

Institute of Linguistics, Academia Sinica<br />

Nankang, Taipei, Taiwan 115<br />

{churen, pyxiao, isu, vitake}@gate.sinica.edu.tw<br />

Abstract. This paper studies and explicates the rich conceptual relations among<br />

sister terms within the WordNet framework. We follow Huang et al. [1] and<br />

define those sister terms by a lexical semantic relation, named paranymy. Using<br />

paranymy, instead of the original indirect approach of defining the sister terms<br />

as words that have the same hypernym, enables WordNets to represent and<br />

enrich ontological knowledge. The familiar ontological problem of ‘ISA<br />

overload’ [2] can be solved by identifying and classifying conceptually salient<br />

groups among those sister terms. A set of paranyms are terms which are<br />

grouped together by one conceptual principle, which are evidenced by their<br />

linguistic behaviors. We believe that treating paranymy as a bona fide lexical<br />

semantic relation allows us to explicitly represent such classificatory<br />

information and enriches the ontological layer of WordNets.<br />

1 Introduction<br />

One area where formal ontologies are considered to be more powerful than WordNets<br />

(some times referred to as linguistic ontologies) is the explicit mechanism for defining<br />

concepts that allows inferences [3]. It is easy to observe that the sister terms in<br />

Princeton WordNet can still contain a cluster of undifferentiated concepts. For<br />

instance, the direct hypernyms of ‘cardinal compass point’ in Princeton WordNet, are<br />

four sister terms: north, south, east, and west. These four terms are not really equal<br />

because there are two conceptual salient pairs: north-south and east-west that play<br />

important roles in other conceptual classifications. In terms of conceptual<br />

representation, the simple IS-A defining hypernymy in WordNets is inadequate to<br />

deal with the complex conceptual relations among the sister terms, known as ‘ISA<br />

overload’ [2]. There are at least two possible approaches to deal ISA overload in<br />

ontologies. Guarino [4] suggests a finer classification of upper categories, while<br />

SUMO [3] implements the two contrast pairs with antecedent axioms. In this paper,<br />

we explore the possibility of solving this ISA overload problem while maintaining the<br />

original WordNet structure and sister term relations. We present two critical parts of<br />

our treatment of paranymy as a lexical semantic relation in this paper: (i) the<br />

classificatory criteria for those elements in order to define X as a C, and (ii) the salient<br />

relation(s) among those different elements in X.


Paranymy: Enriching Ontological Knowledge in WordNets 221<br />

Huang et al. [1] examined sets of coordinate terms and discovered that the<br />

semantic relation, antonymy, was commonly used to explain the relation among those<br />

coordinate terms. However, antonymy and other relations, such as near-synonymy,<br />

are inadequate to account for their conceptual clustering or entailments. In order to<br />

give a more precise and richer semantic representation of lexical conceptual structure<br />

and ontology, the idea of paranymy was proposed. It was claimed that this proposal<br />

allows WordNets to incorporate representations on semantic field.<br />

In Princeton WordNet (PWN, [5]), sister terms are defined as those coordinate<br />

words that have the same hypernym (also called “superordinate” in PWN). Such<br />

approach indeed enables the representation for some ontological knowledge.<br />

However, the hypernymy is quite general, and it is not specific concept to cover the<br />

further detailed relations for its set of hyponyms. When we reconsider the relation<br />

among those hyponyms, we realized that these coordinate terms could be reclassified<br />

into conceptually salient groups. In the earlier works on the theory of semantic field<br />

such as [6] and [7], they had provided the clear explication of how lexical concepts<br />

cluster without actually laying out a comprehensive conceptual hierarchy. Therefore,<br />

in this paper, we plan to identify the salient groups, try to improve the comprehensive<br />

conceptual hierarchy, and enrich the ontological knowledge of WordNets by means of<br />

classificatory information with a richer layer.<br />

In what follows, section 2 discusses the different phenomena of sister terms in<br />

WordNets. The types and definitions of paranyms are explained in the section 3, with<br />

practical examples in Chinese. The conclusion of this paper is given in the section 4.<br />

2 Sister terms (coordinate terms) in WordNets<br />

It is easy to observe that not all coordinate terms are equal when detailed lexical<br />

analysis is done for a set of coordinate terms sharing the same hypernym. For<br />

example, when people talk about seasons, the first intuition for this concept will be<br />

four seasons— spring, summer, fall (or autumn), and winter. Other terms for seasons,<br />

such as dry season and rainy season, are not thought of intuitively as parallel as the<br />

four seasons although all of them share the same superordinate concept, “seasons in a<br />

year”. The same situation happens in the contrast between North vs. Southeast.<br />

Generally speaking, North and Southeast are both hyponyms of the concept,<br />

geographic direction. However, when we talk about the concept of geographic<br />

direction, only the four cardinal compass points, namely East/West/South/North,<br />

would come up intuitively as a set of hyponyms under this concept. Neither the<br />

North/Southeast pair nor the South/Northeast pair may be viewed as the four main<br />

directions at an equivalent level. In WordNets, the knowledge representation for those<br />

phenomena is very unclear because all those situations are simply classified by using<br />

the relation, sister terms. This is typical dilemma of ISA overload when two sets of<br />

unequal hyponyms: four seasons vs. dry season and rainy season, are grouped as sister<br />

terms. Similarly, although the four cardinal compass points and all other non-cardinal<br />

compass points such as Southeast and Northeast, are all directions, grouping them<br />

with the same ISA relation has two parallel problems: that they are not equally


222 Chu-Ren Huang, Pei-Yi Hsiao, I-Li Su, and Xiu-Ling Ke<br />

privileged, and that they form contrast pairs among themselves based on the relation<br />

of opposite directions.<br />

It is important to notice that conceptual dependencies entail linguistic collocations.<br />

For instance, the relations among the four cardinal compass points are revealed in<br />

various collocations formed by the North/South pair (or the East/West pair). Such<br />

collocations are fairly productive, while other combinations, such as South/East,<br />

would be regarded as the rare pair. In terms of conceptual structure and knowledge<br />

representation, it is essential to further specify the direction contrast pairs of<br />

North/South and East/West among the four main directions. Such conventional<br />

collocation will play an important role in our reclassification of hyponyms.<br />

3 The definition and types of paranymy<br />

The semantic relation paranymy is used to refer to the relation between any two<br />

lexical items belonging to the same semantic classification in [1]. A paranyms relation<br />

must conform to the following basic requirements. The first requirement is that<br />

paranyms need to be a set of coordinate terms since they share the same hypernym<br />

(also called “superordinate”). Secondly, paranyms have to share the same<br />

classificatory criteria. The second requirement is critical and has very interesting<br />

consequences because the same conceptual space/semantic field can be partitioned<br />

differently by different criteria. For example, as shown by example (1), (1a) and (1b)<br />

are both possible exhaustive enumerations of the concept “seasons in a year.” People<br />

who live in a certain area, such as Southeast Asia, may prefer to use (1b) to describe<br />

their “seasons in a year”; however, to other people in the world, the four seasons of<br />

(1a) is the default 1 .<br />

(1) Two sets of paranyms of the main concept-“seasons in a year”<br />

a. chun1/xia4/qiu1/dong1<br />

“spring/summer/fall(autumn)/winter”<br />

b. gan1 ji4/yu3 ji4<br />

“dry season/ rainy season’<br />

In addition, paranymy can capture how these concepts cluster by stipulating the<br />

same criterion they shared for conceptual classification. As shown in above (1), any<br />

element of these two different criteria, such as xia4(summer) in (1a) and gan1 ji4(dry<br />

season) in (1b), do not stand in direct contrast against each other although they are<br />

coordinate terms of the same concept “seasons in a year”. In other words, (1a) and<br />

(1b) do not belong to the same semantic field, which are defined by minimal semantic<br />

contrasts [6].<br />

1<br />

Please note that we are making a distinction between 'rainy season (i.e. monsoon season)' as a<br />

primary classification of seasons from the secondary classification of seasons, such as winter<br />

and spring are rainy seasons in Taiwan.


Paranymy: Enriching Ontological Knowledge in WordNets 223<br />

One important consequence of allowing classificatory criteria to determine a set of<br />

paranyms is that the same set of sister terms may receive overlapping classification,<br />

such as in (2)<br />

(2) Directions in Chinese<br />

a. si4mian4 ‘four directions’<br />

dong1/xi1/nan2/bei3<br />

‘East/West/South/North’<br />

b. ba1fang1 ‘eight directions’<br />

dong1/xi1/nan2/bei3/dong1nan2/xi1bei3/dong1bei3/xi1nan2<br />

‘East/West/South/North/SouthEast/NorthWest/NorthEast/SouthWest’<br />

The examples in (2) shows that the cardinality of directions in Chinese does not<br />

have to be four. What is more important is that (2a) is a subset of (2b). In our<br />

definition of paranymy, the relation is governed by a classificatory criterion. For<br />

instance, South and SouthWest are paranyms under the classification of ba1fang1<br />

(Eight directions), but not under the classification of si4mian4 (Four directions).<br />

It is important to note this current study differs from [1] in terms of our treatment<br />

of what they called complementary paranymy. Huang et al. [1] proposed three types<br />

of paranymies: complementary, contrary, and overlapping. After our further study<br />

based on extensive examples from the Chinese WordNet (CWN) and data from two<br />

Austronesian languages (i.e., Formosan languages in Taiwan), Kavalan and Seediq [8,<br />

9], we decided that the relations complementary paranymy intended to capture is<br />

better characterized by simply using the semantic relation, antonymy. The examples<br />

are given in (3) and (4). Basically, such complementary relation is a binary pairs and<br />

the typical criterion of this type is “either A or B.” More specifically, under a concept,<br />

there are only two possible nodes, A or B and these two nodes are contradictory.<br />

Therefore, in this relation, either A or B will appear and this also infers that the<br />

positive of one term necessarily implies the negative of the other. Therefore, we can<br />

say that the relation between A and B is actually antonymy. This can be exemplified<br />

by (3) and (4).<br />

(3)Data from CWN<br />

State of life: si3/huo2 “dead/alive”<br />

Amount: dan1/fu4 “singular/plural”<br />

(4) Data from Kavalan and Seediq<br />

Kavalan: binus/putay ‘alive/dead’<br />

Seediq: muudus/muhuqin ‘alive/dead’<br />

Hence we propose to revise the classification of [1] to include only two types of<br />

paranymy: contrary and overlapping.


224 Chu-Ren Huang, Pei-Yi Hsiao, I-Li Su, and Xiu-Ling Ke<br />

3.1 Contrary Paranymy<br />

Contrary paranymy conforms to a condition that each of a set of terms is related to all<br />

the others by the relation of incompatibility [10]. The paranyms in this type are<br />

gradable and their senses are usually contrary. Contrary paranymy allows<br />

intermediate terms, so it is possible to have something that is neither A nor B. For<br />

example, something may be warm if it is neither hot nor cold. Besides, contrary<br />

paranyms are usually relative, for instance, a thick pencil is likely to be thinner than a<br />

thin girl. Contrary paranyms are classified under the perceptional or conventional<br />

paradigms. The perceptional paradigm is based on human perception or senses, for<br />

example, the superordinate node of fast/slow is speed. Whether the speed is fast or<br />

slow, it all depends on somebody’s perception and such perception is variable from<br />

one to another.<br />

The conventional paranamys is shown in Fig. 1a. There are various ways of<br />

addressing parents based on the register for conventionally using those terms in<br />

Chinese, so those terms shall be further classified into different groups rather then<br />

directly placing them all under the same superordinate, parent. As shown in Fig. 1b,<br />

after re-clustering the sister terms, we get three sub-classes because of the register for<br />

using the terms to address parents. Using such re-clustered classification makes the<br />

conceptual structures clearer and better.<br />

Figure 1a. Parents Addressing<br />

Figure 1b. Parents Addressing (Concepts Re-clustered by the register)<br />

Besides, the contrary paranyms can be divided because of the collocation. For<br />

example, in Fig. 2a, a series of coordinate terms in Chinese that is all used to address<br />

the spouse in marriage under the colloquial register. Due to the collocation of those


Paranymy: Enriching Ontological Knowledge in WordNets 225<br />

terms, most native speakers think that the contrary paranym of “xian1<br />

sheng1”(husband) is “tai4 tai4” (wife) rather than “qi1 zi5” (wife). Similarly, the<br />

contrary paranym of “zhang4 fu1”(husband) is “qi1 zi5” (wife) rather than “tai4 tai4”<br />

(wife).<br />

Figure 2a. Spouse addressing under the colloquial register<br />

Figure 2b. Spouse addressing (Concepts Re-clustered based on the collocation)<br />

By the relation of paranymy, we can give a more precise account for the coordinate<br />

terms or hyponyms, especially ones in the contrary type. A process of re-clustering<br />

sister terms can be formulated, as given in Fig. 3, and therefore, such conditions can<br />

be applied to augment wordnes with for descriptions of important linguistic<br />

collocations and relations.<br />

3.2 Overlapping Paranymy<br />

Overlapping paranymy is defined as the case containing a paradigmatic relation of<br />

inclusion and that of exclusion in linear structures. In other words, two sister terms<br />

belonging to this type have some features in common, and meanwhile, comprise other<br />

distinct features. Overlapping paranyms may include some cases illustrating the<br />

relation of incompatibility and oppositeness, 2 in which the contrastive part is more<br />

2<br />

Please note that overlapping paranymy we call in this paper is not referred to overlapping<br />

antonyms that Cruse [12] terms good/bad and other antonyms having evaluative polarity as<br />

part of their meaning.


226 Chu-Ren Huang, Pei-Yi Hsiao, I-Li Su, and Xiu-Ling Ke<br />

Sister terms<br />

Re-clustered by<br />

using the same<br />

classificatory<br />

criterion or the<br />

collocation<br />

New contrary paranyms<br />

Figure 3. Process of sister terms re-clustering<br />

predominant than the overlap, and also contain near-synonyms, where the features<br />

they share are considerable and more salient than those different (e.g.,[10, 11]). As [1]<br />

explicated, the type of overlapping paranymy is elaborated on the basis of<br />

conventions, which are consistently shared by a language community and conform to<br />

their experience. The contexts in which the contrast in a pair of overlapping paranyms<br />

is foregrounded or not, as well as how their semantics overlaps, depends on discoursal<br />

conventions. This is evident in the choice made between the two conventional<br />

expressions of greeting, good afternoon and good evening. Both expressions are<br />

alternative in a certain time period, say the late afternoon, which indicates the overlap<br />

between the time periods denoted by these two sister terms—afternoon, and evening.<br />

Therefore, they are regarded as overlapping paranyms.<br />

The following examples extracted from CWN illustrate a similar case. For<br />

instance, both coordinate terms in (5) are overlapping, in that they can be alternative<br />

terms we choose to call a large stream of water, while they are differentiated from<br />

each other in some other contexts. Besides, in (6), both xiang1 zi5 and he2 zi5 can be<br />

used to refer to “box”, but when we see a container for a diamond ring, we may call it<br />

he2 zi5 rather than xiang1 zi5. Conversely, we may call a container for a TV set<br />

xiang1 zi5 rather than he2 zi5. To capture such relation between two sister terms,<br />

which the traditional semantic relations, such as antonymy and near-synonymy,<br />

cannot deal with, we appeal to overlapping paranymy.<br />

(5) A large natural stream of water: jiang1/he2 “river”<br />

(6) A (usually rectangular) container: xiang1 zi5/he2 zi5 “box”


Paranymy: Enriching Ontological Knowledge in WordNets 227<br />

4 Conclusion<br />

Using paranymy definitely can further analyze the relations among the sister terms<br />

within the WordNets framework and enrich the ontological knowledge in WordNets.<br />

We introduce this semantic relation, paranymy, into our CWN system and the<br />

ontological knowledge of sister terms is indeed elaborated afterward. More precise<br />

and clear classifications for clustering the coordinate terms can be obtained. For<br />

example, according to the criteria of paranymy, the relationship between brother and<br />

sister can be clustered into three classifications. The first classification is based on the<br />

same gender but different birth order (older or younger), such as “ge1 ge1” (elder<br />

brother) and “di4 di4” (younger brother) / “jie3 jie3” (elder sister) and “mei4 mei4”<br />

(younger sister). The second classification is to classify the different genders but<br />

having the same birth order (older or younger), for instance, “ge1 ge1” (elder brother)<br />

and “jie3 jie3” (elder sister) / “di4 di4” (younger brother) and “mei4 mei4” (younger<br />

sister). The third type is based on the concept of collateral relatives by blood, so those<br />

four coordinate terms, “ge1 ge1”, “jie3 jie3”, “di4 di4”, and “mei4 mei4” are all<br />

grouped together under the concept of sibling. Such three distinctive relations for<br />

paranyms illustrate the enrichment for the knowledge system.<br />

It is important to note that the knowledge enrichment nature of paranymy comes<br />

not only from the introduction of this lexical semantic relation but also from the<br />

definition that requires different sets of paranyms be differentiated by their conceptual<br />

classificatory criteria. Such knowledge can be encodes explicitly, by simply listing the<br />

different subsets of paranyms. Such as {East, West, South, North} and {East, West,<br />

South, North, SouthEast, NorthWest, SouthWest, NorthEast} are listed as two<br />

different paranymy reations. However, such criteria can also be explicitly represented,<br />

as in Figures 1 and 2. Our tentative proposal now is to maintain the item and relations<br />

only approach established by PWN. However, it can also be argued that the<br />

conceptual criteria for classification must be explicitly represented in order to resolve<br />

ISA overload. This architecture problem will be resolved in future studies.<br />

References<br />

1. Huang, C.R., Su, I.L., Hsiao, P.Y., Ke, X.L.: Paranyms, Co-Hyponyms and Antonyms:<br />

Representing Semantic Fields with Lexical Semantic Relations. In: Chinese Lexical<br />

Semantics Workshop. May 20-23. Hong Kong: Hong Kong Polytechnic University (2007)<br />

2. Guarino, N.: The Role of Identity Conditions in Ontology Design. In: Proceedings of IJCAI-<br />

99 workshop on Ontologies and Problem-Solving Methods: Lessons Learned and Future<br />

Trends. Stockholm, Sweden, IJCAI. Lecture Notes in Computer Science, 1661:221–227.<br />

Springer (1999a)<br />

3. Niles, I., Pease, A.: Towards a Standard Upper Ontology. In: Proceedings of the 2 nd<br />

International Conference on Formal Ontology in Information Systems. Ogunquit, Maine<br />

(2001)<br />

4. Guarino, N.:. Avoiding IS-A Overloading: The Role of Identity Conditions in Ontology<br />

Design. Intelligent Information Integration (1999b)<br />

5. WordNet. http://wordnet.princeton.edu/


228 Chu-Ren Huang, Pei-Yi Hsiao, I-Li Su, and Xiu-Ling Ke<br />

6. Grandy, R. E.: Semantic Fields, Prototypes, and the Lexicon. In: Lehrer, A., Kittay , E.F.<br />

(eds.) Frames, Fields, and Contrasts: New Essays in Semantic and Lexical Organization, pp.<br />

103–122. Lawrence Erlbaum: Hillsdale (1992)<br />

7. Lehrer, A.: Names and Naming: Why We Need Fields and Frames. In: Lehrer, A., Kittay ,<br />

E.F. (eds.) Frames, Fields, and Contrasts: New Essays in Semantic and Lexical<br />

Organization, pp. 123–142. Lawrence Erlbaum Associates, Hillsdale, NJ (1992)<br />

8. Chang, Y.: Kavalan Reference Grammar. Yuan-liou Publisher, Taipei (2000a)<br />

9. Chang, Y.: Seediq Reference Grammar. Yuan-liou Publisher, Taipei (2000b)<br />

10. Cruse, A. D.: Meaning in Language: An Introduction to Semantics and Pragmatics, Second<br />

Edition. Oxford University Press, New York (2004)<br />

12. Cruse, Alan D.: Lexical Semantics. Cambridge University Press, Cambridge (1986)<br />

13. Chinese WordNet. http://cwn.ling.sinica.edu.tw/<br />

14. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA<br />

(1998)<br />

15. Huang, C.R., Tsai, D.B.S., Zhu, M.X., He, W.R., Huang, L.W., Tsai, Y.N.: Sense and<br />

Meaning Facet: Criteria and Operational Guidelines for Chinese Sense Distinction.<br />

Presented at the Fourth Chinese Lexical Semantics Workshops. June 23-25 Hong Kong,<br />

Hong Kong City University (2003)<br />

16. Saeed, J. I.: Semantics. Blackwell Publishers Ltd, Oxford (1997)<br />

17. Sun, J. T.S.: Position-dependent Phonology. In: The 2007 Research Result Symposium of<br />

the Institute of. Linguistics, Academia Sinica (2007)


Proposing Methods of Improving<br />

Word Sense Disambiguation for Estonian<br />

Kadri Kerner<br />

University of Tartu<br />

Institute of Estonian and General Linguistics<br />

Liivi 2–308, Tartu, Estonia<br />

kadri.kerner@ut.ee<br />

Abstract. This paper proposes some methods of making word sense<br />

disambiguation (WSD) a more feasible task for Estonian language. Both<br />

automatic and manual WSD are kept in mind. Firstly, this paper gives an<br />

overview of WSD and manual annotation of Estonian language. Also the paper<br />

gives a brief overview of the Word Sense Disambiguation Corpus of Estonian<br />

(WSDCEst). Based on this corpus it is possible to examine contextual and other<br />

patterns of the target word in order to create disambiguation rules. This corpus<br />

is annotated based on Estonian WordNet (EstWN).The second part of this paper<br />

discusses the fine-grainedness of EstWN. And one way towards WSD<br />

improvement is reducing the fine-grained sense inventory of the resources. It<br />

can be done by grouping similar word senses in EstWN. In this paper some<br />

existing work with the WordNet is followed.<br />

Keywords: Word sense disambiguation, semantic annotation, corpora, Estonian<br />

WordNet, similar sense grouping.<br />

1 Word Sense Disambiguation of Estonian Language<br />

Word sense disambiguation is the first step of semantic analysis of some language.<br />

WSD is needed for other natural language processing applications, such as machine<br />

translation, information retrieval etc. Since WSD is still one of the difficult problems<br />

in NLP, it is important to find methods of improving it.<br />

Word sense disambiguation is closely connected to morphological and syntactic<br />

disambiguation.<br />

Lexical entries (literals) in EstWN 1 are presented nominal singular form for nouns<br />

and supine form for verbs. In real texts, the words are mostly in their full richness of<br />

forms. Lemmatizing and part-of-speech-tagging are made with Estmorf tagger [3]. In<br />

sense annotating we considered only nouns (_S_ com) and non-auxiliary verbs<br />

(_V_main or _V_ mod).<br />

1<br />

for details see Orav et al in this volume


230 Kadri Kerner<br />

The modal verbs are explicitly marked in the output of the morphological<br />

disambiguator (_V_ mod). When a verb is marked as such, then the senses that don't<br />

correspond to the modal senses could be removed, e.g. the verb saama has 12 senses<br />

in EstWN, but only 2 of them (can or may) correspond to the modal use of the word.<br />

[11]<br />

The output of the morphological analyzer often contains valuable information for<br />

word sense disambiguation. In some cases the word-form used in the text can<br />

uniquely specify the sense of the word, although its lemma is ambiguous, e.g. the<br />

word palk can either mean salary or log of a tree, but its genitive form is different in<br />

each meaning (either palga or palgi). By using only the lemma we ignore this<br />

distinction that can be explicitly present in the text [5].<br />

At the moment the input text contains no information about its syntactic structure,<br />

most importantly the verbal phrases and other multi-word units are not marked as<br />

such. Also, the syntactic structure can help to reduce the number of possible senses to<br />

choose from. For example the most frequent verb olema (English be, have) has five<br />

more frequent senses. Only one sense is present in complementary clauses; three<br />

senses appear in existential sentences and one in possessive sentences. Linguistic<br />

knowledge about the nature of the sentence can help the disambiguation process of<br />

human annotator [5].<br />

2 Manual Tagging of Word Senses<br />

During four years around 110 000 running words are looked over and all content<br />

words in texts are manually annotated according to EstWN word senses. Twelve<br />

linguists and students of linguistics tagged nouns’ and verbs’ senses in the texts; each<br />

text was disambiguated by two persons. Pre-filtering system added lexeme and<br />

number of senses for each annotating word found in EstWN. Annotators marked in<br />

brackets the sense number of EstWN which matched best with used sense of a word<br />

by their opinion. If the word was missing from the EstWN, “0” was marked as sense<br />

number, and if the word was found in EstWN, but missed appropriate sense, “+1” was<br />

marked. If inconsistencies were met, they were discussed until agreement was<br />

achieved. On about 20% of cases the disambiguators had different opinions. This fact<br />

also indicates the most problematic entries in EstWN and the need to reconsider the<br />

borders of meaning of some concepts.<br />

3 Word Sense Disambiguation Corpus of Estonian 2<br />

The research group of computational linguistics of the University of Tartu has<br />

developed Word Sense Disambiguation Corpus of Estonian (WSDCEst). The source<br />

texts are mostly fiction and manually annotated.<br />

It should be kept in mind, that not all words can be disambiguated, but only content<br />

words. Although normally nouns, verbs, adjectives and adverbs are considered as<br />

2<br />

http://www.cl.ut.ee/korpused/semkorpus


Proposing Methods of Improving Word Sense Disambiguation… 231<br />

content words [10], in WSDCEst only nouns and verbs were subject to<br />

disambiguation.<br />

WSDCEst consist of two parts: base corpus and sentences expressing motion (from<br />

the corpus of generated grammar of Estonian single clauses). For the base corpus 43<br />

texts (each text file contains about 2500 tokens) are chosen for word sense<br />

disambiguation from Corpus of the Estonian Literary Language (CELL) sub-corpus<br />

of Estonian fiction from 1980s. Table 1 gives an overview of current data of<br />

WSDCEst.<br />

The size of the Base-corpus is presently being increased mainly with newspaper<br />

texts.<br />

Table 1. Words and senses in WSD Corpus of Estonian<br />

morphologically<br />

analyzed units<br />

% of running<br />

words in text<br />

% of all<br />

substantives<br />

% of<br />

all verbs<br />

Basecorpus<br />

113870 34,65 64,97 90,48<br />

Sentences<br />

expressing<br />

motion 5738 66,76 72,50 91,48<br />

4 Exploiting and Examining the WSDCEst<br />

It is possible to observe the contextual and other patterns in word sense<br />

disambiguation corpus, in order to create certain rules for certain word senses. These<br />

rules are meant to improve the disambiguation task. For Estonian language around<br />

200 rules for frequent nouns were found. Both automatic and manual WSD could<br />

benefit from using these rules. As we are currently increasing the WSDCEst, we hope<br />

to get detailed feedback and evaluation of using these rules from human annotators.<br />

The following work is to include these rules as independent module to an existing<br />

automatic WSD tool Semyhe 3 to test their effectivity.<br />

Rules are represented in disambiguation manual this way, that human annotator<br />

hopefully can easily follow them, for example:<br />

RULE: choose keel sense-2 (English language):<br />

If the directly preceding word is a genitive attribute;<br />

RULE: choose keel sense-4 (English tongue):<br />

If the word keel (English tongue) is in the form of singular allative.<br />

3<br />

http://uuslepo.it.da.ut.ee/~kaarel/NLP/eng_semyhe/


232 Kadri Kerner<br />

Sometimes the near/local context gives the right sense of a target word – for<br />

examples if there is a word marriage in the sentence, then the words man and woman<br />

are in the sense of married people (husband and wife) [4]. Another example is that a<br />

word clock is in the sense of clock time (not a watch), if there is a number in the near<br />

context. For determining some word senses it is important the directly preceding or<br />

following word. For example the word language (in sense of the speech) can be<br />

tagged only with a suitable sense number, if it is directly followed by one Estonian<br />

adposition järgi. Also, the genus of a word or the plural/singular form of a target<br />

word can point out the suitable sense. For example if the word soul (Estonian hing) is<br />

in the form of singular illative, it gets always only one suitable sense number. By<br />

examining the word sense disambiguation corpus, several other minor observations<br />

were made, such the target word’s position in the sentence helps to determine the<br />

correct sense (for example, words man and woman in the beginning of sentence) tend<br />

to carry only one sense; the target word’s appearance in a specific construction<br />

indicates the sense of a target word.<br />

It was possible to find rules to these word senses, which are more frequent in<br />

WSDCEst. Yet, words with high frequency in texts are often abstract and therefore<br />

difficult to describe. For example words thing (Estonian asi), mind (Estonian meel),<br />

thought (Estonian mõte) etc. Since abstract nouns are very common, it could be useful<br />

to deal more with them, rather concentrating on concrete nouns.<br />

5 WordNet as Resource for WSD<br />

WordNets as lexical-semantic databases are used often for WSD, because of their<br />

multilingualism, various semantic relations and of course because of their free<br />

availability. Yet it has been argued that WordNet is too fine-grained resource for<br />

WSD, because natural language processing applications do not need this kind of high<br />

level of granularity. Of course, different NLP applications need different kind of<br />

granularity too – machine translation systems may need more distinct senses,<br />

information retrieval can operate even on homograph level. It is hard for an automatic<br />

WSD system to disambiguate between very many senses. Also it can be extremely<br />

complicated for human annotators to tag the correct sense, if there are too similar<br />

senses. For example, there are 11 senses of an Estonian noun asi (English thing) in<br />

EstWN, some of these senses are too similar that even the context or the whole text<br />

does not give the correct sense.<br />

6 Grouping Similar Senses<br />

Therefore it could be useful to group some similar word senses in order to make WSD<br />

a more feasible task to both automatic systems and human annotators. Ideas for<br />

Estonian language described here are closely related to existing work, e.g. [2], [8],<br />

[6], [7].


Proposing Methods of Improving Word Sense Disambiguation… 233<br />

6.1 Grouping Similar Senses According to the Disagreement of Human<br />

Annotators<br />

Part of our research focuses on exploiting agreement and disagreement of human<br />

annotators: are there any remarkable and important clusters of similar senses. By<br />

processing human annotator’s disagreement files we have found possible clusters of<br />

similar senses for 50 frequent Estonian verbs and 25 Estonian frequent nouns.<br />

Clusters represent the senses, which are similar in human annotators mind and can be<br />

therefore grouped. For example in Table 2 there is a high frequency co-occurrence of<br />

sense numbers 2 and 3, and this could indicate to the fact that these senses of an<br />

Estonian verb hakkama may be grouped into one sense only.<br />

Table 2. Sense clusters of hakkama (English to begin)<br />

Combination of sense numbers<br />

2 (English approach, deal with) -- 3 (English become, come) 17<br />

2 (English approach, deal with) -- 5 (English become, start) 10<br />

2 (English approach, deal with) -- 6 (English catch, grab) 9<br />

3 (English become, come) -- 5 (English become, start) 3<br />

Frequency<br />

There is little disagreement among word senses that doesn’t include autohyponymy<br />

and/or sisters (co-hyponyms). Also a very important observation is that human<br />

annotators disagree less when all the representation fields of EstWN are properly<br />

filled (hyperonym (s), synset, definition, explanation). The research referred to the<br />

fact that explanations of highly polysemous words seem to be very important for<br />

human annotators. When adding missing explanations, the difference between senses<br />

becomes more definite to human annotators. The fact that some words are not highly<br />

polysemous indicates usually (but not always) to a minor disagreement of human<br />

annotators.<br />

Examining human annotator’s disagreement files also refers to problems in EstWN<br />

like missing examples, overlapping synsets or explanations and over-grained senses.<br />

In many cases it is impossible to determine the one and only sense. Sometimes it is<br />

even not necessary [12] and sometimes the nearby context allows different senses.<br />

It is difficult to distinguish word senses that are detectable in EstWN but not<br />

visible in the real usage of text (or language). In some cases the disagreement between<br />

human annotators arises, because of the lack of lexicographical knowledge (or the<br />

human annotator is somewhat superficial).<br />

Exploiting similar sense clusters can be helpful in referring to insufficiency of<br />

EstWN. For example, if all the sense numbers of a word combine with each other, it<br />

can be assumed that the distribution of senses is incomplete and needs to be<br />

improved. If there is no disagreement among human annotators, then there are no<br />

remarkable sense clusters and therefore the senses of this particular word are<br />

reasonably distributed and not too fine-grained. Also our research showed that words<br />

with abstract meanings are difficult to annotate and make up essential sense clusters.


234 Kadri Kerner<br />

Some researchers [11], [13] claim that words representing so-called Base Concepts<br />

are difficult to annotate semantically (apparently because of their broad meanings).<br />

This research also confirmed this fact. The boundaries and the area of the usage of a<br />

hyperonym or hyponym should be very precisely represented in EstWN. The<br />

tendency seems to be that hyponyms as narrower meanings are better to distinguish<br />

than hyperonyms. That is the reason, why top concepts tend to combine with many of<br />

the different word senses (example in Table 3).<br />

Table 3. Sense clusters of saama (English to get)<br />

Combination of sense numbers<br />

10 (English can) – 11 (English may) 24<br />

10 (English can) – 9 (English attain, get to) 9<br />

10 -- 2 8<br />

10 -- 6 5<br />

10 -- 3 4<br />

10 -- 7 4<br />

Frequency<br />

6.2 Grouping Similar Senses by Processing Semantic Relations<br />

Considering the research of examining disagreement of human annotators, arises<br />

another possibility to group similar senses. That is by processing some semantic<br />

relations, for example auto-hyponymy and co-hyponymy/sisters. This idea relates to<br />

the work of Peters et al [9]. When two senses share the same hyperonym, then they<br />

are usually intuitively similar, they highlight different aspects of given word. Also the<br />

case of auto-hyponymy often shows the similarity and tight relatedness of two (or<br />

more) senses. For example in EstWN a noun aasta (English year) in following<br />

example:<br />

ajavahemik 1, periood 1(English amount of time)<br />

hyp=>aasta2<br />

hyp=>aasta1(kalendriaasta1)(English twelwemonth)<br />

hyp=>aasta4<br />

hyp=>aasta3<br />

In this example, aasta2 (English year in 2 nd sense) is hyperonym to other senses of<br />

the same word year – aasta1 and aasta4. Comparing the representations fields of<br />

senses 1, 2 and 4 in EstWN, gives the possibility to group these senses.


Proposing Methods of Improving Word Sense Disambiguation… 235<br />

6.3 Grouping Similar Senses Using “Semantic Mirroring” and Translational<br />

Equivalents<br />

“Semantic mirroring” is a method, which allows to derive synsets and senses of words<br />

by using translational equivalents in parallel corpora [1]. Also, using this method, it is<br />

possible to determine new semantic relations (hyperonymy, hyponymy etc).<br />

In this experiment, I tried to use the simplified version of Semantic mirroring<br />

method in order to determine similar word senses. EstWN and English translational<br />

equivalents via ILI-link were used. Mirroring was done manually for selected nouns<br />

only, in order to test the effectiveness of this method. For example, the noun thing<br />

(Estonian asi), which senses are represented in following example: (this noun has<br />

altogether 11 senses (represented in bold). Hyperonyms are represented before the<br />

sign =>)<br />

olev 2 (English something, someboby)<br />

hyp=>asi1 (objekt 1) (English object)<br />

hyp=>asi3<br />

hyp=>asi4 (tehisasi 2, artefakt 2) (English artefact)<br />

tehisasi 2, artefakt 2, asi 4<br />

hyp=>asi12 (värk 1, tühi-tähi 1, asjad 1, asjandus 2) (English stuff)<br />

töö 3 (English work)<br />

hyp=>asi5 (toimetus 2, ettevõtmine ) (English project, task, undertaking)<br />

ütlus 1, väljend 1 (English saying, expression)<br />

hyp=>asi6<br />

juht 4, juhtum 1, sündmus 1 (English happening)<br />

hyp=>asi7 (lugu2)<br />

asi iseeneses 1, idee 2, abstraktsioon 1 (English abstraction)<br />

hyp=>asi8 (nähtus2) (English thing)<br />

tegu 2 (English action)<br />

hyp=>asi10<br />

seisund 4, olukord 4, situatsioon 1 (English situation)<br />

hyp=>asi11<br />

atribuut 1, omadus 2 (English attribute)<br />

hyp=>asi9<br />

The next Figure (1) shows every possible English translation for an Estonian word<br />

asi and vice versa, possible translations back to Estonian language (irrelevant<br />

translations are left out and not considered). In circles there are synsets of both<br />

languages that correspond to each other.<br />

As a result, it can be observed, that some senses of an Estonian word asi are indeed<br />

similar and therefore can be grouped (senses 3, 7, 8, 9, 10, 11):<br />

(5) ASI, ettevõtmine, ülesanne, toimetus (English project, task, undertaking)<br />

(4) ASI, tehisasi, artefakt (English artefact, artifact)<br />

(3, 6, 7, 8 ,9, 10, 11) ASI, lugu, nähtus (English thing)<br />

(1) ASI, objekt (English intimate object, object, physical object)<br />

(12) ASI, tühi-tähi, värk, asjandus, asjad (English stuff, sundries, sundry, whatsis)


236 Kadri Kerner<br />

Ettevõtmine<br />

Toimetus<br />

Ülesanne<br />

Tehisasi<br />

Artefakt<br />

Objekt<br />

Nähtus<br />

lugu<br />

Tühi-tähi<br />

Värk<br />

Asjandus<br />

asjad<br />

ASI<br />

Project<br />

Task<br />

Undertaking<br />

Artefact<br />

Artifact<br />

object<br />

Intimate object<br />

Physical object<br />

Thing<br />

Stuff<br />

Sundries<br />

Sundry<br />

whatsis<br />

Fig. 1. Simplified Semantic Mirroring of the Noun asi<br />

7 Conclusions and Future Work<br />

Future work includes gathering feedback from human annotators – how much did<br />

they benefit from using the contextual etc rules for WSD of Estonian frequent nouns.<br />

Also it is important to test and evaluate these rules, which were created examining the<br />

WSDCEst in an Estonian automatic WSD system. A suitable formalism for<br />

describing these rules must be thought of. As the WSDCEst is increasing in size, it is<br />

possible to create new rules; at the same time it is possible to eliminate unsuitable<br />

ones.<br />

There were some words, which senses could not be grouped with semantic<br />

mirroring method, although intuitively they seem similar and are used in same<br />

contexts in a text. It would be necessary to use parallel corpora for similar sense<br />

grouping, because of the real use of language. This task assumes in turn much largesized<br />

parallel corpora.<br />

Also, the future work is finding the best method for grouping similar word senses<br />

for Estonian language. It might be useful to combine different grouping methods<br />

(semantic mirroring, human annotator’s opinion, some semantic relations; an option<br />

would be using systematic polysemy patterns and/or computing semantic similarity. It


Proposing Methods of Improving Word Sense Disambiguation… 237<br />

could be reasonable to group word senses considering different NLP applications, and<br />

specific domains (music, food etc).<br />

Acknowledgements<br />

The work described here was supported by the National Program "Language<br />

Technology Support of Estonian Language" projects No EKKTT04-5, EKKTT06-11<br />

and EKKTT07-21and Government Target Financing project SF0182541s03<br />

("Computational and language resources for Estonian: theoretical and applicational<br />

aspects").<br />

References<br />

1. Dyvik, H.: Translations as a semantic knowledge source. In: Proceedings of The Second<br />

Baltic Conference on Human Language Technologies, pp 27–38. Tallinn (2005)<br />

2. Hovy, E.M., Marcus, M., Palmer, M., Pradhan, S., Ramshaw, L., Weischedel, R.:<br />

OntoNotes: The 90% Solution. Short paper. In: Proceedings of the Human Language<br />

Technology / North American Association of Computational Linguistics conference (HLT-<br />

NAACL 2006). New York, NY (2006)<br />

3. Kaalep, H-J.: An Estonian morphological analyser and the impact of a corpus on its<br />

development. J. Computers and the Humanities 31, 115–133 (1997)<br />

4. Kahusk, N., Kaljurand, K.: Results of Semyhe: (kas tasub naise pärast WordNet ümber<br />

teha?). In:.Pajusalu, R., Hennoste, T. (eds.) Catcher of the Meaning, pp. 185–195.<br />

Publications of the Department of General Linguistics / University of Tartu 3 (2002)<br />

5. Kerner, K., Vider, K.: Word Sense disambiguation Corpus of Estonian. In: Proceedings of<br />

The Second Baltic Conference on Human Language Technologies, pp 143–148. Tallinn<br />

(2005)<br />

6. Michalea, R., Moldovan, D.: Automatic Generation of a Coarse Grained WordNet. In:<br />

Proceedings of NAACL Workshop on WordNet and Other Lexical Resources. Pittsburgh,<br />

PA (2001)<br />

7. Michalea, R., Chlkovski, T.: Exploiting Agreement and Disagreement of Human Annotators<br />

for Word Sense Disambiguation. In: Proceedings of the Conference on Recent Advances in<br />

Natural Language Processing, pp 4–12. Borovetz, Bulgaria (2003)<br />

8. Navigli, R.: Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation<br />

Performance. In: Proc. of the 44th Annual Meeting of the Association for Computational<br />

Linguistics joint with the 21st International Conference on Computational Linguistics<br />

(COLING-ACL 2006), pp 105-112. Sydney, Australia (2006)<br />

9. Peters, W., Peters, I., Vossen, P.: Automatic sense clustering in eurowordnet. In: Proc. of the<br />

1st Conference on Language Resources and Evaluation (LREC). Granada, Spain (1998)<br />

10. Stevenson, M., Wilks, Y.: The interaction of knowledge sources in word sense<br />

disambiguation. J. Computational Linguistics 27 (3), 321–349 (2001)<br />

11. Vider, K.: Notes about labelling semantic relations in Estonian WordNet. In:<br />

Christodoulakis, D. N., Kunze, C., Lemnitzer, L. (eds.) Proceedings of Workshop on<br />

Wordnet Structures and Standardisation, and how these Affect Wordnet Applications and<br />

Evaluation; Third International Conference on Language Resources; Third International


238 Kadri Kerner<br />

Conference on Language Resources and Evaluation (LREC 2002), pp 56–59. Las Palmas de<br />

Gran Canaria (2002)<br />

12. Vider, K., Orav, H.: Concerning the difference between a conception and its application in<br />

the case of the Estonian wordnet. In: Sojka, P., Pala, K., Smrz, P., Fellbaum, Ch., Vossen, P.<br />

(eds.) Proceedings of the second international wordnet conference, pp 285–290. Masaryk<br />

University, Brno (2004)<br />

13. Vossen, P., Kunze, C., Wagner, A., Dutoit, D., Pala, K., Sevecek, P.: Set of Common Base<br />

Concepts in EuroWordnet-2. Amsterdam: Deliverable 2D001, WP3.1, WP 4.1;<br />

EuroWordNet, LE4-8328 (1998)


Morpho-semantic Relations in WordNet –<br />

a Case Study for two Slavic Languages<br />

Svetla Koeva 1 , Cvetana Krstev 2 , and Duško Vitas 3<br />

1<br />

Department of Computational Linguistics, Institute of Bulgarian, 52 Shipchenski prohod,<br />

1113 Sofia, Bulgaria<br />

svetla@ibl.bas.bg<br />

2<br />

Faculty of Philology, University of Belgrade, Studentski trg 3,<br />

11000 Belgrade, Serbia<br />

3<br />

Faculty of Mathematics, University of Belgrade, Studentski trg 16,<br />

11000 Belgrade, Serbia<br />

{cvetana, vitas}@matf.bg.ac.yu<br />

Abstract. In this paper we present the problem of representing the morphosyntactic<br />

relations in WordNets, especially the problems that arise when<br />

WordNets for languages that differ significantly from English are being<br />

developed on the basis of the Princeton WordNet, which is the case for<br />

Bulgarian and Serbian. We present the derivational characteristics of these two<br />

languages, how these characteristics are presently encoded in corresponding<br />

WordNets, and give some guidelines for their better coverage. Finally, we<br />

discuss the possibility to automatically generate new synsets and/or new<br />

relations on the basis of the most frequent and most regular derivational<br />

patterns.<br />

Keywords: global WordNet, morpho-semantic relations, derivational relations<br />

1 Introduction<br />

The aims of this paper are to present the current stage of the encoding of morphosemantic<br />

relations in Bulgarian and Serbian WordNets, briefly to sketch the<br />

derivational properties of Slavic languages based on the observations from Bulgarian<br />

and Serbian, to discuss the nature of morpho-semantic relations and its reflection to<br />

the WordNet structure and to analyze the positive and negative consequences of an<br />

automatic insertion of Slavic derivational relations into it.<br />

The WordNet is a lexical-semantic network which nodes are synonymous sets<br />

(synsets) linked with the semantic or extralinguistic relations existing between them<br />

[3], [8]. The WordNet structure also includes semantic and morpho-semantic relations<br />

between literals (simple words or multiword expressions) constituting the different<br />

synsets. The representation of the WordNet is a graph. The cross-lingual nature of the<br />

global WordNet is provided by establishing the relation of equivalence between<br />

synsets that express the same meaning in different languages [15].


240 Svetla Koeva, Cvetana Krstev, and Duško Vitas<br />

The global WordNet offers the extensive data for the successful implementation in<br />

different application areas such as cross-lingual information and knowledge<br />

management, cross-lingual content management and text data mining, cross-lingual<br />

information extraction and retrieval, multilingual summarization, machine translation,<br />

etc. Therefore the proper maintaining of the completeness and consistency of the<br />

global WordNet is an important prerequisite for any type of text processing to which<br />

it is intended.<br />

The structure of the paper outlines the underlined goals. In the following section<br />

we present a short analysis of related work. In the third section, we briefly describe<br />

the properties of Slavic derivational morphology based on examples from Bulgarian<br />

and Serbian and their reflection into the WordNet structure. The forth section explains<br />

how the morpho-semantic relations are encoded in Bulgarian and Serbian WordNets<br />

respectively. We then discuss the manners to incorporate the (Slavic) derivational<br />

relations into the WordNet structure and some limitations of their automatic insertion.<br />

Finally, we raise some problematic questions connected with the presented study and<br />

propose future work to be done.<br />

2 Related work<br />

WordNets have been developed for the most of the Slavic languages – Bulgarian,<br />

Serbian, Czech, Russian, Polish, Slovenian, and some initial work has been done for<br />

Croatian. WordNets for three Slavic languages (Czech – started with the<br />

EuroWordNet (EWN), Bulgarian and Serbian) have been developed in the scope of<br />

the Balkanet project (BWN) [2], [11] and later on continue developing as nationally<br />

funded projects 1 or on the volunteer basis.<br />

Originally, the Princeton WordNet (PWN) is designed as a collection of synsets<br />

that represent synonymous English lexemes which are connected to one another with<br />

a few basic semantic relations, such as hyponymy, meronymy, antonymy and<br />

entailment [3], [8]. This same structure has basically been mirrored in most of the<br />

WordNets developed on the basis of PWN. The structural difference of Slavic<br />

languages which show many similar features has induced the enrichment of<br />

WordNets with new information. Added information is mostly related to the<br />

inflectional and derivational richness of a language in question. For instance,<br />

information related to inflectional properties has been added to all lexemes in<br />

Bulgarian [4] and Serbian [2] WordNets, and for Serbian some rudimentary semantic<br />

relations that can be inferred from the derivational connectedness, for instance<br />

derived-pos (for possessive adjectives) and (for gender motion) derived-gender [2]<br />

has been added too. On the other hand, the recognized importance of PWN, and<br />

global WordNet in general, for various NLP applications has initiated the major<br />

additions and modifications of PWN itself.<br />

The existence of derivational relations that exhibit a fairly regular behavior and<br />

that connect lexemes that belong to the same or to the different categories seemed to<br />

many as a good starting point for the substantial WordNet enrichment. We will<br />

1<br />

http://dcl.bas.bg/bulNet/general_en.html


Morpho-semantic Relations in WordNet – a Case Study… 241<br />

present the most interesting approaches. All these approaches rely on the fact that if<br />

there is a derivational relation between two lexemes belonging to different synsets<br />

then most probably there is a kind of semantic relation between the synsets to which<br />

the lexemes belong.<br />

The automatic enrichment of WordNet on the basis of the derivational relations has<br />

been proposed and used for the Czech WordNet [10]. The basic and most productive<br />

derivational relations in Czech have been included in a Czech morphological analyzer<br />

and generator, and semantic labels were added to the derivational relations.<br />

The sharing of semantic information across WordNets has been proposed by [1].<br />

Namely, if WordNets for several languages are connected to each other (for instance<br />

via Interlingual index (ILI) [15], as has been done for WordNets developed in scope<br />

of EuroWordNet and Balkanet projects), then semantically related synsets in a source<br />

language for which the connection has been established on the basis of the<br />

derivational relatedness of some of the lexemes can be used to connect the synsets in<br />

a target language whose lexemes may not exhibit any derivational relation.<br />

The method to improve the internal connectivity of PWN has been proposed in [9].<br />

The existing synsets have been manually connected on the basis of the automatically<br />

produced list of pairs of lexemes that are (potentially) derivationally, and therefore<br />

also semantically, connected. In this paper we will try to show why we find the last<br />

approach the most appropriate for Bulgarian and Serbian.<br />

3 Slavic derivation in WordNet structure<br />

The derivation is highly expressive in all Slavic languages. Some of the most frequent<br />

and regular derivational mechanisms in Bulgarian and Serbian are given in the Table<br />

1. The status of the derivational mechanisms listed is not the same. Some of them<br />

represent the more or less frequent models which are not applicable to every lemma<br />

that has certain syntactic or semantic property, while the other models can always be<br />

applied. For instance, the pattern Verb → Noun representing a profession is one of the<br />

numerous derivational pattern in Bulgarian (уча → учител) and Serbian (učiti →<br />

učitelj), while the pattern Verb → Verbal noun is a general rule that can be applied to<br />

all imperfective verbs in the two languages. Similarly, a possessive adjective exists<br />

for every animate noun [12]. We call this phenomenon a regular derivation since in<br />

some respect it enhances the notion of inflectional class.<br />

Formally, regular derivation is performed by derivational operators that<br />

significantly influence the structuring of the lexicon of Slavic languages. The analysis<br />

of this phenomenon is given in [13], [14] on the examples of processing of possessive<br />

and relational adjectives, amplification and gender motion in various English-Serbian<br />

and Serbian-English dictionaries. Moreover, the derivational potential is, as a rule,<br />

connected to the specific sense of a lemma (see sections 5 and 7).


242 Svetla Koeva, Cvetana Krstev, and Duško Vitas<br />

Table 1. Some of the derivational mechanisms in Bulgarian and Serbian<br />

Relation Bulgarian Serbian English<br />

Aspect pairs уча → науча učiti → naučiti 2 teach – learn<br />

Verb → noun уча → учител učiti → učitelj teach – teacher<br />

Verb → noun уча → ученик učiti → učenik learn – student<br />

Verb → noun уча → училище učiti → učilište 3 learn – school<br />

Verb → noun učiti → učionica learn – classroom<br />

Verb → noun уча → учебник učiti → udžbenik learn – textbook<br />

Verb → noun уча → учен učiti → učenjak learn – scientist<br />

Verbal noun уча → учение učiti → učenje learn – studies<br />

Verbal noun уча → учене learn – study<br />

Collective noun Ученик → ученичество student – schooldays<br />

Verb → adjective уча → учебен učiti → učen learn – educational<br />

Verb → adjective уча → учен učiti → učevan learn – educated<br />

Relative adjective Учител → учителски učitelj → učiteljski of or related to teacher<br />

Possessive Учител → учителски učitelj → učiteljev male – female teacher<br />

adjective<br />

Gender pairs Учител → учителка učitelj→ učiteljica teacher – female teacher<br />

Gender pairs učiti → učenica student -female student<br />

Diminutive Ученик – учениче učenik – učeničić student – little student<br />

4 Current state of morpho-semantic relations in Bulgarian and<br />

Serbian WordNets<br />

Eight semantic relations between synsets are represented (in a correspondence with<br />

the Princeton WordNet) in Bulgarian [4], [5] and Serbian WordNets [2]. These<br />

relations are: hypernymy, meronymy (three subtypes are registered among others<br />

recognized), subevent, caused, be in state, verb group, similar to and also see (also see<br />

in PWN actually encodes two different relations: between verbs and between<br />

adjectives, the former one being a kind of morpho-semantic relation between literals<br />

roughly corresponding to Slavic verb aspect while the second one is a semantic<br />

relation of similarity between synsets). Three extralinguistic relations between synsets<br />

are encoded as well: usage domain, category domain and region domain. The<br />

WordNet structure includes also semantic and morpho-semantic (derivational)<br />

relations among literals belonging to the same or to the different synsets. Semantic<br />

relations between literals are: synonymy and antonymy (in Bulgarian and Serbian<br />

WordNets antonymy links synsets); derivational are: derived, participle, derivative in<br />

Bulgarian, and derived-pos, derived-gm, and derived-vn in Serbian.<br />

4.1 Encoded morpho-semantic relations<br />

The morpho-semantic relations in Bulgarian and Serbian WordNets link synsets<br />

although they derivationally apply to the literals only (single word and multi-word<br />

2 There is actually a whole list of perfective verbs that correspond to the imperfective verb учити: izučiti,<br />

naučiti, obučiti, preučiti (se), podučiti, poučiti, priučiti, proučiti.<br />

3 Today obsolete.


Morpho-semantic Relations in WordNet – a Case Study… 243<br />

lemmas). On the other hand, morpho-semantic relations express different kinds of<br />

semantic relations which hold between synsets. Neither the derivational links between<br />

the exact literals, nor labels [10] for the respective semantics relations operating<br />

between synsets are encoded so far in Bulgarian and Serbian WordNets. The<br />

subsumed morpho-semantic relations are briefly presented below (some statistical<br />

data are shown in Table 2):<br />

Derivative is an asymmetric inverse intransitive relation between derivationally<br />

and semantically related noun and verb. For example the Bulgarian literal водя from<br />

the synset {насочвам:1, насоча:1, водя:4, напътвам:1, напътя:1, направлявам:1}<br />

(the corresponding English synset is {steer:1, maneuver:1, maneuver:2, manoeuvre:2,<br />

direct:11, point:4, head:5, guide:1, channelize:1, channelise:1} with a definition<br />

‘direct the course; determine the direction of traveling’) is in derivative relation with<br />

the noun водач from the synset {водач:3} (the corresponding English synset is<br />

{guide:2} with a meaning ‘someone who shows the way by leading or advising’).<br />

Derived is an asymmetric inverse intransitive relation between derivationally and<br />

semantically related adjective and noun. For example the literal меден from the<br />

Bulgarian synset {меден:1} (the English equivalence {cupric:1, cuprous:1} with a<br />

definition ‘of or containing divalent copper’) is in a derived relation with the literal<br />

мед from the synset {мед:2, Cu:1} (in English → {copper:1, Cu:1, atomic number<br />

29:1}). A productive derivational process rely Slavic nouns with respective relative<br />

adjectives with general meaning ‘of or related to the noun’. For example, the<br />

Bulgarian relative adjective {стоманен:1} defined as ‘of or related to steel’ has the<br />

Serbian equivalent {čelični:1} with exactly the same definition. Actually in English<br />

this relation is expressed by the respective nouns used with an adjectival function<br />

(rarely at the derivational level, consider woodenwood, goldengold), thus the<br />

concepts exist in English as well and the mirror nodes should be envisaged.<br />

Participle is an asymmetric inverse intransitive relation between derivationally and<br />

semantically related adjective denoting result of an action or process and the verb<br />

denoting the respective action or process. Consider играя from {играя:7} (the<br />

English equivalent {play:1} with a definition ‘participate in games or sport’) which is<br />

in a Participle relation with the literal игран from {игран:1} denoting ‘(of games)<br />

engaged of’ for the English counterpart {played:1}. All Bulgarian verbs produce<br />

participles (the number of participles varies from one to four depending on the<br />

properties of the source verb) which are considered as verb forms constituting<br />

complex tenses or passive voice. On the other hand, a big part of the Bulgarian<br />

participles acts as adjectives with separate meaning. The similar relations between a<br />

verb and its participles hold for Serbian.<br />

It can be seen that the actual derivational relations are established between<br />

particular literals although the synsets are formally linked (the actual semantic<br />

relation between synsets which marker is the derivation itself is not labeled). The<br />

English derivative, derived, and participle relations are automatically transferred to<br />

Bulgarian WordNet. As they are language specific and obviously there is no one to<br />

one mapping between English and Bulgarian the expanded links are manually<br />

validated. A specification whether a given morpho-semantic relation exists in English<br />

only is declared in a synset note (SNote).<br />

The relation eng_derivative has been also automatically transferred to Serbian<br />

although the corresponding derivational relation may hold in Serbian as well but need


244 Svetla Koeva, Cvetana Krstev, and Duško Vitas<br />

not (see the Serbian example in section 5). The new relations derived-pos, derived-vn,<br />

and derived-gender have been introduced in Serbian WordNet to relate possessive and<br />

relative adjectives, verbal nouns and female (or male) doublets, assigned mainly to<br />

the Balkan specific or Serbian specific synsets.<br />

Table 2. Statistical data for the encoded morpho-semantic relations<br />

in Bulgarian and Serbian WordNets.<br />

Number of BG WN SR WN PWN 2.0<br />

Synsets 29,136 13,612 115,424<br />

Literals 56,223 23,139 203,147<br />

Relations 53,144 18,210 4 204,948<br />

Derived 1,696 314 1,296<br />

Derivative 8,920 83 5 36,630<br />

Participle 212 0 401<br />

4.2 Not-encoded morpho-semantic relations<br />

The general observations are that not all existing derivative, derived, and especially<br />

participle links are marked in Bulgarian and Serbian WordNets. The main reason<br />

originates in the language specific character of the word-building in view of the fact<br />

that an exact correspondence with the PWN has been mostly followed in the expand<br />

WordNet model. As a result a lot of language-specific derivational relations (that can<br />

be described in terms of derivative, derived, and participle relations) remain<br />

unexpressed in Bulgarian and Serbian WordNets. For example the literals from the<br />

Bulgarian synset {метален:1, металически:1} corresponding to the English synset<br />

{metallic:1, metal:1} with a definition: ‘containing or made of or resembling or<br />

characteristic of a metal’ are derived from the literal метал from the synset<br />

{метал:1, метален елемент} equal to the English synset {metallic element:1,<br />

metal:1} with a definition ‘any of several chemical elements that are usually shiny<br />

solids that conduct heat or electricity and can be formed into sheets etc’. Nevertheless<br />

the corresponding derived relation is not linked in the Bulgarian WordNet. Consider<br />

the following more complicated example. The literal пекар from the Bulgarian<br />

synset {пекар:1, хлебар:1, фурнаджия:1} (English equivalent {baker::2, bread<br />

maker:1} with a definition ’someone who bakes bread or cake’) is in a derivative<br />

relation with the literal пека from the synset {пека:1, опичам:1, опека:1,<br />

изпичам:1, изпека:1} (in English {bake:1} with a definition ‘cook and make edible<br />

by putting in a hot oven’). Moreover the second target literal хлебар is in a<br />

derivational relation with the source literal хляб from the synset {хляб:1} (in English<br />

{bread:1, breadstuff:1, staff of life:1} with a definition ‘food made from dough of<br />

flour or meal and usually raised with yeast or baking powder and then baked’), while<br />

the third one фурнаджия is in a derivational relation with the source literal фурна<br />

from {пекарница:1, фурна:2} (in PWN {bakery:1, bakeshop:1, bakehouse:1} with a<br />

definition ‘a workplace where baked goods (breads and cakes and pastries) are<br />

4<br />

Without extralinguistic relations: category and region, and relation eng_derived.<br />

5<br />

Includes relations: derived-pos, derived-vn, and derived-gender.


Morpho-semantic Relations in WordNet – a Case Study… 245<br />

produced or sold’). None of the three existing derivational relations is encoded in the<br />

Bulgarian WordNet so far.<br />

In Serbian, for instance, the adjective synset {zamisliv:1} (English equivalent is<br />

{conceivable:2, imaginable:1, possible:3} with a definition ‘possible to conceive or<br />

imagine’) is not linked with the verbal synset {zamisliti:2y, koncipirati:1b} (in<br />

English {imagine:1, conceive of:1, ideate:1, envisage:1} with a definition ‘form a<br />

mental image of something that is not present or that is not the case’), although<br />

relation derived, or some more specific, would be appropriate.<br />

4.3 Language-specific morpho-semantic relations<br />

There are systematic morpho-semantic differences concerning derivational<br />

mechanisms between English and Slavic languages [7]. Some of the most productive<br />

derivational relations in Slavic languages are briefly presented here: namely verbal<br />

aspect pairs, gender pairs, and diminutives.<br />

4.3.1 Aspect pairs<br />

The verb aspect is a category that occurs in all Slavic languages, its nature is very<br />

sophisticated. Generally speaking, the verb aspect in Slavic languages can be descried<br />

as a relation between the action and its bound (limit) regardless of the person, speaker<br />

and speech act. The perfect aspect verbs express integrity and completeness, while the<br />

imperfect aspect verbs – lack of integrity or a process (duration, recurrence). Each<br />

Slavic verb is either perfective or imperfective; there are a number of verbs that are<br />

bi-aspectual and act as both imperfective and perfective. Most verbs form strict pairs<br />

where perfective and imperfective members form a derivational relation between two<br />

lexemes expressing generally the same meaning. The Bulgarian verbs are classified<br />

as: imperfective (perfective correspondent exists), perfective (imperfective<br />

correspondent exists), bi-aspectual, imperfective tantum (perfective correspondent<br />

does not exist), perfective tantum (imperfective correspondent does not exist). In<br />

Bulgarian WordNet the aspect pairs are introduced in one and the same synset with an<br />

LNote (literal note) describing the respective aspect. For example {съчинявам:2<br />

LNOTE: imperfective, съчиня:2 LNOTE: perfective, пиша:4 LNOTE: imperfective,<br />

написвам:2 LNOTE: imperfective, напиша:2 LNOTE: perfective} (an equivalent of<br />

the English synset {write:1, compose:3, pen:1, indite:1} with a definition ‘produce a<br />

literary work’). Similarly, in Serbian WordNet the aspect pairs are introduced in a<br />

same synset. For instance in a synset {zamišljati:2x, zamisliti:2x, dočaravati:2x,<br />

dočarati:2x, predočavati:1, predočiti:1} (in English {visualize:1, visualise:3,<br />

envision:1, project:9, fancy:1, see:4, figure:3, picture:1, image:1} with a definition<br />

‘imagine; conceive of; see in one's mind’), LNOTE element corresponding to each<br />

literal describes inflectional and derivational properties of each verb, e.g. LNOTE<br />

content for the imperfective verb zamišljati is V1+Imperf+Tr+Iref+Ref, while<br />

LNOTE content the perfective correspondent zamisliti is V162+Perf+Tr+Iref+Ref<br />

[6]. In most cases, however, perfective verbs derived from the imperfective by<br />

prefixation express different meaning and are not in the same synset, for example the<br />

perfective verb uraditi ‘do, perform’ and its imperfective correspondent raditi are<br />

not in the same synset.


246 Svetla Koeva, Cvetana Krstev, and Duško Vitas<br />

4.3.2 Gender pairs<br />

The gender pairing is systematic phenomenon in Slavic languages that display binary<br />

morpho-semantic opposition: male → female, and as a general rule there is no<br />

corresponding concept lexicalized in English. The derivation is applied mainly to<br />

nouns expressing professional occupations, but also to female (or male)<br />

correspondents of nouns denoting representatives of animal species. For example,<br />

Bulgarian synset {преподавател:2, учител:1, инструктор:1} and Serbian synset<br />

{predavač:1} that correspond to the English {teacher:1, instructor:1} with a<br />

definition: ‘a person whose occupation is teaching’ have their female gender<br />

counterparts {преподавателка, учителка, инструктурка} and {predavačica} with a<br />

feasible definition ‘a female person whose occupation is teaching’.<br />

There are some exceptions where like in English one and the same word is used<br />

both for masculine and feminine in Bulgarian and Serbian, for example<br />

{президент:1} which corresponds to the English synset {president:3} with a<br />

definition: ‘the chief executive of a republic’, and as a tendency the masculine noun<br />

can be used referring to females. Following the PWN practice the female counterparts<br />

are encoded in Bulgarian and Serbian WordNets as hyponyms of the corresponding<br />

synset with the male counterpart. For example {актриса:1} (English equivalent<br />

{actress:1} with a definition ‘a female actor’) is a hyponym of {актьор:1, артист:}<br />

(corresponding to the English synset {actor:1, histrion:1, player:3, thespian:1, role<br />

player:2} expressing the meaning ‘a theatrical performer’). It might be foreseen of<br />

introducing a new relation describing the female – male opposition of nouns in Slavic<br />

languages as has already been done for Serbian.<br />

4.3.3 Diminutives<br />

Diminutives are standard derivational class for expressing concepts that relate to<br />

small things. The diminutives display a sort of morpho-semantic opposition: big →<br />

small, however sometimes they may express an emotional attitude too. Thus the<br />

following cases can be found with diminutives: standard relation big → small thing,<br />

consider {стол:1} corresponding to English {chair:1} with a meaning ‘a seat for one<br />

person, with a support for the back’ and {столче:1} with an feasible meaning ‘a little<br />

seat for one person, with a support for the back’; small thing to which an emotional<br />

attitude is expressed. Also, Serbian synset {lutka:1} that corresponds to the English<br />

{doll:1, dolly:3} with a meaning ‘with a replica of a person, used as a toy’ is related<br />

to {lutkica} which has both diminutive and hypocoristic meaning. There might be<br />

some occasional cases when this kind of concept is lexicalized in English, {foal:1}<br />

with a definition: ‘a young horse’, {filly:1} with a definition: ‘a young female horse<br />

under the age of four’, but in general these concepts are expressed in English by<br />

phrases.<br />

For the moment the diminutives are included in Bulgarian and Serbian WordNets<br />

only in the rare case when the English equivalent is lexicalized. On the other hand,<br />

almost from every concrete noun a diminutive (in some cases more than one lexeme)<br />

can be derived. Consequently a place for the diminutives in the WordNet structure has<br />

to be provided.


Morpho-semantic Relations in WordNet – a Case Study… 247<br />

5 The nature of morpho-semantic (derivational) relations<br />

One of the most important features of the morpho-semantic relations is that being<br />

derivational relations between literals (i.e. assistant is a person that assists; participant<br />

is the person that participates etc.) they express also regular semantic oppositions<br />

holding between synsets [9]. The derivational relation linking assist and assistant<br />

from the respective synsets {help:1, assist:1, aid:1} ‘give help or assistance; be of<br />

service’ and {assistant:1, helper:1, help:4, supporter:3} ‘a person who contributes to<br />

the fulfillment of a need or furtherance of an effort or purpose’ implies a kind of<br />

semantic relation over synsets formulated in [10] as an agentive relation existing<br />

between an action and its agent.<br />

Given morpho-semantic relation may be realized by different derivation<br />

mechanisms. Consider the literals from the Bulgarian synset {певец:2, вокалист:1}<br />

(in English {singer:1, vocalist:1, vocalizer:2, vocaliser:2} with a definition ‘a person<br />

who sings’), the former one певец is derived with the suffix –ец from the literal пея<br />

constituting the synset {пея:1} (the English equivalent {sing:2} with a definition<br />

‘produce tones with the voice’}, while the second one вокалист is derived with the<br />

suffix –ист from the literal вокализирам belonging to the synset {вокализирам:1}<br />

(in English {vocalize:2, vocalise:1” with a definition ‘sing with one vowel’).<br />

On the other hand, different derivational mechanisms might correspond to different<br />

semantic relations. For example in Bulgarian, as well as in English the verb чeта<br />

from the synset {чета:3; прочитам:2; прочета:2} corresponding to the English<br />

synset {read:1} with a definition ‘interpret something that is written or printed’ has<br />

the following derivates among others:<br />

– the noun четене from the synset {четене:1} ↔ {reading:1}, with a definition:<br />

‘the cognitive process of understanding a written linguistic message’. The derivation<br />

transforms the verb into a verbal noun. The respective relation between synsets is<br />

formulated as an action relation in [10].<br />

– the noun читател from the synset {читател:1} ↔ {reader:1}, with a<br />

definition: ‘a person who enjoys reading’. The derivational relation links the source<br />

verb with a noun build by an affixation. The respective relation between synsets<br />

expresses a property over the underlying action.<br />

In some cases when the source literal has more than one meaning the exact<br />

correspondences with the derivates can be traced. Consider the verb чeта from the<br />

synset {чета:1, прочитам:1, прочета:1} equivalent with the English synset {read:3}<br />

with a definition ‘look at, interpret, and say out loud something that is written or<br />

printed’. Its verbal noun derivative четене from the synset {четене:1; поетическо<br />

четене:1; рецитал:1} (in English {recitation:2, recital:3, reading:7}) expresses a<br />

meaning which is related with the meaning of the source ‘a public instance of reciting<br />

or repeating (from memory) something prepared in advance’. As the source derivates<br />

counterpart in two different synsets (equivalent to {read:1} and {read:3}), this<br />

presupposes the corresponding difference in the meanings of the resulting derivatives.<br />

Thus the same derivational mechanism might indicate for different semantic<br />

oppositions if it targets graphically equivalent literals expressing different meaning<br />

(the observed difference in the semantic oppositions remains undistinguished). It is<br />

natural that the synsets {read:1} and {read:3} are related with a verb group relation.


248 Svetla Koeva, Cvetana Krstev, and Duško Vitas<br />

The semantic part of the morpho-semantic relations is not language specific,<br />

language specific are the derivational mechanisms of lexicalization. There are several<br />

English derivatives of the literal paint from {paint:3} with a definition ‘make a<br />

painting of':<br />

En 1.{paint:1} – ‘a substance used as a coating to protect or decorate a surface<br />

(especially a mixture of pigment suspended in a liquid); dries to form a hard coating’<br />

En 2. {painter:1} – ‘an artist who paints’<br />

En 3. {painting:1, picture:2} – ‘graphic art consisting of an artistic composition<br />

made by applying paints to a surface’<br />

En 4.{painting:2} – ‘creating a picture with paints’<br />

Neither of the corresponding Bulgarian equivalents:<br />

Bg 1.{боя:2}<br />

Bg 2.{живописец:1, художник:1}<br />

Bg 3.{картина:3}<br />

Bg 4,{живопис:1}<br />

are derivatives of the Bulgarian synset equivalent to {paint:3} – {рисувам:2;<br />

нарисувам:2}. Nevertheless the same semantic oppositions exist in Bulgarian<br />

although they are not marked with any semantic or morpho-semantic relations.<br />

In Serbian some of the related synsets to {naslikati:1 LNOTE:<br />

V101+Perf+Tr+Iref} (equal to {paint:3}) include derivatives, while the other do not<br />

(e.g. {boja:2x, farba:1x}). The derivative relation is transferred from English to<br />

Serbian WordNet, but the name of the relation has not been changed in order to<br />

indicate that the origin of the relation is English, and that it may hold for Serbian but<br />

need not, as shown by the same example.<br />

Sr 1. {boja:2x, farba:1x}<br />

Sr 2. {slikar:1}<br />

Sr 3. {slika:1}<br />

Sr 4. {slikarstvo:1}<br />

This means that the derivational relations in a particular language might be<br />

successfully used not only for the detecting of a given semantic opposition. Moreover<br />

they can be exploited for the identification of the corresponding semantic relations in<br />

other languages where lexicalization is expressed by different mechanisms. Thus we<br />

have to make clear distinction between the derivation as a literal relation (asymmetric,<br />

inverse, and intransitive) and the semantic oppositions between synsets for which the<br />

derivation itself might be a formal pointer.<br />

6 Approaches to cover Slavic specific derivations in WordNet<br />

There are several possible approaches for covering different lexicalizations resulting<br />

from derivation in different languages [7], [11]:<br />

– to treat them as denoting specific concepts and to define appropriate synsets<br />

(gender pairs in Bulgarian and Serbian; relative adjectives in Bulgarian and Serbian);<br />

– to include them in the synset with the word they were derived from (verb aspect<br />

in Bulgarian and in most of the cases in Serbian);<br />

– to omit their explicit mentioning (diminutives in Bulgarian);


Morpho-semantic Relations in WordNet – a Case Study… 249<br />

– to provide source literals with flexion-derivation description encompass these<br />

phenomena as well.<br />

Treating morpho-semantic relations such as verb aspect, relative adjectives, gender<br />

pairs and diminutives among others in Slavic languages as relations that involve<br />

language specific concepts requires an ILI addition for the languages where the<br />

concepts are presented (respectively lexical gaps in the rest). This solution takes<br />

grounds from the following observations:<br />

– Verb aspect pairs, relative adjectives, feminine gender pairs and diminutives<br />

denote an unique concept;<br />

– Verb aspect pairs, relative adjectives, feminine gender pairs and diminutives are<br />

lexicalized with a separate word in Bulgarian, Serbian, Czech and other Slavonic<br />

languages;<br />

– Relative adjectives, feminine gender pairs and diminutives in most of the cases<br />

belong to different category or different inflectional class comparing to the word from<br />

which they are derived (there are some exceptions in the difference of the category,<br />

like diminutives that are derived from neuter nouns in Bulgarian).<br />

Although the new WordNets do not compare yet with PWN’s coverage, the former<br />

are continuously extended and improved so that a balanced global multilingual<br />

WordNet is foreseen. For that reason the task of proper encoding of different levels of<br />

lexicalization in different languages is becoming more and more important in the<br />

view of the various Natural Language Processing tasks. The Slavic languages possess<br />

rich derivational morphology which has to be involved into the strict one-to-one<br />

mapping with the ILI.<br />

7 Automatic building of derivational relations in Bulgarian and<br />

Serbian<br />

The derivational relations for literals that already exist in WordNet can be interpreted<br />

in terms of derivational morphology, e.g., the noun teacher is derived from the verb<br />

teach and so on. WordNet already contains a lot of words that are produced by the<br />

derivational morphology rules: verbal nouns are linked with verbs, etc. In order to<br />

make explicit the morpho-semantic relations that exist already it would be necessary<br />

to include more links. On the other hand, a special attention has to be paid on the<br />

language specific derivational relations (some of them valid for big language families<br />

as Slavic languages). Several problems can be formulated following the observations<br />

and analyses presented in this study:<br />

It is necessary to distinguish the pure derivation form the semantic relations which<br />

meaning is presupposed by the derivation itself. Concerning Bulgarian and Serbian<br />

WordNets this will be reflected particularly in the proper encoding of the derivational<br />

links between exact literals as it has been done in PWN; in the identification of<br />

derivational relations between literals already encoded in WordNets (comparing with<br />

PWN or exploiting language-specific derivational models), and in the introducing of<br />

language specific derivations in their appropriate place in the WordNet structure<br />

providing the exact correspondence with other languages.


250 Svetla Koeva, Cvetana Krstev, and Duško Vitas<br />

In more general plan a theoretical investigation is needed to describe the nature of<br />

the semantic relations to which derivations are formal pointers. Ones a consistent<br />

classification is provided the respective semantic relations might be identified in the<br />

WordNets on the basis of the derivational ones in a particular language<br />

Several tasks may be done semi-automatically: to link literals instead of synsets<br />

with derivational relations; and to identify synsets where the potentially derivationally<br />

related literals appear. Bellow we provide some observations why the complete<br />

automation is not appropriate; although the derivational regularities are in most of the<br />

cases well established.<br />

Although derivation is in many cases regular in the sense that it yields predictable<br />

results, it cannot be freely used for generation since it can lead to over-generation;<br />

namely, one could generate something which exists in a language system but does not<br />

exist in language usage. For instance, in Bulgarian and Serbian an abstract noun can<br />

be regularly derived (with a suffix –ост; –ost) from a descriptive adjective X meaning<br />

‘the quality of something that has the characteristic X’, and a prefix ( –не, –ne; –без,<br />

–bez, etc.) can be used to produce both the adjective and a noun with the opposite<br />

meaning. One such example in Serbian is osećajan ‘be able to respond to affective<br />

changes’ → osećajnost ‘the ability to respond to affective changes’ → bezosećajan<br />

‘not being able to respond to affective changes’ → bezosećajnost ‘the inability to<br />

respond to affective changes’. However, if the same pattern is applied to the adjective<br />

sličan ‘marked by correspondence or resemblance’ → sličnost ‘the quality of being<br />

similar’ → ?nesličan ‘not similar’ → ?nesličnost ‘the quality of being dissimilar’, two<br />

last lexemes in a sequence though easily understood are not lexicalized.<br />

In a context of a WordNet production it is not sufficient to produce new synsets,<br />

for instance by applying the regular derivational mechanisms. It is equally important<br />

to place the generated synsets in the already existent network consisting of various<br />

relations. For instance, in Serbian the nouns sposobnost, vidljivost and popustljivost<br />

are regularly generated from the adjectives sposoban ‘having the necessary means or<br />

skill to do something’, vidljiv ‘having the characteristics that make it visible’ and<br />

popustljiv ‘easily managed or controlled’. However, the produced nouns have three<br />

different hypernyms: osnovna karakteristika {quality:1}, svojstvo {property:3}, and<br />

osobina {trait:1}. The correct placement of newly generated synsets in an existent<br />

network is not straightforward.<br />

It has been noted (in section 5) that many senses of some words are distinguished<br />

by their different derivational capabilities. For instance, Serbian verb polaziti has five<br />

different meanings according to the Serbian explanatory dictionary, and one<br />

submeaning of the second presented meaning is ‘to go somewhere regularly and often<br />

to perform some duty’. That meaning is the only one from which the noun polaznik<br />

‘someone who attends a school or a course’ can be derived by the agentive relation<br />

(realized by a suffix –ik).<br />

It has already been stated in [8] that even derivation that seems very predictable<br />

can show very unpredictable behavior. Some derivational mechanisms in Bulgarian<br />

and Serbian are very predictable, like production of possessive adjectives that are<br />

produced from (mostly) animate nouns. As a consequence possessive adjectives are<br />

not listed in traditional Serbian dictionaries. The production of verbal nouns from<br />

imperfective verbs is also regular and produces a predictive meaning, the act of doing<br />

something. The verbal nouns are however, listed as a separate entries in Bulgarian and


Morpho-semantic Relations in WordNet – a Case Study… 251<br />

Serbian dictionaries. Besides the predicted meaning they often acquire the additional<br />

meaning. For instance, the verbal nouns учение in Bulgarian, učenje in Serbian and<br />

pečenje in Serbian are derived from imperfective verbs уча, učiti ‘to study’ and peći<br />

‘to roast’. Besides the predicated meanings ‘the act of studying’ and ‘the act of<br />

roasting’ they have acquired in Serbian the additional meaning, ‘doctrine’ and ‘roast<br />

meat’, respectively. In the case of other derivational mechanism it can be more<br />

difficult to establish the meaning of the derived word. For instance, adjectives pričljiv<br />

and čitljiv in Serbian are derived respectively from the verbs pričati ‘to talk’ and<br />

čitati ‘to read’ using the same suffix –iv. Both verbs are imperfective and can be used<br />

both as transitive and intransitive: Marko priča priču ‘Marko tells the story’, Marko<br />

puno priča ‘Marko speaks a lot’, Puno ljudi čita knjigu ‘A lot of people read the<br />

book’, Marko puno čita ‘Marko reads a lot’. The meaning of the adjective pričljiv is<br />

derived from the itransitive usage of a verb (namely, Marko puno priča implies<br />

Marko je pričljiv ‘Marko is talkative’), while the adjective čitljiv is derived from the<br />

transitive usage (here Puno ljudi čita knjigu implies Knjiga je čitljiva ‘The book is<br />

easy to read’).<br />

The complexity of the issue of automation is best illustrated by the derivation of<br />

gender pairs in Serbian, since they exhibit all the previously mentioned problems. If<br />

we consider the derivation of female counterparts for the nouns of professions we<br />

encounter the following situations:<br />

- The female counterpart morphologically does not exist: for instance, sudija<br />

‘judge’ is therefore used for both men and women;<br />

- The female counterpart morphologically exists but is never used, vojnik<br />

‘solider’ vs. *vojnica and žena vojnik ‘(woman) soldier’;<br />

- The female counterpart exists and is exclusively used for women performing<br />

that profession or function: kelner ‘waiter’ and kelnerica ‘waitress’;<br />

- The female counterpart exists but the male noun is also sometimes used for<br />

women: profesor ‘(man or woman) professor’ and profesorka ‘(woman) professor’;<br />

- The female counterpart exists but it does not mean quite the same as a noun it<br />

was derived from: sekretar ‘secretary’ is treated as someone performing an highly<br />

responsible function, as opposed to sekretarica ‘(woman) secretary’ who is<br />

performing the low-level tasks in an organization;<br />

- The female counterpart exists but it has acquired a different meaning, so it is<br />

not used to denote a woman performing certain function: saobraćajac ‘traffic cop’ vs.<br />

saobraćajka ‘car accident’.<br />

8 Conclusions and future work<br />

We have briefly presented the current stage of the encoding of morpho-semantic<br />

relations in Bulgarian and Serbian WordNets. Grounding on the derivational<br />

properties of Slavic languages we provided some observations over the sophisticated<br />

nature of morpho-semantic relations and presented some examples proving the<br />

negative consequences from a purely automatic insertion of Slavic derivational<br />

relations into the WordNet structure. We believed we added additional evidences<br />

supporting the approach presented in [9] namely the utilization of a semi-automatic


252 Svetla Koeva, Cvetana Krstev, and Duško Vitas<br />

identification or insertion of morpho-semantic relations. Such an approach would<br />

significantly facilitate the WordNet development although a manual connection on<br />

the basis of the automatically produced lists of suggested pairs has to be provided.<br />

Further development of both Bulgarian and Serbian WordNets is narrowly<br />

connected with an investigation towards the theoretical grounds of the nature of<br />

morpho-semantic relations. At first stage the encoding of derivational relations<br />

between exact literals instead of synsets is foreseen. Another important task is the<br />

introducing of Slavic language specific derivations in a uniform way providing at the<br />

same time ILI correspondences. The accomplishment of these tasks will also reflect in<br />

the successful implementations of approaches based on cross-lingual information<br />

extraction, retrieval, and data mining, multilingual summarization, machine<br />

translation, etc.<br />

References<br />

1. Bilgin, O., Cetinouğlu, O.. Oflazer, K.: Morphosemantic Relations In and Across WordNets<br />

– A Study Based on Turkish. In: Sojka P, et al (ed.) Proceedings of the Global WordNet<br />

Conference, pp. 60–66. Brno (2004)<br />

2. Christodulakis, D. (ed.): Design and Development of a Multilingual Balkan WordNet<br />

(BalkaNet IST-2000-29388) – Final Report. (2004)<br />

3. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge,<br />

Mass. (1998)<br />

4. Koeva, S., Tinchev, T., Mihov, S.: Bulgarian WordNet-Structure and Validation. J.<br />

Romanian Journal of Information Science and Technology =(1-2), 61–78 (2004)<br />

5. Koeva, S.: Bulgarian WordNet – development and perspectives. In: International Conference<br />

Cognitive Modeling in Linguistics, 4–11 September 2005, Varna (2005)<br />

6. Krstev, C., Vitas, D., Stanković, R., Obradović, I., Pavlović-Lažetić, G.: Combining<br />

Heterogeneous Lexical Resources. In: Proceedings of the Fourth Interantional Conference<br />

on Language Resources and Evaluation, vol. 4, pp. 1103-1106. Lisbon, May 2004 (2004)<br />

7 Krstev, C., Koeva, S., Vitas, D.: Towards the Global WordNet. In: First International<br />

Conference of Digital Humanities Organizations (ADHO) Digital Humanities 2006, pp.<br />

114–117. Paris-Sorbonne, 5-9 July 2006 (2006)<br />

8. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K. J.: Five Papers on WordNet. J.<br />

Special Issue of International Journal of Lexicography 3(4) (1990)<br />

9. Miller, G. A., Fellbaum, C.: Morphosemantic links in WordNet. J. Traitement automatique<br />

des langues 44(2), 69–80 (2003)<br />

10. Pala, K., Hlavačková, D.: Derivational Relations in Czech WordNet. In: Proceedings of the<br />

Workshop on Balto-Slavonic Natural Language Processing, ACL, Prague, 75–81 (2007)<br />

11. Stamou S., Oflazer, K., Pala, K., Christoudoulakis, D., Cristea, D., Tufis, D., Koeva, S.,<br />

Totkov, G., Dutoit, D., Grigoriadou, M.: BALKANET: A Multilingual Semantic Network<br />

for the Balkan Languages. In: Proceedings of the International WordNet Conference, pp.<br />

12–14. Mysore, India, 21-25 January 2002 (2002)<br />

12. Vitas, D., Krstev, C.: Regular derivation and synonymy in an e-dictionary of Serbian. J.<br />

Archives of Control Sciences, Polish Academy of Sciences 51(3), 469–480 (2005)<br />

13. Vitas, D.: Morphologie dérivationnelle et mots simples: Le cas du serbo-croate, In:<br />

Lingvisticae Investigationes Supplementa 24 (Lexique, Syntaxe et Lexique-Grammaire /<br />

Syntax, Lexis & Lexicon-Grammar - Papers in honour of Maurice Gross), pp. 629–640.<br />

John Benjamin Publ. Comp. (2004)


Morpho-semantic Relations in WordNet – a Case Study… 253<br />

14. Vitas, D., Krstev, C. Restructuring Lemma in a Dictionary of Serbian, in Erjavec, T.,<br />

Zganec Gros, J. (eds.) Informacijska druzba IS 2004" Jezikovne tehnologije Ljubljana,<br />

Slovenija, eds., Institut "Jozef Stefan", Ljubljana, 2004<br />

15. Vossen P. (ed.) EuroWordNet: a multilingual database with lexical semantic networks for<br />

European Languages. Kluwer Academic Publishers, Dordrecht. (1999)


Language Independent and Language Dependent<br />

Innovations in the Hungarian WordNet<br />

Judit Kuti, Károly Varasdi, Ágnes Gyarmati, and Péter Vajda<br />

Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest<br />

1068 Budapest, Benczúr u. 33.<br />

{kutij,varasdi,aagnes,vajda}@nytud.hu<br />

Abstract. In this paper we present innovations that proved to be useful during<br />

the development of the Hungarian WordNet (HuWN). Part of these are<br />

language independent, but part of speech related expansions of the structure,<br />

which hopefully serve the accuracy of representation, as in the case of new<br />

relation types in the adjectival WordNet. Others seemed necessary because of<br />

the restricted applicability of the expand method to building the WordNet of a<br />

typologically different language than that of the model-WordNet, English. The<br />

Hungarian system of preverbs called for an expansion of the verbal structure in<br />

a way that it fits the characteristics of Hungarian verbs expressing aspect and<br />

Aktionsart. Treating verbs as eventualities and using the notion of nucleus<br />

introduced by Moens&Steedman we classified some of the Hungarian verbal<br />

synsets according to aspectual types and introduced new relations to the<br />

WordNet. This enabled us to represent both linguistically and<br />

psycholinguistically relevant pieces of information in the network.<br />

Keywords: WordNet, verb, event structure, event ontology, aspect, Aktionsart<br />

1 Introduction – the Hungarian WordNet<br />

In the present paper we examine some of the problems encountered during the<br />

development of the Hungarian WordNet (HuWN 1 ), which, due to their language- or<br />

part of speech category specific nature called for different, alternative solutions than<br />

the ones offered by the WordNets 2 serving as models for HuWN. The development<br />

of HuWN started along the lines of the general principle of the expand model 3 .<br />

However, this methodology presupposes the adaptability of a “source-WordNet”<br />

structure to another database of this kind, and with this, the basic compatibility of the<br />

internal semantic net of the lexicalised concepts in the two languages. It is,<br />

accordingly inevitable when dealing with typologically different languages that some<br />

1<br />

The basic model for the HuWN was the Princeton WordNet, but we have used some ideas of<br />

the GermaNet database, as well.<br />

2<br />

When talking about a specific WordNet of a given language, we refer to it with the<br />

widespread, trademark-like spelling, using capital 'W' and 'N', while when referring to the<br />

database type as to a common noun we use lower case letters.<br />

3<br />

The term was introduced by Vossen, [10] p.53.


Language Independent and Language Dependent Innovations… 255<br />

complementary methodological steps be added or certain modifications be carried out<br />

in order to do justice to the linguistic characteristics of the “target” language. One of<br />

the most relevant characteristics of Hungarian, as opposed to English, from the point<br />

of view of building a WordNet, is that, through preverbs, the verb contains much<br />

information related to aspect and Aktionsart. When building a verbal semantic<br />

network, it is, thus, essential to examine the event structure of verbs. Accordingly, the<br />

first part of our paper shows what ways of storing and representing certain semantic<br />

relations stemming from the event structure of verbs we have worked out – within the<br />

framework facilitated by WordNet as a genre. In the second part of the paper we<br />

present new relation types introduced to the Hungarian WordNet on the basis of more<br />

general – not language-dependent – considerations. They concern the adjectival<br />

WordNet, and are meant to be alternative suggestions for the representation of some<br />

atypical adjective-clusters.<br />

2 Language-dependent factors in structuring HuWN<br />

As WordNets were originally designed to describe the hierarchical structure of nouns,<br />

and it is nouns that constitute a preponderant part of existing WordNets, one has to<br />

pay special attention to representing verbal relations in the given framework as<br />

accurately, as possible. For example the choice of two distinct names for relation<br />

types that can be considered equivalent − troponymy for verbs vs hyponymy for<br />

nouns − in the two respective parts of speech in PWN indicates already that a<br />

meaning representation framework for verbs cannot be solely designed on the basis of<br />

the existing grounds for a nominal hierarchy, not even in the case of a language like<br />

English, in which verbs as lexical units bear little or no information related to aspect<br />

or Aktionsart. With this in mind, in this section we first present some fundamental<br />

statements on event structure and aspectuality of verbs, and an elementary eventstructure<br />

called nucleus introduced by Moens&Steedman. Subsequently, we hope to<br />

show that by using the notion of nucleus we acquire a means that enables us to<br />

1. incorporate lexicalised meanings into WordNet more easily than was possible<br />

previously<br />

2. represent psycholinguistically relevant pieces of information that were so far<br />

missing from the Hungarian WordNet<br />

3. store information that proves to be useful for computational linguistic<br />

applications of the HuWN.


256 Judit Kuti, Károly Varasdi, Ágnes Gyarmati, and Péter Vajda<br />

2.1 Eventualities and their aspectual properties<br />

2.1.1 Logical implication between verbal meanings<br />

It is necessary to examine in what way the relation of logical implication holds<br />

between verbs 4 since this is what both the relations troponymy and hyponymy are<br />

based on. The propositions implied by a sentence are highly dependent on its aspect,<br />

illustrated by the following examples:<br />

1. Mari éppen ment át az utca túloldalára, amikor megpillantotta Jánost.<br />

'Mary was crossing the street when she saw John.'<br />

2. Mari átment az utca túloldalára, amikor megpillantotta Jánost.<br />

'Mary crossed the street when she saw John.'<br />

While sentence (1) does not imply that Mary actually crossed the street − she might<br />

have turned back to greet John, sentence (2) does imply that Mary did finish crossing<br />

the street (moreover, the pragmatical implicature suggesting that Mary crossed the<br />

street because she had noticed John, is also present).<br />

The difference between the two main clauses in Hungarian is merely aspectual: the<br />

first one is in progressive, while the second one is in perfective aspect, each<br />

possessing a different logical potential. 5 It is, thus, obvious that the question<br />

concerning what implications the preverb and verb as a whole can take part in is not<br />

separable from its aspectual value in the sentence. Although in Hungarian the actual<br />

aspect of a sentence is of course determined by many factors in the sentence besides<br />

the verb, its aspectual potential − as well as the sentences it can imply − is largely<br />

determined by the event structure of the verb.<br />

In Hungarian some preverbs can bear information related to both aspect and<br />

Aktionsart. This alone might make Hungarian seem to be similar to Slavic languages.<br />

However, on the one hand, Hungarian does not express aspect in as a predictable<br />

manner, as e.g. Russian, whose WordNet we could have used as a basis for the<br />

Hungarian one, if the two languages had had enough similarities. On the other hand,<br />

aspect and Aktionsart in Hungarian are interwoven in a way that is unique among the<br />

languages that so far have been developed a WordNet for. The perfective aspect for<br />

example goes almost always hand in hand with one of the Aktionsart-types that are<br />

present in Hungarian (see [4], p.45.).<br />

Furthermore, Hungarian has an extremely rich system of preverbs which can<br />

modify the meaning of the verb, making it inevitable when dealing with Hungarian, to<br />

consider aspectual characteristics as much as possible within the given framework. As<br />

already mentioned, the basic verbal relation, hypo- and hypernymy as well as<br />

troponymy, were elaborated based on the pattern of nominal relations, meaning that<br />

the WordNet-methodology requires that semantic relations between morphemes hold<br />

4<br />

When talking about aspectual properties of verbs, we should, in fact, be talking about verbal<br />

phrases, since verbs on their own are underspecifed with respect to this kind of information,<br />

see [9].<br />

5<br />

The above phenomenon is known as the imperfective paradox, see [2].


Language Independent and Language Dependent Innovations… 257<br />

through logical implications. While in the case of nouns one can show that N1 is a<br />

hyponym of N2 − by checking whether the pattern ''it is true for each X that if X is an<br />

N1, then X is an N2'' holds −, this is not possible for verbs, since one can only<br />

establish logical relations between propositions or the sentences expressing them, but<br />

the logical structure of sentences is determined by verbs together with their modifiers<br />

and complements. However, the verb-complement relation is highly asymmetrical:<br />

the logical potential of the sentence is determined by the verb; complements are only<br />

more or less passive participants. 6 As the PWN, which has served as a basis for<br />

HuWN, does not contain aspectual information due to the lack of morphological<br />

marking of aspect in English, another way had to be found for representing typically<br />

occurring phenomena related to aspect in Hungarian.<br />

A first framework of approaching aspect in general is provided by Zeno Vendler<br />

[8], who developed a type of event ontology in a way that would become useful for<br />

linguistic theory. This system was later elaborated by Emmon Bach [1], and worked<br />

out for computational linguistics by Marc Moens and Mark Steedman [6]. Drawing on<br />

Moens&Steedman's work we would like to suggest a way to structure aspectually<br />

related verb meanings in WordNet.<br />

2.1.2 Aspectual classes according to Vendler and Bach<br />

Vendler's classification of eventualities 7 distinguishes between four aspectual classes<br />

according to the internal temporal structure of the event expressed by the verb.<br />

Together with their arguments and with context the four event types according to<br />

Vendler may take different aspects: activities (e.g.. swim) typically take the<br />

progressive aspect, accomplishments (go out of the room) take both the progressive<br />

and perfective aspect, and achievements (blow up) take the perfective aspect. States<br />

take neither the progressive nor the perfective aspect. The classification as further<br />

developed and extended by Bach represents aspectual categories in a binary system,<br />

highlighting the existence of point expressions that are different from achievements<br />

(e.g. click). In Bach's terminology Vendler's accomplishments are called protracted<br />

events, achievements are called culminations, while point expressions are called<br />

happenings.<br />

Vendler's four aspectual classes are also characterised by whether the interval of<br />

the event is divisible or not – i.e. whether the eventuality denoted by the verb holds<br />

for most of the subintervals, as well. Accordingly, of the four aspectual classes<br />

activities and states may be considered homogeneous eventualities, since they are<br />

expressed by predicates any sub-intervals of which may be described by the very<br />

same predicates.<br />

6<br />

In the case of verbs with direct object complements it is also the direct object that takes part<br />

in determining the aspect of the sentence. However, the impact of the direct object on the<br />

aspect can be relatively well predicted from the event structure of the verb and the properties<br />

of the object, so we do not have to specifically deal with this in the framework of the<br />

WordNet.<br />

7<br />

We are using the term eventuality after Bach, see [1].


258 Judit Kuti, Károly Varasdi, Ágnes Gyarmati, and Péter Vajda<br />

Fig. 1. Classification of eventualities according to Bach<br />

Accomplishments and achievements on the other hand are in this respect coherent<br />

units of different kinds of heterogeneous event-components. Point expressions are<br />

also taken to be non-complex eventualities. From the point of view of constructing the<br />

Hungarian verbal WordNet it is the representation of complex eventualities –<br />

achievements and accomplishments – in a way that does justice to their aspectual<br />

properties that is of the greatest interest to us. One may interpret their complexity with<br />

the help of the so-called nucleus-structure introduced by Moens&Steedman.<br />

2.1.3 The event-nucleus of Moens&Steedman<br />

Moens&Steedman introduce a classification of eventualities relying on but further<br />

refining Vendler's aspectual classes. Their central notion is that of an event-nucleus,<br />

which might be called a tripartite structure or triad, as well. The reason for the latter<br />

name is that an idealised eventuality consists of potentially three components<br />

belonging together: preparatory phase, telos/culmination and consequent state.<br />

Fig.2. The event-nucleus of Moens&Steedman<br />

One may also represent the triad as an ordered triple < a; b; c > where<br />

a=preparatory phase, b=telos and c=consequent state. Moens&Steedman place this<br />

idealised event-unit beyond the level of linguistically manifested lexicalised<br />

meanings. The components of the event-nucleus are thus filled with meta-linguistic<br />

and not with lexicalised linguistic elements. 8 Treating the three nucleus-components 9<br />

as a unit may be justified as follows. When testing a lexicalised expression with<br />

linguistic tests sensitive to aspectual properties (in Hungarian the tests of the<br />

progressive and the perfective) the co-occurance of no more than the three<br />

8<br />

Since we may only refer to these with linguistic elements, we will use small capitals so that<br />

they can be distinguished from italicised, linguistic elements<br />

9<br />

Here we are dealing with the event-components irrespective of whether they are lexicalised<br />

or not.


Language Independent and Language Dependent Innovations… 259<br />

components outlined above may be shown. Since we are examining eventualities<br />

from an aspectual point of view, this fact must be considered relevant. We may, thus,<br />

acquire information about the aspectual properties of a verb expressing a certain<br />

eventuality by looking at which of the three event-components described above are<br />

conceptually present. On the example of the eventuality lexicalised with the verbal<br />

phrase go out of the room: The existense of the first component can be tested by<br />

looking at whether the expression can be put into the progressive. The existense of the<br />

third component, which practically goes hand in hand with the presence of the second<br />

one, can be tested by examining whether the expression can be put into the perfective<br />

(see [6]). Due to certain characteristics of the Hungarian language the easiest way we<br />

can test whether the second and third components of the triad are conceptualised is by<br />

translating the Hungarian sentence into English and putting the translated equivalent<br />

into Present Perfect / Progressive. 10<br />

3. János éppen ment ki az épületből, amikor találkoztam vele.<br />

‘John was going out of the building when I met him.'<br />

4. Mire Zsuzsa megérkezett, addigra János kiment az épületből.<br />

'By the time Sue arrived, John has gone out of the building.'<br />

As a result of the two tests we can see that the phrase go out of the building<br />

conceptualises all the three components of the triad:<br />

<br />

Moens&Steedman elaborate on the categories named by Vendler/Bach by adding<br />

the factors of the existence or lack of the triad-components. In order to see how the<br />

classification according to the triad-components relates to the classification of<br />

Vendler/Bach, let us look at Table 1. This shows the classification of eventualities<br />

according to the factors taken into consideration by Moens&Steedman (+/– atomic<br />

and +/– consequent state), explicitly referring to the equivalents in Vendler's and<br />

Bach's system, where possible (in cases where the new terminology differs from the<br />

former one, we have indicated the latter in brackets).<br />

10<br />

Since this methodology may be surprising at first, some explanation is in order. In<br />

Hungarian – as opposed to English – there are no clear-cut and simple tests that are sensitive<br />

enough to the aspectual properties of a sentence (or verb phrase). Realizing that the<br />

impossibility of providing a usable test battery for Hungarian, we chose a detour, as it were,<br />

through a proxy in English. Benefiting from the situation that everybody in the WordNet<br />

developers' team spoke English on an advanced level and had learnt to be sensitive to certain<br />

aspectual features in English, we decided to rely on our tacit knowledge of the aspectual<br />

features we wanted to test. When translating a Hungarian sentence into the English Present<br />

Perfect or Progressive, one had to judge its aspectual acceptability irrespective of whether the<br />

translation was correct in any other respect. Obviously, this methodological shortcut should<br />

be backed by further research in second language acquisition to be of sound theoretical value,<br />

but we believe that used with sufficient care it provides a reliable tool when the tests in the<br />

object language prove too complicated for practical usage.


260 Judit Kuti, Károly Varasdi, Ágnes Gyarmati, and Péter Vajda<br />

Table 1. Eventualities based on the system of Moens&Steedman<br />

atomic<br />

non-states<br />

extended<br />

states<br />

+conseq<br />

-conseq<br />

culmination<br />

(=ACHIEVME<br />

NT)<br />

recognize,<br />

point<br />

hiccup<br />

tap, wink<br />

culminated process<br />

(=ACCOMPLISHMENT)<br />

build a house,<br />

eat a sandwich<br />

process<br />

run, swim, walk<br />

play the piano<br />

resemble<br />

understand<br />

love<br />

know<br />

Theoretically 2 3 different potential aspectual types may be distinguished according<br />

to the conceptual presence of the nucleus-components, listed as follows. 11<br />

<br />

<br />

< Ø, Ø, c><br />

The coherence of the nucleus components is more than mere temporal<br />

sequentiality, it is what Moens&Steedman call contingency "a term related, but not<br />

identical to a notion like causality" [6].<br />

The mutual dependency among the three components of the nucleus means that<br />

none of them can be seen as preparatory phase, culmination or consequent state per<br />

se. An eventuality that, based on the above tests, seems to possess a preparatory<br />

phase, but lacks both culmination and consequent state (could be marked as<br />

) cannot be seen as a preparatory process as it does not precede anything. By<br />

analogy an eventuality that, based on the above tests, seems to possess a consequent<br />

state but lacks a culmination (could be marked as ) cannot be seen as a<br />

consequent state, just like an eventuality with what seems to be a point of<br />

culmination, but lacking both preparatory phase and consequent state (could be<br />

marked as ) cannot be interpreted as a telos. In other words, a triad having a<br />

consequent state implies that the triad also has a culmination point. However, the<br />

three respective components seemingly appearing on their own may easily be<br />

interpreted as corresponding to the notion process and state as used by Vendler and to<br />

the Bachian point expression.<br />

Although the three non-complex eventualities (process, point, state) are not<br />

discussed in depth further by Moens&Steedman, we found it important to deal with<br />

them in HuWN, and follow the above convention of showing the aspectual<br />

information in an ordered triple. Accordingly, the above listed possible combinations<br />

of the nucleus-components, each standing for one possible aspectual verb-subtype, are<br />

illustrated with examples, as follows:<br />

11<br />

The sign Ø refers to non-conceptualised components of the triad.


Language Independent and Language Dependent Innovations… 261<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

no example<br />

befelhősödik ('become cloudy')<br />

no example<br />

eltörik ('break')<br />

no example<br />

fut ('run')<br />

kattan ('click')<br />

szeret ('love')<br />

Three of the possible combinations are excluded based on epistemologic grounds:<br />

(i) A nucleus having no components at all cannot be discussed neither conceptually<br />

nor linguistically. An eventuality (ii) having a preparatory phase and a culmination<br />

point, as well as one (iii) having a preparatory phase and a consequent state cannot be<br />

lexicalised due to the coherence of the telos and the consequent state.<br />

Besides the remaining five lexicalised possibilities of nucleus-component<br />

combinations we have, however, seen the need for marking a sixth possible aspectual<br />

type in HuWN. As mentioned above, in many cases linguistic tests in Hungarian are<br />

unreliable in the sense that they provide ambiguous results even for native speakers.<br />

For the sake of usability in Hungarian language technology applications we<br />

considered it necessary to explicitely mark those cases in HuWN where the<br />

Hungarian test for the progressive did not result in a clearly grammatical sentence, but<br />

the English equivalent did. One such example can be seen in (5):<br />

5. János éppen gyógyult meg, amikor huzatot kapott a füle és újra belázasodott.<br />

John was getting better when his ear caught cold and he got fever again.<br />

In cases like the above mentioned we decided to mark the first component of the<br />

nucleus "unmarked", designating this with an x: <br />

2.2 The notion of the nucleus in HuWN<br />

As we have seen, the conceptual presence or absence of meta-language elements<br />

beyond the lexicalised expressions can be tested with the help of Moens&Steedman's<br />

nucleus structure. The number of components a verb conceptualizes compared to an<br />

idealized complex event unit provides information on the telicity or atelicity of a<br />

given eventuality. If the third component of a nucleus denoted by a given verb is<br />

expressed 12 , the eventuality is telic, if the component is not present, the eventuality is<br />

atelic.<br />

2.2.1 Representing telicity in HuWN<br />

From the six mentioned possible patterns whose lexicalisation the presence of the<br />

respective nucleus-components enables it is only complex eventualities that can be<br />

12<br />

As mentioned in the previous section, the presence of the consequent state as third<br />

component entails the presence of the culmination point as second component.


262 Judit Kuti, Károly Varasdi, Ágnes Gyarmati, and Péter Vajda<br />

telic. If we would like to get an overview of these complex eventualities from an<br />

aspectual point of view, the representation in ordered triples as introduced in 2.3.<br />

seems appropriate, as it can be seen in Table 2:<br />

Table 2. Telicity of complex eventualities illustrated by the tripartite event structure of<br />

Moens&Steedman<br />

Compon<br />

ents of the<br />

triad<br />

<br />

<br />

The metalinguistic name for the conceptualised<br />

components of the phrase lexicalising the triad<br />

to exit: <br />

blow up: <br />

Telicity of the VP<br />

+consequent state<br />

telic<br />

+consequent state<br />

telic<br />

Of the simple eventualities, processes and states are usually considered atelic while<br />

point expressions (on their own, without context) are underspecified for this kind of<br />

information. When constructing HuWN, the question arises whether and how to<br />

represent meanings that should be synonyms according to the notion of synonymy in<br />

WordNet and yet differ aspectually. The notion of the nucleus help us answer:<br />

aspectual differences can and should be represented in HuWN. If a meaning<br />

represented as a synset in the WordNet is transformed into a minimal proposition, one<br />

can determine whether the consequent state of the appropriate nucleus is present. 13<br />

By encoding whether a meaning has a consequent state (and hence a telos), through<br />

assigning to it one of the six conceptualization patterns of the triad components, the<br />

telicity of the eventuality expressed by the verb will be made explicit. This<br />

information is stored in HuWN in a similar way as in the case of the information on<br />

verb frames: we indicate which of the three triad components is conceptualised in<br />

Hungarian on the level of the literals.<br />

As already introduced, for the sake of uniformity and transparency we follow the<br />

convention of showing the aspectual information in an ordered triple even in the case<br />

of simple eventualities mentioned in 2.2 and 2.3. Accordingly, the ordered triple of<br />

the verb fut 'run' is (). This triple shows on the one hand that the eventuality<br />

expressed by the verb fut is atelic, and on the other that it is a Vendlerian process,<br />

indicated by the preparatory phase being solely present.<br />

2.2.2 Complex eventualities in HuWN<br />

Besides the possibility of storing a minimal amount of aspectual information<br />

concerning the given literal in a verb synset, the relational structure of the WordNet<br />

and the nucleus taken as a single unit allow us to propose another extension to the<br />

13<br />

Transforming verbal meanings into minimal propositions is ensured in the WordNet by<br />

mapping all the possible verbal subcategorisation frames of a given literal onto its synset.<br />

Sometimes several verb frames are merged into one verb frame record with optional<br />

arguments. In this case verbs should be considered with the minimal number of obligatory<br />

arguments. E.g.: the verb frame eszik, ’eat’ contains an optional direct object, so the minimal<br />

predicate should be formed without an object, and that predicate is atelic.


Language Independent and Language Dependent Innovations… 263<br />

verb synset structure. In the case of complex eventualities whose certain triad<br />

components are not only conceptually present, but are lexicalised, as well, the unity of<br />

these components can be represented. Although the structure of PWN is based on a<br />

hierarchical system, an alternative structure has already been accepted for adjectives<br />

in PWN. By analogy it should also be possible to organise the verb synsets in a<br />

slightly modified way than nouns. The tripartite structure described above may be<br />

mapped onto the system of WordNet in the form of relations. The meta-language<br />

level described by Moens&Steedman's nucleus structure can be mapped onto the level<br />

of lexicalised elements, represented by WordNet synsets. The connection of the two<br />

levels is shown in Figure 3.<br />

Fig.3. Applying the event-nucleus of Moens&Steedman to the synsets of WordNet<br />

Artificial nodes introduced in HuWN [5] are suitable for naming meta-language<br />

nuclei, e.g. the complex eventuality denoting the change of state from wet to dry, in<br />

the above example 14 . The relational structure of the WordNet allows introducing three<br />

new relations according to the respective triad-components being related to the metalanguage<br />

nucleus-unit, represented by an artificial node. These new relations point to<br />

the appropriate artificial node and they are called is_preparatory_phase_of,<br />

is_telos_of and is_consequent_state_of, respectively, based on the names of the<br />

different nucleus components.<br />

Meanings that are lexicalised by a single verb in English but not in Hungarian can<br />

thus be distinguished: the same meaning might be present in Hungarian both as a verb<br />

with a preverb providing more aspectual information and as a verb without a preverb,<br />

more underspecified for aspectual information. In the above example, the Hungarian<br />

14<br />

Artificial nodes are written with capital letters to distinguish them from natural language<br />

synsets.


264 Judit Kuti, Károly Varasdi, Ágnes Gyarmati, and Péter Vajda<br />

szárad and megszárad synsets are both equivalent to the English {dry:2} 15 . Without<br />

integrating the nucleus system into the WordNet, the synset megszárad could be<br />

placed into HuWN only as a hyponym of szárad, considering the originally available<br />

relations. However this kind of representation would not distinguish the different<br />

implicational relation between the above mentioned two meanings, but would merge<br />

them into a hyponym−hypernym relation. 16 After having integrated the nucleus<br />

system into the HuWN, there is no need for an additional explicit relation between the<br />

components of a nucleus: they are already connected through the artificial node.<br />

Following the path of the relations is_preparatory_phase_of and is_telos_of, it is easy<br />

to determine that the synset szárad represents the preparatory phase of the nucleus<br />

whose another lexicalized component is megszárad, hence megszárad implies szárad<br />

while the implication does not hold in the other direction. 17<br />

As we have seen, verbs belonging to the same triad (often with and without a<br />

preverb respectively) can be placed more accurately in HuWN with the help of the<br />

new relations. Furthermore, the relation is_consequent_state of is not restricted to<br />

verbs, the third component of the triad mentioned above is the adjective synset száraz<br />

({dry:1}). This psycholinguistically relevant piece of information is present in HuWN<br />

but would be lost if we had strictly held onto the structure of PWN without the tools<br />

for representing triads.<br />

2.3 Possible applications<br />

Besides the fact that one of the main tasks of a WordNet is to provide a uniform<br />

representation for the idiosyncratic properties of the lexical items, the extension of<br />

HuWN in the proposed way may bring practical benefits, as well. As we have seen, it<br />

can be easily deduced from the triad whether a given verb is telic or atelic, perfect or<br />

progressive, respectively. A Hungarian-English MT system may be improved by<br />

using this information provided in HuWN, e.g. in the area of matching the verb tenses<br />

in the source and the target language more appropriately. Since there are only two<br />

morphologically marked tenses in Hungarian (present and past), a rule-based MT<br />

system would select the same two tenses in the target language, simple present and<br />

simple past, respectively. Inaccurate translations would emerge inevitably. However,<br />

the above outlined information integrated into HuWN would improve the system. In<br />

Hungarian, for example, morphologically present tense forms of a telic verb have a<br />

future reference. The English equivalent of the Hungarian sentence Felhívom Pétert is<br />

not I call Peter, but I will call Peter. Similarly, progressive past tense verb forms<br />

should be matched with the past continuous form of the appropriate verb, instead of<br />

selecting the simple past form: the Hungarian Péter az udvaron játszott should be<br />

15<br />

szárad ’is drying’ (v), megszárad ’get dry’ (v), száraz ’dry’ (a)<br />

16<br />

By analogy to the nominal hypernymy relation, one way of conceiving of this relation<br />

between verbs would be basing it on selectional restrictions. E.g. the synsets {hervad,<br />

fonnyad} ('fade', 'wither') and {rohad} ('rot') would have such an ideal hypernymy relation,<br />

since the former selects plants as subject, while there is no such restriction on the subject of<br />

the latter one.<br />

17<br />

See Section 2.1.3 for a discussion on the connection (called contingency by<br />

Moens&Steedman) between the components of a triad.


Language Independent and Language Dependent Innovations… 265<br />

matched to the English Peter was playing in the yard, instead of the expected Peter<br />

played in the yard. Aspectual information may be used in generating sentences, as<br />

well, let it be translation from English to Hungarian, or some other tasks requiring<br />

generation.<br />

These properties of verbs may be helpful in text understanding, as well. The<br />

knowledge of these idiosyncratic properties of verbs is an important component of the<br />

inner representation of a computer. Without this information, just by considering the<br />

temporal adverbials (possibly) present in the sentence, it is not possible to accurately<br />

represent or reconstruct the temporal structure of a narrative.<br />

3 Language independent new relation types in the HuWN<br />

When constructing the HuWN, we tried to remain as faithful as possible to the<br />

structure of PWN 2.0 on which HuWN is based on. This was not always possible, as<br />

English and Hungarian show differences in their organising the lexicon and in some<br />

word-association tests. In what follows, we will describe two problematic cases we<br />

have faced, and for which we provide alternative solutions to the one suggested in<br />

PWN.<br />

Some descriptive adjectives do not fit into the typical bipolar cluster structure of<br />

PWN. They occur in clusters having more focal synsets than the usual number, ie.<br />

more than two adjectives are meant to express opposing values of an attribute, see<br />

Figure 4.<br />

Fig. 4. Atypical adjective clusters<br />

The focal synsets of these domains form a „triangle” along the near_antonym<br />

relations running between each pair among them. Considering this representation, it<br />

might be deduced that these attributes are not bipolar but are of 3 dimensions, having<br />

three marked "poles". In the present section we argue for an alternative kind of<br />

representation, which, with the help of two new relations, enables adherence to the<br />

original bipolar structure of adjective clusters.<br />

Descriptive adjectives are organised in clusters along semantic similarity and<br />

antonymy between words (instead of concepts), reflecting psychological principles<br />

[3]. Consider the example in Figure 5/b. The adjective pair pozitív 'positive'-negatív


266 Judit Kuti, Károly Varasdi, Ágnes Gyarmati, and Péter Vajda<br />

'negative' are the opposing poles of their domain. The situation of the word semleges<br />

'neutral' is odd. Its English equivalent occurs as a third focal synset in the same<br />

domain as positive and negative in PWN. Relying on word association tests for<br />

Hungarian, we did not follow the solution of PWN when inserting semleges ('neutral')<br />

into HuWN. While the words pozitív and negatív do evoke each other in word<br />

association tests, the relation between pozitív and semleges, and negatív and semleges,<br />

respectively is not as straightforward. Although the word semleges does evoke pozitív,<br />

the antonym pair of pozitív is the adjective negatív. Loosening the scope of the usage<br />

of the relation near_antonym in order to enable antonym triangles to fit into a<br />

WordNet might cause anomalies in regular bipolar clusters as well (cf. direct and<br />

indirect antonyms). Therefore we have defined a new relation as an alternative to<br />

dealing with the case of triangles described above.<br />

The adjectives pozitív and negatív determine a bipolar domain. This domain differs<br />

from the typical domains in the number and structure of its members. Apart from the<br />

two focal synsets, there is another adjective whose role is marked, but, as we have<br />

already shown, it is no real antonym of neither pozitív nor negatív. Furthermore, this<br />

special adjective expresses a value lying exactly in the middle of the domain.<br />

Therefore, the new relation we are proposing here, and we have already used in<br />

HuWN is called scalar middle, and points to both focals of the given domain (Fig. 5.).<br />

Fig. 5. The middle relation<br />

It should be noted that the newly introduced relation scalar middle can be used in<br />

any bipolar domain where the exact value (either being actually or considered<br />

conceptually as a discrete point) is lexically marked, e.g. in the domain determined by<br />

the adjectives alsó-felső-középső ('lower-upper-middle'). Although we have defined<br />

scalar middle in relation to HuWN, it may be used in other WordNets, as well, since<br />

the above described case is not limited to the Hungarian language alone.<br />

At first sight the scalar middle relation could be used in the example shown in<br />

(5.a). The two opposing poles of the domain are {működő, aktív} 'active' and<br />

{kialudt} 'extinct, inactive', while the midpoint is denoted by {alvó, inaktív}<br />

'dormant, inactive'. In this domain, however, the middle value of the attribute cannot<br />

be considered as discrete. Furthermore, the synset {alvó, inaktív} might be considered<br />

to be in similar_to relation with {működő, aktív}, as the adjective alvó 'dormant'<br />

refers to a "presently not functioning volcano", thus having a closer meaning to<br />

{működő, aktív}, just as langyos 'lukewarm' is in similar_to relation with meleg<br />

'warm'.<br />

The domain specified by these three synsets differs from the aforementioned<br />

domains not only because of the similarities and contrasts between its members.


Language Independent and Language Dependent Innovations… 267<br />

These adjectives also constrain their scope: they can only refer to volcanos, and the<br />

WordNet has to account for this semantic relation. PWN and BalkaNet relate these<br />

adjectives through the use of the antonymy relation, and do not even indicate the<br />

relation with the noun exclusively modified by these adjectives from one point of<br />

view.<br />

The synset-triple concerning volcanos is not the only triangle of this kind present<br />

in the semantic lexicon. For another simple example, we refer to the adjectives<br />

egynyári-kétnyári-évelő 'annual, biennial, perennial'. Had we only the near_antonym<br />

relation at our disposal, the information that the respective adjectives can only refer to<br />

plants would have to be omitted, and the fact that these three adjectives belong<br />

together could indeed only be present in a triangle form among them.<br />

When taking a closer look, one can see that the adjectives mentioned above<br />

partition the extension of the particular noun, i.e. they divide the set of nouns, e.g. all<br />

the plants in the last example, into disjoint subsets. This motivates the name of the<br />

suggested new relation: partitions, which is represented as a pointer pointing from the<br />

adjectives to the noun synset they partition (see Fig. 6.).<br />

Fig. 6. The partitions relation<br />

With the introduction of this new relation the explicit designation of the opposition<br />

between the adjective synsets becomes redundant, since due to the nature of the<br />

partitioning relation they may only be mutually exclusive. Although the partitions<br />

relation is similar to the category_domain relation of the WordNet, the two relations<br />

should not be confused. Category_domain relates the given adjectival meaning and<br />

the domain it can be used in, e.g.: {egyvegyértékű, monovalens} 'monovalent '–<br />

{kémia, vegyészet} 'chemistry', but does not specify the noun(s) it can modify, even if<br />

it can modify a certain noun exclusively.<br />

4 Conclusion<br />

In the present paper we have tried to show on the example of the Hungarian WordNet<br />

in what ways the WordNet-structure as conceived of in PWN may be exploited and<br />

extended in order to represent some language-specific phenomena of typologically<br />

different languages than English. Although specifically implemented for solving a


268 Judit Kuti, Károly Varasdi, Ágnes Gyarmati, and Péter Vajda<br />

linguistic situation in the Hungarian language, the implementation of the nucleusstructure<br />

in the WordNet in the form of relations might prove to be useful for other<br />

languages with a rich morphology showing aspectual distinctions, as well. Similarly,<br />

the use of the adjectival relations as suggested in the present paper enables us to<br />

represent semantic relations partly in a form that they remain faithful to some of the<br />

original ideas behind the WordNet-structure, while hopefully allowing for an equally<br />

accurate representation of semantic relation as with the originally offered alternative.<br />

References<br />

1. Bach, E:. The Algebra of Events. J. Linguistics and Philosophy 9, 5-16 (1986)<br />

2. Dowty, D.: Word Meaning and Montague Grammar. D. Reidel, Dordrecht, W. Germany<br />

(1979)<br />

3. Fellbaum, C.: WordNet An Electronic Lexical Database. MIT Press (1998)<br />

4. Kiefer, F.: Aspektus és akcióminőség. Különös tekintettel a magyar nyelvre. [Aspect and<br />

Aktionsart. with Special Respect to Hungarian]. Akadémiai Kiadó, Budapest (2006)<br />

5. Kuti, J., Vajda, P., Varasdi K.: Javaslat a magyar igei WordNet kialakítására. [Suggestions<br />

for Building the Hungarian WordNet]. In: Alexin, Z., Csendes, D. (eds.) III. Magyar<br />

Számítógépes Nyelvészeti Konferencia, pp. 79-88. Szeged, Szegedi Tudományegyetem<br />

(2005)<br />

6. Moens, M., Steedman, M.: Temporal ontology and temporal reference. J. Computational<br />

Linguistics 14(2), 15-28 (1998)<br />

7. Tufis, D. et al.: BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. J.<br />

Romanian Journal of Information Science&Technology 7(1-2), 1-35 (2004)<br />

8. Vendler, Z.: Verbs and Times. J. Philosophical Review 66, 143-160 (1957)<br />

9. Verkuyl, H. J.: On the compositional nature of the aspects. Foundations of Language,<br />

Supplementary Series, 15. Reidel, Dordrecht (1972)<br />

10. Vossen, P.: EuroWordNet General Document. Technical Report EuroWordNet (LE2-4003,<br />

LE4-8328) (1999)


Introducing the African Languages WordNet<br />

Jurie le Roux 1 , Koliswa Moropa 1 , Sonja Bosch 1 , and Christiane Fellbaum 2<br />

1 Department of African Languages, University of South Africa,<br />

PO Box 392, 0003 UNISA, South Africa<br />

{lrouxjc, moropck, boschse}@unisa.ac.za<br />

2<br />

Department of Psychology, Princeton University, USA<br />

fellbaum@clarity.princeton.edu<br />

Abstract. This paper introduces the African Languages WordNet project which<br />

was launched in Pretoria during March 2007. The construction of a WordNet<br />

for African Languages is discussed. Adding African languages to the WordNet<br />

web will enable many natural language processing (NLP) applications such as<br />

crosslinguistic information retrieval and question answering, and significantly<br />

aid machine translation. Some of the accomplishments of the WordNet<br />

workshop as well as the main challenges facing the development of Wordnets<br />

for African languages are examined. Finally we look at future challenges.<br />

Keywords: WordNet, African languages, Bantu language family, agglutinating<br />

languages, noun class system<br />

1 Introduction<br />

We discuss the construction of a WordNet for African Languages. Our main focus is<br />

on the challenges posed by languages that are typologically distinct from those for<br />

which the original WordNet design was conceived.<br />

1.1 African Languages WordNet: Laying the Foundations 1<br />

African Languages WordNets present an exciting addition to the WordNets of the<br />

world. NLP applications will be enabled not only for each of the African languages in<br />

isolation, but powerful cross-linguistic applications such as machine translation will<br />

be made possible by linking the African languages WordNets to one another and to<br />

the many global WordNets. Moreover, our initial investigations into the lexicons of<br />

different African languages suggest interesting similarities and differences among the<br />

1<br />

The African Languages WordNet effort, aiming to create an infrastructure for WordNet<br />

development for African languages, began with a week-long workshop funded by the Meraka<br />

Institute (CSIR) at the University of Pretoria in March, 2007. Christiane Fellbaum<br />

(Princeton), Piek Vossen (Amsterdam) and Karel Pala (Brno) facilitated. Linguists and<br />

computer scientists representing 9 official South African languages were introduced to<br />

WordNet lexicography and familiarized with the lexicographic editing tools DebVisDic<br />

(http://nlp.fi.muni.cz/trac/deb2/wiki/DebVisDicManual).


270 Jurie le Roux, Koliswa Moropa, Sonja Bosch, and Christiane Fellbaum<br />

African languages. We work with the nine official African languages of South Africa,<br />

viz. isiZulu, isiXhosa, isiSwati, isiNdebele, Tšhivenda, Xitsonga, Sesotho (Southern<br />

Sotho), Sesotho sa Leboa (Northern Sotho) and Setswana (Western Sotho). These<br />

languages all belong to the Bantu language family and are grammatically closely<br />

related. The Nguni languages, i.e. isiZulu, isiXhosa, isiSwati and isiNdebele form one<br />

group. The Sotho languages, viz. Sesotho, Sesotho sa Leboa and Setswana (Western<br />

Sotho) form another group with Tšhivenda and Xitsonga being more or less on their<br />

own.<br />

An important reason for distinguishing between these groups of languages lies in<br />

their distinct orthographies. In the case of the Nguni languages, a conjunctive system<br />

of writing is adhered to with a one-to-one correlation between orthographic words and<br />

linguistic words. For example, the IsiZulu orthographic word siyakuthanda (si-ya-kuthand-a)<br />

‘we like it’ is also a linguistic word. The Sotho languages as well as<br />

Tšhivenda and Xitsonga on the other hand, are disjunctively written, and the above<br />

mentioned single IsiZulu orthographic word is written as four orthographic words in<br />

Sesotho sa Leboa, namely re a go rata ‘we like it’. These four orthographic entities<br />

constitute one linguistic word.<br />

1.2 Language resources<br />

A problem facing the African languages WordNet is the limited availability of<br />

electronic language resources such as large corpora, parallel corpora, electronic<br />

dictionaries and machine-readable lexicons, particularly in comparison to those<br />

available for other languages. For the moment we need to rely on both monolingual<br />

and bilingual dictionaries available in some African languages for semantic<br />

information. African languages are still lagging behind with regard to corpus<br />

compilation. General corpora are available for all nine African languages at the<br />

University of Pretoria, but with access restrictions which involve on site computer<br />

processing of the corpora and downloading only the results of the queries. The sizes<br />

of the various corpora range from 1 million tokens for isiNdebele to 5.8 million<br />

tokens for Sesotho sa Leboa [1].<br />

2 Challenges<br />

2.1 Morphological complexities<br />

For the African languages we identify various specific challenges that have not been<br />

addressed by other WordNets. The first and foremost is that these languages are<br />

primarily agglutinative languages:<br />

a) based on a noun class system according to which nouns are categorised by<br />

prefixal morphemes; and<br />

b) having roots/stems as the constant core element from which words or word<br />

forms are constructed.


Introducing the African Languages WordNet 271<br />

These morphological complexities make the notion of "words" and their<br />

lexicographic treatment particularly challenging when one tries to form synsets.<br />

Traditional lexicography does not reflect the linguistic intuitions of native speakers<br />

and does not withstand modern linguistic analysis as illustrated in the following<br />

examples representing the disjunctively as well as the conjunctively written<br />

languages.<br />

Entries in some traditional Setswana (Western-Sotho) dictionaries follow the<br />

alphabetical order of stems rather than the prefix that precedes the stem (a complete<br />

noun is formed by a prefix and a stem, a complete verb is syntactically formed by<br />

prefixes and/or suffixes around a root). The stem with the prefixes following it, where<br />

applicable, are presented in bold. The following is an example of an entry found<br />

under a in a Setswana dictionary:<br />

ádímí, mo- ba- dev < adima, borrower, lender<br />

This implies that "words" are listed under the first letter of their stems or roots<br />

except in those cases where prefixes have coalesced with the stems or roots, e.g.<br />

mmútla pl mebútla, hare<br />

Other Setswana dictionaries do, however, take the noun as it is as an entry, for<br />

example,<br />

moadimi N. cl 1 mo-, SING. OF baadimi, DER. F. adima, a lender; a borrower.<br />

Verbs are entered under the verb stem, e.g.<br />

ádímā, borrow, lend<br />

ádíngwā pass < adima, be lent or borrowed<br />

Adjectives are entered under the specific adjectival root but with all the noun class<br />

prefixes that it may take e.g.<br />

ntlê, 1. adj dí-, bó- má-, gó-, lé- má-, ló- dí-, mó- bá-, mó- mé- or sé- dí-,<br />

beautiful, pretty, handsome<br />

Adverbs do not present major problems since only a few adverbs are primitive, and<br />

do not show derivation from any other part of speech. The great majority are<br />

derivative, being formed mainly from nouns and pronouns by prefixal and suffixal<br />

inflexion. There are a considerable number of nouns and pronouns which may be used<br />

as adverbs without undergoing any change of form at all. Take the following entry for<br />

example,<br />

ntlé, (ntlê), 1. adv in (i) fa ntlê, here outside; (ii) kwa ntlê, outside; 2. conj in kwa<br />

ntlê ga, besides or with the exception of; 3. n le-, outside; le- ma-, faeces (human)


272 Jurie le Roux, Koliswa Moropa, Sonja Bosch, and Christiane Fellbaum<br />

The Nguni languages (isiXhosa, isiZulu, siSwati and isiNdebele) are particularly<br />

challenging, as they are written conjunctively; for example, basic noun forms are<br />

written together with concord morphemes. Entries in traditional dictionaries for these<br />

languages follow the alphabetical order of stems rather than the prefix that precedes<br />

the stem (a complete noun is formed by a prefix and a stem). The stem is presented in<br />

bold to distinguish it from the prefix. Below are example entries found under k in<br />

isiXhosa dictionaries:<br />

i-khaka (shield)<br />

isi-khumba (skin)<br />

u-khokho (ancestor -usually great-grandparent)<br />

The three nouns begin with different prefixes i.e. i-, isi-, u- which denote different<br />

noun classes, but the common feature is the first letter of the stem, k. Verbs are<br />

entered with the infinitive prefix uku-, followed by the verb stem. WordNets for these<br />

languages will follow this pattern for representing synset members. As WordNets are<br />

organized entirely by meaning and not alphabetically, any look-up difficulties that<br />

this format poses for conventional dictionaries will not arise.<br />

2.2 Roots, words, and word class membership<br />

As with the original WordNet, African Languages WordNets will include only "openclass<br />

words", viz. nouns, verbs, adjectives and adverbials. For example, all Setswana<br />

words can feature as one of these categories because of the fact that most "closedclass<br />

words", such as pronouns, are in any case derived from these word classes. This<br />

standpoint may seem to create some problems but we will pursue it. Take the example<br />

of the root -ntlê [ntl], i.e. -ntlè and -ntlé, depending on the tone, it generally means<br />

'beauty' and 'outside', therefore noun, adjective and also adverbial. One entry can be<br />

bontlê (beauty), a noun as:<br />

bontlê (beauty)<br />

Leina bontlê le na le bokao ba (Noun bontlê has the sense):<br />

1. bontlê - boleng bo bo kgatlhang (quality that pleases)<br />

Malatodi: maswê, mmê, mpe<br />

(Antonyms: dirt, ugly)<br />

However, if we now try to enter this 'word' as an adjective, we start to encounter<br />

problems. If we enter it as a word, i.e. as a complete linguistic unit, we need to take<br />

into account that different prefixes can be taken by the root -ntlê, in fact, all the noun<br />

classes generate a prefix for this root and we must now decide what our entry should<br />

be. If we take the root as our entry, we need to indicate all the possible prefixes that it<br />

may take. Our entry can then be something like:<br />

-ntlê (-ntlè)<br />

Modi wa letlhaodi -ntlê o na le bokao jwa (Adjective stem -ntlê has the sense):


Introducing the African Languages WordNet 273<br />

1. yô montlê, ba bantlê, ô montlê, ê mentlê, lê lentlê, a mantlê, sê sentlê, tsê<br />

dintlê, ê ntlê, lô lontlê, bô bontlê, gô gontlê<br />

- boleng bo bo kgatlang (quality that pleases)<br />

Malatodi: maswê, mmê, mpe<br />

(Antonyms: dirt, ugly)<br />

The user will now only get the word, for example yo montle, as part of the synset -<br />

ntlê and not as a word on its own. On the other hand, if we take every possible word,<br />

i.e. the combination of the root with the prefixes as our entries we will need to make<br />

12 different entries for -ntlê (beautiful) as an adjective. Since the user encounters the<br />

complete form in speech/writing the latter option seems to be the most practical.<br />

However, because we are working with 'lexemes' per se, we need to find a way to<br />

make one entry to present -ntlê (beautiful) as an adjective and as an adverb. It<br />

therefore seems that we need to opt for the entry as -ntlê. We can then accommodate -<br />

ntlê as a noun and also as an adverbial. This can be done with an entry for bontlê, i.e.<br />

as a noun and also as an adverb sentlê. The entry for the lexeme ntlê can then be:<br />

-ntlê (-ntlè)<br />

Modi wa letlhaodi -ntlê o na le bokao jwa (Adjective stem -ntlê has the sense):<br />

1. yô montlê, ba bantlê, ô montlê, ê mentlê, lê lentlê, a mantlê, sê sentlê, tsê<br />

dintlê, ê ntlê, lô lontlê, bô bontlê, gô gontlê<br />

- boleng bo bo kgatlang (quality that pleases)<br />

Malatodi: maswê, mmê, mpe<br />

(Antonyms: dirt, ugly)<br />

Modi wa letlhalosi -ntlê o na le bokao jwa (Adverbial stem -ntlê has the sense):<br />

1. sentlê - ka mokgwa o o kgatlang e bile o siame (with<br />

quality that pleases and is right)<br />

Letswa: -ntlê<br />

(Derivative: -ntlê)<br />

Leina bontlê le na le bokao jwa (Noun bontlê has the sense):<br />

1. bontlê - boleng bo bo kgatlang (quality that pleases)<br />

Malatodi: maswê, mmê, mpe<br />

(Antonyms: dirt, ugly)<br />

Letswa: -ntlê<br />

(Derivative: -ntlê)<br />

For the adverbial it is now necessary to call it an adverbial root (modi wa<br />

letlhalosi), but since it is still the 'lexeme' which expresses the notion of 'beautiful',<br />

'pretty', 'nice' and 'well' it can feature under the same entry.<br />

The root -ntlê i.e. -ntlé, as a lexeme, expresses the meaning 'outside'. It is first and<br />

foremost adverbial but can take noun prefixes and be part of a noun. It can also be a<br />

conjunctive according to traditional dictionary entries. We will therefore form a<br />

synset for -ntlê (-ntlé), first as an adverb with the prefixes it takes in which we<br />

incorporate it also as a conjunct (adverbial), but then also as a noun with the prefixes<br />

it then takes. The entry can then be:


274 Jurie le Roux, Koliswa Moropa, Sonja Bosch, and Christiane Fellbaum<br />

-ntlê (-ntlé)<br />

Modi wa letlhalosi -ntlê o na le bokao jwa (Adverbial stem -ntlê has the sense):<br />

1. fa ntlê - gaufi le bokafantlê ga sengwe kgotsa lefelô (close<br />

to the outside of something or place)<br />

Malatodi: mo têng<br />

(Antonym: inside)<br />

2. kwa ntlê - bokafantlê ga sengwe kgotsa lefelô (outside<br />

something or place)<br />

Malatodi: ka mo têng<br />

(Antonym: on the inside)<br />

3. kwa ntlê ga - go tlogêla kgotsa go lesa mo go tse di leng teng<br />

(separate or leave from those present)<br />

Malatodi: le, e bile, gape<br />

(Antonyms: and, also, more)<br />

Leina lentlê le na le bokao jwa (Noun bontlê has the sense):<br />

1. lentlê - masalêla a dijô a a ntshetswang kwa ntlê (body<br />

waste matter)<br />

For verbs the verb stem, i.e. the root plus the ending of the infinitive form, is taken<br />

as the lexeme. A typical entry for a verb can be:<br />

-búa (speak)<br />

Lediri go búa le na le bokao jwa (Verb go búa has the sense):<br />

1. go búa - go dumisa mafoko ka maikaêlêlô a go itsese yo<br />

mongwe sengwe (utter words in order to let someone else<br />

to know something)<br />

Malatodi: go didimala, go tuulala<br />

(Antonyms: keep quite, to silence)<br />

The infinitive form of the verb can also be used as a noun. An extra entry should<br />

therefore be made under -búa to tender for this possibility, viz.<br />

Leina go búa le na le bokao jwa (Noun go búa has the sense):<br />

1. go búa - ntlha ya go dumisa mafoko ka maikaêlêlô a go itsese<br />

yo mongwe sengwe (the uttering of words in order to let<br />

someone else know something)<br />

Malatodi: go didimala, go tuulala<br />

(Antonyms: keeping quite, silencing)<br />

2.3 Deverbatives<br />

As the formation of deverbatives (nouns from verb stems) is very common in the<br />

languages under discussion, we also need to include these elements under the verb<br />

stem since it forms part of the same lexeme. Since verb synsets are connected by a<br />

variety of lexical entailment pointers [2], the above entry can be extended by adding<br />

the deverbative forms as well, as illustrated in the following Setswana examples:


Introducing the African Languages WordNet 275<br />

-búa<br />

Lediri go búa le na le bokao jwa (Verb go búa has the sense) :<br />

1. go búa - go dumisa mafoko ka maikaêlêlô a go itsese yo<br />

mongwe sengwe (utter words in order to let someone else<br />

know something)<br />

Malatodi: go didimala, go tuulala<br />

(Antonyms: keep quite, to silence)<br />

Leina go búa le na le bokao jwa (Noun go búa has the sense) :<br />

1. go búa - ntlha ya go dumisa mafoko ka maikaêlêlô a go itsese<br />

yo mongwe sengwe (the uttering of words in order to let<br />

someone else know something)<br />

Malatodi: go didimala, go tuulala<br />

(Antonyms: keeping quite, silencing)<br />

Leina mmúi le na le bokao jwa (Noun mmúi has the sense):<br />

1. mmúi - yo o dumisang mafoko ka maikaêlêlô a go itsese<br />

yo mongwe sengwe (person who utters words in<br />

order to let someone else know something)<br />

Letswa: -búa<br />

(Derivative: -búa)<br />

Leina mmúisi le na le bokao jwa (Noun mmúisi has the sense):<br />

1. mmúisi - yo o balang sengwe (reader of something)<br />

Letswa: -búisa<br />

(Derivative: -búisa)<br />

Leina mmúisiwa le na le bokao jwa (Noun mmúisiwa has the sense):<br />

1. mmúisiwa - yo go dumiswang mafoko ka maikaêlêlô a go<br />

itsese go ene (one to whom words are pronounced)<br />

Letswa: -búisiwa<br />

(Derivative: -búisiwa)<br />

Leina mmúêlêdi le na le bokao jwa (Noun mmúêlêdi has the sense):<br />

1. mmúêlêdi - yo o dumisang mafoko ka ntlha ya go buêlêla<br />

(one who utters words in order to speak for)<br />

Letswa: -búêlêla<br />

(Derivative: -búêlêla)<br />

2.4 Heteronyms<br />

It should also be noted that we need to make a separate entry for heteronyms, i.e.<br />

partial homonyms, which are totally different lexemes. In this case it is a difference in<br />

tone. The entry for the heteronym is then:<br />

-bua


276 Jurie le Roux, Koliswa Moropa, Sonja Bosch, and Christiane Fellbaum<br />

Lediri go bua le na le bokao jwa (Verb go bua has the sense):<br />

1. go bua - go tlosa letlalo ka go dirisa thipa (to take of skin by using a<br />

knife)<br />

Leina go bua le na le bokao jwa (Noun go bua has the sense):<br />

1. go bua - ntlha ya go tlosa letlalo ka go dirisa thipa (process of taking<br />

of skin by using a knife)<br />

From the above it becomes clear that the Setswana WordNet will follow a pattern<br />

of using different morphological elements as entries, i.e. linguistic words for nouns,<br />

stems for verbs and roots for adjectives and adverbs. As WordNets are organized<br />

entirely by meaning and not alphabetically, this will not cause any look-up<br />

difficulties.<br />

As stated above all Setswana words can be accommodated within WordNet's four<br />

prescribed word classes, i.e. nouns, adjectives, verbs and adverbs. All, so-called<br />

'qualificatives', i.e. noun qualifiers, can be accommodated under nouns and adjectives.<br />

All verbal qualifiers can feature under verbs and adverbs. Some problems may be<br />

encountered, but none seems to be unsolvable.<br />

3 Prior work<br />

Traditionally, word categories in the Bantu languages are grouped together according<br />

to their function in the sentence and their grammatical relationship to one another.<br />

Cole [3] and Doke [4] distinguish six major categories for Setswana and IsiZulu<br />

respectively, viz. Substantives (Subjects and Objects), Qualificatives (Nominal<br />

modifiers), Predicatives (Verbs and Copulatives), Descriptives (Adverbs and<br />

Ideophones), Conjunctives (Connectives), and Interjectives (Exclamations). This<br />

classification is based on grammatical features and falls away the moment we<br />

consider language units in terms of lexemes.<br />

Substantives and Qualificatives are all nominal and can feature under nouns and<br />

adjectives with the derived forms featuring under verbs and/or adverbs. The<br />

categories for verbs and adverbs will logically accommodate verbs and adverbs.<br />

Copulatives are non-verbal structures and need not be accommodated since their<br />

existence is based on structure.<br />

Doke [5] proposed the term "ideophone" for a part of speech which describes a<br />

predicate, qualificative or adverb in respect to manner, colour, sound, smell, action,<br />

state or intensity. In contrast to the linguistic word in the Bantu languages, which is<br />

characterised by a number of morphemes such as prefixes and suffixes, as well as a<br />

root or stem, the ideophone consists only of a root which simultaneously functions as<br />

a stem and a fully-fledged word. This is illustrated in the following IsiZulu examples:<br />

Bathula bathi du<br />

Ingilazi iwe yathi phahla phansi<br />

Amanzi abomvu klubhu!<br />

"They kept completely quiet"<br />

"The glass fell smashing on the floor"<br />

"The water is as red as blood"


Introducing the African Languages WordNet 277<br />

As lexemes, ideophones are desciptive of sound, colour, smell, manner,<br />

appearance, state, action or intensity. The majority of ideophones are therefore<br />

adverbial and will feature under the specific adverbial root. Where they indicate<br />

colour, for instance, they will feature under adjectives.<br />

Most conjunctives are derived forms and will therefore feature under the category<br />

from which they are derived. Since we see most so-called conjunctives as adverbials<br />

anyway [6] only a few remain as true conjunctions in Setswana. These elements can<br />

however also be accommodated as adverbials since they generate some modifying<br />

meaning. By far the majority of interjectives are only nouns or pronouns being used<br />

vocatively or verbs being used imperatively. The rest are expressive of some aspect<br />

and can be accommodate under the specific lexeme expressing that particular aspect,<br />

e.g. cold.<br />

4 Examples of African Languages WordNets<br />

Examples of noun and verb synsets in isiXhosa and isiZulu are given below. Table 1<br />

shows how the single concept expressed by English vehicle corresponds to two<br />

concepts (synsets) with distinct word forms in isiXhosa; the difference is revealed in<br />

the definitions, "wheeled vehicle" vs. "vehicle used for transporting people and<br />

goods." For each concept, hyponyms are grouped according to the mode of movement<br />

(vehicles that travel on air, road, water, rail, etc.).The second definition refers to a<br />

vehicle without wheels, with one example provided. Table 1 also shows meronyms<br />

(engine, tyres and wheels) linked to vehicle by the part-whole relation familiar from<br />

other WordNets.<br />

Importantly, the isiXhosa and isiZulu synsets are connected to one another, so that<br />

corresponding words and synsets are given, allowing a direct comparison of<br />

corresponding and distinct concepts and lexicalizations.<br />

Verb synsets are connected by a variety of lexical entailment pointers [2]. Tables 2<br />

and 3 show the verbs corresponding to English walk and put along with their<br />

troponyms (manner subordinates). For example, -gxanyaza refers to walking in a<br />

certain manner (fast with fairly long strides). It is interesting to note that English also<br />

has many verbs expressing manners of walking, but they do not always match those in<br />

the African languages.<br />

5 Software<br />

African WordNets will use the editing tool DebVisDic, freeware multilingual<br />

software, designed for the development, maintenance and efficient exploitation of the<br />

aligned WordNets [7]. Initial experiments showed it to be well suited and adaptable to<br />

the construction of African Languages WordNets.


278 Jurie le Roux, Koliswa Moropa, Sonja Bosch, and Christiane Fellbaum<br />

Table 1. IsiXhosa and isiZulu Noun Synsets<br />

ISIXHOSA<br />

ISIZULU<br />

inqwelo (vehicle)<br />

isithuthi (vehicle)<br />

Def. 1. isithuthi esinamavili sokuthutha<br />

abantu nempahla<br />

(a vehicle with wheels for transporting people<br />

and goods)<br />

Ezomoya (air)<br />

inqwelomoya (aeroplane)<br />

inqwelontaka (helicopter)<br />

ijethi (jet)<br />

Ezendlela (road)<br />

imoto / ikari (car)<br />

ibhasi / udula-dula (bus)<br />

itrakhi (truck)<br />

ilori (lorry)<br />

iveni (van/ bakkie)<br />

itekisi (taxi)<br />

ibhayisekile (bicycle)<br />

isithuthuthu (motorbike)<br />

Ezesiporo (rail)<br />

uloliwe / ujujuju (train)<br />

igutsi (goodstrain)<br />

utramu (tram)<br />

Isithuthi sasemoyeni (air)<br />

indiza (aeroplane)<br />

ibhanoyi (aeroplane)<br />

Isithuthi sasemgaqweni (road)<br />

isithuthuthu (motorbike)<br />

imoto (car)<br />

Isithuthi sikajantshi (rail)<br />

ingolovane (cocopan)<br />

isitimela (train)<br />

Ezasemanzini (water)<br />

inqanawa (ship)<br />

inkwili (submarine)<br />

isikhephe (boat)<br />

Isithuthi sasemanzini (water)<br />

umkhumbi (boat)<br />

umkhumbingwenya (submarine)<br />

Meronym (part-whole relation)<br />

injini (engine), ivili lokuqhuba (steering wheel),<br />

amavili (wheels),<br />

Def. 2. isithuthi esingenamavili sokuthutha<br />

abantu nempahla<br />

(a vehicle without wheels for transporting<br />

people and goods)<br />

isileyi (sledge)


Introducing the African Languages WordNet 279<br />

Table 2. IsiXhosa verb –hamba<br />

ISIXHOSA<br />

-hamba (walk)<br />

-gxanyaza ( walk fast with fairly long strides)<br />

-cotha ( walk slowly)<br />

-khasa (crawl)<br />

-yantaza (walk aimlessly)<br />

-thwakuza (walk aimlessly)<br />

-nyalasa (walk boldly)<br />

-ndolosa (walk proudly)<br />

-khawuleza (walk fast)<br />

-qhwalela (limp)<br />

Table 3. IsiXhosa and IsiZulu verb -beka<br />

ISIXHOSA<br />

-beka (put)<br />

ISIZULU<br />

-beka/ faka (put)<br />

-gcina ( keep )<br />

-londoloza (save, e.g save money)<br />

-thwala (put on)<br />

-ngcwaba (bury someone)<br />

-fulela ( put a roof on)<br />

-gqoka (wear)<br />

-emboza (cover)<br />

-thwala (put on)<br />

-endlala (make)<br />

6 Future work and conclusion<br />

It should be clear from the above discussion that the intention with the African<br />

Languages WordNet is not to just produce another wordlist or a conventional<br />

dictionary based on grammatical analysis. With the lexeme and not the word as our<br />

point of departure we hope to produce a unique WordNet for African languages,<br />

driven by 'meaning' and not by copying existing WordNets. It is our belief that<br />

'meaning' can only come from within the language and cannot be interpreted in terms<br />

of another language.<br />

The long term aim of this project is the development of aligned WordNets for<br />

African languages spoken in South Africa (i.e. languages belonging to the Bantu


280 Jurie le Roux, Koliswa Moropa, Sonja Bosch, and Christiane Fellbaum<br />

language family) as multilingual knowledge resources which could be extended to<br />

include a wide variety of related languages from other parts of Africa. Such research<br />

and development would depend on the commitment of researchers to continue the<br />

work begun with great enthusiasm, the co-operation of numerous language<br />

institutions, the availability of a variety of language resources as well as further<br />

financial support following the seed research funding.<br />

References<br />

1. University of Pretoria Department of African Languages:<br />

http://www.up.ac.za/academic/humanities/eng/eng/afrlan/eng/initiative.htm<br />

2. Fellbaum, C. (ed.): WordNet. An Electronic Lexical Database. The MIT Press, Cambridge<br />

(1998)<br />

3. Cole, D.T.: An Introduction to Tswana grammar. Longmans, Green and Co., Ltd, London,<br />

Cape Town, New York (1955)<br />

4. Doke, C. M.: Textbook of Zulu Grammar. University of the Witwatersrand Press,<br />

Johannesburg (1973)<br />

5. Doke, C. M.: Bantu Linguistic Terminology. Longmans, London (1935)<br />

6. Ouirk, R, Greenbaum, S, Leech, G, Svartvik, J.: A Comprehensive grammar of the English<br />

language. Longman, London; New York (1985)<br />

7. DEBVisDic Manual. http://nlp.fi.muni.cz/trac/deb2/wiki/DebVisDicManual<br />

8. Snyman, J.W., Shole, J.S., Le Roux, J.C.: Dikišinare ya Setswana English Afrikaans<br />

Dictionary Woordeboek. Via Afrika Limited, Pretoria (1990)


Towards an Integrated OWL Model for Domain-<br />

Specific and General Language WordNets<br />

Harald Lüngen 1, Claudia Kunze 2 , Lothar Lemnitzer 2 , and Angelika Storrer 3<br />

1 Justus-Liebig-Universität Gießen, 2 University of Tübingen,<br />

3 University of Dortmund<br />

luengen@uni-giessen.de, kunze@sfs.uni-tuebingen.de,<br />

angelica.storrer@uni-dortmund.de, lothar@sfs.uni-tuebingen.de<br />

Abstract. This paper presents an approach to integrate the general language<br />

WordNet GermaNet with TermNet, a German domain-specific ontology. Both<br />

resources are represented in the Web Ontology Language OWL. For GermaNet,<br />

we adopted the OWL model suggested by van Assem et al. [3] for the Princeton<br />

WordNet, for TermNet we developed a slightly different model better suited to<br />

terminologies. We will show how both resources can be inter-related using the<br />

idea of plug-in relations (as proposed by Magnini and Speranza 2002). In contrast<br />

to earlier plug-in approaches, our method of connecting general language<br />

WordNets with domain-specific terminology does not impose changes on the<br />

structure of these two types of lexical representations. We therefore consider<br />

our proposal to be a step towards the interoperability of lexical-semantic resources.<br />

Keywords. WordNets; GermaNet; OWL; terminology<br />

1 Introduction<br />

WordNets (like the Princeton WordNet, cf. [1]) have been used in various applications<br />

of text processing, information retrieval, and information extraction. When these<br />

applications process documents dealing with a specific domain, one needs to combine<br />

knowledge about the domain-specific vocabulary represented in domain ontologies<br />

with lexical repositories representing general language vocabulary. In this context, it<br />

is useful to represent and inter-relate the entities and relations in both types of resources<br />

using a common representation language. In this paper we discuss an integrated<br />

representation model for domain-specific and general language resources using<br />

the Web Ontology Language OWL. The model was tested by relating entities of the<br />

German WordNet GermaNet to corresponding entities of the German domain ontology<br />

TermNet [2].<br />

In Section 3, the main characteristics of these two resources are described. We built<br />

on the W3C approach to convert Princeton WordNet in RDF/OWL [3] and adapted<br />

them to GermaNet. For the domain ontology TermNet a different model was developed.<br />

The main classes and properties of both models are discussed in Section 4. The<br />

focus of this paper is on the question how the entities of the two OWL models — the<br />

model of the general language WordNet GermaNet and the model of the domain ontology<br />

TermNet — can be linked in a principled fashion. For this purpose, we defined


282 Harald Lüngen, Claudia Kunze, Lothar Lemnitzer, and Angelika Storrer<br />

OWL-properties that relate entities of the two lexical resources, following the basic<br />

idea of the so-called plug-in approach by [4] for linking general language with domain-specific<br />

WordNets. Section 5 discusses the plug-in approach and our adaptation<br />

of it with reference to appropriate examples from GermaNet and TermNet. With our<br />

work, we aim at contributing to the emergent issue of interoperability between language<br />

resources.<br />

2 Related Work<br />

The work presented in this paper is inspired by the plug-in approach, which was developed<br />

in the context of ItalWordNet [5] and was originally proposed by [4]. However,<br />

rather than focusing on the processing aspects of the original method, in the present<br />

study we propose a declarative model of interlinking general language with<br />

domain-specific WordNets from the perspective of explicitly defined plug-in relations,<br />

which differ slightly from the ones proposed by Magnini and Speranza [4].<br />

These relations allow for connecting specific terms with appropriate concepts, but do<br />

not modify the original resources and concepts.<br />

Subsequent applications of the plug-in approach, like ArchiWordNet [6] or Jur-<br />

WordNet [7], implement plug-in relations for extending generic resources with domain<br />

terms from a processing perspective. The procedures lead to merged concepts<br />

and additional features being integrated into or added to the original databases.<br />

De Luca and Nürnberger [8] describe an approach that relates an OWL representation<br />

of EuroWordNet to an OWL representation of domain terms. In their approach,<br />

terms are directly mapped onto synsets without any reference to intermediate relations.<br />

By defining distinct OWL plug-in properties our model aims to capture, in addition,<br />

different types of semantic correspondence between general language and domain-specific<br />

concepts<br />

In Vossen’s [9] approach, WordNet is adapted to the field and the needs of a specific<br />

organisation by extending it to include domain-specific vocabulary and removing<br />

concepts (and thus word senses) that are irrelevant for the organisation. In contrast,<br />

both the plug-in approach and the approach introduced in this paper are neutral with<br />

respect to the question whether a global ontology is extended by a specialised ontology<br />

or the other way around. Furthermore, the plug-in approach and the present approach<br />

do not address the question of how to automatically derive a domain ontology<br />

from a text collection; they are applicable to both automatically derived ontologies<br />

and hand-crafted ones. Moreover, Vossen’s [9] approach is procedural, meaning that<br />

its focus is on the specification of an extraction and integration algorithm, whereas the<br />

aim of the present paper is to declaratively model and specify the relational structure<br />

of the interface between a general and a domain-specific ontology in a formal language,<br />

i.e. the Semantic Web Ontology Language OWL.<br />

3 Lexical and Terminological Resources<br />

Our approach was developed and tested using an OWL model for a representative<br />

subset of the German WordNet GermaNet and an OWL model for the German termi-


Towards an Integrated OWL Model for Domain-Specific and… 283<br />

nological WordNet TermNet. In this section we outline the main characteristics of<br />

GermaNet and TermNet.<br />

3.1 Characteristics of GermaNet<br />

GermaNet is a lexical-semantic WordNet for German which has been developed<br />

along the lines of the Princeton WordNet [1], covering the most important and frequent<br />

general language concepts and lexical-semantic relations holding between the<br />

concepts and lexical units represented, like hyponymy, meronymy and antonymy<br />

[10]. As is typical of WordNets, the central unit of representation is the synset, which<br />

comprises all synonyms or lexical units of a given concept. GermaNet presently covers<br />

more than 53 000 synsets with some 76 000 lexical units, among them nouns,<br />

verbs and adjectives. A basic subset of GermaNet (15 000 concepts) has been integrated<br />

into the polylingual EuroWordNet database [11]. The following features distinguish<br />

GermaNet from the data model of the Princeton WordNet, version 2.0:<br />

1. The use of so-called artificial, non-lexicalised concepts, in order to achieve<br />

well-formed taxonomic hierarchies. For example, the artificial concept<br />

Schultyplehrer (‘school type teacher’) has been introduced to act as a hyper(o)nym<br />

of the lexicalised concepts Grundschullehrer (‘primary school<br />

teacher’), Realschullehrer (‘secondary school teacher’), Berufsschullehrer<br />

(‘vocational school teacher’) etc.;<br />

2. Named entities are explicitly marked. Proper names in GermaNet primarily<br />

occur in the geographic domain; 1<br />

3. In GermaNet, the taxonomic approach is also applied to the representation of<br />

adjectives, as opposed to WordNet's satellite approach (based upon the notion<br />

of similarity with regard to different adjective clusters);<br />

4. Meronymy is deemed a generic relation in GermaNet;<br />

5. GermaNet verbs are provided with an exhaustive list of sub-categorisation<br />

frames and example phrases.<br />

The data model of GermaNet is depicted in Fig. 1 as an entity-relationship graph.<br />

This model guided the conversion process of GermaNet objects and relations into<br />

XML elements and attributes.<br />

1<br />

In version 2.1 of WordNet, however, over 7,600 synsets were manually classified as instances<br />

and tagged as such (cf. [12]).


284 Harald Lüngen, Claudia Kunze, Lothar Lemnitzer, and Angelika Storrer<br />

3.2 Characteristics of TermNet<br />

Fig. 1. Entity-relationship diagram for GermaNet<br />

TermNet is a lexical resource that was developed in a project on automated text-tohypertext<br />

conversion (cf. [13]). TermNet represents more than 400 German technical<br />

terms occurring in a corpus with documents in the domains “text-technology” and<br />

“hypertext research.” Most terms are noun terms, including multiword terms composed<br />

of a noun and an adjective modifier such as bidirektionaler Link (engl. ‘bidirectional<br />

link’). The entities and relations introduced for the Princeton WordNet [1] are<br />

fundamental for the structure of TermNet. The two basic entities of the TermNet<br />

model are terms (the analogue to word/lexical unit in the WordNet model) and termsets<br />

(the analogue to synsets in the WordNet model). Terms in TermNet are lexical<br />

units for which the technical meaning is explicitly defined in the documents of our<br />

corpus. Termsets contain technical terms that denote the same or a quite similar topic<br />

in different approaches to a given domain (cf. [14]). Terms are related by lexical relations,<br />

e.g. isAbbreviationOf, and termsets are related by conceptual relations, e.g.<br />

isHyponymOf, isMeronymOf. The data model of TermNet is illustrated by the ERdiagram<br />

in Fig. 2.<br />

For automated hyperlinking, and probably for other applications, it is useful to<br />

know that term A occurring in document X denotes a category similar to the one denoted<br />

by term B occurring in document Y. Unlike other standards and proposals for<br />

representing thesauri (e.g. [15, 16, 17]), TermNet focuses on the representation of semantic<br />

correspondences between terms defined in different taxonomies or in competing<br />

scientific schools. Since competing taxonomies or schools may all have their<br />

benefits, we do not want to decide which terminology is to be preferred. Thus, the<br />

current TermNet model deliberately does not label terms as “preferred term.”<br />

Since the entity type TermSet is crucial for the purpose of representing semantic<br />

correspondences between technical terms defined in competing schools, we want to<br />

explain the idea behind it using an example from German hypertext terminology:<br />

Kuhlen [18] and Tochtermann [19] both introduced a terminology for hypertext concepts<br />

that influenced the usage of technical terms in German publications on hypertext


Towards an Integrated OWL Model for Domain-Specific and… 285<br />

research. Both authors provide definitions for the concept hyperlink and specify a taxonomy<br />

of subclasses (external link, bidirectional link etc.). But Kuhlen uses the term<br />

Verknüpfung in his taxonomy (extratextuelle Verknüpfung, bidirektionale Verknüpfung)<br />

while in Tochtermann’s taxonomy the term Verweis is used (with subclasses<br />

like externer Verweis, bidirektionaler Verweis). The definitions of the concepts and<br />

subconcepts given by these authors are slightly different, and the two taxonomies are<br />

not isomorphic. As a consequence, in a scientific document on the subject domain, a<br />

term from the Kuhlen taxonomy cannot be replaced by the corresponding term from<br />

the Tochtermann taxonomy. After all, the purpose of defining terms is exactly to bind<br />

their word forms to the semantics specified in the definition. The usage of technical<br />

terms in documents may then serve to indicate the theoretical framework or scientific<br />

school to which the paper belongs. In our OWL model of TermNet, on the one hand<br />

we represent relations between terms of the same taxonomy, on the other hand we<br />

capture categorial correspondences between terms of competing taxonomies by assigning<br />

similar terms to the same termset.<br />

LSR<br />

definition<br />

CR<br />

domain<br />

POS<br />

ID<br />

Term<br />

member<br />

Termset<br />

Subclass<br />

orthographic<br />

variant<br />

ID<br />

Fig. 2. Entity-relationship diagram for TermNet<br />

4 OWL Models of GermaNet and TermNet<br />

The Web Ontology Language OWL was created by the W3C Ontology Working<br />

Group as a standard for the specification of ontologies in the context of the Semantic<br />

Web. OWL comprises the three sublanguages OWL Light, OWL DL, and OWL Full,<br />

which differ in their expressivity. An ontology in the sublanguage OWL DL can be


286 Harald Lüngen, Claudia Kunze, Lothar Lemnitzer, and Angelika Storrer<br />

interpreted according to description logics (DL), and DL-based reasoning software<br />

(e.g. RacerPro 2 or Pellet 3 ) can be applied to check its consistency or to draw inferences<br />

from it. To take advantage of this, our OWL models of GermaNet, TermNet<br />

and the plug-in structure all remain within the OWL DL dialect.<br />

Several approaches to convert PWN to OWL and to make it available for Semantic<br />

Web applications exist (e.g. [20, 21, 3]). In all these, the individual synsets and lexical<br />

units are rendered as instances of the OWL ontology. Although alternative modelling<br />

options have been discussed (cf. [22]), in the present project we adhere to an instance<br />

model as proposed by [3].<br />

4.1 GermaNet OWL Model<br />

In our OWL model, sets of GN concepts are represented as classes (),<br />

while the properties of and relations between concepts are represented as OWL properties<br />

( or ) of these classes. For the<br />

two basic objects in the E-R-model of GN (Fig. 1), the classes Synset and LexicalUnit<br />

are introduced. Following the W3C model for PWN [3], we introduce NounSynset,<br />

VerbSynset, AdjectiveSynset, and AdverbSynset as immediate subclasses of Synset, as<br />

well as NounUnit, VerbUnit, AdjectiveUnit, and AdverbUnit as immediate subclasses<br />

of LexicalUnit.<br />

Table 1. Features of OWL object properties for GermaNet<br />

hasExample Synset Example - - -<br />

Conceptual Relations (CR)<br />

Property Domain Range Characteristics<br />

Inverse<br />

Property<br />

Local<br />

Restrictions<br />

hasMember Synset LexicalUnit inverse- memberOf POS-based<br />

functional<br />

memberOf LexicalUnit Synset functional hasMember POS-based<br />

isHyperonymOf Synset Synset transitive isHyponymOf POS-based<br />

isHyponymOf Synset Synset transitive isHyperonymOf POS-based<br />

isHolonymOf NounSynset NounSynset - - -<br />

isMeronymOf NounSynset NounSynset - - -<br />

IsAssociated- Synset Synset - - -<br />

With<br />

entails VerbSynset VerbSynset - - -<br />

causes VerbSynset VerbSynset ∪ - - -<br />

AdjectiveSynset<br />

Lexical-semantic relations (LSR)<br />

hasAntonym LexicalUnit LexicalUnit symmetric hasAntonym POS-based<br />

hasPertainym LexicalUnit LexicalUnit - - -<br />

isParticipleOf VerbUnit VerbUnit - - -<br />

2<br />

cf. http://www.racer-systems.com<br />

3<br />

cf. http://pellet.owldl.com


Towards an Integrated OWL Model for Domain-Specific and… 287<br />

For modelling the lexicalisation relation between synsets and lexical units, an<br />

OWL Object Property called hasMember with domain Synset and range LexicalUnit<br />

is introduced. For each POS-based subclass of Synset (e.g. NounSynset), a restriction<br />

of the range of hasMember to the corresponding subclass of LexicalUnit (e.g.<br />

NounUnit) is encoded using .<br />

OWL is particularly well-suited to model the two basic relation types CR and LSR.<br />

Both types hold between internally defined classes and thus correspond to object<br />

properties in OWL. Like classes, properties can be arranged in a hierarchy in OWL<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Listing 1: OWL Code for the introduction of hypernymy<br />

using the construct. Our model thus contains two top-level object<br />

properties, conceptualRelation (with domain and range = Synset) and lexicalSemanticRelation<br />

(with domain and range = LexicalUnit). The OWL characteristics of<br />

their respective subproperties are shown in Table 1. Hypernymy, for example, is encoded<br />

as an called isHyperonymOf with domain and range<br />

= Synset, as an immediate subproperty of conceptualRelation, and as the inverse<br />

property of hyponymy, cf. Listing 1.<br />

Similar to hasMember, for each POS-based subclass of Synset, the range of<br />

isHyperonymOf is restricted to synsets of the same subclass. Relations that do not<br />

hold between internally defined classes, but in which a range in the form of an XML<br />

Schema data type like string or boolean is assigned to an internal class, are modelled<br />

as OWL datatype properties. In the case of GN, they are obviously the ones that are<br />

represented as ellipses in the E-R model of GN (Fig. 1). Table 2 contains a survey of<br />

datatype properties in the OWL model of GN with their respective domain, range and<br />

function status.<br />

Table 2. Features of OWL datatype properties for GermaNet<br />

Property Domain Range Functional<br />

POS Synset “N”|”V”|”A”|”ADV” yes<br />

hasParaphrase Synset xs:string no<br />

isArtificial Synset ∪ LexicalUnit xs:boolean (yes)<br />

isProperName NounSynset ∪ xs:boolean<br />

(yes)<br />

NounUnit<br />

hasOrthographicForm LexicalUnit xs:string yes<br />

hasSenseInt LexicalUnit xs:positiveInteger yes<br />

isStylisticallyMarked LexicalUnit xs:boolean (yes)<br />

hasFrame VerbUnit ∪ Example xs:string no<br />

hasText Example xs:string yes


288 Harald Lüngen, Claudia Kunze, Lothar Lemnitzer, and Angelika Storrer<br />

A subset of GermaNet (54 synset and 104 lexical unit instances including all conceptual<br />

and lexical-semantic relations holding between them) has been encoded in<br />

OWL according to the model presented above, using the Protégé ontology editor 4 .<br />

The GermaNet subset contains most of the candidate synsets for plugging in TermNet<br />

terms. Furthermore, this exemplary subset contains at least one instance of each conceptual<br />

and each lexical-semantic relation. We employed the reasoning software<br />

RacerPro to ensure its consistency within OWL DL. An automatic conversion of the<br />

complete GermaNet 5.0 is under way.<br />

4.2 TermNet OWL Model<br />

The complete TermNet in its OWL representation contains 425 technical terms and<br />

206 termsets. In the OWL model we define all terms as classes, the instances of which<br />

are those objects in the real world that are denoted by the respective terms (e.g., an instance<br />

of the term externer Verweis is a concrete hyperlink in a hyperdocument compliant<br />

with Tochtermann’s definition of this term). Since we only account for nominal<br />

terms, all terms are subclasses of the superclass NounTerm. We use the<br />

property to relate narrower terms to broader terms within the same<br />

taxonomy (e.g., we define Kuhlen’s term extratextuelle Verknüpfung as a subclass of<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Listing 2: OWL code for the assignment of terms to termset<br />

his broader term Verknüpfung). By modelling terms as classes we benefit from the<br />

mechanism of feature inheritance related to the predefined property.<br />

In addition, we are able to represent disjointness between classes using the OWL<br />

construct. By defining that the sets of instances denoted by the<br />

terms externer Verweis and interner Verweis are disjoint, we make sure that a link object<br />

in a document can only be assigned to one of these classes. In other words, a link<br />

object can either be an instance of the class externer Verweis or an instance of the<br />

class interner Verweis (although it may quite well be an instance of both externer<br />

Verweis and bidirektionaler Verweis). Terms of competing taxonomies that represent<br />

similar categories (like externer Verweis and extratextuelle Verknüpfung from the example<br />

in Sect. 3.2) are assigned to the same termset. For this purpose termsets are defined<br />

as subclasses of the superclass NounTermSet. Terms are assigned to termsets us-<br />

4<br />

http://protege.stanford.edu


Towards an Integrated OWL Model for Domain-Specific and… 289<br />

ing the object property tn:isMemberOf (with NounTerm as domain and NounTermSet<br />

as range). The inverse property is tn:hasMember. Since termsets and terms are modelled<br />

as classes, we cannot simply adopt the definition of the gn:MemberOf object<br />

property specified in the GermaNet OWL model (cf. Sect. 4.1.). Instead, we had to<br />

use the restriction to assign all instances of a term class to the<br />

respective termset class. Listing 2 illustrates how the term Verweis is assigned to the<br />

termset Termset_Link (which comprises other terms like Verknüpfung, Link, Hyperlink,<br />

Kante etc.).<br />

In addition to the taxonomic relations specified between terms of the same taxonomy<br />

by means of the property, we also represent hierarchical relations<br />

between termsets, e.g. we want to account for the fact that all terms assigned to<br />

the termset TermSet_Link have a broader meaning than the terms assigned to the<br />

termset TermSet_Monodirektionaler_Link. For this purpose, we defined the<br />

tn:isHypernymOf–Property, which relates termsets containing broader terms to termsets<br />

containing more specific terms. Its inverse property is isHyponymOf. Listing 3<br />

demonstrates how the more specific termset TermSet_Monodirektionaler_Link is defined<br />

to be a hyponym of the broader termset TermSet_Link by means of the property<br />

isHyponymOf and the restriction.<br />

Property Domain Range Characteristics Inverse Property<br />

hasMember NounTermSet NounTerm inverse-functional isMemberOf<br />

isMemberOf NounTerm NounTermSet functional hasMember<br />

Relations between termsets<br />

isHypernymOf NounTermSet NounTermSet transitive isHyponymOf<br />

isHyponymOf NounTermSet NounTermSet transitive isHypernymOf<br />

isHolonymOf NounTermSet NounTermSet<br />

isMeronymOf NounTermSet NounTermSet<br />

Relations between terms<br />

IsAbbreviationOf NounTerm NounTerm isExpansionOf<br />

IsExpansionOf NounTerm NounTerm isAbbreviationOf<br />

<br />

<br />

<br />

<br />

<br />

Listing <br />

relation between Termsets<br />

<br />

<br />

<br />

Table 3. Features of OWL object properties for TermNet<br />

The object properties isMeryonymOf and isHolonymOf were introduced to account<br />

for part-whole-relations between objects denoted by the terms of two termsets. The<br />

property isAbbreviationOf relates short terms to their expanded forms within the same<br />

taxonomy. Table 3 provides an overview of the properties defined in the OWL Term-


290 Harald Lüngen, Claudia Kunze, Lothar Lemnitzer, and Angelika Storrer<br />

Net model. The property between terms of the same taxonomy is<br />

not included in this overview because its semantics is predefined.<br />

5 Representing Plug-in Relations in OWL<br />

Providing domain-specific extensions for general language resources in order to capture<br />

and exploit the respective advantages of both resource types in natural language<br />

processing and semantic web applications has been discussed in the approaches by [9]<br />

and [4].<br />

Vossen [9] describes a procedure to extract a hierarchy of terms (called “topics”)<br />

from a document collection, e.g. the set of all documents used in a specific organisation,<br />

and to subsequently combine it with WordNet. This is achieved by merging topics<br />

from the extracted hierarchy with matching WN concepts. The kind of matching<br />

criterion used is not specified; from the examples given one can assume that simple<br />

string matching is applied. Similar to the plug-in approach, one of the features of the<br />

resulting hierarchy is that the lower levels of the WN hierarchy and the possible upper<br />

levels of the terminological hierarchy are discarded. Vossen's procedure only identifies<br />

plug-in synonymy and plug-in near synonymy, which are not differentiated in the<br />

new hierarchy.<br />

The resulting hierarchy is subsequently trimmed by automatically removing those<br />

concepts that are irrelevant in the domain of the document collection, i.e. removing<br />

unwanted sense ambiguities that were introduced by the merger of the two resources.<br />

Finally, a procedure to fuse the compositional hierarchy with a so-called “private” or<br />

“personal” ontology, which apparently is a more domain-specific upper level ontology<br />

designed for the organisation and its document collection, is presented. For the<br />

fusing procedure, an “interface level” with matching concepts or topics from the<br />

source and target hierarchies seems to be externally defined, i.e. criteria other than<br />

string matching could potentially be applied. In this step, subtrees of the combined hierarchy<br />

are placed under the interface nodes of the private ontology; thus, it can be<br />

regarded as another instance of merging a global with a specialised ontology.<br />

Whereas Vossen first builds ad-hoc terminologies from large document collections<br />

using information retrieval and term extraction methods and then links the resulting<br />

terms to WordNet synsets, Magnini and Speranza have proposed the plug-in approach<br />

which serves to link two (independently) existing resources of different types, namely<br />

the general-language ItalWordNet (IWN) and the specialised ontology ECOWN from<br />

the economic domain. Plug-in is a special instance of ontology merging, which is<br />

normally concerned with aligning resources of the same type.<br />

Various kinds of plug-in relations serve to combine the relevant synsets of both resources.<br />

The plug-in approach yields a common hierarchy in which the top concepts<br />

of the specialised ontology are “eclipsed” while the subordinate concepts, the terms,<br />

are imported into the general language ontology. A relatively small number of instances<br />

of plug-in relations (269) suffices to integrate 4662 ECOWN concepts into<br />

ItalWordNet (cf. [4]).<br />

ECOWN synsets are linked to a small domain ontology; 100 basic terms dominating<br />

relevant subhierarchies have been selected by experts due to relevance and frequency<br />

of use. The following scenarios of correspondences between IWN synsets and<br />

ECOWN terms are discussed:


Towards an Integrated OWL Model for Domain-Specific and… 291<br />

1. Overlapping concepts: generic terms from the economic domain which also play a<br />

role in general language;<br />

2. Overdifferentiation: a given ECOWN synset corresponds to more than one IWN<br />

concept, or an IWN synset corresponds to more than one ECOWN concept—these<br />

phenomena can be traced to different sense distinctions made by lexicographers vs.<br />

terminologists;<br />

3. Gaps: for terms which have no general language counterpart, a suitable hypernym<br />

in the generic resource is selected.<br />

The first scenario is captured by plug-in synonymy for overlapping synsets in IWN<br />

and ECOWN. A new plug-synset is created which replaces the corresponding IWN<br />

and ECOWN synsets in the integrated resource. This plug-synset takes its synonyms<br />

and hyponyms from the terminological resource and its hypernym from the generic<br />

resource. As a consequence, the terminological hypernym and the general language<br />

hyponyms are eclipsed.<br />

The case of overdifferentiation is dealt with by plug-in near-synonymy. A new<br />

plug-synset is being created which also takes its hypernym from IWN and its synonyms<br />

and hyponyms from ECOWN.<br />

In order to bridge the gap between IWN synset and ECOWN synset in the third<br />

scenario, plug-in hyponomy is applied. Two new plug-synsets are derived: one for the<br />

superordinate IWN synset (Plug-IWN) and one for the subordinate ECOWN synset<br />

(Plug-ECOWN). Plug-IWN takes its synonyms and hyponyms from IWN, and its hyponyms<br />

also include the Plug-ECOWN node. Plug-ECOWN relates to synonyms and<br />

hyponyms from ECOWN. Plug-ECOWN is assigned a new hypernym: Plug-IWN replaces<br />

the former hypernym from ECOWN.<br />

The integration process is realised in four steps that centre around the plug-in relations.<br />

Thus, plug-in can be seen as a dynamic device with regard to merging two resources.<br />

The procedure yields new concepts, the plug-in concepts. The status of these<br />

merged plug-in concepts remains unclear—whether they constitute new lexical items,<br />

new terms or artificial concepts.<br />

The plug-in approach has also been used and enhanced for Jur-WordNet [7] and<br />

ArchiWordNet [6], two domain-specific WordNet extensions. Jur-WordNet addresses<br />

theoretical considerations regarding common language versus expert language, and<br />

emphasises the citizens' perspective on law terms, applying more or less the original<br />

plug-in relations. For ArchiWordNet, several plug-in procedures (substitutive, integrative,<br />

hyponymic and inverse plug-ins) are developed to replace or rearrange MultiWordNet<br />

hierarchies and integrate them with ArchiWordNet hierarchies. Furthermore,<br />

synsets may be enriched with terminological features, synonyms may be added<br />

or deleted from synsets, and relations may be added or deleted for specific synsets.<br />

Within this merging process, a lot of manual work specific to the resources in question<br />

had to be done which might possibly not be representative for any other pair of<br />

resources.<br />

The plug-in approach offers an attractive model for linking TermNet to GermaNet,<br />

as both resources are also WordNet-based and of different coverage and specificity<br />

with a significant number of overlapping concepts. We primarily focus on modelling<br />

the relationships between general language and domain-specific concepts, and we use<br />

the plug-in metaphor for the relational model, less for the integration process. Thus,<br />

from our linking procedure, no new plug-in concepts evolve as the outcome of merging<br />

general language synsets with terms. The original databases, GermaNet and


292 Harald Lüngen, Claudia Kunze, Lothar Lemnitzer, and Angelika Storrer<br />

TermNet, remain unchanged, but are supplemented with the relational structure provided<br />

by the established plug-in links.<br />

As described in Section 4, in our OWL models, TermNet terms are modelled as<br />

classes and GermaNet synsets as individuals. Within OWL DL, a meta-class of term<br />

classes cannot be built, i.e. OWL classes cannot be declared to be OWL individuals<br />

without resorting to OWL Full. Thus, within OWL DL, the alignment can only be realised<br />

by restricting the range of a plug-in property to the individual that represents<br />

the corresponding GN synset. We distinguish three different linking scenarios between<br />

TermNet terms and GermaNet synsets:<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Listing 4: OWL code for a relation instance of attachedToNearSynonym<br />

1. Correspondence between a given TermNet term and a GermaNet synset, for example<br />

between the term tn:term_Link and the GermaNet noun synset gn:Link. The<br />

corresponding object property plg:attachedToNearSynonym has tn:NounTerm as<br />

domain and gn:Synset as range. By using an over<br />

plg:attachedToNearSynonym, every individual of the class tn:Term_Link is assigned<br />

the individual gn:Link (see Listing 4). Since we do not assume pure synonymy<br />

for a corresponding term-synset pair, no synonymy link is established for<br />

plugging general language with domain language; the closest sense-relation being<br />

near-synonymy.<br />

2. A TermNet term cannot be assigned a corresponding GermaNet synset but is a<br />

subclass of another TermNet term which in turn is linked to a GermaNet synset by<br />

plg:attachedToNearSynonym. For instance, the term<br />

tn:Term_MonodirektionalerLink stands in a subclass relation with the term tn:Link,<br />

which itself is linked to the GermaNet synset gn:Link by<br />

plg:attachedToNearSynonym. The property plg:attachedToGeneralConcept relates<br />

a term class like tn:Term_MonodirektionalerLink with a GermaNet synset which<br />

stands in a plg:attachedToNearSynonym relation with a superordinate term. Thus, a<br />

relation between indirectly linked concepts is made explicit and also serves to reduce<br />

the path length between semantically similar concepts for applications in<br />

which semantic distance measures are calculated. In this respect, we go beyond the<br />

scope of the plug-in approach which does not account for indirect links.<br />

3. A TermNet term cannot be assigned a corresponding GermaNet synset, and, furthermore,<br />

no suitable hypernym for linking the term is available in the GermaNet<br />

data. But the term can be linked to a holonym concept in GermaNet, via the plug-in<br />

relation plg:attachedToHolonym. For example, the TermNet term tn:Term_Anker


Towards an Integrated OWL Model for Domain-Specific and… 293<br />

(meaning 'anchor', i.e. a part of a link in the domain of hypertext research) has no<br />

semantic counterpart in GermaNet, but can be linked to the superordinate holonym<br />

in GermaNet, the synset gn:Link, by a plg:attachedToHolonym relation. This plugin<br />

relation is unique to our approach and has not been derived from the original<br />

model.<br />

Using the Protégé ontology editor and the reasoners RacerPro and Pellet, we encoded<br />

150 OWL restrictions representing plug-in relations for plugging terms into the<br />

Synsets of the representative subset of GermaNet, 27 of which are<br />

plg:attachedToNearSynonym, 103 plg:attachedToGeneralConcept, and 20<br />

plg:attachedToHolonym. In the actual integration of resources, the OWL construct<br />

is applied to import both GermaNet and TermNet into the OWL file<br />

containing the plug-in specifications, while the original GermaNet and TermNet<br />

OWL ontologies remain unchanged and reside in their separate files. The integrated<br />

ontology is within OWL DL, and the reasoning software confirms its consistency.<br />

For identifying the necessary plug-in relation instances, we adapted the basic concepts<br />

identification and alignment steps specified for the integration procedure in [4],<br />

using a correspondence list of GermaNet synset and TermNet term pairs which was<br />

derived on the basis of string matching. The remaining 377 TermNet terms will be<br />

linked when the complete GermaNet is available in OWL.<br />

Applying this approach to integrating the residual TermNet terms or even further<br />

terminologies, we might possibly encounter terms without any corresponding hypernymic<br />

or holonymic concept in GermaNet. A complete alignment of both resources<br />

will yield the relevant number of instances regarding different plug-in relations<br />

and the number of concepts that cannot be linked by one of the three relations.<br />

The outcome will show whether the introduction of further types of plug-in relations<br />

is required.<br />

Since we decided to model TermNet terms as OWL classes and GermaNet synsets<br />

as OWL individuals, the inverse relations of the plug-in relations cannot be defined<br />

within OWL DL, i.e. with Synset as their domain and the meta-class of all terms as<br />

range. This would however be a desirable feature of the model, even if drawing inferences<br />

is possible without it.<br />

6 Conclusions and Future Work<br />

Recently, the discussion about interoperability of language resources, including lexical<br />

resources of all kinds, has gained momentum. Interoperability issues are, for example,<br />

the focus of the newly-launched EU-project CLARIN (Common Language Resources<br />

and Technology Infrastructure, cf. www.clarin.eu). Interoperability issues<br />

include the development of standards for various kinds of resources. For WordNets<br />

and similar resources, the Lexical Markup Framework (LMF, [23]) is of utmost importance.<br />

True interoperability, however, is more than imposing format standards on<br />

resources. It should pave the way to merging and combining resources in the context<br />

of an application, even if they do not adhere to a common format standard, a requirement<br />

which often cannot be met. The plug-in approach, as we present it here, shows<br />

how lexical resources can be merged by a set of relations, while the resources themselves<br />

are left untouched. We will demonstrate that our approach can be applied to<br />

other terminological resources and WordNets.


294 Harald Lüngen, Claudia Kunze, Lothar Lemnitzer, and Angelika Storrer<br />

The next steps in our research are the automatic conversion of the complete GermaNet<br />

into the OWL model presented above, and a completion of the definition of<br />

plug-in relation instances needed to connect TermNet to it. We will also implement<br />

and process a test suite of queries to the integrated ontology that are typical of texttechnological<br />

applications such as thematic chaining and discourse parsing (cf.<br />

[22,23], i.e. determining (transitive) hypernyms and calculating path lengths and semantic<br />

distances between synsets or units. In our approach, the merging of plug-in<br />

configurations and the pruning of the upper level of the specialised ontology as well<br />

as the lower level of the general ontology are deliberately shunned. Thus, if the effect<br />

of eclipse as described in [4] is desired, it will have to be produced by the query resolution<br />

procedure. However, we believe that this is the right place for it to go.<br />

Another aspect of our work is worth mentioning. The aforementioned conversions<br />

of Princeton WordNet into an OWL format [20, 3, 21] convert synsets into OWL individuals.<br />

This is surprising both from a lexicographical and a terminological point of<br />

view. Synsets are assumed to represent concepts that are lexicalised by the lexical<br />

units which a synset contains. The conversion of a synset into an OWL class seems<br />

therefore more natural. For instance, the concept dog represents a class of animals, of<br />

which e.g. Fido is an instance. Arguably a conversion of synsets into instances is due<br />

to restrictions of the OWL-DL formalism and in particular of the tools which process<br />

OWL-encoded data. Sanfilipo et al. [25], for instance, deem the modelling of a larger<br />

amount of synsets as classes “impractical for a real-world application.” Elsewhere we<br />

have reported about experiments with an alternative modelling of GermaNet, in which<br />

synsets as well as lexical units have been modelled as OWL classes (cf. [22] for details).<br />

We will therefore investigate how and with which consequences OWL classes such<br />

as the Synset class can be modelled as meta-classes, with the individual synsets being<br />

instances of this meta-class. Schreiber [26] pointed out the growing need for such an<br />

extension in the Semantic Web. Thus far, the definition of meta-classes was only possible<br />

within the dialect of OWL Full. Pan et al. [27] introduce a variant of OWL,<br />

called OWL FA, which provides a well-defined meta-modelling extension to OWL<br />

DL, preserving decidability. Still, the success of such an extension of OWL DL<br />

hinges on the availability of processing tools for this dialect of OWL. From our point<br />

of view, though, such an extension will facilitate linguistically more adequate representations<br />

of lexical-semantic and terminological resources. We will continue to investigate<br />

and to tap the potential of upcoming modelling standards.<br />

References<br />

1. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. The MIT Press, Cambridge,<br />

MA (1998)<br />

2. Kunze, C., Lemnitzer, L., Lüngen, H., Storrer, A.: Repräsentation und Verknüpfung<br />

allgemeinsprachlicher und terminologischer Wortnetze in OWL. J.: Zeitschrift für<br />

Sprachwissenschaft 26(2) (2007)<br />

3. van Assem, M., Gangemi, A., Schreiber, G.: RDF/OWL Representation of WordNet. W3C<br />

Public Working Draft of 19 June 2006 of the Semantic Web Best Practices and Deployment<br />

Working Group. Online: http://www.w3.org/TR/wordnet-rdf/ (2006)<br />

4. Magnini, B., Speranza, M.: Merging Global and Specialized Linguistic Ontologies. In: Proceedings<br />

of Ontolex 2002, pp. 43–48. Las Palmas de Gran Canaria, Spain (2002)


Towards an Integrated OWL Model for Domain-Specific and… 295<br />

5. Roventini, A., Alonge, A., Bertagna, F., Calzolari, N., Cancila, J., Girardi, C., Magnini, B.,<br />

Marinelli, R., Speranza, M., Zampolli, A.: ItalWordNet: Building a Large Semantic Database<br />

for the Automatic Treatment of the Italian Language. In: Zampolli, A., Calzolari, N.,<br />

Cignoni, L. (eds.) Computational Linguistics in Pisa, Special Issue of Linguistica<br />

Computazionale, Vol. XVIII-XIX. Istituto Editoriale e Poligrafico Internazionale, Pisa-<br />

Roma (2003)<br />

6. Bentevogli, L., Bocco, A., Pianta, E.: ArchiWordNet: Integrating WordNet with Domain-<br />

Specific Knowledge. In: Sojka, P. et al. (eds.) Proceedings of the Global WordNet Conference<br />

2004, pp. 39–46 (2004)<br />

7. Bertagna, F., Sagri, M.T., Tiscornia, D.: Jur-WordNet. In: Sojka, P. et al. (eds.) Proceedings<br />

of the Global WordNet Conference 2004, pp. 305–310 (2004)<br />

8. DeLuca, E.W., Nürnberger, A.: Converting EuroWordNet in OWL and extending it with<br />

domain ontologies. In: Proceedings of the GLDV-Workshop on Lexical-semantic and ontological<br />

resources, pp. 39–48 (2007)<br />

9. Vossen, P.: Extending, trimming and fusing WordNet for technical documents. In: Proceedings<br />

of NAACL-2001 Workshop on WordNet and other Lexical Resources Applications.<br />

Pittsburgh, USA (2001)<br />

10. Kunze, C.: Lexikalisch-semantische Wortnetze. In: Carstensen, K.-U. et al. (eds.)<br />

Computerlinguistik und Sprachtechnologie: Eine Einführung, pp. 386–393. Spektrum,<br />

Heidelberg (2001)<br />

11. Vossen, P.: EuroWordNet: A multilingual database with lexical-semantic networks. Kluwer<br />

Academic Publishers, Dordrecht (1999)<br />

12. Miller, G.A., Hristea, F.: WordNet Nouns: Classes and Instances. J. Computational Linguistics<br />

32(1) (2006)<br />

13. Lenz, E.A., Storrer, A.: Generating hypertext views to support selective reading. In:<br />

Proceedings of Digital Humanities, pp. 320–323. Paris (2006)<br />

14. Beißwenger, M., Storrer, A., Runte, M.: Modellierung eines Terminologienetzes für das<br />

automatische Linking auf der Grundlage von WordNet . In: LDV-Forum, 19 (1/2) (Special<br />

issue on GermaNet applications, edited by Claudia Kunze, Lothar Lemnitzer, Andreas Wagner),<br />

pp. 113–125 (2003)<br />

15. ISO 1986: International Organisation for Standardization. Documentation – Guidelines for<br />

the establishement and development of monolingual thesauri. ISO 2788-1986 (1986)<br />

16. ANSI/NISO: Guidelines for the construction, format and management of monolingual<br />

thesauri. ANSI/NISO z39.19-2003 (2003)<br />

17. Miles, A., Brickley, D. (eds.): SKOS Core Guide. W3C Working draft 2, November 2005.<br />

Online: http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20051102 (2005)<br />

18. Kuhlen, R.: Hypertext. Ein nicht-lineares Medium zwischen Buch und Wissensbank.<br />

Springer, Berlin (1998)<br />

19. Tochtermann, K.: Ein Modell für Hypermedia: Beschreibung und integrierte<br />

Formalisierung wesentlicher Hypermediakonzepte. Shaker, Aachen (1995)<br />

20. Ciorăscu, I., Ciorăscu, C., Stoffel, K.: Scalable Ontology Implementation Based on<br />

knOWLer. In: Proceedings of the 2nd International Semantic Web Conference (ISWC2003),<br />

Workshop on Practical and Scalable Semantic Systems. Sanibel Island, Florida (2003)<br />

21. van Assem, M., Menken, M.R., Schreiber, G., Wielemaker, J., Wielinga, B.: A Method for<br />

Converting Thesauri to RDF/OWL. In: Proceedings of the 3rd International Semantic Web<br />

Conference (ISWC 2004), Lecture Notes in Computer Science 3298 (2004)<br />

22. Lüngen, H., Storrer, A.: Domain ontologies and wordnets in OWL: Modelling options. In:<br />

OTT'06. Ontologies in Text Technology: Approaches to Extract Semantic Knowledge from<br />

Structured Information. In: Publications of the Institute of Cognitive Science (PICS), vol. 1.<br />

University of Osnabrück (2007)<br />

23. Francopoulo, G., Bel, N., George, M., Calzolaria, N., Monachini, M., Pet, M., Soria, C.:<br />

Lexical Markup Framework (LMF) for NLP Multilingual Resources. In: Proceedings. of the<br />

Workshop on Multilingual Language Resources and Interoperability. pp 1–8. Sidney (2006)


296 Harald Lüngen, Claudia Kunze, Lothar Lemnitzer, and Angelika Storrer<br />

24. Cramer, I.M., Finthammer, M.: An Evaluation Procedure for Word Net Based Lexical<br />

Chaining: Methods and Issues. In this volume (<strong>2008</strong>)<br />

25. Sanfilippo, A., Tratz, S., Gregory, M., Chappell, A., Whitney, P., Posse, C., Paulson, P.,<br />

Baddeley, B., Hohimer, R., White, A.: Automating Ontological Annotation with WordNet.<br />

In: Proceedings of the 5th International Workshop on Knowledge Markup and Semantic<br />

Annotation (SemAnnot2005) located at the 4th Semantic Web Conference. Galway/Ireland<br />

(2005)<br />

26. Schreiber, G.: The Web is not Well-formed. In: IEEE Intelligent Systems, vol. 17-2, pp.<br />

79–80 (2002)<br />

27. Pan, J.Z., Horrocks, I., Schreiber, G.: OWL FA: A Metamodeling Extension of OWL DL.<br />

In: Proceedings of the Workshop OWL: Experiences and directions. Galway/Ireland (2005)<br />

28. Farrar, S.: Using Ontolinguistics for language description. In: Schalley, A. and Zaeferer, D.<br />

(eds.) Ontolinguistics: How Ontological Status Shapes the Linguistic Coding of Concepts.<br />

Mouton de Gruyter, Berlin (2007)<br />

29. Staab, S. Studer, R. (eds.): Handbook on Ontologies. International Handbooks on Information<br />

Systems. Springer, Heidelberg (2004)


The Possible Effects of Persian Light Verb<br />

Constructions on Persian WordNet<br />

Niloofar Mansoory and Mahmood Bijankhan<br />

University of Tehran, Linguistics Department<br />

nmansoory@gmail.com, mbjkhan@ut.ac.ir<br />

Abstract. This paper deals with a special class of Persian verbs, called complex<br />

verbs (CVs). The paper can be divided into two parts. The first part discusses<br />

the Persian verbal system, and is mainly concerned with the syntactic and<br />

semantic properties of Persian complex verbs or light verb constructions<br />

(LVCs). In the second part of the paper we have discussed the possible effect of<br />

these syntactic and semantic properties on Persian verb hierarchy and Persian<br />

WordNet.<br />

Keywords: complex verbs, simple verbs, verb hierarchy, Persian WordNet<br />

1 Introduction<br />

Persian complex verbs have been discussed by a number of authors with respect to<br />

their syntactic and semantic properties. Some scholars have studied the phenomena<br />

within a syntax –based approach and have considered Persian CVs as elements, the<br />

syntactic and semantic properties of which are determined post-syntactically rather<br />

than in the lexicon. Among these scholars are Karimi [1], Dabir-Moghadam [2], Folli<br />

[3], Karimi-Doostan [4], among others. Some authors like Vahedi-Langrudi [5] have<br />

taken an in between approach. For example, Vahedi-Langrudi provides evidence for<br />

Persian CVs as V (lexical units) and V max (verb phrases). With reference to recent<br />

approaches in the Minimalist Program, he suggests that Persian CVs be placed both<br />

in the lexicon and the syntax. In addition to theoretical linguists, researches in NLP<br />

systems have also been interested in Persian light verb constructions (LVCs) 1 . As<br />

Multiword expressions, Persian LVCs pose many problems in processing and<br />

generating Persian language since they have both lexical and phrasal properties.<br />

Megerdoomian [6], [7] discusses the issue from computational and NLP perspective<br />

and states that we will ignore the productive nature of CVs if we simply list them in<br />

the lexicon.<br />

This paper is concerned with the lexical and computational aspect of the issue and<br />

proposes the possible effects of the compositional and idiomatic aspects of Persian<br />

CVs on Persian WordNet. The paper is organized as follows: Section 2 explains the<br />

1<br />

complex verbs, light verb constructions, compound verbs, and complex predicates<br />

are synonymously used in the literature to refer to Persian complex verbs .


298 Niloofar Mansoory and Mahmood Bijankhan<br />

Persian verbal system. Section 3 deals with the semantic connections between the<br />

light verb (LV) and the nonverbal (NV) elements of Persian LVCs. In section 4 we<br />

have elaborated the nature of the semantic connections illustrating the semantic<br />

regularities in LV uses of zædæn(to hit).finally in section 5 the possible effects of<br />

Persian LVCs on Persian WordNet are discussed .<br />

2 The Persian verbal system<br />

One of the significant characteristics of the Persian verbal system is its small number<br />

of simple verbs. Verbal concepts in Persian are mostly expressed by the combination<br />

of a nonverbal element (NV) 2 and a light verb, the result of which is traditionally<br />

called “compound verb”. This process is very productive and as a result, the number<br />

of simple verbs is less than 200, which is very small in comparison to languages like<br />

English. Some of the Persian simple verbs can also function as LVs in LVCs<br />

(complex verbs). Examples include zædæn (to hit ), kærdæn (to do), Xordæn (to eat),<br />

dadæn (to give), bordæn (to take), and others. The preverbal (or nonverbal) elements<br />

in such constructions range over a number of phrasal categories which are usually<br />

nouns and sometimes adjectives, adverbs, and prepositional phrases (Karimi [1];<br />

Karimi-Doostan [4]. Some examples are as follows:<br />

N + V<br />

A) Pa zædæn Leg hit ‘To pedal’<br />

B) Email zædæn 3 Email hit ‘Send an email’<br />

adj + V<br />

C) lus kærdæn louse doing ‘make louse’<br />

D) agah kærdæn informed doing ‘to inform’<br />

adv + V<br />

E) birun ændaxtæn out throwing ‘to fire’<br />

F) bala keshidæn up pulling ‘to advance’<br />

pp+V<br />

G) æz dæst dadæn of hand giving ‘to lose’<br />

H) be xater aværdæn to memory bringing<br />

‘to remember’<br />

The most conflicting characteristic of Persian CVs is that they have both word-like<br />

properties and phrasal properties. As Karimi [1] and Goldberg [8] suggest, Persian<br />

CVs have one single word stress like simple words. Meanwhile, they undergo<br />

derivational processes that are typically restricted to zero level categories.<br />

Nevertheless, the majority of them can not be considered as frozen lexical elements<br />

since their NV and LV can be separated by some elements such as the feature<br />

auxiliary, negational and inflectional affixes and emphatic elements (Karimi, [1];<br />

Müler [9] ; Folli, [3]).<br />

2<br />

As Karimi [1] mentions, the nonverbal element in Persian LVCs is not restricted to native<br />

Persian elements, and includes borrowed words.


The Possible Effects of Persian Light Verb Constructions… 299<br />

Considering these and other syntactic features of Persian CVs, most of the authors<br />

suggest that LV and NV elements in Persian LVCs are separately generated and<br />

combined in syntax and become semantically fused at a different, later level (Karimi,<br />

[1]; Dabir-Moghaddam, [2]. From the semantic point of view, LVCs are traditionally<br />

listed as separate entries in Persian dictionaries since their semantic properties are the<br />

same as simple verbs. But the problem is that simply listing Persian CVs in the<br />

lexicon and adopting a lexical approach can not explain their phrase-like properties<br />

and the productivity of the process in this language. As Folli [3] suggests Persian<br />

LVCs pose a more serious problem for lexicalist accounts in that it would essentially<br />

need to claim that Persian CVs are instances of idioms receiving a separate entry<br />

along with their syntactic structure. On the other hand, a nonlexicalist approach seems<br />

to have more capabilities in accounting for the syntactic freedom of NV elements and<br />

LVs in these constructions.<br />

In the existing literature, among the authors who have explained the phenomenon<br />

from the syntactic perspective, Dabir-Moghaddam [2] suggest that some Persian CVs<br />

are the result of syntactic incorporation. On the contrary, Karimi [1] maintains that<br />

Persian CVs can not be the result of syntactic incorporation suggesting that for the<br />

better description of such constructions, it would be more helpful to consider both<br />

their semantic and syntactic specifications. In the next section, we will disscus some<br />

semantic connections between the NV elements and LVs in Persian LVCs.<br />

3 Some Semantic Properties of Persian CVs<br />

The Persian CVs do not yield a uniform interpretation. They may receive either an<br />

idiomatic or compositional interpretation. From the LVCs which are compositional<br />

we can not extract a clear pattern showing that they are fully compositional. In other<br />

words, the meaning of the whole is not directly derived from the meaning of the parts.<br />

Consider the following examples:<br />

1) ræng zædæn paint hit ‘ to paint’<br />

roghæn zædæn oil hit ‘ to oil’<br />

2) hærf zædæn words hit ‘ to talk’<br />

færyad zædæn shout hit ‘ to shout’<br />

3) dæst zædæn hand hit ‘ to clap’<br />

pa zædæn leg hit ‘ to pedal’<br />

4) zæng zædæn ring hit ‘ to call’<br />

telefon zædæn phone hit ‘ to call’<br />

email zædæn email hit ‘to send an e mail’<br />

fax zædæn fax hit ‘ to send a fax’<br />

5) lægæd zædæn foot hit ‘ to kick’<br />

sili zædæn hand’s flat hit ‘to strike with the flat of the hand’<br />

chækosh zædæn hammer hit ‘to hammer’


300 Niloofar Mansoory and Mahmood Bijankhan<br />

6) ja zædæn space hit ‘ to give up’<br />

pærse zædæn folling hit ‘to fool around’<br />

jush zædæn boiling hit ‘to tense up’<br />

bærgh zædæn shining hit ‘ to shine’<br />

All the examples above are a sample of Persian CVs containing a noun and the LV<br />

zædæn. The meaning of the verb zædæn as a simple verb in Persian is “ to hit”. As the<br />

examples suggest, this verb, apart from its full verbal usage, can occur in LVCs. In<br />

LV uses of zædæn, where it cooccurs with different PV elements, it conveys new<br />

meanings that are not directly related to its simple verbal meaning. In (1) it means<br />

coating; in (2) it denotes doing, in (3) movement; in (4) sending, and in (5) the<br />

meaning of the LV zædæn is exactly similar to its full verbal meaning. The most<br />

conflicting examples are those in (6), where it seems impossible to postulate a<br />

meaning for zædæn.<br />

The polysemious nature of the LV zædæn in the above examples is that some<br />

constructions are semantically opaque and idiomatic (6), some are compositional (5)<br />

and some are semi-compositional (1-4). Classifying Persian CVs, Karimi [1] suggests<br />

that most Persian compositional CVs can be considered as idiomatically combining<br />

expressions whose idiomatic meaning is composed on the basis of the meaning of<br />

their parts [1]. In this sense, we may consider most of the above semi-compositional<br />

examples as idiomatically combining expressions. However, the problem is that it<br />

would be counter-intuitive to analyze each of the constructions as a separate lexical<br />

entry in the lexicon. since it is possible to extract certain patterns from them. It is<br />

certain that giving an elaborate and detailed picture of such patterns requires an<br />

examination of a large set of data and adoption of an approach that can explain their<br />

compositional and productive properties. In this paper we have attempted to show the<br />

existence of these productive patterns.<br />

4 The semantic patterns in LV uses of zædæn<br />

In (1) zædæn is combined with two nominal elements, namely ræng (paint) and<br />

roghæn (oil). The two nouns belong to the same semantic class; that is, coating (in<br />

WordNet 2.0, paint and oil are under the same synset {coat, coating}). There are other<br />

examples in which zædæn combines with nominal elements with the same semantic<br />

feature as ræng and roghæn 3 :<br />

7) kerem zædæn cream hit ‘to cream’<br />

shampoo zædæn shampoo hit ‘to shampoo’<br />

sabun zædæn soap hit ‘ to soap’<br />

In the examples above, the meaning of LV zædæn is {coat, cover} or {put on,<br />

apply}. This meaning is drastically far from its full verb meaning (hit).<br />

3<br />

It is noteworthy that in WordNet 2.0 coating is a kind of artifact. However, cream is not a<br />

coating but as an instrumentation under the concept artifact. Shampoo and soap are under<br />

none of these concepts, are kinds of substance.


The Possible Effects of Persian Light Verb Constructions… 301<br />

Now consider examples (2). Both hærf (words or a brief statement) and færyad<br />

(out cry) are a kind of communication (as WordNet 2.0, also classifies their English<br />

equivalents as a kind of communication). In this semantic space, the common<br />

properties of the nominal elements seem to activate a different meaning of zædæn;<br />

that is, doing. Other nouns with the same semantic attributes may, also trigger the<br />

same meaning of zædæn. Examples are naære zædæn (to rear) and jigh zædæn (to<br />

scream). The nominal PV elements in these constructions are also a kind of auditory<br />

communication (similar to their English counterparts in WordNet 2.0). Examples (3)<br />

show another type of semi-compositional constructions in which the nominal element<br />

seems to trigger a new meaning of the LV zædæn; that is, movement. Both dæst<br />

(hand) and pa (leg) are external body parts. The meaning of the whole constructions<br />

in such cases indicates a repeated movement of the organ involved. A similar example<br />

is pelk zædæn (to move the eye lids). The nominal elements in (4), except for zæng (in<br />

its literal sense as ringing), are a kind of instrumentation used for communication 4 .<br />

Here, the meaning implied by the LV is communicating via, which is totally different<br />

from its full verb meaning. Among the nouns involved, zæng in the CV zæng zædæn<br />

(to call) can be interpreted as phone in its connotational sense (similar to the English<br />

verb ring meaning call). In this sense, it poses a problem for our analysis since it does<br />

not fall within the class of instrumentation as the other nominal elements in (4).<br />

Examples (5) illustrate a different pattern. Here, the light verb is synonymous to its<br />

full verb meaning (hit) and the preverbal nominal elements do not belong to the same<br />

semantic class. The meaning involved here is compositional in the sense that the<br />

meaning of the LVCs is semantically transparent and can be given by the sum of its<br />

parts.<br />

The examples in (6) illustrate a completely different class of LVCs in which the<br />

meaning is idiomatic. Here, the nominal elements, like the examples in (5) do not<br />

belong to a specific semantic domain as they have no common semantic attribute. The<br />

most important property of these LVCs is that the meaning that the LV implies is not<br />

predictable so that the meaning of the whole construction can not be interpreted<br />

compositionally. Here, no semantic pattern emerges and as such the nominal elements<br />

involved can not be classified under a specific semantic group (unlike those in 1-4).<br />

So it seems plausible to consider them as totally frozen expressions and suppose them<br />

to be stored in the lexicon as individual lexical entries.<br />

In this section, we illustrated the variation involved in the interpretation of Persian<br />

LVCs and CVs by studying some constructions resulting from the combination of the<br />

LV zædæn and a number of nominal PV elements. We revealed the existence of some<br />

semantic regularities in these constructions. Our analysis implies that the group of<br />

nominal elements with the same semantic property, when combined with the same LV<br />

produce a group of LVCs in which the meaning of the whole can be interpreted<br />

compositionally (1-5). In these constructions, the meaning that the verbal element or<br />

LV implies is predictable with respect to the semantic regularities among PV<br />

elements.<br />

When there is no regularity among nominal elements or nominal elements do not<br />

have common semantic attributes (5 and 6), two different groups of LVCs result: (1)<br />

4<br />

WordNet 2.0 classifies telephone and fax as instrumentation, but puts email under the<br />

concept communication.


302 Niloofar Mansoory and Mahmood Bijankhan<br />

LVCs in which the meaning of the LV does not deviate from its full verb meaning<br />

and the interpretation of the CV takes place compositionally ;(2) CVs in which the<br />

LV is semantically different from its full verb. This group of LVCs are idioms and<br />

we can not find any semantic pattern in them.<br />

In the next section we will discuss the possible effects of these findings on the<br />

structure of Persian WordNet.<br />

5 The Possible effects of Persian LVCs on Persian WordNet<br />

So far no serious attempt has been made by Iran's governmental centers or<br />

universities to build Persian WordNet. Some preliminary steps have been taken<br />

outside Iran (Keyvan, [10]) there is also a research done inside addressing the<br />

adjective hierarchy for Persian WordNet (Famiyan, [11]).<br />

The present paper provides a theoretical basis for adopting an efficient strategy to<br />

build Persian WordNet. In order to decide between the Merge or Expand approach (or<br />

the combination of the two) which are applied in constructing WordNets for dozens of<br />

languages around the world, we found it reasonable to concentrate on the language<br />

specific properties of Persian in the first step. One of the most conflicting issues in the<br />

Persian verbal system, as mentioned in section 2, is the specific characteristics of its<br />

LVCs. In order to construct the Persian WordNet, it was very important to answer a<br />

crucial question: should we consider all Persian CVs as frozen lexical elements and<br />

simply place them as individual lexical entries in Persian verb hierarchy? If not, It<br />

would be necessary to find a way to illustrate the productive nature of some semantic<br />

pattern in Persian LVCs.<br />

Studying a group of Persian LVCs, a sample of which is presented in (3), we have<br />

classified the constructions under the three categories of compositional, semi–<br />

compositional and idiomatic. According to this classification, we propose that in<br />

building Persian verb hierarchy, in order to show the productive nature of the first two<br />

classes, we should connect each Persian LV with the specific class of nouns which<br />

trigger an embedded meaning in the LV, when combined with it. To do so, we can list<br />

every LV after its full verb meaning as an individual synset and define the meaning it<br />

implies in relation to the class of nouns triggering this meaning.<br />

For example, remind the properties of the verb zædæn. in Persian WordNet, after<br />

the synset which illustrates the full verb meaning of this verb, we can list other<br />

synsets defining other meanings for this verb, as it is used as a LV. One meaning is to<br />

put on or apply (on a surface), when it joins with the nouns under the concept coating.<br />

Then, under the synset we can list the CVs constructed by this pattern or link the LV<br />

to the nouns involved. The other meaning would be doing, when it joins with the<br />

nouns under the concept communication. In this case, too, we can list the available<br />

CVs after the synset or link the synset to the nouns involved. A similar procedure can<br />

be followed for other meanings mentioned before.<br />

In the case of idiomatic LVCs, we have no option but to list them as individual<br />

lexical entries in Persian WordNet.<br />

This approach is comprehensive from both theoretical and practical perspectives.<br />

First, it attempts to illustrates the existing semantic regularities that contribute to the


The Possible Effects of Persian Light Verb Constructions… 303<br />

productivity of Persian LVCs and predict the possible LVCs in this language. Second,<br />

because in the process of building Persian WordNet we have to work on a large<br />

amount of data, a comprehensive classification of Persian CVs will be done.<br />

Moreover, From the practical point of view, Persian WordNet will not be a mere copy<br />

of PWN and will present the features and properties specific to Persian. In this way,<br />

apart from the verb hierarchy, it is possible to have some language specific<br />

classifications of nominal concepts which can be categorized into different groups<br />

with respect to their relations with the verbal concepts in the process of constructing<br />

LVCs.<br />

6 Conclusion<br />

Building a WordNet for Persian requires a comprehensive study of this language. In<br />

this paper we discussed one of the properties of this language, namely LVCs. We<br />

intrdruced a classification of LVCs and proposed a new method in listing Persian CVs<br />

in an electronic database serving as Persian WordNet.<br />

References<br />

1. Karimi, S.: Persian Complex Verbs: Idiomatic or Compositional. J. Leicology 3, 273–318<br />

(1997)<br />

2. Dabir-Moghaddam, M. Compound Verbs in Persian. J. Studies in Linguistic Sciences 27,<br />

25–59 (1997)<br />

3. Folli, R., Harley, H., Karimi, S.: Determinants of the event type in Persian Complex<br />

Predicates. In: Astruc, L., Richards, M. (eds.) Cambridge occasional papers in Linguistics,<br />

pp. 100–120 (2003)<br />

4. Karimi-Doostan, G.: Light Verbs and Structural Case. J. Lingua 115, 1737–1756 (2004)<br />

5. Vahedi-Langrudi, M.: The syntax, semantics and argument structure of complex predicates<br />

in modern Farsi. PhD dissertation. University of Ottawa (1996)<br />

6. Megerdoomian, K.: Event Structure and Complex Predicates in Persian. J. Canadian Journal<br />

of Linguistics 46, 97–125 (2001)<br />

7. Megerdoomian, K.: A Semantic Template for Light Verb Constructions. In: The 1 st<br />

workshop on Persian Language and Computer, pp. 99–106 (2004)<br />

8. Goldberg, A.: Words by default: The Persian complex predicate construction. In: Froncis,<br />

E.J., Michaels, L.A. (eds.) Mismatch. Center for the study of language and information,<br />

Stanford, pp. 117–146 (2003)<br />

9. Müler, S.: Persian Complex Predicates. In: Proceeding of the 13 th international conference on<br />

Head –Driven Phrase Structure Grammar, pp. 247–267 (2003)<br />

10. Keyvan, F., Borjan, H., Kasheff, M., Fellbaum, C.: Developing PersiaNet: The Persian<br />

WordNet. In: Proceedings of the 3 rd global WordNet Conference, pp. 315–318 (2006)<br />

11. Famiyan, A., Aghajaney, D.: Towards Builing a WordNet for Persian Adjectives. In:<br />

Proceedings of the 3 rd global WordNet Conference, pp. 307–308 (2006)<br />

12. Miller,G.: WordNet2.0. http://wordnet.princeton.edu (2003)


Towards a Morphodynamic WordNet<br />

of the Lexical Meaning<br />

Nazaire Mbame<br />

LRL, UBP Clermont 2, France<br />

mbame@LRL.univ-bpclermont.fr<br />

Abstract. We aim at conceiving a new form of semantic organisation that could<br />

help to build in the future what we entitled: Morphodynamic WordNet of a<br />

language like English. This new form of semantic organisation is<br />

consociationnist and gestaltist in contrast with the associationnist one that<br />

continues to structure nowadays dictionaries. To illustrate, we are going to take<br />

the example of the lexical root “trench-” that the lexical items to trench,<br />

trenching, trencher, trenchspade, trenchcoat, trenchweapon, trenchknife, etc.<br />

repeat in their morphology. After the study of its semantic morphogenesis, we<br />

will draw up the corresponding schematic organisation that needs to be<br />

computed and implemented in the frame of the Morphodynamic WordNet<br />

project.<br />

Keywords. Gestalt, Morphogenesis, categorization, Morphodynamic WordNet,<br />

semantic forms.<br />

1 Introduction<br />

By Morphodynamic WordNet, we intend a schematic representation in which lexical<br />

meanings and items derive one from another according to their natural causality. By<br />

Semantic morphogenesis (or Morphosemantic genesis) we intend, the generation<br />

process of lexical meanings by particularisation, differentiation, and categorial<br />

transposition. We are going to present in this paper the general aspects of our<br />

Morphodynamic WordNet project. We will first begin by presenting our theoretical<br />

framework, and then, will offer a descriptive example of application.<br />

2 Theoretical Framework<br />

Our representational model, we derive it from diverse sources. From the<br />

phenomenologist and gestaltist theory of the concept as developed by Husserl,<br />

Gurwitsch, Merleau-Ponty, etc. According to this theory, the concept or noema comes<br />

from the sensible reality (objects and events) such as it is perceived and experienced<br />

through different points of view and facets. The same way this reality is perceived and<br />

experienced, the same way, the corresponding noema or concept is organised. So,


Towards a Morphodynamic WordNet of the Lexical Meaning 305<br />

there is a kind of structural isomorphism between the concept and the corresponding<br />

experienced reality.<br />

The other source of our inspiration is the Theory of catastrophe of R. Thom [1, 2]<br />

who, while talking about concept and signification, stipulated that the concept (and<br />

thus the signification) presents a “nucleus” around which we find a gestalt of its own<br />

deriving points of views, facets, forms, etc., which are its “presuppositions”. Between<br />

the “nucleus” and its “presuppositions”, we find structural genetic relationships. In<br />

the Theory of Semantic forms of Cadiot and Visetti [3] – in which the philosophical<br />

assumptions of phenomenology and gestalt are exploited for application in semantics,<br />

the same kind of functional organization of the concept reappears through the notions<br />

“motive” and “profile” the former denoting a kind of undifferentiated “nucleus”, and<br />

the latter, the different ways or forms by which this semantic nucleus appears to us, or<br />

can be perceived and intuitionized.<br />

Without being subject to semiotisation by acoustic and graphic signifyings,<br />

concepts would be useless in linguistics. We find the basis of this semiotisation in the<br />

materialist and sensualist philosophy of language, in the way Cassirer [4] presents it<br />

in the Philosophy of Symbolic forms, 1 language.<br />

3 Morphogenesis of the lexical meaning: example of the lexical<br />

root “trench-”<br />

R. Thom [1, 2] defined Morphogenesis: generation (or destruction) of forms. We are<br />

going to see how this concept works in the semantic domain, especially in the<br />

generation of what we call semantic forms (uses, senses, etc.) of lexical items, which<br />

dictionaries just collect in the associationnist way. A semantic form [3] is qualitative,<br />

individual and original by itself. Moreover, it is not isolated: it integrates a kind of<br />

system structured by genetic relationships. Morphogenesis of semantic forms can thus<br />

be defined as the generation process of lexical senses, meanings, uses, etc. It starts<br />

from a semantic nucleus semiotized by a lexical root, and yield semantic forms<br />

relating to this nucleus by proper reduction, particularization, differentiation and<br />

categorial transposition. The emerging semantic forms are often notified by lexical<br />

items repeating, in their morphology, the lexical root of the nucleus. For example, to<br />

trench, trenching, trencher, trenchspade, trenchcoat, trenchweapon, trenchknife, etc.<br />

are repeating the lexical root “trench-”. This is the material proof of the genetic<br />

relationships these lexical items share prior at the conceptual level. All these words<br />

are modifying the lexical root “trench-” and the corresponding nucleus in the frame<br />

of a dynamic process of categorization.<br />

The semantic nucleus is phenomenological, descriptive and generic. Its nature<br />

should first be determined, before trying to see how its semantic morphogenesis (or<br />

morphosemantic genesis) works. As we said, the semantic forms (meanings, uses,<br />

senses...) generated by means of this morphogenesis are not isolated: they integrate a<br />

kind of gestalt, a system where they derive one from another in accordance with their<br />

natural causality.


306 Nazaire Mbame<br />

3.1 The semantic morphogenesis of the lexical root “ trench-” and its schematic<br />

representation<br />

Let us take the example of the lexical notion “trench”. The free online dictionary lists<br />

for it at least 20 lexical inputs and uses that are, for example: to trench, trenching, the<br />

trench, trencher, trenchspade, trenchcoat, trenchweapon, trenchknife, etc. Each of<br />

these lexical items is individually introduced in its definition(s), without showing the<br />

semantic structural relationships they all share around the lexical root “trench-”. In<br />

our perspective, this lexical root gives access to a proto semantic nucleus that is<br />

continuously categorized by different ways.<br />

So, what is this nucleus? A serious semiotic study could be done on that in the way<br />

Cassirer [4] suggests it in its Philosophy of symbolic forms, 1the Language, with the<br />

help of Indo-European roots [5]. Nevertheless, pragmatically, in comparison with the<br />

verb to trench, this proto nucleus seems to express a virtuous and undifferentiated<br />

idea of cutting the matter and taking it out to free a longitudinal opening with plane<br />

vertical sides... Husserl [6] introduced the method of the Phenomenological<br />

reduction, which consists in investigating, in a descriptive way, the perceptive noema<br />

(concept or the linguistic signified) in order to determine its actual and potential<br />

forms, structures, facets and points of view of (re)presentation. This method is going<br />

to help us in determining the semantic morphogenesis of the nucleus “trench-”. By<br />

means of morphogenesis (or conceptual categorization), semantic forms relating to<br />

this nucleus are generated. After, we will draw up the schematic representation of this<br />

morphogenesis that has to be computed and implemented for the needs of the<br />

Morphodynamic WordNet project.<br />

The question that comes to us immediately is about the difference between the<br />

nucleus expressed by the lexical root “trench-” and the semantic form expressed by<br />

the verb to trench. In fact, objectively, the verb to trench considers the nucleus of the<br />

lexical root “trench-” from the point of view of its temporal anchorage. Besides, we<br />

find the lexical item trenching (act of trenching) that binds this nucleus into spatial<br />

praxis. Space and time are thus primitively particularizing the lexical root “trench-”.<br />

In practice, the act of trenching - that is the act of cutting in the matter and taking<br />

it out to free a longitudinal opening... could be assimilated to the act of digging. This<br />

explains why, in some cases, to trench is defined as to dig nearby its other definition<br />

to cut. Digging implies cutting and these actions are “sequential parts” of the act of<br />

trenching. Digging and cutting are thus semantic features of this global action that, in<br />

some pragmatic contexts, the verb to trench can promote and express.<br />

The action expressed by to trench sketches a pragmatic scenario where we find<br />

participants interacting. The actantial reduction of this pragmatic scenario yields<br />

individual lexical concepts that are: trencher (man who trench), trenching spade,<br />

trencher, etc. (instruments used in trenching). As we can notice it, the lexical item<br />

trencher is yet stabilising two meanings: the meaning of the intentional person who<br />

trench, and the meaning of the machine used in trenching (cutting, digging). These<br />

are semantic poles of its polysemy.<br />

In praxis, the act of trenching needs to be direct, exact, calculated and strong. In<br />

the linguistic usage, the lexical expression to be trenchant transposes these qualities to<br />

human being in order to qualify him by their points of view. To be trenchant means<br />

for man, to be direct, exact, strong in judging. Qualities relating prior to the praxis of


Towards a Morphodynamic WordNet of the Lexical Meaning 307<br />

trenching are then transferred to human being in order to point out some of its<br />

spiritual and intellectual abilities.<br />

This categorial transposition of to be trenchant is viewed as a kind of<br />

particularization of the nucleus “trench-”. Here, transposition means: transferring<br />

properties of an object of a category A to an object of a category B by virtue of their<br />

similarities, analogy, etc. This gives justice to Cassirer [4] who claimed that, even the<br />

most abstract concepts are rooted in praxis, in sensible experience.<br />

From the expression to be trenchant, derives conceptually and morphologically the<br />

adverb trenchantly (in the trenchant way), and the nominal trenchancy (the faculty of<br />

being trenchant, for a man). The related conceptual categorization goes along with the<br />

corresponding grammatical class changing. The principal outcome of the action of<br />

trenching is to free a longitudinal opening that the lexical item the trench denotes<br />

particularly. At its proper level, this notion specifies itself in natural trench, artificial<br />

trench; the former denoting a trench created by natural catastrophes, and the later, a<br />

trench created by human beings or animals. Natural/artificial are thus categorizing<br />

the concept of trench, as defined above. Furthermore, at their levels, natural trench /<br />

artificial trench can be subject to conceptual particularization. For example, we will<br />

find natural trenches located in land, and natural trenches located in sea, the fact of<br />

being located in land or in sea being the matter of the corresponding semantic<br />

refinement. The particularization process does not stop at these levels: it continues<br />

because we find different kinds of natural land trench (Attacama trench, for example)<br />

and different kinds of natural sea trench (Japan trench, Bourgainville trench). Along<br />

with that, we have Trenchtown that is a quarter of Kingston in Jamaica. This quarter<br />

wears the name trench because a natural land trench parts it.<br />

We started from the generic notion trench (in the sense of a longitudinal hole), and<br />

we observed how the process of its dynamic categorization works relatively to its<br />

branching node natural trench. Relatively to its other branching node artificial trench,<br />

we have, for example, the categories draining trench (an artificial trench helping to<br />

drain water), military trench (an artificial trench dug for war), etc. Specially, we are<br />

going to focus on military trench, because its semantic morphogenesis is very wide<br />

and rich.<br />

3.2 Military trench and its semantic morphogenesis.<br />

The adjective military (of military trench) differentiates the generic notion artificial<br />

trench by binding it to the military context. This is the matter of its categorization. In<br />

English, many lexical items formed with the lexical root “trench-” are semantically<br />

motivated in this context. For example, slit trench categorizes military trench over the<br />

detail of its practical function of allowing the evacuation of soldiers. Slit trench is<br />

therefore a kind of military trench with specific properties and function.<br />

During their stay in military trenches, soldiers were exposed to climate<br />

disturbances (snowing, raining, coldness...), and they needed adequate furniture for<br />

their protection. Sometimes, they could also contact diseases relating to their<br />

environment and life condition. About that, we find expressions like trench cap,<br />

trenchcoat originally signifying the cap or coat that soldiers used to wear in trenches<br />

to protect themselves from raining, coldness, etc. Nowadays, trenchcoat is totally free


308 Nazaire Mbame<br />

from the proto military context, because it generally signifies a kind of overcoat that<br />

common people can wear to protect themselves from raining and coldness. The<br />

original properties and function of this equipment are kept. We also find trench fever,<br />

trench foot denoting specific illnesses that soldiers were able to contact during their<br />

stay in military trenches. We thus consider trench cap, trench coat; trench fever,<br />

trench foot as semantic forms relating to the generic notion military trench. They put<br />

into evidence and semiotise some of its potential points of view and facets. They are<br />

its immediate deriving semantics forms. Films where actors wear trenchcoats (or<br />

overcoats) are called trenchcoat films. Either, we find trenchcoat mafia referring to a<br />

gang of youngsters responsible of the Columbine killing in USA. This gang deserves<br />

the qualification trenchcoat because its members used to wear black overcoats. So<br />

trenchcoat films, trenchcoat mafia are particularizing the generic notion trenchcoat<br />

according to contexts of its application. They are its deriving categorial semantic<br />

forms.<br />

As we said, the notion military trench gives access to a military context full with<br />

local particularizing points of view. For example, when a soldier died, he was buried<br />

in the trench in a way called: trench interment. In trenches, soldiers were using codes<br />

named: trench codes. Furthermore, the notion of trench warfare actualizes in praxis<br />

the military context that is just latent in military trench. Conceptually, trench<br />

interment, trench codes, trench warfare, are either deriving from the generic notion<br />

military trench. They are its categorial semantic forms.<br />

When there is war, soldiers use arms and fighting strategies. The notions trench<br />

fire, trench raiding, trench weapons are apprehending the generic notion trench<br />

warfare from the points of view of arms used in it, of its fighting strategies, etc. They<br />

are consequently its deriving semantic forms. At its proper level, trench weapon<br />

categorizes itself in trench mortar, trench knife, trench gun, etc., which are its brands.<br />

Dictionaries also mention trench war that is a video game reproducing the trench<br />

warfare. Here, the concept of trench warfare is considered from the point of view of<br />

its possible reproduction into a video game. In relation with trench war, we find the<br />

expression trench trophy designating the trophy that the winner of a trench war video<br />

game obtains.<br />

In the context of actual trench warfare, it could occur that, after an offensive that<br />

initially pulled soldiers out of their trenches, they move backward to these trenches<br />

for self-protection. This potential strategic retreat of soldiers is what the verb to<br />

entrench puts into evidence in the context of trench warfare. From this verb derives<br />

conceptually the nominal entrenchment. And by generalization of its meaning, we<br />

find some uses of to entrench in which it only means the act of withdrawing from an<br />

offensive position and hiding behind a protection that is no more just a trench, but<br />

that could also be a wall, for example.<br />

In dictionaries, we find other usages of to entrench in which he promotes various<br />

ideas of digging, occupying a trench, fortifying, securing, etc. All these ideas are<br />

particularizing the verb to entrench according to points of view relating to its<br />

contextual applications and categorial transpositions. They are its deriving categorial<br />

semantic forms. For the rest, dictionaries also list expressions like: trench Schotty<br />

Barrier, trencher friend, trench effect, trencher cap, Trench mouth, etc. These notions<br />

are categorizing the nucleus “trench-” directly, or indirectly through some of its<br />

morphosemantic categories. Additional studies will be made to determine the points


Towards a Morphodynamic WordNet of the Lexical Meaning 309<br />

of view they promote. After the complete study of the morphosemantic genesis of<br />

“trench-”, we will then be able to reproduce its complete schematic representation as<br />

the following figure sketches it:<br />

-<br />

Bourgain<br />

Japan<br />

r<br />

Trenche<br />

Trenchi<br />

ng spade<br />

A trench (a<br />

castle fortification)<br />

Trenchtow<br />

n (quarter)<br />

Attaca<br />

ma trench<br />

Trench<br />

coat<br />

Tren<br />

ch coat<br />

Trench<br />

coat films<br />

Trench<br />

( natural<br />

sea opening )<br />

Trench<br />

(<br />

natural<br />

Trench<br />

( natural<br />

land opening<br />

t<br />

(man<br />

Trench<br />

interme<br />

Trench<br />

military<br />

clothing<br />

Tren<br />

ch cap<br />

Trench<br />

r<br />

Trenche<br />

« Trench<br />

-»<br />

to<br />

Trench<br />

(instrument)<br />

trench,<br />

Trench (<br />

an opening )<br />

Trench<br />

(<br />

artificial<br />

h<br />

Trenc<br />

( ilit<br />

Trench<br />

warfare<br />

(actual war)<br />

e<br />

Fir<br />

To trench (to dig a<br />

castle fortification)<br />

To<br />

trenchant<br />

(to<br />

)<br />

To<br />

retrench<br />

be<br />

be<br />

Trench drain<br />

( chanel of<br />

ti )<br />

To<br />

entrench<br />

(seek<br />

Trench<br />

wars<br />

Slit Trench<br />

(for<br />

Trench<br />

Trench<br />

military codes<br />

To<br />

retrench<br />

Trenchan<br />

tly<br />

(S l<br />

Trenchan<br />

cy<br />

Trenc<br />

h fever<br />

Tren<br />

ch foot<br />

Entren<br />

cht<br />

t<br />

Trench<br />

wars<br />

h


310 Nazaire Mbame<br />

4 Conclusion<br />

As shown above, the lexical root « trench- » is the genetic patrimony of all the lexical<br />

items which repeat it and modify at the conceptual and morphological levels. As<br />

stipulated by R. Thom [1, 2], Semantic is thus a kind of genetics. The above<br />

systematic representation should be completed, computed and implemented in the<br />

frame of the project of creating a Morphodynamic WordNet of English. The aim of<br />

this Morphodynamic WordNet is to reproduce schematically the morphosemantic<br />

derivations of lexical meanings, uses, etc. in their natural way of processing. It should<br />

also precise the semantic links and catastrophes, which cause and support theses<br />

derivations. Moreover, the stable morphosemantic poles of this WordNet should be<br />

illustrated by specific linguistic examples concerning the individual meanings they<br />

fix. In this semantic domain, Morphogenesis generates semantic forms by differential<br />

variations [7] as from an undifferentiated semantic nucleus situated in the centre, and<br />

that lexical roots usually notify. This semantic morphogenesis is extendable and<br />

limitless.<br />

References<br />

1. Thom, R.: Stabilité structurelle et morphogenèse. Essai d'une théorie générale des modèles.<br />

New York, Benjamin & Paris, Ediscience (1972)<br />

2. Thom, R.: Modèles Mathématiques de la Morphogenèse. Christian Bourgois (1980)<br />

3. Cadiot, P., Visetti, Y.-M.: Pour une théorie des formes sémantiques: motif/profil/thème.<br />

PUF, Paris (2001)<br />

4. Cassirer, E.: La philosophie des formes symboliques, 1 le Langage. Traduction française,<br />

Collection «le sens commun». Edition de minuit (1972)<br />

5. Köbler, G.: Indogermanisches Wörterbuch, (3. Auflage)<br />

http://homepage.uibk.ac.at/~c30310/idgwbhin.html<br />

6. Husserl, E.: Idées directrices pour une Phénoménologie. Gallimard, Paris (1950-1913)<br />

7. Petitot. J., Varela, F., Roy, J.-M., Pachoud, B.: Naturaliser la phénoménologie de Husserl.<br />

CNRS Editions, Paris (2002)<br />

8. Cassirer, E.: Substance et fonction: éléments pour une théorie du concept. Éditions de<br />

Minuit, Paris (1977)<br />

9. Fellbaum, C.:WordNet, an Electronic Lexical Database. MIT (1998)<br />

10. Cruse, A.: Lexical semantics. Cambridge University Press, Cambridge (1986)<br />

11. Gurwitsch, A.: Théorie du champ de la conscience. Desclée de Brouwer, Paris (1957)<br />

12. Husserl, E.: Recherches Logiques, Recherche III, IV et V. PUF, Paris (1972)<br />

13. Mbame Nazaire: Part-whole relations: their ontological, phenomenological and lexical<br />

semantic aspects. PHD, UBP Clermont 2 (2006)<br />

14. Petitot, J.: :Morphogenèse du sens. PUF, Paris (1985)<br />

15. Petitot, J.: Physique du Sens. CNRS, Paris (1992)<br />

16. Rosenthal, V., Visetti, Y.-M.: Sens et temps de la Gestalt. J. Intellectica, 28, 147-227<br />

(1999)<br />

17. Rosenthal, V.: Formes, sens et développement: quelques aperçus de la microgenèse.<br />

http://www.revue-texto.net/Inedits/Rosenthal/Rosenthal_Formes.html<br />

18. The free dictionary, http://www.thefreedictionary.com<br />

19. Visetti, Y-M.: Anticipations linguistiques et phases du sens. In: Sock, R., Vaxelaire, B.<br />

(2004)


Methods and Results of the Hungarian<br />

WordNet Project 1<br />

Márton Miháltz 1 , Csaba Hatvani 2 , Judit Kuti 3 , György Szarvas 4 , János Csirik 4 ,<br />

Gábor Prószéky 1 , and Tamás Váradi 3<br />

1<br />

MorphoLogic, Orbánhegyi út 5, H-1126 Budapest<br />

{mihaltz, proszeky}@morphologic.hu<br />

2<br />

University of Szeged, Department of Informatics, Árpád tér 2, H-6720, Szeged<br />

hacso@inf.u-szeged.hu<br />

3<br />

Research Institute of Linguistics, Hungarian Academy of Sciences, Benczúr utca 33, H-<br />

1068 Budapest<br />

{kutij, varadi}@nytud.hu<br />

4<br />

Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and<br />

University of Szeged, Aradi vértanúk tere 1, H-6720 Szeged<br />

{szarvas, csirik}@inf.u-szeged.hu<br />

Abstract. This paper presents a complete outline of the results of the Hungarian<br />

WordNet (HuWN) project: the construction process of the general vocabulary<br />

Hungarian WordNet ontology, its validation and evaluation, the construction of<br />

a domain ontology of financial terms built on top of the general ontology, and<br />

two practical applications demonstrating the utilization of the ontology.<br />

1 Introduction<br />

This paper presents a complete outline of the results of the Hungarian WordNet<br />

(HuWN) project: the construction process of the general vocabulary Hungarian<br />

WordNet ontology, its validation and evaluation, the construction of a domain<br />

ontology of financial terms built on top of the general ontology, and two practical<br />

applications demonstrating the utilization of the ontology.<br />

The quantifiable results of the project may be summarized as follows. The<br />

Hungarian WordNet comprises of over 40.000 synsets, out of which 2.000 synsets<br />

form part of a business domain specific ontology. The proportion of the different<br />

parts-of-speech in the general ontology follows that observed in the Hungarian<br />

National Corpus and includes approximately 19.400 noun, 3.400 verb, 4.100 adjective<br />

and 1.100 adverb synsets.<br />

In the following section, we describe our construction methodology in detail for the<br />

various parts of speech. In section 3., we present our validation and evaluation<br />

methodology, and in the last section we present the information extraction and the<br />

1<br />

The work presented was produced by the Research Institute for Linguistics of the Hungarian<br />

Academy of Sciences, the Department of Informatics, University of Szeged, and<br />

MorphoLogic in a 3-year project funded by the European Union ECOP program<br />

(GVOP-AKF-2004-3.1.1.)


312 Márton Miháltz et al.<br />

word sense disambiguation corpus building applications that make use of the<br />

ontology.<br />

2 Ontology construction<br />

The development of the HuWN followed the methodology that was called expand<br />

model by [8]. Although this general principle seemed applicable in the case of the<br />

nominal, adjectival and adverbial parts of our WordNet, naturally, some minor<br />

adjustments to the language-specific needs were allowed as well. In the case of verbs,<br />

however, some major modifications were necessary. Due to the typological difference<br />

between English and Hungarian some of the linguistic information that Hungarian<br />

verbs express through preverbs − related to aspect and aktionsart − called for an<br />

additional different representation method of Hungarian verbs than the one in PWN.<br />

This new representation, together with some other innovations in the adjectival part of<br />

the Hungarian WordNet are described in detail in [2] and a separate paper submitted<br />

to the Conference.<br />

A second principle we decided to comply with was the so-called conceptual<br />

density, as defined by [6]. This means that if a nominal or verbal synset was selected<br />

for inclusion in the Hungarian ontology, all its ancestors were also added to the<br />

ontology. This way the resulting ontology is dense, in the sense that it does not<br />

contain contextual gaps. This fact has the advantage that later extensions of the<br />

HuWN can be performed by further extending the important parts of the hierarchies,<br />

without the need of constant validation and search for gaps in the upper levels.<br />

During the construction of the HuWN there were several work steps when the<br />

usage of monolingual resources was necessary: the Concise Hungarian Explanatory<br />

Dictionary (Magyar értelmező kéziszótár, EKSZ) ([4]), a monolingual explanatory<br />

dictionary, the Hungarian National Corpus ([7]) and a subcategorisation frame table<br />

of the most frequent verbs in Hungarian, developed by the Research Institute for<br />

Linguistics of the Hungarian Academy of Sciences.<br />

The relation types that have been retained from the Princeton WordNet are hypoand<br />

hypernymy, antonymy, meronymy (substance, member and part), attribute<br />

(be_in_state), pertainym, similar (similar_to), entailment (subevent), cause (causes),<br />

also see (in the case of adjectives). Since the verbal relation indicating super- and<br />

subordination, called troponymy in PWN is called hypernymy in the version imported<br />

into the VisDic WordNet-building tool we have used, we have adopted the latter<br />

name. Some new relation types were also introduced, partly because of language<br />

specific phenomena − relations within the nucleus-strucure have to be mentioned here<br />

− and partly due to other, language-independent reasons − two new relations<br />

introduced in the adjectival HuWN, scalar middle and partitions, represent the latter<br />

type of new relations. These are described in detail in a separate paper submitted to<br />

the Conference.<br />

A primary concern when starting the ontology building was to provide a large<br />

overlap between the vocabulary covered by the Hungarian WordNet and other<br />

WordNets developed over the recent years. Accordingly, we have decided that we<br />

will take the BalkaNet Concept Set ([6]) (altogether 8,516 synsets) as a basis for the


Methods and Results of the Hungarian WordNet Project 313<br />

expand model, and find a Hungarian equivalent for all its synsets, or state if the given<br />

meaning is non-lexicalised in Hungarian.<br />

2.1 Nouns<br />

2.1.1 Translation of the BCS and adding the LBC<br />

We first implemented the nominal part of the BalkaNet Concept Set (BCS sets 1, 2<br />

and 3 together), consisting of 5,896 Princeton WordNet 2.0 noun synsets.<br />

First, we applied several machine-translation heuristics, developed earlier ([3]) in<br />

order to get rough translations for as many literals as possible. This comprised about<br />

50% of all BCS synsets. These were then manually examined, corrected and extended<br />

with further synonyms using the VisDic editor. We also allowed for many-to-one and<br />

one-to-many mappings between the ILI and HuWN synsets. The BCS synsets that<br />

remained untranslated by automatic means were translated manually and processed in<br />

a similar way. The lexicographers also linked related entries from the EKSZ<br />

dictionary to as many synsets as possible, and added definitions, based on EKSZ<br />

definitions.<br />

As a starting point, we adopted all the semantic relations among the synsets from<br />

PWN 2.0. After the translation of all the BCS synsets to Hungarian was complete, we<br />

manually checked all the adopted relations and modified the hierarchies according to<br />

specifics of Hungarian lexical semantics.<br />

Following the EuroWordNet methodology, we then added our Local Base<br />

Concepts (LBCs), synsets for basic-level and important Hungarian concepts not<br />

covered by the common core of the BCS. For this, we used a list of most frequent<br />

nouns in the Hungarian National Corpus and those used most frequently as genus<br />

terms in the definitions of the EKSz monolingual dictionary. For each of these, we<br />

identified the most frequent sense in the EKSz, then identified the subset for which no<br />

references were made in the Hungarian BCS. For these, we created 250 additional<br />

synsets, which constitute the local base concepts for Hungarian.<br />

2.1.2 Concentric extension based on the ILI<br />

After the creation of the concepts of the Base Concept Set and the Local Base<br />

Concepts, we decided to extend the Hungarian nominal WordNet concentrically,<br />

considering in several iterations the direct descendants of the ILI projection of the<br />

actual Hungarian WordNet as candidates. This way, the conceptual density criterion<br />

was automatically satisfied during the expansion, and we added general concepts from<br />

the upper levels of the concept hierarchy (since we started with the Base Concept<br />

Set).<br />

Regarding the fact that upper level synsets usually have more than one hyponym<br />

descendants, in each iteration we had to select the 1-2 thousand most promising<br />

candidates from 30-40 thousand available. We used four, not necessarily concordant<br />

characteristics for ranking:<br />

Translation: The concept candidate was preprocessable with automatic synset<br />

translation heuristics ([3]). This way the creation and correct insertion of the concept


314 Márton Miháltz et al.<br />

to the Hungarian hie0rarchy was easier to carry out, as one or more literals of the<br />

original English synset were available in Hungarian for the linguist expert.<br />

Frequency: The concept had high frequency in English corpora (British National<br />

Corpus, American National Corpus First Release, SemCor). This usually indicates<br />

that the concept itself appears frequently in communication and thus adding it to the<br />

WordNet under construction was sensible.<br />

Overlap with other languages: The candidate synset was conceptualized in<br />

WordNets for several languages besides English. This way we could maximize the<br />

overlap between Hungarian and foreign WordNets, that can be beneficial in<br />

multilingual applications like Machine Translation, and furthermore we could extend<br />

the ontology with such concepts that have been found useful by many other research<br />

groups as they added it to their own WordNet.<br />

Number of relations: In the initial phases of the extension it made sense to take<br />

into account how many new synsets would become reachable by adding the one in<br />

question to the ontology. This way we could increase the number of candidates for<br />

later phases of the concentric extension.<br />

2.1.3 Complete hierarchies for selected domains<br />

As an additional extension method, we chose several domains for which all of the<br />

synsets in all of the hyponym subtrees in Princeton WordNet 2.0 were implemented in<br />

Hungarian. We did this to try to reach maximum encyclopedic coverage of the<br />

following areas:<br />

• Geographic concepts and instances (countries, capitals and major cities,<br />

member states, geopolitical and other important regions, continents, names<br />

of important bodies of water, mountain peaks and islands)<br />

• Human languages and language families<br />

• Names of people<br />

• Monetary units of the world.<br />

We added 3,200 synsets based on these criteria<br />

2.1.4 Domain synsets<br />

In order to enable the coding of domain relations for synsets to be implemented in the<br />

future, we translated all the PWN 2.0 category and region domain synsets. We also<br />

extended the set of region domain synsets with a collection of specific Hungarian<br />

region names.<br />

We decided to neglect the Princeton WordNet usage domain relationships because<br />

of several inconsistencies observed in PWN (e.g. in some cases, the usage<br />

classification pertains to all literals in a synset, while in other cases it doesn’t.)<br />

Instead, we used a fixed list of our own usage codes, which could be applied<br />

individually to each literal using VisDic, providing a more flexible approach.<br />

2.1.5 Proper names<br />

National WordNets contain entity names among nominal synsets in a certain<br />

proportion. Among these are universal ones, like the world’s countries, capitals, world<br />

famous artists, scientists or politicians, and ones that are important for that certain<br />

nation/country.


Methods and Results of the Hungarian WordNet Project 315<br />

We added a considerable amount of the named entities that were found most useful<br />

for the Hungarian WordNet, after the following processing steps:<br />

• Standardization (format and character encoding)<br />

• Selection (selection of categories to incorporate to the ontology and selection of<br />

instances for chosen categories)<br />

• Extension (we collected different transliterations, synonyms and paraphrases of the<br />

selected entities)<br />

2.2 Verbs and adjectives<br />

In the case of verbs, after an initial phase of applying the expand method, it became<br />

obvious that the simple translation of English synsets with the same hierarchical<br />

relations between them would not result in a coherent Hungarian semantic network,<br />

even if local modifications are allowed. Consequently, we decided to make more<br />

extensive use of our monolingual resources, and tried to apply a methodology that<br />

would both satisfy the need for alignment with the standard WordNet (at least<br />

concerning the core vocabulary) and the need for a representation that does justice to<br />

language-specific lexical characteristics of Hungarian.<br />

Lacking frequency data of verb senses, we started out from the frequency data of<br />

Hungarian verbal subcategorisation frames, which in Hungarian have specific enough<br />

syntactic information to be close to determining sense frequency. We included all the<br />

senses of the 800 most frequent Hungarian verbal subcategorisation frames in the<br />

Hungarian WordNet and made sure they had English equivalents, but also allowing<br />

for approximate interlingual connections (eq_near_synonym relation). If the<br />

equivalent of a Hungarian synset was found outside of the range of the BCS, the<br />

criterion of conceptual density was followed in all cases.<br />

In order to achieve a more consistent hierarchy of HuWN, we decided that<br />

although the Hungarian synsets themselves should be connected to the PWN<br />

equivalent synsets, their internal structure should be developed independent of the<br />

English one.<br />

In the case of adjectives, the translation of the BCS synsets proved not to present<br />

such problems, and concerned only approx. 300 synsets. Given that these were all<br />

focal synsets of different descriptive adjective clusters, we followed the expansion<br />

method: we added the respective satellite synsets to the translated focal ones, and, if<br />

this was necessary, added the antonym half-cluster as well. This work, however,<br />

included some minor adjustments, since the lexicalized antonym pairs and their<br />

satellite synsets are highly language-specific, which should be reflected in the<br />

ontology. Some more structural changes implemented in the adjectival WordNet<br />

concerned antonym clusters which were not centered around a bipolar scale, but<br />

which had three circular antonym relations ([1]).


316 Márton Miháltz et al.<br />

2.3 Adverbs<br />

Considering the ratio of the parts of speech observed in corpora, we decided to add<br />

about 1,000 adverbial synsets in addition to the synsets of the localized BCS, that did<br />

not contain any adverb synsets.<br />

Because of the lack of adverbial sense frequency data for Hungarian, we decided to<br />

translate about 1,000 most frequent adverbial senses in PWN 2.0. In order to<br />

accomplish this, we first selected PWN synsets containing at least one literal that<br />

occurred at least once in that sense in the SemCor sense-tagged corpus. Next, we<br />

added up all the frequencies of all the surface forms of all the adverbs in the American<br />

National Corpus for each PWN 2.0 adverb synset, and selected synsets with a score of<br />

at least 1. The intersection of these two sets formed 1,013 adverbial synsets, which<br />

were automatically and manually translated and edited as outlined above.<br />

We then carried out a number of revisions in order to adjust for Hungarian<br />

semantics and morphology:<br />

• Separated and added senses for adverbs that have both time and place meaning.<br />

• For adverbs of place, we identified the possible direction subgroups determined by<br />

case suffixes, and made each subgroup complete.<br />

• Merged PWN synsets that could be expressed by a single Hungarian adverb sense.<br />

2.4 The financial domain ontology<br />

Besides the construction of general purpose language ontologies, developing domain<br />

ontologies for specific terminologies is important, since the vocabularies of general<br />

language ontologies are rarely capable of covering the specific language of a special<br />

scientific or technical domain. The financial domain ontology connected to the<br />

general HuWN ontology served as a basis for information extraction application,<br />

described in section 4.1.<br />

We used two different approaches to add domain-relevant terms to the Hungarian<br />

WordNet. First, we made use of the high coverage of Princeton WordNet. By manual<br />

inspection, we located 32 concepts in PWN that we found to contain relevant terms in<br />

the domains of economy, enterprise and commerce. We added the 1,200 synsets that<br />

are in the hyponym subtrees of these domain top concepts.<br />

As a second step, we examined a domain corpus consisting of short business<br />

articles and collected candidate domain-relevant terms from the text. Those that were<br />

not already in HuWN have been added as synsets, along with their synonyms to the<br />

ontology. The following table summarizes the distribution of the domain terms as<br />

observed in a corpus, over the different parts of speech:


Methods and Results of the Hungarian WordNet Project 317<br />

Table 1.<br />

POS<br />

Terms<br />

Noun 2835<br />

Adjective 270<br />

Adverb 6<br />

Verb 181<br />

Overall 3292<br />

3. Validation and Evaluation<br />

3.1 Validation<br />

In the final phase of the project, we focused on merging the parts of the ontology<br />

developed at the different project sites and performing several integrity and<br />

consistency checks, following [5]. The majority of the most frequent and serious<br />

problems were automatically identifiable with simple scripts and were then corrected<br />

manually. These included structural problems like:<br />

• invalid sense ids<br />

• same synsets connected with holonym and hypernym relations<br />

• same synsets connected with similar to and near antonym relations<br />

• duplicate synset ids<br />

• duplicated relation between two synsets<br />

• invalid characters in a literal, definition or usage example (character en-coding<br />

issues)<br />

• invalid relation types (mostly typos)<br />

• improper linking to the EKSZ monolingual explanatory dictionary<br />

• lexicalized (non-named entity) synset with empty or missing defini-tion/usage<br />

example<br />

• mismatching part-of-speech tag and id suffix<br />

• Hungarian local synset with missing external relation<br />

• direct circles in hierarchical relations<br />

• duplicate literals in synsets<br />

• invalid relation (connected synset does not exist or has different POS than<br />

required)<br />

• the same definition is used in more than one synsets<br />

We also checked some semantic inconsistencies that required manual inspection of<br />

the database by linguist experts, without major computer assistance:<br />

• central synsets of two adjective clusters connected with near antonym relation (we<br />

considered these as improper uses of near antonym relation and hanged these to<br />

also see relations)<br />

• not reasonable sense distinctions: two synsets could be merged as they represented<br />

practically the same concept (here we collected synsets that shared several literals<br />

in common.)


318 Márton Miháltz et al.<br />

3.2 Evaluation method<br />

In order to assess the relevance of synsets added to the Hungarian WordNet, we<br />

evaluated random samples from the whole WordNet, from the Base Concept Sets and<br />

from the whole hyponym trees we incorporated to the Hungarian Ontology, and<br />

compared them to the synsets that received the highest rank during one of the<br />

concentric extension phases.<br />

The evaluation was performed in the following way:<br />

1. We generated a random sample of 200 synsets from the concepts we wanted<br />

to evaluate.<br />

2. Two native Hungarian speakers independently evaluated the importance of<br />

synsets according to their usefulness in a linguistic ontology. They had to<br />

assign a score ranging from 1 to 10 to each concept. The higher value they<br />

assigned to the concept, the more relevant it was in their point of view. The<br />

agreement rate of the annotators leveraged to all the samples was 78.67%<br />

(considering the agreement to be 100% in case they assigned the same value<br />

to the synset in question and 0% if the difference between their scores was<br />

maximal).<br />

3. We took the average of the scores assigned by the two linguists for each<br />

synset and then calculated the average and deviance of scores over the 200<br />

element samples.<br />

3.3 Results<br />

The columns of the following two tables represent the segments of the ontology from<br />

which we generated the 200 synsets large samples. These were:<br />

NONBCS: the set of English synsets that are not among the base concept sets.<br />

BCS1: 1 st Base Concept Set<br />

BCS2: 2 nd Base Concept Set<br />

BCS3: 3 rd Base Concept Set<br />

CONC_1: a random sample of synsets added during the first concentric extension<br />

phase<br />

TREE: a random sample of synsets that were added during the extension of<br />

Hungarian WordNet by whole hyponym subtrees<br />

CONC_2_CAND: a random sample of the candidates for the second concentric<br />

extension phase<br />

LIT_FREQ: top ranked synsets from the candidates for the second extension<br />

phase using frequency-based ranking<br />

ILI_OVL: top ranked synsets from the candidates for the second extension phase<br />

according to the number of foreign WordNets they appear in Table 3.


Methods and Results of the Hungarian WordNet Project 319<br />

Table 2.<br />

NONBCS BCS1 BCS2 BCS3 CONC_1 TREE<br />

Mean 4.51 6.56 6.21 5.03 5.71 4.21<br />

Deviance 2.48 2.78 2.20 2.45 1.71 2.61<br />

Table 3.<br />

CONC_2_CAND LIT_FREQ ILI_OVL<br />

Mean 4,25 5,26 8,32<br />

Deviance 2,27 1,74 1,25<br />

As a summary we conclude that it is worthy to construct evaluation heuristics for the<br />

selection of synset candidates to extend WordNets with. Some heuristics clearly<br />

helped to incorporate more useful concepts to the ontology than adding synsets<br />

without considering their relevance.<br />

4. Applications<br />

4.1 Information extraction<br />

Our information extraction engine was developed to identify the event type (such as<br />

sales, privatisation, litigation etc.) and the participating entities (eg. the seller, buyer<br />

and the price in a sale) expressed in short business news texts.<br />

We created so-called event frame descriptions manually after analyzing our<br />

collected business news corpus. Each frame description defines an event, and contains<br />

participants in specific roles that correspond to the main verb and its typical<br />

arguments. In the implementation of the IE engine, a parser first identifies the main<br />

syntactic constituents in the input text, and then it tries to match these to the elements<br />

of the candidate event frames. There are several kinds of constraints that have to be<br />

satisfied for a match. Lexical constraints can either be specified as strings, or as synset<br />

ids corresponding to hyponym subtrees of the HuWN ontology. Semantic constraints<br />

are expressed by so-called semantic meta-features, or basic semantic categories, such<br />

as “human”, “company”, “currency” etc. that are mapped to HuWN synsets and all<br />

their hyponyms. There are also syntactic and morphologic constraints, which are<br />

checked against the output of the parser and the underlying morphologic analyzer.<br />

Finally, the IE engine ranks the candidate event frame matches for the output<br />

according to the ratio of event participants matched.<br />

In this approach, the use of ontological categories allows for a simpler and easier to<br />

understand layout of the event frames. The main advantage of the use of synset ids


320 Márton Miháltz et al.<br />

and semantic types (as opposed to bare lexical listings) lies in the fact that the<br />

vocabulary of the IE engine can be easily customized and extended by adding new<br />

concepts to the ontology, without the need to modify the original event frame<br />

descriptions<br />

4.2 Creating an annotated corpus for WSD<br />

In parallel with the construction of the ontology itself, we selected 39 words that had<br />

several commonly used senses and built a lexical sample word sense disambiguation<br />

corpus for Hungarian. This corpus is freely available for research and teaching<br />

purposes 2 and consists of 350-500 labeled examples for each polysemous lexical item.<br />

The sense tags were taken from the synset ids of the senses of the polysemous words<br />

in HuWN.<br />

The corpus follows the SensEval lexical sample format in order to ease its use for<br />

testing systems developed for other previous lexical sample datasets. The annotation<br />

was performed by two independent annotators. The initial annotation had an average<br />

inter-annotator agreement rate of 84.78%. Disagreements were later on resolved by<br />

consensus of the two annotators and a third independent linguist. The most common<br />

sense covers 66.12% of the instances and an average of 4 further senses share the<br />

remaining percentage of the labeled examples.<br />

References<br />

1. Gyarmati, Á., A. Almási, D. Szauter: A melléknevek beillesztése a Magyar WordNetbe.<br />

[Inclusion of Adjectives into the Hungarian WordNet] In: Alexin Z., Csendes D. (ed.):<br />

MSZNY2006 - IV. Magyar Számítógépes Nyelvészeti Konferencia, SZTE, Szeged, p. 117–<br />

126 (2006)<br />

2. Kuti, J., K. Varasdi, J. Cziczelszki, Á. Gyarmati, A. Nagy, M. Tóth, P. Vajda: Hungarian<br />

WordNet and representation of verbal event structure. To appear in Acta Cybernetica (<strong>2008</strong>)<br />

3. Miháltz, M., Prószéky, G..: Results and Evaluation of Hungarian Nominal WordNet v1.0. In<br />

Proceedings of the Second International WordNet Conference (<strong>GWC</strong> 2004), Brno, Czech<br />

Republic, pp. 175–80 (2004)<br />

4. Pusztai, F. (ed.): Magyar értelmező kéziszótár. Budapest, Akadémiai Kiadó (1972)<br />

5. Smrz, P.: Quality Control and Checking for Wordnets Development: A Case Study of<br />

BalkaNet. J. Romanian Journal of Information Science and Technology Special Issue 7(1–2)<br />

(2004)<br />

6. Tufiş, D., Cristea, D., Stamou, S.: BalkaNet: Aims, Methods, Results and Perspectives. A<br />

General Overview. J. Romanian Journal of Information Science and Technology Special<br />

Issue 7(1–2) (2004)<br />

7. Váradi, T.: The Hungarian National Corpus. In: Proceedings of the Second International<br />

Conference on Language Resources and Evaluation, Las Palmas, pp. 385–389 (2002)<br />

8. Vossen, P. (ed.): EuroWordNet General Document. EuroWordNet (LE2-4003, LE4-8328),<br />

Part A, Final Document Deliverable D032D033/2D014 (1999)<br />

2<br />

Please contact the authors for information about obtaining the WSD corpus.


Synset Based Multilingual Dictionary:<br />

Insights, Applications and Challenges<br />

Rajat Kumar Mohanty 1 , Pushpak Bhattacharyya 1 , Shraddha Kalele 1 ,<br />

Prabhakar Pandey 1 , Aditya Sharma 1 , and Mitesh Kopra 1<br />

1<br />

Department of Computer Science and Engineering<br />

Indian Institute of Technology Bombay, Mumbai - 400076, India<br />

{rkm, pb, shraddha, pande, adityas, miteshk}@cse.iitb.ac.in<br />

Abstract. In this paper, we report our effort at the standardization, design and<br />

partial implementation of a multilingual dictionary in the context of three large<br />

scale projects, viz., (i) Cross Lingual Information Retrieval, (ii) English to<br />

Indian Language Machine Translation, and (iii) Indian Language to Indian<br />

Language Machine Translation. These projects are large scale, because each<br />

project involves 8-10 partners spread across the length and breadth of India with<br />

great amount of language diversity. The dictionary is based not on words but on<br />

WordNet SYNSETS, i.e., concepts. Identical dictionary architecture is used for<br />

all the three projects, where source to target language transfer is initiated by<br />

concept to concept mapping. The whole dictionary can be looked upon as an M<br />

X N matrix where M is the number of synsets (rows) and N is the number of<br />

languages (columns). This architecture maps the lexeme(s) of one languagestanding<br />

for a concept- with the lexeme(s) of other languages standing for the<br />

same concept. In actual usage, a preliminary WSD identifies the correct row for<br />

a word and then a lexical choice procedure identifies the correct target word<br />

from the corresponding synset. Currently the multilingual dictionary is being<br />

developed for 11 languages: English, Hindi, Bengali, Marathi, Punjabi, Urdu,<br />

Tamil, Kannada, Telugu, Malayalam and Oriya. Our work with this framework<br />

makes us aware of many benefits of this multilingual concept based scheme<br />

over language pair-wise dictionaries. The pivot synsets, with which all other<br />

languages link, come from Hindi. Interesting insights emerge and challenges are<br />

faced in dealing with linguistic and cultural diversities. Economy of<br />

representation is achieved on many fronts and at many levels. We have been<br />

eminently assisted by our long standing experience in building the WordNets of<br />

two major languages of India, viz., Hindi and Marathi which rank 5th (~500<br />

million) and 14th (~70 million) respectively in the world in terms of the number<br />

of people speaking these languages.<br />

Keywords: Multilingual Dictionary, Dictionary Standardization, Concept<br />

Based Dictionary, Light Weight WSD and Lexical Choice, Multilingual<br />

Dictionary Database


322 Rajat Kumar Mohanty et al.<br />

1 Introduction<br />

In any natural language application, dictionary look-up plays a vital role. We report a<br />

model for multilingual dictionary in the context of large scale natural language<br />

processing applications in the areas of Cross Lingual IR and Machine Translation.<br />

Unlike any conventional monolingual or bilingual dictionary, this model adopts the<br />

Concepts expressed as WordNet synsets as the pivot to link languages in a very<br />

concise and effective way. The paper also addresses the most fundamental question in<br />

any lexicographer’s mind, viz., how to maintain lexical knowledge, especially in a<br />

multilingual setup, with the best possible levels of simplicity and economy? The case<br />

study of multiple Indian languages with special attention to three languages belonging<br />

to two different language groups (such as, Germanic and Indic) within the Indo-<br />

European family - English, Hindi and Marathi- throws lights on various linguistic<br />

challenges in the process of dictionary development.<br />

The roadmap of the paper is as follows. Section 2 motivates the work. Section 3<br />

is on related work. The proposed synset based model for multilingual dictionary is<br />

presented in section 4. Section 5 is on how to tackle the problem of correct lexical<br />

choice on the target language side in an actual MT situation through a novel idea of<br />

word alignment. Linguistic challenges are discussed in Section 6. Creation, storage<br />

and maintenance of the multilingual dictionary is an involved task, and the<br />

computational framework for the same is described in section 7. Section 8 concludes<br />

the paper.<br />

2 Motivation<br />

Our mission is to develop a single multilingual dictionary for all Indic languages plus<br />

English in an effective way, economizing on time and effort. We first discuss the<br />

disadvantages of language pair wise conventional dictionaries.<br />

2.1 Disadvantages of Conventional Bilingual Dictionaries<br />

In a typical bilingual dictionary, a word of L 1 is taken to be a lexical entry and for<br />

each of its senses the corresponding words in L 2 are given. It is possible that one sense<br />

of W i in L 1 is exactly the same as one of the senses of W j in L 1 . This means that W i and<br />

W j are synonymous for a given sense. An example of this is dark and evil where one<br />

of the senses of dark and evil overlaps as for example in dark deeds and evil deeds.<br />

This phenomenon is abundant in any natural language. In a conventional dictionary,<br />

there is no mechanism to relate W i with W j in L 1 , though they conceptually express the<br />

same meaning. In turn, the corresponding words for W i and W j in L 2 are no way related<br />

to each other though conceptually they are. That is a major drawback, because of<br />

which conventional pair wise dictionaries cannot be used effectively in natural<br />

language application, especially when multiple languages are involved.<br />

The other disadvantage of the conventional dictionary is the duplication of<br />

manual labor. If an MT system is to be developed involving n languages, n(n-1)/2


Synset Based Multilingual Dictionary… 323<br />

language pair wise dictionaries have to be created. For instance, if we consider 6<br />

languages, 30 bilingual dictionaries have to be constructed. Additionally will be<br />

required 15 perfect bilingual lexicographers- by no means an easy condition to meet.<br />

Finally, the effort of incorporating semantic features in O(n2) dictionaries is<br />

duplicated by n/2 lexicographers- a wastage of manual labor and time.<br />

3 Related Work<br />

Our model has been inspired by the need to efficiently and economically represent the<br />

lexical elements and their multilingual counterparts. The situation is analogous to<br />

EuroWordNet [1] and Balkanet [2] where synsets of multiple languages are linked<br />

among themselves and to the Princeton WordNet ([3], [4]) through Inter-lingual<br />

Indices (ILI). Our framework is similar, except for a crucial difference in the form of<br />

cross word linkages among synsets (explained in section 5). Another difference is that<br />

there are semantic and morpho-syntactic attributes attached to the concepts and their<br />

word constituents to facilitate MT. The Verbmobil project [5] for speech to speech<br />

multilingual MT had pair wise linked lexicons. To the best of our knowledge, no<br />

major machine translation nor CLIR project involving multiple large languages has<br />

ever used concept based dictionaries.<br />

The framework has indeed been motivated by our creation of the Marathi<br />

WordNet [6] by transferring from the Hindi WordNet [7]. We noticed the ease of<br />

linking the concepts when two languages with close kinship were involved ([8], [9]).<br />

4 Proposed Model: Concept-based Multilingual Dictionary<br />

We propose a model for developing a single dictionary for n languages, in which there<br />

are linked concepts expressed as synsets and not as words. For each concept, semantic<br />

features- which are universal- are worked out only once. As for morph-syntactic<br />

features, their incorporation will demand much less effort, if languages are grouped<br />

according to their families; in other words we can take advantage of the fact that close<br />

kinship languages share morpho-syntactic properties. Table 1 illustrates the conceptbased<br />

dictionary model considering three languages from two different families.


324 Rajat Kumar Mohanty et al.<br />

Table 1. Proposed multilingual dictionary model<br />

Concepts L 1 (English) L 2 (Hindi) L 3 (Marathi)<br />

Concept ID: (W 1 , W 2 , (W 1 , W 2 , W 3 , W 4 , W 5<br />

Concept description W 3 , W 4 ) W 6 , W 7 , W 8 )<br />

02038: a typical star<br />

that is the source of<br />

light and heat for the<br />

planets in the solar<br />

system<br />

04321: a youthful<br />

male person<br />

06234: a male human<br />

offspring<br />

(sun)<br />

(male_child,<br />

boy)<br />

(son, boy)<br />

(सूयर्, सूरज, भानु, िदवाकर,<br />

भास्कर, भाकर, िदनकर,<br />

रिव, आिदत्य, िदनेश,<br />

सिवता, ुष्कर, िमिहर,<br />

अंशुमान, अंशुमाली)<br />

(लड़का, बालक, बच्चा,<br />

छोकड़ा, छोरा, छोकरा, लौंडा<br />

)<br />

(ु, बेटा, लड़का, लाल, सुत,<br />

बच्चा, नंदन, ूत, तनय,<br />

तनुज, आत्मज, बालक,<br />

कु मार, िचरंजीव, िचरंजी )<br />

(W 1 , W 2 , W 3 , W 4 ,<br />

W 5 W 6 , W 7 , W 8 ,<br />

W 9 , W 10 )<br />

(सूयर्, भानु, िदवाकर,<br />

भास्कर, भाकर,<br />

िदनकर, िम, िमिहर,<br />

रिव, िदनेश, अकर् ,<br />

सिवता, गभिस्त, चंडांशु,<br />

िदनमणी)<br />

(मुलगा, ोरगा, ोर,<br />

ोरगे )<br />

(मुलगा, ु, लेक,<br />

िचरंजीव, तनय )<br />

Given a row, the first column is the pivot for n number of languages describing a<br />

concept. Each concept is assigned a unique ID. The columns (2-4) show the<br />

appropriate words expressing the concepts in respective languages. To express the<br />

concept ‘04321: a youthful male person’, there are two lexical elements in English,<br />

which constitute a synset. There are seven words in Hindi which form the Hindi<br />

synset, and four words in Marathi which constitute the Marathi synset for the same<br />

concept, as illustrated in Table 1. The members of a particular synset are arranged in<br />

the order of their frequency of usage for the concept in question. The proposed model<br />

thus defines an M X N matrix as the multilingual dictionary, where each row expresses<br />

a concept and each column is for a particular language.<br />

4.1 Advantages of the concept-based multilingual dictionary<br />

(a) The first advantage of the proposed model is economy of labor and storage.<br />

Semantic features like [±Animate, ±Human, ±Masculine, etc.], are assigned to a<br />

nominal concept and not to any individual lexical item of any language. Similarly, the<br />

semantic features, such as [+Stative (e.g., know), +Activity (e.g., stroll),<br />

+Accomplishment (e.g., say), +Semelfactive (e.g., knock), +Achievement (e.g., win)]<br />

are assigned to a verbal concept. These semantic features are stored only once for<br />

each row and become applicable independent of any language. Consequently, lexical<br />

entries with highly enriched semantic features can be added to a dictionary for as<br />

many languages as required within a short span of time.<br />

(b) The dictionary developed in this approach also serves all purposes that either a<br />

monolingual or bilingual dictionary serves. A monolingual or bilingual dictionary can


Synset Based Multilingual Dictionary… 325<br />

automatically be generated from this concept-based multilingual dictionary. The<br />

quality of such monolingual or bilingual dictionaries is better than that of any<br />

conventional bilingual dictionary in terms of lexical features.<br />

(c) The model admits of the possibility of extracting a domain specific dictionary for<br />

all or any specific language pair. This is because the synsets or concepts pertaining to<br />

a domain can be selected from among the rows in the M X N concepts vs. languages<br />

matrix.<br />

(d) The language group which lacks competence in the pivot language- which in our<br />

case is Hindi- can benefit from the already worked out languages. It may be the case<br />

that the lexicographers of language L 6 do not have enough competence in the pivot<br />

language L pivot . They can look for a language Ln which they are comfortable with and<br />

use L n as pivot to link L 6 . This paves way for the seamless integration of a new<br />

language into the multilanguage dictionary.<br />

5 Word-Alignment in the Proposed Model<br />

In an actual MT situation, for every word or phrase in the source language a single<br />

word or phrase in the target language will have to be produced. The multilingual<br />

dictionary proposed by us links concepts which are sets of synonymous words. This is<br />

a major difference from the conventional bilingual dictionary in which a word (SW 1 )<br />

in the source language is typically mapped to one or more words in the target<br />

language depending upon the number of senses SW 1 has. This implies that for each<br />

sense of SW 1 , there is a single target language word TW 1 . In our concept-based<br />

approach, even if we choose the right sense of a word in the source language (SW 1 ),<br />

there is still the hurdle of choosing the appropriate target language word. This lexical<br />

choice is a function of complex parameters like situational aptness and native speaker<br />

acceptability. For example, the concept of ‘the state of having no doubt of something’<br />

is expressed through the Hindi synset having six members (िनश्शंक, अनाशंिकत, आशंकाहीन,<br />

बेखटक, बेिफ़, संशयहीन) and through the Marathi synset having four members (िनःशंक,<br />

िनधार्स्त, िनात, शंकारिहत). However, the third member in the Hindi synset आशंकाहीन is<br />

appropriately mapped to the fourth member in the Marathi synset शंकारिहत. Though the<br />

mapping of the third member in the Hindi synset (i.e., आशंकाहीन) with the first member<br />

of the Marathi synset (i.e., िनःशंक) expresses the same meaning, this substitution<br />

sounds quite unnatural to the native speakers.<br />

We tackle the problem of correct lexical choice on the target language side by<br />

proposing a novel approach of word-alignment across the synsets of languages. Wordalignment<br />

refers to the mapping of each member of a synset with the most appropriate<br />

member of the synset of another language. For instance, when the word लड़का ‘boy’ in<br />

Hindi in the sense of ‘a young male person’ needs to be lexically transferred to<br />

Marathi, there are four choices available in the synset, as illustrated in Figure 1.


326 Rajat Kumar Mohanty et al.<br />

Marathi Synset<br />

Hindi Synset<br />

English Synset<br />

मुलगा /HW1 ,<br />

ोरगा /HW6,<br />

ोर / HW2,<br />

ोरगे / HW6<br />

लड़का /HW1,<br />

बालक / HW2,<br />

बच्चा / HW3,<br />

छोकड़ा / HW4,<br />

छोरा / HW5,<br />

छोकरा / HW6,<br />

लौंडा / HW7<br />

male-child<br />

/ HW1,<br />

boy / HW2<br />

Fig. 1. Illustration of aligned synset members for the concept: a youthful male person<br />

Considering Hindi as the pivot, we propose that each of the four words in the Marathi<br />

synset be linked to the appropriate Hindi word in the direction MarathiHindi and<br />

each of the two words in English synset has to be linked with the appropriate Hindi<br />

word in the direction EnglishHindi. As a result, the first and the third member of<br />

the Marathi synset (i.e., मुलगा and ोर) are mapped to two different Hindi words (i.e.,<br />

मुलगालड़का, ोरबच्चा). The second and the fourth member in Marathi synset are<br />

linked to one word (i.e., ोरगाछोकरा and ोरगेछोकरा) in the Hindi synset. Three words<br />

in Hindi synset (i.e., HW4, HW5, HW7) are left without being linked, as shown in<br />

Figure 1. In a situation, when a Marathi word is aligned with a single Hindi word<br />

(e.g., मुलगा लड़का) for a particular concept in the direction of MarathiHindi, from<br />

our past experience we assume that the lexical transfer in the reverse direction<br />

(Hindi Marathi) also holds good, yielding लड़का मुलगा.<br />

Following this strategy of alignment of synset members of Marathi (or any other<br />

language) with the synset members of the pivot (i.e., Hindi in the present scenario),<br />

we are having four types of situation to perform a lexical transfer from any language<br />

to any language:<br />

Situation (1) One-to-One<br />

Situation (2) Many-to-One<br />

Situation (3) One-to-Many<br />

Situation (4) No link<br />

In situation (1), the source word is found to be linked to a single target word, via a<br />

synset member of the pivot if it is neither the source nor the target for any lexical<br />

transfer. For instance, the Marathi word मुलगा can be transferred to the Hindi target<br />

word लड़का, and the Marathi word मुलगा can be transferred to the English target word<br />

‘boy’ via the pivot Hindi word लड़का. In situation (1), virtually there is no problem in<br />

performing the lexical transfer maintaining the best naturalness to the target language


Synset Based Multilingual Dictionary… 327<br />

speakers. In situation (2), two words from the source language synset are linked to a<br />

single word in target language, e.g.,ोरगा छोकरा and ोरगे छोकरा. Hence, there is no<br />

issue involved in lexical transfer maintaining the naturalness. The situation (3) arises<br />

when the pivot is taken as the source language in any practical application, e.g.,<br />

Hindi Marathi. The lexical transfer involves a puzzle with respect to the naturalness<br />

of the target word. Since the members of a synset are ordered according to their<br />

frequency of usage for a concept, we are inclined to choose the first member of the<br />

target synset as the best in this situation. For instance, the source Hindi word छोकरा<br />

‘boy’ has two choices in the target Marathi synset, i.e., ोरगा and ोरगे, as shown in<br />

figure 1. Since ोरगा appears prior to ोरगे, we choose ोरगा for lexical transfer. In<br />

situation (4), where no link is available between the source word and the target word,<br />

we choose the first member of the target synset for lexical transfer. If we need to<br />

transfer the Marathi word ोर to English, there is no consecutive link available, since<br />

it stops at बच्चा/HW3 in the pivot (cf. figure 1). However, we choose the first member<br />

of the English synset, i.e., boy for Marathi ोर, which is quite appropriate and widely<br />

acceptable. Similarly, if the English word boy happens to be the source in the sense of<br />

‘a youthful male person’, the first member of the Marathi synset (i.e., मुलगा) is chosen<br />

as the target for lexical transfer, even if its link stops at बालक/HW2 in the pivot (cf.<br />

figure 1). In section 8, we present a user-friendly tool to align the members of the<br />

synsets across languages with respect to a particular concept. We also present a<br />

lexical transfer engine to make the aligned data usable in any system.<br />

6 Linguistics Challenges Involved<br />

In the process of synset based multilingual dictionary development, we face a number<br />

of challenges to deal with linguistic and cultural diversity. In this section, we present a<br />

few cases that we experienced while dealing with three languages, i.e., English, Hindi<br />

and Marathi.<br />

(a) A concept may be expressed using different syntactic category in different<br />

languages. For example, the nominal concept कलौंजी ‘stuffed vegetable’ in Hindi is<br />

expressed through an adjectival concept भरली ‘stuffed’ in the expression भरलेली भाजी<br />

‘stuffed vegetable’ in Marathi.<br />

(b) It is often the case that a concept is expressed through a synthetic expression in<br />

one language, but through a single word expression in the other language. For<br />

example, the concept ‘reduce to bankruptcy’ is expressed through a single word in<br />

English but through a synthetic expression in Hindi and Marathi, as illustrated in<br />

Table 2.


328 Rajat Kumar Mohanty et al.<br />

Table 2. Illustration of single word vs. synthetic expressions<br />

ु<br />

Concept English Hindi Marathi<br />

‘reduce to bankrupt िदवाला िनकालना (N+V) िदवाळे काढणे (N+V)<br />

bankruptcy’ (V)<br />

‘to make bankrupt’ ‘to make bankrupt’<br />

‘resulting from considered िवचारूवर्क िकया हआ िवचारूवर्क के लेला<br />

careful thought’ (ADJ) (ADV+VERB)<br />

‘thoughtfully done’<br />

(ADV+VERB)<br />

‘thoughtfully done’<br />

किन (ADJ)<br />

‘least in age than<br />

the other person’<br />

youngest<br />

(ADJ)<br />

सवार्त लहान (N+ADJ)<br />

‘among-all less-inage<br />

’<br />

Considering Hindi as the pivot in the process of dictionary development in our<br />

approach, one has to deal with two kinds of situation: (i) synthetic expression in the<br />

pivot to a single word expression in the other language, (ii) single word expression<br />

in the pivot to a synthetic expression in the other language. In situation (ii), the<br />

question arises with respect to its morpho-syntactic category in the dictionary.<br />

Because, the synthetic element is often constituted of different syntactic categories,<br />

as shown in Table 2. In such a situation, we consider the grammatical function of<br />

the synthetic element and assign the category accordingly. For example, the<br />

Marathi expression िवचारूवर्क के लेला (ADV+VERB) ‘thoughtfully done’ refers to an<br />

adjectival function at the grammatical level, hence its syntactic category is assigned<br />

as ‘adjective’.<br />

(c) When a word expressing the meaning specific to a particular language and culture<br />

has to be mapped to another language in the dictionary, we find two ways to<br />

express the concept in another language: (i) using a synthetic expression, (ii) using<br />

transliteration, if the synthetic expression is larger. For example, the culture<br />

specific concept of ‘ornaments and other gifts given to the bride by the bridegroom<br />

on the day of wedding’ is lexicalized in Hindi yielding चढ़ावा, but a Marathi speaker<br />

has to use a larger synthetic expression िववाहसमयी वराकडन ू वधुला िदले जाणारे दािगने ‘atthe-time-of-wedding–bridegroom–bride–<br />

given–ornament’ to express the same<br />

concept. The Hindi word सेहरा ‘garland’ is a culture specific word which has no<br />

lexical equivalent in Marathi. Even using a large synthetic expression does not<br />

express the borrowed concept naturally. In such a situation, we transliterate the<br />

culture specific word into Marathi.<br />

It is also the case that a concept is culture specific to a language other than the<br />

pivot. For example, the Marathi culture specific concept, e.g., माहेरवाशीण ‘a woman<br />

who has come to stay at her parents' place after her marriage’, is not expected to<br />

be available in the pivot language dictionary in the initial phase. Therefore, such<br />

culture specific concepts are added to the Marathi dictionary in a monolingual<br />

manner without being mapped to the pivot language. But those are marked for<br />

review using the dictionary development tool. At a later phase, each language<br />

specific cultural concepts can be taken, and systematically added to the pivot


Synset Based Multilingual Dictionary… 329<br />

language to enrich the pivot, and in turn, the whole multilingual dictionary with<br />

multicultural concepts.<br />

(d) Given Hindi as the pivot language, when we develop and link the Marathi<br />

dictionary, we come across a strange situation. A concept initially recorded in Hindi<br />

dictionary, having a singleton member in the pivot synset, can be expressed through<br />

more than one finer concept in Marathi. The Hindi word फ़ीका means ‘the food<br />

prepared with less sugar, salt or spice’, the equivalent of which is expressed in<br />

Marathi through three distinct words expressing three distinct finer concepts, i.e.,<br />

अगोड ‘less sweet’, अळणी ‘less salty, and िमळिमळत ‘less spicy’. These three words<br />

cannot be taken as the members of a single synset in Marathi for the concept ‘the<br />

food prepared with less sugar, salt or spice’, since the three-way finer meaning<br />

distinction is very natural to Marathi speakers. Had it been the case that Marathi<br />

were the pivot, we could have been tempted to add three different concepts into<br />

Marathi dictionary, and in turn, Hindi dictionary could have included फ़ीका against<br />

three concepts implying that फ़ीका has three senses. As long as Hindi is the pivot, the<br />

finer concepts found in Marathi (e.g., अगोड ‘less sweet’, अळणी ‘less salty, and<br />

िमळिमळत ‘less spicy’) cannot be mapped to the coarse concept found in Hindi (e.g.,<br />

फ़ीका ‘the food prepared with less sugar, salt or spice’). However, at a later phase of<br />

the dictionary development process, the finer concepts of Marathi (or any other<br />

languages) can be identified, and added to the pivot language, i.e., Hindi, after<br />

which the other languages can borrow the concepts from the same pivot to enrich<br />

their dictionary in the multilingual setting. The computational tool (cf. section 8)<br />

provides a support to mark such cases for review, and to retrieve all those when one<br />

decides to add to the pivot language synsets.<br />

8 Computational Framework for the Multilingual Dictionary<br />

For effective implementation of our idea of synset based multilingual dictionary, we<br />

carefully designed the dictionary development process, which is, in fact, expected to<br />

involve a number of human lexicographers. Figure 2 shows the complete semiautomatic<br />

data flow in the dictionary development process.


330 Rajat Kumar Mohanty et al.<br />

Fig. 2. Data flow in the dictionary development process<br />

The pivot synsets are extracted from the existing Hindi WordNet along with the<br />

concept descriptions, syntactic category and examples. For convenience, an<br />

appropriate template is used for multilingual dictionary development, as illustrated in<br />

Table 3.<br />

Table 3. Dictionary entry template<br />

ID :: 02691516<br />

CAT<br />

:: verb<br />

CONCEPT<br />

:: be in a state of movement or action<br />

EXAMPLE<br />

:: "The room abounded with screaming children"<br />

SYNSET-ENGLISH :: (abound, burst, bristle)<br />

The whole process, shown in figure 2, is implemented using a centralized MYSQL<br />

database and a JAVA GUI. The screenshots of the GUI windows are shown in figure<br />

3 and 4. Language and task configuration window is shown Figure 3, and the synset<br />

entry interface is shown in figure 4. The tool accepts the data in UNICODE only.


Synset Based Multilingual Dictionary… 331<br />

Fig. 3. Language and Task Configuration Window<br />

Fig. 4. Synset entry and word-alignment interface


332 Rajat Kumar Mohanty et al.<br />

Once the dictionary is built out of the multilingual data as shown in figure 4, a lexical<br />

transfer engine provides the following for various usages:<br />

(i) Given a word in any language, get all the records in the specified template in the<br />

same language or in any other language. (useful for a WSD system)<br />

(ii) Given a word in any language and its part-of-speech, get all the records in<br />

specified template in the same language or in any other language. (useful for a<br />

WSD system)<br />

(iii) Given a word in any language with respect to a particular concept, get the most<br />

appropriate translation of that word in any other language. (useful for lexical<br />

transfer in an MT system, if a WSD system is embedded in the MT system)<br />

(iv) Given a word in any language, get the most probable translation of that word in<br />

any other language. (useful for lexical transfer in an MT system having no WSD<br />

system embedded and in a cross-lingual information retrieval system)<br />

Using this lexical transfer engine, the multilingual dictionary is accessible online<br />

through a user-friendly website having a facility for obtaining feedback from online<br />

dictionary users. The feedback obtained from online users is expected to be useful for<br />

further development of this invaluable lexical resource.<br />

9 Conclusion and Future Directions<br />

We have reported here our experiences in the construction of a multilingual dictionary<br />

framework that is being used across language groups to create large scale MT and<br />

CLIR systems. Many challenges are faced on the way, chief amongst them being the<br />

one-on-one production of a target language lexeme corresponding to a source<br />

language lexeme. On the computational front there are challenges to be tackled for the<br />

maintenance of multilingual data, their insertion, deletion and updation in a spatially<br />

and temporally distributed situation. Of the many advantages of the framework are:<br />

(i) a linguistically sound basis of the dictionary framework, (ii) economy of<br />

representation and (iii) avoidance of duplication of effort. Our future work consists in<br />

incorporating domain sensitivity to the framework and also in solving the challenges<br />

of the distributed access and storage.<br />

References<br />

1. Vossen, Piek (ed.) 1999. EuroWordNet: A Multilingual Database with Lexical Semantic<br />

Networks for European languages. Kluwer Academic Publishers, Dordrecht.<br />

2. Christodoulakis, Dimitris N. 2002 . BalkaNet: A Multilingual Semantic Network for Balkan<br />

Languages. EUROPRIX Summer School, Salzburg Austria, September 2002.<br />

3. Miller G., R. Beckwith, C. Fellbaum, D. Gross, K.J. Miller. 1990. “Introduction to WordNet:<br />

An On-line Lexical Database". International Journal of Lexicography, Vol 3, No.4, 235-244.<br />

4. Fellbaum, C. (ed.) 1998, WordNet: An Electronic Lexical Database. The MIT Press.<br />

5. Wahlster, W. (ed.). 2000. Verbmobil: Foundations of Speech-to-Speech Translation.<br />

Springer-Verlag. Berlin, Heidelberg, New York, 2000


Synset Based Multilingual Dictionary… 333<br />

6. Marathi Wordnet. http://www.cfilt.iitb.ac.in/wordnet/webmwn<br />

7. Jha., S., D. Narayan, P. Pande, P. Bhattacharyya. 2001. A WordNet for Hindi. Workshop on<br />

Lexical Resources in Natural Language Processing, Hyderabad, India, January, 2001.<br />

8. Ramanand, J., Akshay Ure, Brahm Kiran Singh and Pushpak Bhattacharyya. Mapping and<br />

Structural Analysis of Multilingual Wordnets. IEEE Data Engineering Bulletin, 30(1),<br />

March 2007.<br />

9. Sinha, Manish., Mahesh Reddy and Pushpak Bhattacharyya. 2006. An Approach towards<br />

Construction and Application of Multilingual Indo-WordNet. 3rd Global Wordnet<br />

Conference ( <strong>GWC</strong> 06), Jeju Island, Korea, January, 2006.


Estonian WordNet: Nowadays<br />

Heili Orav, Kadri Vider, Neeme Kahusk, and Sirli Parm<br />

University of Tartu, Institute of Estonian and General Linguistics<br />

Abstract. The Estonian WordNet has been built since 1998. After finishing<br />

EuroWordNet-2, the Estonian team continued with word sense<br />

disambiguation, using Estonian WN as lexicon. Many synsets are improved<br />

since then. Nowadays main attention is payed to specific domains<br />

and completing with glosses and examples. Adverbs constitute a totally<br />

new part in EstWN.<br />

1 Introduction<br />

The Estonian team joined the WordNet community (EuroWordNet-2) from the<br />

beginning of January 1998. In the framework of the project of Estonian language<br />

technology the Estonian WordNet has been created during the years 1997–2000.<br />

After some discontinuation this project is awaken again. This year started project<br />

for increasing Estonian WordNet (EstWN) and is supported by Estonian National<br />

Programme on Human Language Technology.<br />

In our paper we aim to give an overview about development of EstWN and<br />

problems with which we face in every-day work. The Estonian WordNet at the<br />

present stage includes 10372 noun synsets, 1580 adjective synsets and 3252 verb<br />

synsets. Parallel works with thesaurus are increasing in size, adding new semantic<br />

relations and specification of specific domains.<br />

35000<br />

30000<br />

Estonian WordNet in Numbers<br />

nouns<br />

verbs<br />

adjectives<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

synsets<br />

semantic relations<br />

between synsets<br />

word<br />

senses<br />

lexical<br />

entries<br />

(lemmas)<br />

ILI relations<br />

Fig. 1. Current state of Estonian WordNet (September 2007)


Estonian WordNet: Nowadays 335<br />

2 The Dynamics of Progress<br />

We can consider the EuroWordNet-2 project as first stage in our WordNet building.<br />

Then we ended up with ca 9500 synsets. Every synset had to have at least<br />

one Language-Internal relation and one InterLingual Index (ILI) relation. Hyperonymy/hyponymy<br />

links had first priority, but several other semantic relations<br />

were added during more intense insights into specific topics (see Tab. 1).<br />

Table 1. Distribution of language internal relations by part of speech<br />

Language-internal relations Nouns Verbs Adjectives<br />

Hyperonymy/hyponymy 14752 6444 0<br />

Near-synonymy 157 145 322<br />

antonymy 198 122 126<br />

causation 118 188 2<br />

Involvement and roles 262 216 1<br />

subevents 7 45 0<br />

Holonymy/meronymy 294 0 0<br />

In ILI-relations we can see fairly great number of non-equal synonyms. This<br />

can have two main reasons: first, near-synonymy relations are result of differences<br />

in word-sense distribution. The members of Estonian synsets sometimes<br />

do not map precisely into English ones. Second, there are plenty of languagespecific<br />

concepts that do not have equal match in English – they have equalhyperonymy/hyponymy<br />

relations to describe their exact meaning via ILI.<br />

Table 2. Distribution of interlingual relations by part of speech<br />

Interlingual index relations (ILI) Nouns Verbs Adjectives<br />

eq synonym 5980 2153 291<br />

eq near synonym 826 1308 25<br />

eq has hyperonym 708 181 0<br />

eq has hyponym 279 136 0<br />

eq causes 2 24 0<br />

eq is caused by 39 83 0<br />

eq be in state 91 37 0<br />

eq involved 144 95 0<br />

eq has holonym 11 0 0<br />

eq has meronym 50 0 0


336 Heili Orav, Kadri Vider, Neeme Kahusk, and Sirli Parm<br />

The second stage in Estonian WN development started after end of the EuroWordNet<br />

project. Our main focus was concentrated on EstWN applications.<br />

The WSD task on SENSEVAL-2 showed that several word senses were missing<br />

(Kahusk and Vider 2002). Problems at manual disambiguation revealed the<br />

need for more precise sense borders in EstWN, so we added glosses and examples<br />

to many synsets. The glosses come mainly from the Explanatory Dictionary of<br />

Estonian (EKSS) and examples come from our new WSD corpus.<br />

The third stage in Estonian WN development started on the year 2000. In<br />

contrast to the first stage, that involved much manual work, the present day<br />

EstWN contains about 4500 noun, verb and adjective synsets that are added<br />

from the Estonian dictionary of synonyms (Õim [5]) by automatic extraction.<br />

Still, these are only lexical synonym entries without any glosses, examples and<br />

semantic relations. We have imported some glosses and examples from EKSS, but<br />

language-internal semantic relations and ILI links are provided by lexicographers.<br />

3 Adverbs<br />

Since September 2007 we started to add adverbs into Estonian WordNet. Adverbs<br />

clarify or modify the meaning of a sentence. So, they have an important<br />

role in sentence meaning. The most common adverbs in Estonian are these of<br />

quantity and time, some of them have multiple senses. For example, adverb ‘veel’<br />

(more; still/yet), which means in one context: a greater number or quantity, and<br />

another context: the time mentioned previously, however the adverb ‘veel’ is<br />

sometimes used in the both sense (as the quantifier and as the time particle).<br />

We are started with adverbs of time, such as ‘täna’ (today), ‘homme’ (tomorrow)<br />

etc and polysemous adverbs, such as ‘jälle’ (again), ‘juba’ (already) etc.<br />

There are some problems with semantic relations, which we have needed to solve.<br />

Estonian adverbs typically express some relation of space, time, manner, degree,<br />

cause, inference, condition, exception, purpose, or means. How specifically we<br />

need to mark semantic relation between adverbs meanings, such as express some<br />

relation of time. An example, in classical semantic analysis the time adverb ‘veel’<br />

(still/yet) is antonym for the time adverb ‘juba’ (already), or at least the near<br />

antonym.<br />

In further works, we are continuing to add manner adverbs, which meanings<br />

are mostly linked to adjective senses. Estonian manner adverbs are often<br />

formed by adding suffix -sti, or -lt, as ‘kiire+sti=kiiresti’, or ‘kiire+lt=kiirelt’<br />

(quickly/rapidly), so these derived adverbs usually inherits the sense of the base<br />

adjective and also their semantic relations, in this case, an example ‘kiirelt’<br />

(rapidly) is a antonym for ‘aeglaselt’ (slowly).


Estonian WordNet: Nowadays 337<br />

4 Adjectives<br />

The most throughly examined domain in EstWN are the adjectives of personality<br />

traits. This specific domain include in Estonian around 1200 words or expressions,<br />

which accordingly form around 400 synsets into Estonian WordNet.<br />

Semantically, words and expressions of character traits converge into certain<br />

concept groups. The composition of character vocabulary showed the definitions<br />

of intrapersonal or interpersonal qualities to be for the most part broader<br />

and more general, it is into these two vast categories the vocabulary is divided<br />

into. Based on the material, 55 concept groups of personality traits have been<br />

defined, some with subsequent subgroups, and formed mainly on the basis of<br />

synonymy/antonymy relationships (Orav 2006). In future we plan to examine<br />

more domains which are represented mostly by adjectives, for example colours,<br />

weather etc.<br />

5 Specific domains<br />

There are some students who have studied specific domains in language (eg.<br />

transportation and motion, see Fig. 2), and increased the number of synsets up<br />

to 500 per domain.<br />

Semantic fields which are covered in details at this stage of work are as on<br />

Fig. 2.<br />

Processes/actions:<br />

Directive verbs (270 words)<br />

Motion verbs (ca 300 words)<br />

Entities/phenomena:<br />

buildings, music and measure instruments,<br />

emotions, food (in frame of EuroWordNet−2 project)<br />

Transportation and weather<br />

(ca 300 words both)<br />

Ceremonies (wedding<br />

and funeral)<br />

Attributes/properties:<br />

adjectives of personal traits,<br />

ca 1200 words<br />

Fig. 2. Specific domains in Estonian WordNet.


338 Heili Orav, Kadri Vider, Neeme Kahusk, and Sirli Parm<br />

6 Availability<br />

There is an application of EstWN is called TEKsaurus. This is an online service<br />

that is based on Estonian WordNet. TEKsaurus is browseable in Internet 1 . The<br />

engine behind TEKsaurus is a Python script running on server. In first stage,<br />

the EstWN export file is used to generate index file of literals. The server-side<br />

engine uses the same export file to find offsets, where synset data is found and<br />

presented to browser.<br />

For more specific description, see Kahusk and Vider (2005).<br />

The Estonian WordNet source file is available at ELRA.<br />

7 Acknowledgements<br />

The Estonian WordNet is supported by National Programme “Language Technology<br />

Support of Estonian Language” projects No EKKTT04-5, EKKTT06-<br />

11 and EKKTT07-21 and Government Target Financing project SF0182541s03<br />

“Computational and language resources for Estonian: theoretical and applicational<br />

aspects.”<br />

References<br />

1. Eesti Kirjakeele Seletussõnaraamat (Explanatory Dictionary of Estonian) Eesti NSV<br />

TA Keele ja Kirjanduse Instituut, Eesti Keele Instituut. Tallinn, 1988–. . .<br />

2. Kahusk, N., and Vider, K. (2002) Estonian WordNet benefits from word sense disambiguation.<br />

In: Proceedings of the 1st International Global WordNet Conference,<br />

Central Institute of Indian Languages, Mysore, India. pp. 26–31<br />

3. Kahusk, N., and Vider, K. (2005) TEKsaurus – The Estonian WordNet Online. In:<br />

The Second Baltic Conference on Human Language Technologies, April 4–5, 2005.<br />

Proceedings, Tallinn. pp. 273–278.<br />

4. Orav, H. (2006) Isiksuseomaduste sõnavara semantika eesti keeles. Dissertationes<br />

Linguisticae Universitatis Tartuensis. 6. Tartu Ülikooli Kirjastus. Tartu.<br />

5. Õim, A. (1991) Sünonüümisõnastik (Estonian dictionary of synonyms) Oma kulu<br />

ja kirjadega. Tallinn<br />

1 http://www.cl.ut.ee/ressursid/teksaurus


Event Hierarchies in DanNet<br />

Bolette Sandford Pedersen 1 and Sanni Nimb 2<br />

1<br />

University of Copenhagen, Njalsgade 80, 2300 S, Denmark,<br />

2<br />

Det Danske Sprog- og Litteraturselskab, Christians Brygge 1, 1219 K, Denmark<br />

bolette@cst.dk, sn@dsl.dk<br />

Abstract. The paper discusses problems related to the building of event<br />

hierarchies on the basis of an existing lexical resource of Danish, Den Danske<br />

Ordbog (DDO). Firstly, we account for the reuse principles adopted in the<br />

project where some of the senses given in DDO are either collapsed or<br />

readjusted. Secondly, we discuss the semantic principles for building the<br />

DanNet event hierarchy. Following the line of Fellbaum, we acknowledge that<br />

the manner relation (troponymy) must be defined as the main taxonomical<br />

principle for describing verbs, but we observe some complications with this<br />

organizing method since many subordinate verbs tend to specify other meaning<br />

dimensions than manner. We suggest to encode verbs that do not follow the<br />

strict manner pattern as ‘orthogonal’ to the basic hierarchy, a strategy which<br />

opens for the compatibility between taxonomical and non-taxonomical sisters<br />

of synsets.<br />

Keywords: WordNets, events, event hierarchies, verbs, troponymy.<br />

1 Introduction<br />

Building meaningful event hierarchies proves to be a challenging task, in many<br />

respects much harder than building taxonomies over 1 st order entities. Firstly, event<br />

hierarchies are not quite as intuitive as hierarchies of 1 st order entities. and secondly,<br />

there seems to be an extra measure of indeterminacy in the meaning of a verb which<br />

complicates the issue at several levels. The aim of this paper is to present and discuss<br />

some of the principles that we have applied in order to ease the construction of<br />

consistent event hierarchies in the DanNet WordNet, basing the encodings partly on a<br />

big traditional dictionary of Danish, Den Danske Ordbog (DDO) [1], and partly on<br />

encodings from an EU project on semantic computational lexica, SIMPLE (Semantic<br />

Information for Multifunctional, Plurilingual Lexica) [2].<br />

It is generally accepted that current WordNets are built on rather heterogeneous<br />

subsumption relations, a fact which has been discussed and questioned in literature by<br />

both formal ontologists [3] and WordNet builders [4], [5]. The apparently rather<br />

messy taxonomical structure of many WordNets should however not be judged as<br />

inconsistent or incompetent work, but rather as a result of the fact that they are built<br />

on the basis of corpus derived lexical data, and thereby they actually represent the<br />

variety and complexity of lexical items with its characteristic heterogeneous mixture


340 Bolette Sandford Pedersen and Sanni Nimb<br />

of types and roles. This can be seen as a contrast to formal ontologies where main<br />

attention is paid to types in the ontology skeleton. In DanNet, we distinguish between<br />

types and roles for 1 st order entities in the sense that we propose to apply Cruses<br />

distinctions [6] on nouns between Natural kinds, Functional kinds, and Nominal kinds<br />

(see [5]). This distinction is carried out in order to determine when a synset should be<br />

categorised as “orthogonal” to the hierarchy (i.e. as a role), as in cases of nominal<br />

kinds like for instance climbing trees, and when it is actually a type in the main<br />

taxonomy, as is the case for natural kinds like for instance oaks.<br />

In this paper we wish to examine whether similar distinctions can help clear up 2 nd<br />

order entities in terms of event hierarchies. Fellbaum [4] discusses semantically<br />

hetergeneous manner-relations (defined as troponymy) and argues that similar cases<br />

do hold between verbs. She gives the examples move and exercise and proposes that<br />

parallel hierarchies should be established, allowing verbs like run and jog to act as<br />

subordinates of both move and exercise. From a practical viewpoint, we claim that<br />

such parallel hierarchies are, however, extremely complicated to build and maintain in<br />

a consistent way over a large scale, and we therefore adopt the solution as mentioned<br />

above of marking non-taxonomical synsets as “orthogonal”. Such a marking indicates<br />

among other characteristics that there is no incompatibility between an orthogonal<br />

sister and a taxonomical sister; an oak may be a climbing tree at the same time, as<br />

well as practically any moving event could also be seen as an exercising event in a<br />

specific context.<br />

Two classical lexical phenomena tend to complicate the establishment of event<br />

hierarchies even further, namely those of polysemy and synonymy. In DanNet we take<br />

the sense distinctions given by DDO (which are corpus-based) as our starting point,<br />

but in the case of the verbs, we have realised that some reconsideration of polysemy<br />

and synonymy is necessary when building a WordNet.<br />

Regarding polysemy, some principles for when to split senses and more<br />

importantly when to merge polysemous senses are therefore developed. Also the<br />

establishment of synonymy calls for further clarification, since it is not always obvious<br />

when two verbs denote one or two events. Finally, we observe that within some<br />

domains, the manner relation is not the main organizing principle whatsoever. In<br />

these cases, we propose an under-specification of the taxonomical description, but<br />

suggest to specify the particular meaning dimension via other specific relations, if<br />

possible.<br />

The paper is composed as follows: we start in Section 2 with a brief introduction to<br />

the DanNet project as a whole, a WordNet project which relies heavily on reuse of<br />

existing lexical data. Then we present some problems especially connected to the<br />

reuse of verb entries from DDO (Section 3). In Section 4 we discuss the building of<br />

DanNet verb hierarchies and introduce the orthogonal hyponym as a way of dealing<br />

with a series of non-typical cases of troponymy. Finally in Section 5 we conclude and<br />

turn to future work, where we plan to combine DanNet with a deeper FrameNet-like<br />

description of verbs.


Event Hierarchies in DanNet 341<br />

2 DanNet: Background<br />

DanNet [7] is a collaborative project between a research institution, Center for<br />

Sprogteknologi, University of Copenhagen, and a literary and linguistic society, Det<br />

Danske Sprog- og Litteraturselskab under The Danish Ministry of Culture. In the<br />

project we exploit the large dictionary, DDO (approx. 100,000 senses) and the<br />

ontological resource SIMPLE-DK (11,000 senses), i.e. the Danish part of the EUproject<br />

Semantic Information for Multifunctional, Plurilingual Lexica [2].<br />

The first phase of the project is coming to and end, and the DanNet database has<br />

now reached a size of 40,000 synsets, of which 6,000 are verb senses. In the second<br />

phase of the project, the goal is to achieve the complete coverage of DDO, namely<br />

approx. 65,000 senses disregarding in this context most multiword expressions.<br />

3 Reuse Perspectives<br />

DDO contains 6600 Danish verb lemmas amounting to 19.000 senses in all. For<br />

verbs, as for all other word classes, genus proximum information is assigned in a<br />

specific field in DDO, coinciding with a superordinate verb which is already a part of<br />

the word definition. At a first glance, it therefore seems straight-forward to reuse this<br />

information on genus proximum when building the event taxonomy in DanNet, as<br />

well as it is done in the case of nouns. When we look closer into the data, we see that<br />

many verb senses share the same genus proximum and that the verbs which are most<br />

frequently used as genus proximum often have very vague meanings. As an example,<br />

4755 verbs senses (25 % of the total number of senses) share the same 15 verbs as<br />

genus proximum. These 15 verbs are all extremely polysemous lemmas, in average<br />

they have in fact 22 main senses and sub-senses each.<br />

Given the fact that the genus proximum in DDO is not marked with respect to<br />

sense distinctions in the many cases of polysemy, it becomes clear that the genus<br />

proximum information given for verbs in DDO does not automatically indicate a<br />

reusable structure of a general hierarchy, especially not at the top level of the<br />

network. We have therefore also taken in information from the network of SIMPLE-<br />

DK and have been forced to manually adjust a large set of the hyponymy relations<br />

given in DDO (see Section 4 for further details).<br />

These adjustments regarding the event taxonomy are furthermore challenged by the<br />

classical dilemma of when to merge and when to split senses, see e.g. [8], [9] and<br />

[10]. Frequent verbs are described in DDO at a very detailed level with many senses<br />

and sub-senses (see Table 1). The question is whether we necessarily want to<br />

maintain the fine-grainedness of DDO. Is it at all manageable in a semantic net meant<br />

for computer systems? And if not, how do we ensure a systematic reduction of<br />

senses? And vice versa: are there cases where we need to split DDO senses in order to<br />

capture important ontological differences?


342 Bolette Sandford Pedersen and Sanni Nimb<br />

Table 1. the distribution and the polysemy of verb genus proxima in DDO<br />

Genus proximum<br />

number of main senses and<br />

sub-senses of this genus<br />

proximum lemma in DDO<br />

(without phrasal verb senses<br />

and idiom senses)<br />

number of verb senses<br />

described by this genus<br />

proximum in DDO (total<br />

number of verb senses in<br />

DDO: 19.000)<br />

gøre (to do) 25 743<br />

være (to be) 20 580<br />

give (to give) 17 506<br />

få (to get/have) 26 413<br />

bevæge (to move) 4 391<br />

have (to have) 23 376<br />

blive (to become) 11 329<br />

fjerne (to remove) 5 229<br />

lade (to let) 11 195<br />

tage (to take) 71 187<br />

komme (to come) 23 187<br />

bringe (to bring) 8 182<br />

gå (to go) 35 171<br />

sætte (to put) 30 161<br />

holde (to hold) 24 105<br />

total 15 verbs total 333 senses 4755 = 25 % of all verb<br />

senses<br />

Starting with the latter case, we sometimes find verb definitions in DDO covering<br />

what we would in DanNet consider as two different senses with different<br />

hyperonyms. One example is: krumme (to bend) which is defined as: ‘to be or to<br />

become curved’. The definition covers two types of telicity, and therefore represents<br />

two different ontological types in DanNet, resulting therefore in a splitting into two<br />

synsets. Another example is afbilde (to depict), where one definition in DDO in fact<br />

covers two DanNet senses: 1) somebody illustrates something by producing a<br />

mathematic figure; 2) a mathematic figure shows something. The first part of the<br />

definition describes an act by a person, whereas the second rather denotes a state.<br />

Summing up, the following procedure is followed:<br />

• split senses when a sense in DDO covers both an activity and a state.<br />

If we now move to the verbs that are described as polysemous in DDO, it turns out<br />

that 1500 of the 6600 verbs in DDO have more than one sense, meaning that approx<br />

14.000 senses come from polysemous verbs. In other words, each polysemous verb<br />

has an average of almost 10 senses. The general assumption in DanNet is that we<br />

maintain the main sense divisions given in DDO since we rely on them as being<br />

actually corpus-based and therefore relevant for our purpose. For instance, we<br />

maintain the four sense distinctions given for lukke (to close): 1) to close a window,<br />

the eyes, a door 2) to close a bag 3) to close a road, a passage and 4) to stop a function<br />

(e.g. the television). Nevertheless, for some of the more systematic main sense cases,<br />

we have adopted two merging strategies:


Event Hierarchies in DanNet 343<br />

• merge senses describing a certain (physical) act being performed by either a<br />

human being or another living entity. Example: æde (to eat) which has two<br />

main senses in DDO, one for animals and one for humans.<br />

• merge senses describing different valency patterns, but with the same<br />

meaning, such as ergative verbs like geare ned (gear down (fig.)) which can<br />

either be intransitive of transitive.<br />

When it comes to the many subsenses of verbs in DDO, we generally find far more<br />

cases of potential merging. A main principle is to merge a sub-sense with its main<br />

sense when the sub-sense represents (i) a more restricted sense, or (ii) an extended<br />

sense. In contrast, figurative senses, these are generally maintained, belonging often<br />

to different ontological type. See Fig. 1.<br />

DDO verb entry<br />

main sense<br />

sub-sense, case 1.<br />

restricted sense<br />

sub-sense, case 2.<br />

extended sense<br />

DanNet<br />

{SynSet 1}<br />

{SynSet 2}<br />

sub-sense, case 3<br />

figurative sense<br />

Fig. 1. Merging sub-senses from DDO with the main sense<br />

4 Verb Descriptions in DanNet<br />

A preliminary encoding of the first 6,000 verb synsets has recently been completed in<br />

the project. Although highly inspired by the SIMPLE-DK descriptions of events (built<br />

partly on Levin classes, cf. [11] and [12]), we have chosen to apply the EWN Top<br />

Ontology of 2 nd Order Entities (see Figure 2) in order to be compatible with other<br />

WordNets developed within this framework. In order to guide the encoding work,<br />

approx. 60 event templates have been established, combining situation types and<br />

situation components in different sets. The main dividing principle is that of telicity,<br />

as reflected in the situation types BoundedEvent and UnboundedEvent. However, in<br />

Danish, telicity is in most cases specified by means of verb particles - and not as in<br />

Romance languages - given in the verbal root. This can be seen for instance for the<br />

verb spise (eat) which seen in isolation denotes an atelic, unbounded event as opposed


344 Bolette Sandford Pedersen and Sanni Nimb<br />

to the phrasal verb spise op (finish ones food) which denotes a telic, bounded event.<br />

Phrasal verbs in general constitute a large part of the encoded senses in DanNet, and<br />

many verbs have parallel encodings as bounded and unbounded events depending on<br />

the presence or absence of a phrasal particle.<br />

Fig. 2: The EWN Top Ontology, cf. [13]<br />

The building process was initiated by working with the genus proximums in DDO<br />

that denote physical events, and where the encoded hyperonym proved to be more or<br />

less reliable, such as bevæge sig (move), fjerne (remove), stille (place), ytre (evince,<br />

express) etc. A feature in the DanNet tool enables us to identify such groups directly<br />

from DDO, analyzing thereby where to find larger groups of more or less<br />

homogeneous verbs.<br />

The encodings of physical as well as communicative verbs served as the first<br />

building blocks for the event hierarchy. When about half of the verb vocabulary had<br />

been provisionally encoded, the need emerged to work top-down in order link the


Event Hierarchies in DanNet 345<br />

verb groups together in a joint network. For this purpose 24 Danish top-ontological<br />

verbs were identified forming thereby a language-specific parallel to the EWN Top<br />

Ontology onto which all other verb senses are subsequently linked.<br />

4.1 Determining the Taxonomical Structure of Events<br />

Where the main organizing mechanism behind 1 st and 3 rd Order Entities is constituted<br />

by the hyponymy relation, events seem to be better organized along the dimensions of<br />

the manner relation – or troponymy relation – as proposed by Fellbaum. To give an<br />

example from DanNet, guffe (scoff) is a (quick and rough) way of spise (eat) which<br />

again is a way of indtage (consume), which again is a way of handle (act) etc as<br />

depicted in Figure 3.<br />

Fig. 3. Indtage (consume) with some of its hyponyms<br />

Some verbs in the domain, however, tend to fall out of the pattern of troponymy, or<br />

at least they denote another dimension of the manner relation. The verb trøstespise<br />

(eat for comfort, i.e. be a compulsive eater), which is composed by the two verbs<br />

trøste (comfort) and spise (eat), respectively, is an example of a verb which does not<br />

relate to the physical manner of eating (fast, slow, nice, ugly, large amounts, small<br />

amounts), but rather to a psychological dimension of eating. As seen in Figure 4, we<br />

thus assign the hyponym as orthogonal to spise, depicted in the figure by a rhomb.<br />

Another solution would have been to follow Fellbaum, and establish parallel<br />

hierarchies by encoding trøstespise both as an eating event and as a comforting event,<br />

i.e. as troponym to trøste_sig. Such multiple inheritance is possible in the DanNet<br />

framework, but for pragmatic reasons we have decided not to establish such parallel<br />

hierarchies if they can be avoided, since they prove hard to encode in a consistent<br />

way.


346 Bolette Sandford Pedersen and Sanni Nimb<br />

Fig. 4. trøstespise (eat for comfort) as orthogonal to spise (eat)<br />

A similar situation arises when encoding near synonyms. Synonymy further<br />

complicates the establishment of event hierarchies since, compared to 1 st order<br />

entities, it is often much more unclear when two verbs actually refer to the same<br />

event. Figure 5 is a screen shot from the DanNet encoding tool which illustrates the<br />

problem with the synset tilberede (prepare (food)) together with its co-synonyms<br />

tillave, preparere and lave. Kokkerere (to perform finer cooking) in Danish is another<br />

word for preparing food, but it gives a specific association to finer cooking. Therefore<br />

it has been placed as a subordinate to {tilberede, tillave, lave, preparere}. This is at<br />

first glance unproblematic, only it specifies another semantic dimension than bage<br />

(bake), pochere (poach), spejle (fry (an egg)), koge (boil) etc. where the manner<br />

component is clearly in focus describing exactly what kind of heating process the<br />

food undergoes. Therefore, we encode kokkere as orthogonal to the rest of the<br />

hyponyms to cook, again visualized in the figure by a rhomb. Note that the orthogonal<br />

synset is characterized by being compatible with its sisters (unlike taxonyms); while<br />

performing finer cooking the ingredients may actually both undergo frying, baking<br />

and boiling.<br />

Fig. 5. kokkere (to perform finer cooking) as orthogonal to tilberede (to cook)<br />

4.2 Other Organizing Principles than Manner?<br />

During our work, we have observed that within some physical domains, the main<br />

organizing principle cannot actually be characterized as a manner relation. Under the<br />

verb fjerne (remove), we find a series of verbs such as afkalke (decalcify, descale),<br />

afluse (delouse), affugte (dehydrate) and affarve (discolour, bleach) which specify<br />

what is removed from the object and not how it is removed. Likewise, in the domains


Event Hierarchies in DanNet 347<br />

of most mental verbs, it proves to be the case that more subtle meaning dimensions<br />

are specified in the different hyponyms. Under the verb tænke (think) we find verbs<br />

like dagdrømme (day dream), bekymre sig (worry), forske (investigate) and mindes<br />

(recall), a very heterogeneous group of verbs organized along different dimensions of<br />

meaning that are not satisfactorily labeled as manner-relations.<br />

5 Conclusions and Future Work<br />

In this paper we have discussed some problems related to the building of event<br />

hierarchies on the basis of existing lexical resources of Danish. Following the line of<br />

Fellbaum, we acknowledge that the manner relation (troponymy) must be defined as<br />

the main taxonomical principle for describing verbs, but we also observe from the<br />

practical encoding that there are several complications with this organizing method<br />

since many subordinate verbs tend to specify other meaning dimensions than manner.<br />

If we look at verbs denoting physical events, which are actually the less complicated<br />

to work with, the manner relation is by far the most frequent relation, but there are<br />

many exceptions where a verb denotes a slightly different semantic dimension. In<br />

several of these cases, the verb could also be organized under another hyperonym<br />

building thereby parallel hierarchies, a strategy, however, that we have abandoned for<br />

maintenance reasons. Instead we suggest to mark verbs that do not follow the strict<br />

manner pattern with a feature stating that it denotes an orthogonal dimension of<br />

meaning to the basic hierarchy. This can be seen as parallel to the way that we encode<br />

1 st Order Entities, where we distinguish taxonomical and non-taxonomical hyponymy<br />

relations. We believe that by introducing this division, we obtain cleaner event<br />

hierarchies, and we allow for the compatibility between taxonomical and nontaxonomical<br />

synsets (i.e. for kokkere (to perform finer cooking) to take place by<br />

means of frying and baking etc).<br />

Future plans regarding semantic verb descriptions in DanNet include to combine<br />

the resource with the already existing syntactic lexical database STO [14] in order to<br />

relate each verb sense to its corresponding valency pattern. We also intend to further<br />

specify the semantics of Danish verbs in a FrameNet-like project for which we are<br />

currently applying for funding. The hypothesis is that groups of verbs sharing the<br />

same hyperonym in DanNet as well as the same ontological type, are also candidates<br />

to be members of the same Semantic Frames in a Danish FrameNet, meaning that<br />

they will share the same semantic roles and to some degree also similar selectional<br />

restrictions.<br />

References<br />

1. DDO = Hjorth, E., Kristensen, K. et al. (eds.): Den Danske Ordbog 1-6 (‘The Danish<br />

Dictionary 1-6’). Gyldendal & Society for Danish Language and Literature (2003–2005)<br />

2. Lenci, A., Bel, N., Busa, F., Calzolari, N., Gola, E., Monachini, M., Ogonowski, A., Peters,<br />

I., Peters, W., Ruimy, N., Villages, M., Zampolli, A.: ‘SIMPLE – A General Framework for


348 Bolette Sandford Pedersen and Sanni Nimb<br />

the Development of Multilingual Lexicons’. J. International Journal of Lexicography 13, pp.<br />

249–263. Oxford University Press (2000)<br />

3. Guarino, N., Welty, C.: ‘Identity and Subsumption’. In: Green, R.,. Bean, C.A, Myaeng, S.H.<br />

(eds.) The Semantics of Relationsships: An Interdisciplinary Perspective, Information<br />

Science and Knowledge Management. Springer Verlag (2002)<br />

4. Fellbaum, C.: Parallel Hierarchies in the Verb Lexicon. Proceedings of the OntoLex<br />

Workshop , LREC, pp. 27–31. Las Palmas, Spain (2002)<br />

5. Pedersen, B.S., Sørensen, N.: Towards Sounder Taxonomies in Wordnets. In: Oltramari, A.,<br />

Huang, C.R.. Lenci, A., Buuitelaar, P., Fellbaum, C. (eds.) Ontolex 2006 at 5th International<br />

Conference on Language Resources and Evaluation, pp. 9–16. Genova, Italy (2006)<br />

6. Cruse, D.A.: ‘Hyponymy and Its Varieties’. In: Green, R.,. Bean, C.A, Myaeng, S.H (eds.)<br />

The Semantics of Relationships: An Interdisciplinary Perspective, Information Science and<br />

Knowledge Management. Springer Verlag (2002)<br />

7. Asmussen, L. Pedersen, B.S., Trap-Jensen, L.: DanNet: From Dictionary to WordNet.<br />

Kunze, C., Lemnitzer, L., Osswald, R. (eds.) GLDV-2007 Workshop on Lexical-Semantic<br />

and Ontological Resources 1–11. Universität Tübingen, Germany (2007)<br />

8. Hanks, P.: Do word meanings exist? J. Computers and the Humanities, 34 (1-2), Special<br />

Issue on the Proceedings of the SIGLEX/SENSEVAL Workshop, A. Kilgarriff and M.<br />

Palmer, eds., 171–177 (2000)<br />

9. Kilgariff, A.: I don’t believe in word senses. J. Computers and the Humanities, 31 (1-2), 1–<br />

13 (1997)<br />

10. Palmer, M., Dang, H.T., Fellbaum, C.: Making fine-grained and coarse-grained sense<br />

distinctions. J. Journal of Natural Language Engineering (2005)<br />

11. Levin, B.: English Verb Classes and Alternations - A Preliminary Investigation. The<br />

University of Chicago Press (1993)<br />

12. Pedersen, B. S., Nimb, S.: Semantic Encoding of Danish Verbs in SIMPLE - Adapting a<br />

verb-framed model to a satellite-framed language. In Proceedings from the Second<br />

International Conference on Language Resources and Evaluation, pp. 1405–1412. Language<br />

Resources and Evaluation - LREC 2000, Athens (2000)<br />

13. Vossen, P. (ed.): EuroWordNet General Document. University of Amsterdam (2005)<br />

14. Braasch, A., Pedersen, B.S.: Recent Work in the Danish Computational Lexicon Project<br />

"STO". In: EURALEX Proceedings 2002, Center for Sprogteknologi. Copenhagen (2002)


Building Croatian WordNet<br />

Ida Raffaelli 1 , Marko Tadić 1 , Božo Bekavac 1 , and Željko Agić 2<br />

1 Department of Linguistics<br />

2 Department of Information Sciences<br />

Faculty of Humanities and Social Sciences<br />

University of Zagreb, Ivana Lučića 3, Zagreb, Croatia<br />

{ida.raffelli, marko.tadic, bbekavac, zeljko.agic}@ffzg.hr<br />

Abstract. This paper reports on the prototype Croatian WordNet (CroWN). The<br />

resource has been collected by translating BCS1 and 2 from English, but also<br />

by usage of machine readable dictionary of Croatian language which was used<br />

for automatical extraction of semantic relations and their inclusion into CroWN.<br />

The paper presents the results obtained, discusses some problems encountered<br />

along the way and points out some possibilities of automated acquisition and<br />

populating synsets and their refinement in the future. In the second part the<br />

paper discusses the lexical particularities of Croatian, which are also shared<br />

between other Slavic languages (verbal aspect and derivation patterns), and<br />

points out the possible problems during the process of their inclusion in<br />

CroWN.<br />

Keywords: WordNet, Croatian language, lexical semantics.<br />

1 Introduction<br />

WordNet has become one of the most valuable resources in any language for which<br />

the language technologies are tried to be built. One could say that having in mind the<br />

state-of-the-art in LT, a WordNet for a particular language could be considered as one<br />

of the basic lexical resources for that language. Semantically organized lexicons like<br />

WordNets can have a number of applications such as semantic tagging, word-sense<br />

disambiguation, information extraction, information retrieval, document classification<br />

and retrieval, etc. In the same time carefully designed and created WordNet represents<br />

one of possible models of a lexical system of a certain language and this pure<br />

linguistic value is sometimes being neglected or forgotten.<br />

Following, but also widening the original Princeton design of WordNet for English<br />

[7], since EuroWordNet [18], a multilingual approach in building WordNets has taken<br />

the ground resulting in number of coordinated efforts for more than one language<br />

such as BalkaNet [17], MultiWordNet [9]. A comprehensive list of WordNet building<br />

initiatives is available at Global WordNet Association web-site 1 .<br />

In spite of efforts to coordinate building of WordNets for Central European<br />

languages (Polish, Slovak, Slovenian, Croatian, Hungarian) since 2 nd <strong>GWC</strong> in Brno<br />

1<br />

http://www.globalwordnet.org/gwa/wordnet_table.htm.


350 Ida Raffaelli, Marko Tadić, Božo Bekavac, and Željko Agić<br />

from 2004, building WordNets for these particular languages have started separately<br />

by respective national teams. The Croatian WordNet (CroWN from now on) is being<br />

built at the Institute of Linguistics, Faculty of Humanities and Social Sciences at the<br />

University of Zagreb. This paper represents the first report on the work-in-progress<br />

and the results that it presents are by all means preliminary.<br />

The second section of the paper deals with the method of creating CroWN,<br />

dictionaries and corpora used. The third section discusses some particularities of<br />

Croatian lexical system that have been observed and which has to be taken into<br />

consideration while building the CroWN. The paper ends with future plans and<br />

concluding remarks.<br />

2 The Process of Building<br />

2.1 Method<br />

To build a WordNet for a language there are two methods to choose from: 1) expand<br />

model [19], which in essence takes the source WordNet (usually PWN) and translates<br />

the selected set of synsets into target language and then later expands it with its own<br />

lexical semantic additions; and 2) merge model [19], where different separate<br />

(sub-)WordNets are being built for specific domains and later merged into a single<br />

WordNet. Both approaches have pros and cons with the former being simpler, less<br />

time- and man-months (i.e. also financially) consuming, while the latter is usually<br />

quite the opposite. On the other hand the results of the former approach are WordNets<br />

that are to a large extent at the upper hierarchy levels isomorphous with the source<br />

WordNet thus possibly deviating from the real lexical structure of a language. This<br />

can be noted particularly in the case of typologically different languages when<br />

number of discrepancies starts to grow. The latter approach reflects the lexical<br />

semantic structure more realistically but it can be hard to connect it with other<br />

WordNets and to make this resource usable for multilingual applications as well.<br />

Having no semantically organized lexicons for Croatian except the [13] which<br />

exists only on paper, for initial stages of building CroWN we were forced to use<br />

existing monolingual Croatian lexicons which we had in digital form i.e. [1]. Also<br />

having very limited human and financial resources we were also forced to opt for<br />

expand model but we wanted to keep in mind all the time that it should not be reduced<br />

to a mere “copy, paste and translate” operation and that one should always take care<br />

about the differences in lexical systems. The expand model has being successfully<br />

used in a number of multilingual WordNet projects so we believed that this direction<br />

could not be wrong if we also consider thorough manual checking as well.<br />

Up until now our top-down approach has been limited to the translation of BCS1, 2<br />

and 3 from BalkaNet and additional data collecting from dictionary and corpora. The<br />

more specialized and more language-specific concepts will be added in further phases<br />

of creating CroWN. Table 1. shows basic statistics of POS in BCS1 and BCS2 of<br />

CroWN. The BCS3 is not included since it has not been completely adapted.


Building Croatian WordNet 351<br />

Table 1. Basic statistics on POS in BCS1 and BCS2 of CroWN.<br />

BCS1 BCS2 Total<br />

Nouns 965 2245 3210<br />

Verbs 254 1188 1442<br />

Adjectives 0 36 36<br />

Total 1219 3469 4688<br />

2.2 Dictionary and its processing<br />

The only dictionary resource we had available in machine readable form thus usable<br />

for populating the CroWN was [1]. Printed and CD-ROM edition of the dictionary<br />

contains approximately 70,000 dictionary entries. The right-side of lexicographic<br />

articles was divided into several subsections: part-of-speech and other grammatical<br />

information, domain abbreviations (e.g. anat. for anatomy), a number of entry<br />

definitions (containing various examples and synonyms), syntagms and phraseology,<br />

etymology and onomastics. Each of the subsections was labeled in original dictionary<br />

using a special symbol, making the dictionary easily processible. After extracting<br />

dictionary data and resolving some technical issues, we were left with 69,279 entries<br />

as candidates for the first phase of CroWN population. At this step, we omitted<br />

grammatical and lexicographic category information, phraseology, etymology and<br />

onomastics from articles but this information can be easily added later. In Figure 1<br />

both original and simplified dictionary entries are shown:<br />

pòstanak m<br />

1. pojava, pojavljivanje, nastanak čega<br />

2. prvi trenutak u razvoju čega; postanje<br />

∆ Knjiga ~ka prva biblijska knjiga, govori o postanku svijeta<br />

<br />

postanak<br />

postanak DEF pojava, pojavljivanje, nastanak čega<br />

postanak DEF prvi trenutak u razvoju čega; postanje<br />

postanak SINT Knjiga ~ka prva biblijska knjiga, govori o postanku svijeta<br />

<br />

Fig. 1. Original and reduced dictionary entry.<br />

Each processed lexicographic element in reduced dictionary entry was tagged by<br />

the corresponding tag for definition, example and syntagm. Each headword was<br />

repeated before DEF and SINT tags, indicating that the definition and syntagm<br />

sections are linked to the entry. This redundant form was easily processed with<br />

regular patterns (local grammars) using NooJ environment [11]. The starting 69,279<br />

entries now contained 88,352 different definition tags and 7,788 syntagm tags. 2<br />

2<br />

Note that the overall number of definitions is even bigger since we omitted as redundant the<br />

tags in single-line entries, i.e. those entries that contain only the headword and its right-side<br />

definition – their processing is trivial.


352 Ida Raffaelli, Marko Tadić, Božo Bekavac, and Željko Agić<br />

In this first extraction step we aimed at two things: 1) automatic linking of<br />

headwords to their definitions; and 2) creation of a set of well-defined lexical patterns<br />

which will be used to acquire additional knowledge from entries using information<br />

available in definitions and syntagms sections. We chose definitions and syntagms<br />

over all other lexicographic elements as definitions are more likely to contain wellformed<br />

word links than phraseology: e.g. the entry crn (en. black) has seven<br />

definitions in the dictionary and all of them are starting with koji je (en. which is, that<br />

is), providing a constant data extraction pattern. The same procedure is applicable to<br />

syntagms – crni humor (en. black humor), crna lista (en. black list), etc.<br />

In dictionary filtering and pattern design, it was our intention to create correct and<br />

reliable set of WordNet entries containing basic information – their nearest hypo- and<br />

hyperonym classes, basic definitions and possible links to other entries.<br />

In the preliminary test, which was used to determine whether the pattern method is<br />

feasible or not, we defined several lexical patterns and using NooJ tested them on our<br />

tagged and filtered dictionary. The simple patterns were defined in order to separate<br />

animate and inanimate nouns and also to try and link these nouns to other entry types<br />

similar in meaning. Some results are given in Table 2.<br />

Table 2. Filtering definitions using lexical patterns.<br />

Pattern Extracted Examples<br />

onaj koji<br />

(en. the one who)<br />

2138 brojač PATTERN broji<br />

(en. counter PATTERN counts)<br />

psiholog PATTERN se bavi psihologijom<br />

(en. psychologist PATTERN does<br />

psychology)<br />

osoba koja<br />

90 korisnik PATTERN se koristi računalom<br />

(en. the person that)<br />

osobina onoga koji (je)<br />

(en. property of one who (is))<br />

odlika onoga koji (je)<br />

(en. quality of one who (is))<br />

(en. user PATTERN uses a computer)<br />

170 aktivnost PATTERN aktivan<br />

(en. activity PATTERN active)<br />

budnost PATTERN budan<br />

(en. awakeness PATTERN awake)<br />

We can make several conclusions from results of the type given in this table. The first<br />

one is that the pattern itself, if well-defined, can provide us with an insight on<br />

resulting entries; for example, onaj koji (en. the one who) clearly indicates a person,<br />

while osobina onoga koji (en. a quality of one who) indicates a property of an entity.<br />

Furthermore, although the [1] dictionary was written using a fairly controlled<br />

language subset, our patterns should still undergo parallel expansions in order to<br />

handle language variety that occurs in definitions (in Table 1: property, quality could<br />

be expanded with feature, attribute, etc.). Patterns should also be tuned with regard to<br />

article tokens occurring on its right sides; some of them could capture related nouns<br />

(psychologist – psychology) while others could link nouns to adjectives (awakeness –<br />

awake). Another possible enhancement to these patterns could be token sensitivity; if<br />

the dictionary were to be preprocessed with a PoS/MSD tagger or a morphological<br />

lexicon [16], pattern surroundings could be inspected and tokens collected with regard<br />

of their MSD and other properties (e.g. obligatory number, case and gender agreement<br />

in attribute constructions). Given these facts, we can come out with a conclusion to<br />

test: if carefully designed and paired with large, reliable dictionaries and MSD


Building Croatian WordNet 353<br />

tagging, pattern detection using local grammars could prove a good method for semiautomated<br />

construction of CroWN. Therefore, future dictionary processing and data<br />

acquiring tasks will include: enhancing all processing stages in order to collect even<br />

more definitions and syntagms that were left behind this first attempt in automatic<br />

CroWN population.<br />

2.3 Corpora<br />

We were aware that harvesting semantic relations encoded in the existing machinereadable<br />

dictionary, would still not be sufficient for building the exhaustive semantic<br />

net as WordNet should be. Therefore we also turned our attention to Croatian corpora<br />

and text collections in order to detect more examples and validate the existing ones.<br />

As the treatment of compound words in WordNet from version 1.6. became more<br />

important, and since we had developed a system for detecting, collecting and<br />

processing compounds words (i.e. syntagms) [5], we decided to include them in<br />

CroWN right after completing the translation of BCS1-3. Overview of the compound<br />

words in the WordNet and their treatment is described in [10] so we will not go into<br />

details here.<br />

When building an ontology from the scratch it is very useful to have a huge source<br />

of potential candidates for ontology population. For this task we used the downloaded<br />

Croatian edition of Wikipedia (http://hr.wikipedia.org ) which at that time comprised<br />

30,985 articles. For identification of distinctive compounds we extracted all explicitly<br />

tagged Wikipedia links, that undoubtfully point to a concept which was worded with<br />

at least two lower case words. The example can be seen in Figure 2.<br />

Fig. 2. Example of targeted compound from Wikipedia (circled text ekumenske teologije).<br />

Definition of internal compound structures serves as filter for elimination of<br />

unwanted candidates. Examples of such patterns are combinations of MSDs like<br />

Adjective + Noun: e.g. plava zastava (en. blue flag); Noun + Noun-in-Genitive: e.g.<br />

djeca branitelja (en. children of defenders); Noun+Preposition+Noun-in-case-


354 Ida Raffaelli, Marko Tadić, Božo Bekavac, and Željko Agić<br />

governed-by-preposition: e.g. hokej na ledu (en. litteraly hockey on ice) etc. The<br />

compound dictionary collected in this way has also been included in lexical pattern<br />

processing of dictionary text described in the previous section.<br />

Since we are still in the process of collecting and processing basic resources to<br />

create CroWN, we have not used Croatian National Corpus (HNK) for collecting<br />

literals. However it will be used in the process of corpus evidence and validation of<br />

literals within synsets used in CroWN.<br />

Of course the last step before the inclusion of new items in CroWN is always the<br />

human checking and postprocessing of retrieved candidates where the final judgment<br />

about their inclusion and position in CroWN is taking place.<br />

3. Particularities of Croatian<br />

In this part of the paper we would like to discuss some underlying problems that we<br />

have detected while we were examining the structure of the Croatian lexical system<br />

which could, we believe, be relevant for building WordNets of other languages.<br />

Except the necessity to be compatible with other WordNets, CroWN should<br />

preserve and maintain language specificity of Croatian lexical system in order to be a<br />

computational lexical database which reflects all semanti specifics of lexical<br />

structures in Croatian. Specifics of semantic and lexical structures in Croatian will<br />

especially become relevant in the construction of synsets at deeper hierarchical levels.<br />

Beside linking synsets with basic relations such as (near)synonyms, hypo/hypernyms,<br />

antonyms and meronyms, some of morphosemantic phenomena typical not only for<br />

Croatian, but also for other Slavic languages, should be taken into consideration and<br />

integrated in the construction of synsets and linking lexical entries within a synset.<br />

Two of the most problematic language-specific phenomena of Croatian (which are<br />

shared with other Slavic languages) that should have inevitable impact on creating<br />

CroWN are: 1) verbal aspect and 2) derivation. Although these phenomena are<br />

traditionally considered as morphological processes, their impact to the semantic<br />

structure of a lexical unit should not be neglected in labeling lexical entries in<br />

CroWN. Moreover, as we will try to show all of these two morphological processes<br />

exhibit some regularity in patterns in Croatian derivation which could be exploited for<br />

automatic labeling of lexical entries. Regular derivational patterns characteristic for<br />

each morphological category should not be considered without close examination of<br />

their role in changing the semantic structure of a certain lexical entry in the CroWN.<br />

In other words, regularity of morphosemantic or derivational patterns could be useful<br />

for automatic labeling of senses in the CroWN, but at the same time there are many<br />

cases in the lexical system where some of these patterns considerably have motivated<br />

the change of the meaning from the basic lexical item.<br />

3.1 Verbal Aspect<br />

In one of the most recent Croatian grammars [12] aspect is defined as an instrument to<br />

express a difference between an ongoing action (imperfective aspect) and an action<br />

that has already been finished (perfective aspect). The category of aspect enables the


Building Croatian WordNet 355<br />

division of verbs in Croatian into perfective verbs and imperfective verbs which stand<br />

in binary opposition. Perfective verbs could be derived from the imperfective verbs<br />

and vice versa, imperfective verbs could be derived from the perfective verbs.<br />

Traditionally, aspectual verbal pairs are treated as separate lexical entries and in<br />

lexica they are sometimes listed as separate headwords and sometimes under the same<br />

headword (usually imperfective). Both practices can exist in the same dictionary in<br />

parallel. Some of the most prominent derivational patterns in the formation of both,<br />

perfective and imperfective verbs are the following:<br />

1) Perfective verbs could be formed from imperfective verbs by substitution of the<br />

suffix of the verbal stem of an imperfective verb. The perfective verb e.g. baciti (en.<br />

to throw) is formed by substitution of the suffix -a of the verbal stem bacati (en. to<br />

throw as imperfective verb) with the suffix –i. Similar substitutional patterns cover<br />

other suffixes.<br />

2) Perfective verbs could be formed by adding the prefix (e.g. pre-, na-, u-, pri-,<br />

do-, od-, pro-, etc.) to the verbal stem of an imperfective verb. Many perfective verbs<br />

are formed this way: gledati (en. to look) – pregledati (en. to look over, to examine),<br />

hodati (en. to walk) – prehodati (en. to walk a distance, used often in a metaphorical<br />

sense, meaning to walk a flu over), pisati (en. to write) – prepisati (en. to copy in<br />

writing) and many others.<br />

As it could be observed from the previous examples, adding the prefix pre- to the<br />

verbal stem of an imperfective verb enables the formation of the perfective verb using<br />

regular and frequent derivational pattern, but it also triggers some of not negligible<br />

changes of the semantic structure of the basic verbal meaning. If we take the example<br />

of the aspect pair pisati (en. to write) – prepisati (en. to copy in writing) the semantic<br />

change of the perfective verb prepisati is quite significant with respect to the<br />

imperfective verb pisati. Though, there is another derivational pattern for the<br />

formation of a perfective verb from the imperfective pisati i.e. it is possible to add the<br />

prefix na- to the same verbal stem. The aspect pair pisati – napisati does not exhibit a<br />

significant semantic shift of the derived verb toward a new meaning as in the previous<br />

case. Moreover, the derivational pattern has been introducing only the distinction<br />

between an ongoing and an already finished action<br />

The same pattern exhibit the aspect pair gledati (en. to look) – pregledati (en. to<br />

examine) pointing again to the significant semantic shift of the perfective verb,<br />

whereas the aspect pair gledati – pogledati (perfective verb formed by adding the<br />

prefix po-) is exclusively related with respect to the differentiation of the type of<br />

action.<br />

3) The most prominent pattern of the formation of the imperfective verbs from the<br />

perfective ones is the substitution of the suffixes of the verbal stem with derivational<br />

morphemes such as -a-, -ava-, and -iva- like in examples: preporuč-i-ti › preporuč-ati,<br />

prouč-i-ti › prouč-ava-ti and uključ-i-ti › uključ-iva-ti.<br />

It is necessary to point out that this kind of formational pattern does not trigger<br />

significant semantic changes of the formed (imperfective) verb. The aspect pair is in<br />

binary opposition only with respect to the type of action (prefective or imperfective)<br />

they are referring to.<br />

Basically, in Croatian grammars [3] and [12] verbs which differentiate with respect<br />

to the type of the action are considered as aspect pairs. However, aspect pairs could<br />

also differentiate with respect to the nature of the action or the way the action is


356 Ida Raffaelli, Marko Tadić, Božo Bekavac, and Željko Agić<br />

effected. This way of differentiating verbs which form an aspect pair is highly<br />

semantically motivated and should be taken into consideration when placing the<br />

lexical entries within a synset. For example the aspect pairs kopati (en. to dig) –<br />

otkopati (en. to dig up), kopati – zakopati and kopati – pokopati semantically<br />

differentiate primarily with respect to the nature of the action. In the first aspect pair<br />

the perfective verb exhibits the beginning of the action (inchoative meaning), the two<br />

other pairs exhibit the end of the action (finitive meaning). It should also be pointed<br />

out that perfective verbs zakopati and pokopati do not have the same meaning. The<br />

verb pokopati means “to bury”, whereas zakopati could mean “to bury” but also “to<br />

cover with something”.<br />

Grammar [12] distinguishes 11 different meanings of the aspect pairs with respect<br />

to the nature of the action and it is clear that this task will not be simple and without<br />

problems. The main issue is could we differentiate between these subtle senses using<br />

automatic techniques instead of tedious manual validation against the corpora.<br />

3.2 Derivation<br />

As Pala and Hlaváčková in [8] point out, derivational relations in highly inflectional<br />

languages represent a system of semantic relations that definitely reflects cognitive<br />

structures that may be related to language ontology. Derivational processes are deeply<br />

integrated in language knowledge of every speaker and represent a system which is<br />

morphologically and semantically highly structured. Therefore, as it is stressed in [8],<br />

derivational processes can not be neglected in building Czech Wordnet as well as any<br />

other Slavic language WordNet.<br />

As already mentioned derivations in Slavic languages are highly regular and are<br />

suitable for automatic processing. In the paper [8] 14 (+2) derivational patterns have<br />

been adopted as a starting point for the organization of so-called derivational nests of<br />

the Czech Wordnet. They are aware of the main problem considering the derivational<br />

patterns and relations. Although there exists a significant number of cases where<br />

affixes preserve their meaning in Czech as well as in Croatian, it should be taken into<br />

consideration that there is also many cases where affixes do not preserve their<br />

prototypical meaning and become semantically opaque. This certainly poses a<br />

problem for automatic processing of derivational patterns and relations.<br />

If we consider perfixation as one of possible derivational processes in Croatian as<br />

in Czech [8] as well, it is without any doubt that prefixes denote different relations<br />

such as time, place, course of action, and other circumstances of the main action.<br />

There are many cases where prefixes preserve their prototypical meaning, often<br />

related to its prototypical meaning as prepositions since most of them developed from<br />

prepositions. For example Croatian prefix na- has been developed from the<br />

preposition na with a prototypical meaning referring to the process of directing an<br />

object X on the surface of an object Y. There are many verbs in Croatian formed with<br />

the prefix na- where the prefix has preserved this meaning: baciti (en. to throw) –<br />

nabaciti (en. to thow on sth/smb), lijepiti (en. to stick) – nalijepiti (en. to stick sth. on<br />

sth.), skočiti (en. to jump) – naskočiti (en. to jump on sth.).<br />

Unfortunately, this is not the only meaning of the prefix na- in Croatian. It also<br />

serves for derivation of a large number of verbs meaning to do sth. to a large extent.


Building Croatian WordNet 357<br />

For example: krasti (en. to steal) – nakrasti (en. to steal heavily), kuhati (en. to cook)<br />

– nakuhati (en. to cook lot of food, or to cook for a long time). In [2] there are three<br />

more meanings of the prefix na- and they should be integrated in any kind of<br />

automatic processing of prefixation in CroWN. Though in our opinion the greatest<br />

problem would represent some cases where the same verb, as a result of a prefixation,<br />

changes a meaning towards a completely new domain but still preserving some of<br />

possible meanings of the prefix.<br />

Such an example is the verb napustiti. The verb pustiti means to drop, to let<br />

go/loose while napustiti has two semantic cores or two basic meanings. One is related<br />

to the first meaning of the prefix na- (to put X on the surface of Y) and it is to drop X<br />

on the surface of Y. The other meaning is related to another possible meaning of na-;<br />

lead to a result. So napustiti could also mean to abandon, to quit, to give up. The<br />

connection between two semantic cores is hard to grasp for an average speaker of<br />

Croatian, but it could be explained with respect to different meanings of the prefix<br />

na-. In the CroWN the verb napustiti should be linked to the verb pustiti and its<br />

(near)synonyms, as well to verbs such as ostaviti, odustati which are both<br />

(near)synonyms of napustiti. What co-textual patterns will be detected in the corpus<br />

and will there be any explicit means to univocally differentiate between these senses<br />

remains to be seen.<br />

As shown from the previous examples, derivational patterns such as suffixation<br />

and prefixation could not be considered as formal processes using affixes with simple<br />

and unique semantic value. Moreover, in highly grammatically motivated languages<br />

such as Croatian, as well as in any other Slavic language, suffixation and prefixation<br />

should not be regarded as grammatical processes which always result in same<br />

transparent and regular semantic changes of the basic lexical item. In many cases<br />

affixes used in derivational patterns lose their prototypical meaning enabling<br />

significant changes of the semantic structure of the basic lexical item thus influencing<br />

the organisation of highly structured morphosemantic relations<br />

4 Future Plans and Concluding Remarks<br />

Being at the very beginning of creating CroWN, this section could be expected to be<br />

quite extensive. In order to keep things moderate, we will list only the most imminent<br />

future plans to develop CroWN.<br />

The first step would be the digitalization of [13] dictionary and its preprocessing<br />

for later usage. Being a lexicographically well-formed dictionary of synonyms in<br />

Croatian, this resource would provide us with huge amount of reliable data for direct<br />

CroWN synset acquisition and refinement.<br />

The next step is refining and elaborating patterns for extraction of semantic<br />

relations from the dictionaries and corpora. This does not only include more complex<br />

lexical patterns but also additional dictionaries and corpora including mono- and<br />

multilingual such as Croatian-English Parallel Corpus [14] etc.<br />

Particularly important for quality checking of CroWN will be proving the<br />

frequency data of literals and their meanings with Croatian reference corpus, namely<br />

Croatian National Corpus [15].


358 Ida Raffaelli, Marko Tadić, Božo Bekavac, and Željko Agić<br />

We expect to gain some insight also from checking correspondence with WordNets<br />

of genetically close languages (Slovenian, Serbian) [6,17] as well as culturally close<br />

languages (Slovenian, Czech, Hungarian, German, Italian), particularly at the level of<br />

culturally motivated concepts.<br />

In this paper we have presented the first steps in creating Croatian WordNet which<br />

consist of translating BCS1, 2 and 3 from English into Croatian. Also we have<br />

described procedures for additional synset population from a machie-readable<br />

monolingual Croatian dictionary using lexical patterns and regular expressions.<br />

Similar procedure has been applied for compound words collection from a<br />

semistructured corpus of Croatian Wikipedia articles. Particularities of Croatian and<br />

possible problematic issues for defining synset structures are being discussed at the<br />

end of the paper with the hope that their solving will lead to a more thorough and<br />

precise semantic network of Croatian language.<br />

Acknowledgments<br />

This work has been supported by the Ministry of Science, Education and Sports,<br />

Republic of Croatia, under the grants No. 130-1300646-0645, 130-1300646-1002,<br />

130-1300646-1776 and 036-1300646-1986.<br />

References<br />

1. Anić, V.: Veliki rječnik hrvatskoga jezika. Novi liber, Zagreb (2003)<br />

2. Babić, S.: Tvorba riječi u hrvatskome književnome jeziku. Croatian Academy of Sciences<br />

and Arts-Globus, Zagreb (2002)<br />

3. Barić, E., Lončarić, M., Malić, D., Pavešić, S., Peti, M., Zečević, V., Znika, M.: Priručna<br />

gramatika hrvatskoga književnog jezika. Školska knjiga, Zagreb (1979)<br />

4. Bekavac, B., Šojat, K., Tadić, M.: Zašto nam treba Hrvatski WordNet? In: Granić, J. (ed.)<br />

Semantika prirodnog jezika i metajezik semantike: Proceedings of annual conference of<br />

Croatian Applied Linguistics Society, pp. 733–743. CALS, Zagreb-Split (2004)<br />

5. Bekavac, B., Vučković, K., Tadić, M.: Croatian resources for NooJ (in press)<br />

6. Erjavec, T., Fišer, D.: Building Slovene WordNet. In: Proceedings of the 5th LREC (CD).<br />

Genoa (2006)<br />

7. Fellbaum, M. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA<br />

(1998)<br />

8. Pala, K., Hlaváčková, D.: Derivational Relations in Czech WordNet. In: Proceedings of the<br />

Workshop on Balto-Slavonic Natural Language Processing 2007, pp. 75–81. ACL, Prague<br />

(2007)<br />

9. Pianta, E., Bentivogli, L., Girardi, C.: MultiWordNet: developing an aligned multilingual<br />

database. In: Proceedings of the First Global WordNet Conference, pp. 293–302. Mysore,<br />

India (2002)<br />

10. Sharada, B.A., Girish, P.M.: WordNet Has No ‘Recycle Bin’. In: Proceedings of the Second<br />

Global WordNet Conference, pp. 311–319. Brno, Czech Republic (2004)<br />

11. Silberztein, M.: NooJ Manual (2006), http://www.nooj4nlp.net<br />

12. Silić, J., Pranjković, I.: Gramatika hrvatskoga jezika. Školska knjiga, Zagreb (2005)


Building Croatian WordNet 359<br />

13. Šarić, L., Wittschen, W.: Rječnik sinonima. Neretva-Universitätsverlag Aschenbeck und<br />

Isensee (2003)<br />

14. Tadić, M.: Building Croatian-English Parallel Corpus. In: Proceedings of the 2nd LREC, pp.<br />

523–530. Athens (2000)<br />

15. Tadić, M.: Building Croatian National Corpus. In: Proceedings of the 4th LREC, pp. 441–<br />

446. Las Palmas (2002)<br />

16. Tadić, M.: Croatian Lemmatization Server. In: Vulchanova, M., Koeva, S. (eds.)<br />

Proceedings of the 5th Formal Approaches to South Slavic and Balkan Languages<br />

Conference, pp. 140–146. Bulgarian Academy of Sciences, Sofia (2006)<br />

17. Tufiş, D. (ed.): Special Issue on the BalkaNet Project. J. Romanian Journal of Information<br />

Science and Technology 7 (1–2), 1–248 (2004)<br />

18. Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks.<br />

Kluwer Academic Publishers, Dordrecht (1998)<br />

19. Vossen, P. (ed.): EuroWordNet: General Doccument, Final, Version 3. University of<br />

Amsterdam (2002), http://www.illc.uva.nl/EuroWordNet/ docs/GeneralDocPS.zip


Towards Automatic Evaluation of WordNet Synsets<br />

J. Ramanand and Pushpak Bhattacharyya<br />

Department of Computer Science and Engineering,<br />

Indian Institute of Technology, Bombay<br />

{ramanand@it, pb@cse}.iitb.ac.in<br />

Abstract. Increasing and varied applications of WordNets call for the creation<br />

of methods to evaluate their quality. However, no such comprehensive methods<br />

to rate and compare WordNets exist. We begin our search for WordNet<br />

evaluation strategies by attempting to validate synsets. As synonymy forms the<br />

basis of synsets, we present an algorithm based on dictionary definitions to<br />

verify that the words present in a synset are indeed synonymous. Of specific<br />

interest are synsets in which some members “do not belong”. Our work, thus, is<br />

an attempt to flag human lexicographers’ errors by accumulating evidences<br />

from myriad lexical sources.<br />

1 Introduction<br />

Lexico-semantic networks such as the Princeton WordNet ([1]) are now considered<br />

vital resources in several applications in Natural Language Processing and Text<br />

mining. Wordnets are being constructed in different languages as seen in the<br />

EuroWordNet ([2]) project and the Hindi WordNet ([3]). Competing lexical networks,<br />

such as ConceptNet ([4]), HowNet ([5]), MindNet ([6]), VerbNet [7], and FrameNet<br />

([8]), are also emerging as alternatives to WordNets. Naturally, users would be<br />

interested in knowing not only the relative merits from among a selection of choices,<br />

but also the intrinsic value of such resources. Currently, there are no measures of<br />

quality to evaluate or differentiate these resources.<br />

A study of lexical networks could involve understanding the size and coverage,<br />

domain applicability, and content veracity of the resource. This is especially critical in<br />

cases where WordNets will be created by automated means, especially to leverage<br />

existing content in related languages, in contrast to the slower manual process of<br />

WordNet creation which has been the traditional method.<br />

The motivation for evaluating WordNets is to help answer questions such as the<br />

following:<br />

1. How to select one lexico-semantic network over another?<br />

2. Is a given WordNet sound and complete?<br />

3. Is this resource usable, scalable, and deployable?<br />

4. Is this WordNet suitable for a particular domain or application?


Towards Automatic Evaluation of WordNet Synsets 361<br />

A theory of evaluation must address the following issues:<br />

1. Establishing criteria to measure intrinsic quality of the content held in these lexical<br />

networks.<br />

2. Establishing criteria to make useful comparisons between different lexico-semantic<br />

networks.<br />

3. Methods to check if a network's quality has improved or declined after content<br />

updates.<br />

4. Quality of content in the synsets and relationships between synsets<br />

This paper is organized as follows: In Section 2, we briefly survey work related to<br />

the area of ontology evaluation. This is followed in Section 3 by an introduction to the<br />

novel problem of validating synonyms in a synset. In Section 4, we describe our<br />

dictionary-based algorithm in detail. We discuss the experimental setup and results in<br />

Section 5. Finally, in Section 6, we present the key conclusions from our work.<br />

2 Related Work<br />

2.1 Evaluations of Lexico-Semantic Networks<br />

Our literature survey revealed that, to the best of our knowledge, there have been no<br />

comprehensive efforts to evaluate WordNets or other lexico-semantic networks on<br />

general principles. [9] describes a statistical survey of WordNet v1.1.7 to study types<br />

of nodes, dimensional distribution, branching factor, depth and height. A syntactic<br />

check and usability study of the BalkaNet resource (WordNets in Eastern European<br />

languages) has been described in [10]. The creators of the common-sense knowledge<br />

base ConceptNet carried out an evaluation of their resource based on a statistical<br />

survey and human evaluation. Their results are described in [4]. [11] discuss<br />

evaluations of knowledge resources in the context of a Word Sense Disambiguation<br />

task. [12] apply this in a multi-lingual context. Apart from these, we are not aware of<br />

any other major evaluations of any lexico-semantic networks.<br />

2.2 Evaluations of Ontologies<br />

In the related field of ontologies, several evaluation efforts have been described. As<br />

lexical networks can be viewed as common-sense ontologies, a study of ontology<br />

evaluations may be useful. [13] describes an attempt at creating a formal model of an<br />

ontology with respect to specifying a given vocabulary's intended meaning. The paper<br />

provides an interesting theoretical basis for evaluations. [14] provides a classification<br />

of ontology content evaluation strategies and also provides an additional perspective<br />

to evaluation based on the “level" of appraisal (such as at the lexical, syntactic, data,<br />

design levels). [15] describes some metrics which have been used in the context of<br />

ontology evaluation.


362 J. Ramanand and Pushpak Bhattacharyya<br />

Some ontology evaluation systems have been developed and are in use. One of<br />

these is OntoMetric ([16]), a method that helps users pick an ontology for a new<br />

system. It presents a set of processes that the user should carry out to obtain the<br />

measures of suitability of existing ontologies, regarding the requirements of a<br />

particular system. The OntoClean ([17]) methodology is based on philosophical<br />

notions for a formal evaluation of taxonomical structures. It focuses on the cleaning of<br />

taxonomies. [18] describes a task-based evaluation scheme to examine ontologies<br />

with respect to three basic levels: vocabulary, taxonomy and non-taxonomic semantic<br />

relations. A score based on error rates was designed for each level of evaluation. [19]<br />

describes an ontology evaluation scheme that makes it easier for domain experts to<br />

evaluate the contents of an ontology. This scheme is called OntoLearn.<br />

We felt that none of the above methods seemed to address the core issues particular<br />

to WordNets, and hence we approached the problem by looking at synsets.<br />

3 Synset Validation<br />

3.1 Introduction<br />

Synsets are the foundations of a WordNet. A WordNet synset is constructed by<br />

putting together a set of synonyms that together define a particular sense uniquely, as<br />

given by the principles of minimality and coverage described in the previous section.<br />

This sense is explicitly indicated for human readability by a gloss. For instance, the<br />

synset {proboscis, trunk} represents the sense of “a long flexible snout as of an<br />

elephant”, as opposed to the synset {luggage compartment, automobile trunk, trunk}<br />

which is “a compartment in an automobile that carries luggage or shopping or<br />

tools”. Words with potentially multiple meanings are associated together, out of<br />

which a single sense emerges. To evaluate the quality of a synset, we began by<br />

looking at the validity of its constituent synonyms.<br />

Before the validation, the following theoretical questions must be addressed:<br />

1. What is the definition of a synonym?<br />

2. What are the necessary and sufficient conditions to determine that synonymy exists<br />

among a group of words?<br />

Intuitively, synonymy exists between two words when they share a similar sense.<br />

This also implies that one word can be replaced by its synonym in a context without<br />

any loss of meaning. In practice, most words are not perfect replacements for their<br />

synonyms i.e. they are near synonyms. There could be contextual, collocational and<br />

other preferences behind replacing synonyms. [20] describes attempts to<br />

mathematically describe synonymy. To the best of our knowledge, no necessary and<br />

sufficient conditions to prove that two words are synonyms of each other have been<br />

explicitly stated.


Towards Automatic Evaluation of WordNet Synsets 363<br />

The foundation of our work is the following: we conjecture that<br />

1. if two words are synonyms, it is necessary that they must share one common<br />

meaning out of all the meanings they could possess.<br />

2. A sufficient condition could be showing that the words replace each other in a<br />

context without loss of meaning.<br />

The task of synset validation has the following subtasks:<br />

1. Are the words in a synset indeed synonyms of each other?<br />

2. Are there any words which have been omitted from the synset?<br />

3. Does the combination of words indicate the required sense?<br />

In this paper, we attempted to answer the first question above i.e. given a set of<br />

words, could we verify if they were synonyms? Our literature survey revealed that<br />

though much work had been done in the automated discovery of synonyms (from<br />

corpora and dictionaries), no work had been done in automatically verifying whether<br />

two words were synonyms. Nevertheless, we began by studying some of the synonym<br />

discovery methods available.<br />

3.2 Related Work on Automatic Synset Creation<br />

All these methods are based on web and corpora mining. [21] describes a method to<br />

collect synonyms in the medical domain from the Web by first building a taxonomy<br />

of words. [22] provides an unsupervised learning method for extracting synonyms<br />

from the Web. [23] shows an interesting topic signature method to detect synonyms<br />

using document contexts and thus enrich large ontologies. Finally, [24] is a survey of<br />

different synonym discovery methods, which also proposes its own dictionary-based<br />

solution for the problem. Its dictionary based approach provides some useful hints for<br />

our own experiments in synonymy validation.<br />

3.3 Our Approach<br />

We focus only on the problem of checking whether the words in a synset can be<br />

shown to be synonyms of each other and thus correctly belong to that synset. As of<br />

now, we do not flag omissions in the synsets. It is to be also noted that failure to<br />

validate the presence of a word in a synset does not strongly suggest that the word is<br />

incorrectly entered in the synset - it merely raises a flag for human validation.<br />

The input to our system is a WordNet synset which provides the following<br />

information:<br />

1. The synonymous words in the synset<br />

2. The hypernym(s) of the synset<br />

3. Other linked nodes, gloss, example usages


364 J. Ramanand and Pushpak Bhattacharyya<br />

The output consists of a verdict on each word as to whether it fits in the synset, i.e.<br />

whether it qualifies to be the synonym of other words in the synset, and hence,<br />

whether it expresses the sense represented by the synset. A block diagram of the<br />

system is shown in Fig.1.<br />

Fig. 1. Block Diagram for Synset Synonym Validation<br />

4 Our Dictionary-based Algorithm<br />

4.1 The Basic Idea<br />

In dictionaries, a word is usually defined in terms of its hypernyms or synonyms. For<br />

instance, consider definitions of the word snake, whose hypernym is reptile, and its<br />

synonyms serpent and ophidian (obtained from the website Dictionary.com [22]):<br />

snake: any of numerous limbless, scaly, elongate reptiles of the suborder<br />

Serpentes, comprising venomous and non-venomous species inhabiting tropical and<br />

temperate areas.<br />

serpent: a snake<br />

ophidian: A member of the suborder Ophidia or Serpentes; a snake.<br />

This critical observation suggests that dictionary definitions may provide useful clues<br />

for verifying synonymy.<br />

We use the following hypothesis:<br />

if a word is present in a synset, there is a dictionary definition for it which refers to<br />

its hypernym or to its synonyms from the synset.<br />

Instead of matching synonyms pair-wise, we try to validate the presence of the<br />

word in the synset using the hypernyms of the synset and the other synonyms in the<br />

synset. A given word belongs to a given synset if there exists a definition for that<br />

word, which refers to one of the given hypernym words or one of the synonyms. We


Towards Automatic Evaluation of WordNet Synsets 365<br />

use the hypernyms and synonyms to validate other synonyms by mutual<br />

reinforcement.<br />

4.2 Algorithm Description<br />

The dictionary-based algorithm consists in applying three groups of rules in order.<br />

The first group applies to each word individually, using its dictionary definitions. The<br />

second group relies on a set of words collected for the entire synset during the<br />

application of the first group. The final group consists of rules that do not use the<br />

dictionary definitions. (All definitions in this section are from the website<br />

Dictionary.com [25].)<br />

In this section, we describe the steps of the algorithm with examples. The<br />

Algorithm has been stated in Section 4.3.<br />

Group 1<br />

Rule 1 - Hypernyms in Definitions<br />

Definitions of words for particular senses often make references to the hypernym of<br />

the concept. Finding such a definition means that the word's placement in the synset<br />

can be defended.<br />

e.g.<br />

Synset: {brass, brass instrument}<br />

Hypernym: {wind instrument, wind}<br />

Relevant Definitions:<br />

brass instrument: a musical wind instrument of brass or other metal with a cupshaped<br />

mouthpiece, as the trombone, tuba, French horn, trumpet, or cornet.<br />

Rule 2 - Synonyms in Definitions<br />

Definitions of words also make references to fellow synonyms, thus helping to<br />

validate them.<br />

e.g.<br />

Synset: {anchor, ground tackle}<br />

Hypernym: {hook, claw}<br />

Relevant Definitions:<br />

ground tackle: equipment, as anchors, chains, or windlasses, for mooring a vessel<br />

away from a pier or other fixed moorings.<br />

Rule 3 - Reverse Synonym Definitions<br />

Definitions of synonyms may also make references to the word to be validated.<br />

e.g.<br />

Synset: {Irish Republican Army, IRA,Provisional Irish Republican Army,<br />

Provisional IRA, Provos}<br />

Hypernym: {terrorist organization, terrorist group, foreign terrorist organization,<br />

FTO}


366 J. Ramanand and Pushpak Bhattacharyya<br />

Relevant Definitions:<br />

Irish Republican Army: an underground Irish nationalist organization founded to<br />

work for Irish independence from Great Britain: declared illegal by the Irish<br />

government in 1936, but continues activity aimed at the unification of the Republic of<br />

Ireland and Northern Ireland.<br />

Provos: member of the Provisional wing of the Irish Republican Army.<br />

Here Irish Republican Army can be validated using the definition of Provos.<br />

Rules 4 and 5 - Partial Hypernyms and Synonyms in Definitions<br />

Many words in the WordNet are multi-words, i.e., they are made up of more than<br />

one word. In quite a few cases, such multi-word hypernyms are not entirely present in<br />

the definitions of words, but parts of them can be found in the definition.<br />

e.g.<br />

Synset: {fibrinogen, factor I}<br />

Hypernym: {coagulation factor, clotting factor}<br />

Relevant Definitions:<br />

fibrinogen: a globulin occurring in blood and yielding fibrin in blood coagulation.<br />

Group 2<br />

Rule 6 – Bag of Words from Definitions<br />

In some cases, definitions of a word do not refer to synonyms or hypernym words.<br />

However, the definitions of two synonyms may share common words, relevant to the<br />

context of the sense. This rule captures this case.<br />

When a word is validated using Group 1 rules, the words of the validating<br />

definition are added to a collection. After applying Group 1 rules to all words in the<br />

synset, a bag of these words (from all validating definitions seen so far) is now<br />

available. For each remaining synonym yet to be validated, we look for any definition<br />

for it which contains one of the words in this bag.<br />

e.g.<br />

Synset: {serfdom, serfhood, vassalage}<br />

Hypernym: {bondage, slavery, thrall, thralldom, thraldom}<br />

Relevant Definitions<br />

serfdom: person (held in) bondage; servitude<br />

vassalage: dependence, subjection, servitude<br />

serfdom is matched on account of its hypernym bondage being present in its<br />

definition. So the Bag of Words now contains “person, bondage, servitude”.<br />

No definition of vassalage could be matched with any of the rules from 1 to 5. But<br />

Rule 6 matches the word servitude and so helps validate the word.


Towards Automatic Evaluation of WordNet Synsets 367<br />

Group 3<br />

Rules 7 and 8 - Partial Matches of Hypernyms and Synonyms<br />

Quite a few words to be validated are multi-words. Many of these do not have<br />

definitions present in conventional dictionaries, which make the above rules<br />

inapplicable to them. Therefore, we use the observation that, in many cases, these<br />

multi-words are variations of their synonyms of hypernyms i.e. the multi-words share<br />

common words with them. Examples of these are synsets such as:<br />

1. {dinner theater, dinner theatre}: No definition was available for dinner theatre,<br />

possibly because of the British spelling.<br />

2. {laurel, laurel wreath, bay wreath}: No definitions for the two multi-words.<br />

3. {Taylor, Zachary Taylor, President Taylor}: No definition for the last multiword.<br />

As can be seen above, the multi-word synonyms do share partial words. To validate<br />

such multi-words without dictionary entries, we check for the presence of partial<br />

words in their synonyms.<br />

e.g.<br />

Synset: {Taylor, Zachary Taylor, President Taylor}<br />

Hypernym: {President of the United States, United States President, President,<br />

Chief Executive}<br />

Relevant Definitions:<br />

Taylor, Zachary Taylor: (1784-1850) the 12th President of the United States from<br />

1849-1950.<br />

President Taylor: - no definition found -<br />

The first two words have definitions which are used to easily validate them. The<br />

third word has no definition, and so rules from Group 1 and 2 do not apply to it.<br />

Applying the Group 3 rules, we look for the component words in the other two<br />

synonyms. Doing this, we find “Taylor” in the first synonym, and hence validate the<br />

third word.<br />

A similar rule can be defined for a multi-word hypernym, wherein we look for the<br />

component word in the hypernym words. In this case, we would match the word<br />

“President” in the first hypernym word.<br />

We must note that, in comparison to the other rules, these rules are likely to be<br />

susceptible to erroneous decisions, and hence a match using these rules should be<br />

treated as a weak match. The reason for creating these two rules is to overcome the<br />

scarcity of definitions for such multi-words.


368 J. Ramanand and Pushpak Bhattacharyya<br />

4.3 Algorithm Statement<br />

Algorithm 1 – Validating WordNet synsets using a<br />

dictionary<br />

1: Input: synset S, words W in synset S, Dictionary of<br />

definitions<br />

2: For each word w belonging to W do<br />

3: Apply rules in Group 1:<br />

- 3.1: (Rule 1) Find a definition for w in the<br />

dictionary such that it contains a hypernym word h<br />

(repeat with other hypernyms if necessary)<br />

- 3.2: (Rule 2) Else, find a definition for w<br />

containing any synonym of w from the synset<br />

- 3.3: (Rule 3) Else, find a synonym's definition<br />

referring to w<br />

- 3.4: (Rule 4) (applicable to multi-words in the<br />

hypernym) Else, find a definition of w referring to a<br />

partial word from a multi-word in the hypernym<br />

- 3.5: (Rule 5) (applicable to synonyms that are<br />

multi-words) Else, find a definition for w referring to<br />

a partial word from a multi-word synonym<br />

4: Apply the rule 6 in Group 2:<br />

- 4.1: For every word m from the synset that was<br />

matched by one of the above rules, add the words in the<br />

validating definition for m to a collection of words C.<br />

- 4.2: For each word w in the synset that has not<br />

been validated, find a definition d of w such that d<br />

has a word appearing in C.<br />

5: Apply rules in Group 3 to each remaining unmatched<br />

word w:<br />

- 5.1: (Rule 7) See if a partial word from a multiword<br />

w is found in the synonym to be matched<br />

- 5.2: (Rule 8) Else, see if a partial word from the<br />

multi-word w is found in a hypernym word h.<br />

6: end for


Towards Automatic Evaluation of WordNet Synsets 369<br />

5 Experimental Results<br />

5.1 Setup<br />

The validation was tested on the Princeton WordNet (v2.1) noun synsets. Out of the<br />

81426 noun synsets, 39840 are synsets with more than one word – only these were<br />

given as input to the validator. This set comprised of a total of 103620 words.<br />

One of the contributions of our work is the creation of a super dictionary which<br />

consists of words and their definitions constructed by automatic means from the<br />

online dictionary service Dictionary.com ([25]) (which aggregates definitions from<br />

various sources such as Random House Unabridged Dictionary, American Heritage<br />

Dictionary, etc.) Of these, definitions from Random House and American Heritage<br />

dictionaries were identified and added to the dictionary being created. English stop<br />

words were removed from the definitions, and the remaining words were stemmed<br />

using Porter's stemmer [26]. The resulting dictionary had 463487 definitions in all for<br />

a total of 49979 words (48.23% of the total number of words).<br />

5.2 Results and Discussions<br />

Figs. 2, 3, and 4 summarise the main results obtained by running the dictionary-based<br />

validator on 39840 synsets. As shown in Fig. 2, 14844 out of the 18322 unmatched<br />

words did not have definitions in the dictionary. Therefore, there are 88776 words<br />

which either have definitions in the dictionary, or are referenced in the dictionary, or<br />

are matched by the partial rules. So, considering only these 88776 words, there are<br />

85298 matched words, i.e. a validation value of 96.08%.<br />

In about 9% of all synsets, none of the words in the synset could be verified. Of<br />

these 3660 synsets, 2952 (80%) had only 2 words in them. The primary reason for this<br />

typically was one member of the synset not being present in the dictionary, and hence<br />

reducing the number of rules applicable to the other word.<br />

Failure to validate a word does not mean that the word in question is incorrectly<br />

present in the synset. Instead, it flags the need for human intercession to verify<br />

whether the word indeed has that synset's sense. The algorithm is not powerful<br />

enough to make a firm claim of erroneous placement. In an evaluation system, the<br />

validator can serve as a useful first-cut filter to reduce the number of words to be<br />

scrutinised by a human expert. In some cases, the non-matches did raise some<br />

interesting questions about the validity of a word in a synset. We discuss some<br />

examples in the next section.


370 J. Ramanand and Pushpak Bhattacharyya<br />

Fig. 2. The Dictionary Approach: Summary of results<br />

Fig. 3. The Dictionary Approach: A synset perspective<br />

Fig. 4. The Dictionary Approach: Rule-wise summary


Towards Automatic Evaluation of WordNet Synsets 371<br />

5.3 Case Studies<br />

(All sources for definitions in the following examples are from the website<br />

Dictionary.com [25])<br />

5.3.1 Possible True Negatives flagged by the validator<br />

The validator could not match about 18% of all words. In most of these cases, the<br />

words are indeed correctly placed (as one would expect of a resource manually<br />

created by experts) but are flagged incorrectly by the validator as it is as yet not<br />

powerful enough to match them. However, consider the following cases of words<br />

where non-matches are interesting to study.<br />

Instance 1:<br />

Synset: {visionary, illusionist, seer}<br />

Hypernym: {intellectual, intellect}<br />

Gloss: a person with unusual powers of foresight<br />

The word “illusionist” was not matched in this context. This seems to be a highly<br />

unusual sense of this word (more commonly seen in the sense of “conjuror”). None<br />

of the dictionaries consulted provided this meaning for the word.<br />

Instance 2:<br />

Synset: {bobby pin, hairgrip, grip}<br />

Hypernym: {hairpin}<br />

Gloss: a flat wire hairpin whose prongs press tightly together; used to hold bobbed<br />

hair in place<br />

It could not be established from any other lexical resource whether grip, though a<br />

similar sounding word to hairgrip, was a valid synonym for this sense. Again, this<br />

could be a usage local to some cultures, but this was not readily supported by other<br />

dictionaries.<br />

5.3.2 True Positives correctly flagged by the validator<br />

Here are examples of the validator correctly flagging matches.<br />

Instance 1<br />

Synset: {smokestack, stack}<br />

Word to be validated: smokestack<br />

Hypernym: {chimney}<br />

Relevant Definitions:<br />

smokestack: A large chimney or vertical pipe through which combustion vapors,<br />

gases, and smoke are discharged.


372 J. Ramanand and Pushpak Bhattacharyya<br />

Instance 2<br />

Synset: {zombi, zombie, snake god}<br />

Word to be validated: snake god<br />

Hypernym: {deity, divinity, god, immortal}<br />

Relevant Definitions:<br />

zombie: a snake god worshiped in West Indian and Brazilian religious practices of<br />

African origin.<br />

5.3.2 False Negatives flagged by the validator<br />

Here are examples of the validator being unable to match words, despite definitions<br />

being present:<br />

Instance 1<br />

Synset: {segregation, separatism}<br />

Word to be validated: segregation<br />

Hypernym: {social organization, social organisation, social structure, social<br />

system, structure}<br />

Relevant Definitions:<br />

segregation: The act or practice of segregating<br />

segregation: the state or condition of being segregated<br />

Noun forms of such verbs typically refer to the act, which makes it hard to validate<br />

using other words.<br />

Instance 2<br />

Synset: {hush puppy, hushpuppy}<br />

Word to be validated: hush puppy<br />

Hypernym: {cornbread}<br />

Relevant Definitions:<br />

Hush puppy: a small, unsweetened cake or ball of cornmeal dough fried in deep fat.<br />

Establishing the similarity between cornmeal and cornbread would have been our<br />

best chance to validate this word. Currently, we are unable to do this.<br />

6 Conclusions and Future Work<br />

Our observations show that the intuitive idea behind the algorithm holds well. The<br />

algorithm is quite simple to implement. No interpretation of numbers is required; the<br />

process is just a simple test. The algorithm is heavily dependent on the depth and<br />

quality of dictionaries being used. WordNet has several words that were not present in<br />

conventional dictionaries available on the Web. Encyclopaedic entries such as<br />

Mandara (a Chadic language spoken in the Mandara mountains in Cameroon),<br />

domain-specific words, mainly from agriculture, medicine, and law, such as ziziphus<br />

jujuba (spiny tree having dark red edible fruits) and pediculosis capitis (infestation of


Towards Automatic Evaluation of WordNet Synsets 373<br />

the scalp with lice), phrasal words such as caffiene intoxication (sic) were among<br />

those not found in the collected dictionary.<br />

Since the Princeton WordNet is manually crafted by a team of experts, we do not<br />

expect to find too many errors. However, many of the words present in the dictionary<br />

and not validated were those with rare meanings and usages. Our method makes it<br />

easier for human validators to focus on such words. This will especially be useful in<br />

validating the output of automatic WordNet creations.<br />

The algorithm cannot yet detect omissions from a synset, i.e. the algorithm does<br />

not discover potential synonyms and compare them with the existing synset.<br />

Possible future directions could be expanding the synset validation to other parts of<br />

a synset such as the gloss and relations to other synsets. The results could be<br />

summarized into a single number representing the quality of the synsets in the<br />

WordNet. The results could then be correlated with human evaluation, finally<br />

converging to a score that captures the human view of the WordNet.<br />

The problem of scarcity of definitions could be further addressed by adding more<br />

dictionaries and references to the set of sources.<br />

The presented algorithm is available only for English WordNet. However, the<br />

approach should broadly apply to other language WordNets as well. The limiting<br />

factors are the availability of dictionaries and tools like stemmers for those languages.<br />

Similarly, the algorithm could be used to verify synonym collections such as in<br />

Roget's Thesaurus and also other knowledge bases. The algorithm has been executed<br />

on noun synsets; they can also be run on synsets from other parts of speech.<br />

We see such evaluation methods becoming increasingly imperative as more and<br />

more WordNets are created by automated means.<br />

Acknowledgments<br />

We would like to express our gratitude to Prof. Om Damani (CSE, IIT Bombay) and<br />

Sagar Ranadive for their valuable suggestions towards this work.<br />

References<br />

1. Miller, G., Beckwith, R., Felbaum C., Gross D., Miller, K.J.: Introduction to WordNet: an<br />

on-line lexical database. J. The International Journal of Lexicography 3(4), 235–244 (1990)<br />

2. Vossen, P.: EuroWordNet: a multilingual database with lexical semantic networks. Kluwer<br />

Academic Publishers (1998)<br />

3. Narayan, D., Chakrabarty, D., Pande, P., Bhattacharyya, P.: An Experience in building the<br />

Indo-WordNet - A WordNet for Hindi. In: First International Conference on Global<br />

WordNet (<strong>GWC</strong> '02) (2002)<br />

4. Liu, H., Singh, P.: Commonsense Reasoning in and over Natural Language. In: The<br />

proceedings of the 8th International Conference on Knowledge-Based Intelligent<br />

Information and Engineering Systems (2004)<br />

5. Dong, Z., Dong, Q.: An Introduction to HowNet. Available from: http://www.keenage.com<br />

6. Richardson, S., Dolan, W., Vanderwende, L.: MindNet: acquiring and structuring semantic<br />

information from text. In: 36th Annual meeting of the Association for Computational<br />

Linguistics, vol. 2, pp. 1098–1102 (1998)


374 J. Ramanand and Pushpak Bhattacharyya<br />

7. Kipper-Schuler, K.: VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D.<br />

dissertation. University of Pennsylvania (2005)<br />

8. Baker, C. F., Fillmore, C. J., Lowe, J. B.: The Berkeley FrameNet project. In: Proceedings of<br />

the COLING-ACL (1998)<br />

9. W. Group, “WordNet 2.1 database statistics”<br />

http://wordnet.princeton.edu/man/wnstats.7WN.<br />

10. Smrz, P.: Quality Control for WordNet Development. In: Proceedings of <strong>GWC</strong>-04, 2nd<br />

Global WordNet Conference (2004)<br />

11. Cuadros M., Rigau G.: Quality Assessment of Large-Scale Knowledge Resources. In:<br />

Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language<br />

Processing (EMNLP'06). Sydney, Australia (2006)<br />

12. Cuadros M., Rigau G., Castillo M: Evaluating Large-scale Knowledge Resources across<br />

Languages. In: Proceedings of the International Conference on Recent Advances on Natural<br />

Language Processing (RANLP'07). Borovetz, Bulgaria (2007)<br />

13. Guarino, N.: Toward a Formal Evaluation of Ontology Quality - (Why Evaluate Ontology<br />

Technologies? Because It Works!). J. IEEE Intelligent Systems 19(4), 74–81 (2004)<br />

14. Brank, J., Grobelnik, M., Mladenic, D.: A survey of ontology evaluation techniques. In:<br />

Proceedings of the Conference on Data Mining and Data Warehouses (SiKDD 2005) (2005)<br />

15. Maynard, D., Peters, W., Li, Y.: Metrics for Evaluation of Ontology based Information<br />

Extraction. In: EON2006 at WWW (2006)<br />

16. Hartman, J., Spyns, P., Giboin, A. et al.: D1.2.3 Methods for ontology evaluation.<br />

Deliverable for Knowledge Web Consortium (2005)<br />

17. Guarino, N., Welty, C.: Evaluating ontological decisions with OntoClean. J.<br />

Communications of the ACM 45(2), 61–65 (2002)<br />

18. Porzel, R., Malaka, R.: A Task-based Approach for Ontology Evaluation. In: ECAI<br />

Workshop on Ontology Learning and Population (2004)<br />

19. Velardi, P. et al.: Automatic Ontology Learning: Supporting a Per-Concept Evaluation by<br />

Domain Experts. In: Workshop on Ontology Learning and Population (OLP), in the 16th<br />

European Conference on Artificial Intelligence (2004)<br />

20. Edmundson, H.P., Epstein, M.: Computer Aided Research on Synonymy and Antonymy.<br />

In: Proceedings of the International Conference on Computational Linguistics (1969)<br />

21. Sanchez , D., Moreno, A.: Automatic discovery of synonyms and lexicalizations from the<br />

Web. In: Proceedings of the 8th Catalan Conference on Artificial Intelligence (2005)<br />

22. Turney, P.D.: Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In:<br />

Proceedings of the Twelfth European Conference on Machine Learning, pp. 491–502.<br />

Springer Verlag, Berlin (2001)<br />

23. Agirre, E., Ansa, O., Hovy, E., Martinez, D.: Enriching very large ontologies using the<br />

WWW. In: Proceedings of the Workshop on Ontology Construction of the European<br />

Conference of AI (ECAI-00) (2000)<br />

24. Senellart, P. P., Blondel, V. D.: Automatic discovery of similar words. In: Berry, M. W.<br />

(ed.) A Comprehensive Survey of Text Mining. Springer-Verlag (2003)<br />

25. Dictionary.Com. Available at: http://dictionary.reference.com/<br />

26. Porter's Stemmer. Available at: http://www.tartarus.org/martin/PorterStemmer/ (as of 1 July<br />

2005)


Lexical Enrichment of a Human Anatomy<br />

Ontology using WordNet<br />

Nils Reiter 1 and Paul Buitelaar 2<br />

1 Department of Computational Linguistics, Heidelberg University,<br />

Heidelberg, Germany ⋆⋆<br />

reiter@cl.uni-heidelberg.de<br />

2 Language Technology Lab & Competence Center Semantic Web, DFKI,<br />

Saarbrücken, Germany<br />

paulb@dfki.de<br />

Abstract. This paper is concerned with lexical enrichment of ontologies,<br />

i.e. how to enrich a given ontology with lexical entries derived from a<br />

semantic lexicon. We present an approach towards the integration of<br />

both types of resources, in particular for the human anatomy domain as<br />

represented by the Foundational Model of Anatomy (FMA). The paper<br />

describes our approach on combining the FMA with WordNet by use of<br />

a simple algorithm for domain-specific word sense disambiguation, which<br />

selects the most likely sense for an FMA term by computing statistical<br />

significance of synsets on a corpus of Wikipedia pages on human anatomy.<br />

The approach is evaluated on a benchmark of 50 ambiguous FMA terms<br />

with manually assigned WordNet synsets (i.e. senses).<br />

1 Introduction<br />

This paper is concerned with lexical enrichment of ontologies, i.e. how to enrich<br />

a given ontology with lexical entries derived from a semantic lexicon. The<br />

assumption here is that an ontology represents domain knowledge with less emphasis<br />

on the linguistic realizations (i.e. words) of knowledge objects, whereas a<br />

semantic lexicon such as WordNet defines lexical entries (words with their linguistic<br />

meaning and possibly morpho-syntactic features) with less emphasis on<br />

the domain knowledge associated with these.<br />

1.1 Ontologies<br />

An ontology is an explicit and formal description of the conceptualization of a<br />

domain of discourse (see e.g. Gruber [1], Guarino [2]). In its most basic form an<br />

⋆⋆ This work has been done while the first author was affiliated to the Department of<br />

Computational Linguistics at Saarland University.


376 Nils Reiter and Paul Buitelaar<br />

ontology consists of a set of classes and a set of relations that describe the properties<br />

of each class. Ontologies formally define relevant knowledge in a domain of<br />

discourse and can be used to interpret data in this domain (e.g. medical data such<br />

as patient reports), to reason over knowledge that can be extracted or inferred<br />

from this data and to integrate extracted knowledge with other data or knowledge<br />

extracted elsewhere. With recent developments towards knowledge-based<br />

applications such as intelligent Question Answering, Semantic Web applications<br />

and semantic-level multimedia indexing and retrieval, the interest in large-scale<br />

ontologies has increased. Here we use a standard ontology in human anatomy, the<br />

FMA: Foundational Model of Anatomy 3 ([3]). The FMA describes the domain<br />

of human anatomy in much detail by way of class descriptions for anatomical<br />

objects and their properties. Additionally, the FMA lists terms in several languages<br />

for many classes, which makes it a lexically enriched ontology already.<br />

However, our main concern here is to extend this lexical representation further<br />

by automatically deriving synonyms from WordNet.<br />

1.2 Lexicons<br />

A lexicon describes the linguistic meaning and morpho-syntactic features of<br />

words and possibly also of more complex linguistic units such as idioms, collocations<br />

and other fixed phrases. Semantically organized lexicons such as WordNet<br />

and FrameNet define word meaning through formalized associations between<br />

words, i.e. in the form of synsets in the case of WordNet and with frames in<br />

the case of FrameNet. Although such a representation defines some semantic<br />

aspects of a word relative to other words, it does not represent any knowledge<br />

about the objects that are referred to by these words. For instance, the English<br />

noun “ball” may be represented by two synsets in WordNet ({ball, globe},<br />

{ball, dance}) each of which reflects another interpretation of this word. However,<br />

deeper knowledge about what a “ball” in the sense of a “dance” involves (a<br />

group of people, a room, music to which the group of people move according to a<br />

certain pattern, etc.) cannot be represented in this way. Frame-based definitions<br />

in FrameNet allow for such a deeper representation into some respect, but also in<br />

this case the description of word meaning is concerned with the relation between<br />

words and not so much with object classes and their properties that are referred<br />

to by these words. However, for instance in the case of BioFrameNet ([4]), an<br />

extension of FrameNet for use in the biomedical domain, such an attempt has<br />

been made and is therefore much in line with our work described here.<br />

1.3 Related Work<br />

Other related work is on word sense disambiguation (WSD) and specifically<br />

domain-specific WSD as this is a central aspect of our algorithm in selecting the<br />

3 See http://sig.biostr.washington.edu/projects/fm/AboutFM.html for more details<br />

on the FMA


Lexical Enrichment of a Human Anatomy Ontology using WordNet 377<br />

most likely sense of words occurring in FMA terms. The work presented here is<br />

based directly on [5] and similar approaches ([6, 7]). Related to this work is the<br />

assignment of domain tags to WordNet synsets ([8]), which would obviously help<br />

in the automatic assignment of the most likely sense in a given domain – as shown<br />

in [9]. An alternative to this idea is to simply extract that part of WordNet that<br />

is directly relevant to the domain of discourse ([10, 11]). However, more directly<br />

in line with our work on enriching a given ontology with lexical information<br />

derived from WordNet is presented in [12], but the main difference here is that<br />

we use a domain corpus as additional evidence for statistical significance of a<br />

selected word sense (i.e. synset). Finally, also some recent work on the definition<br />

of ontology-based lexicon models ([13–15]) is of (indirect) relevance to the work<br />

presented here as the derived lexical information needs to be represented in such<br />

a way that it can be easily accessed and used by NLP components as well as<br />

ontology management and reasoning tools.<br />

2 Approach<br />

Our approach to lexical enrichment of ontologies consists of a number of steps,<br />

each of which will be addressed in the reaminder of this section:<br />

1. extract all terms from the term descriptions of all classes in the ontology,<br />

lookup of terms in WordNet<br />

2. for ambiguous terms: apply domain-specific WSD by ranking senses (synsets)<br />

according to statistical relevance in the domain corpus<br />

3. select most relevant synset and add the synonyms of this synset to the corresponding<br />

term representation<br />

2.1 Term Extraction and WordNet Lookup<br />

Ontologies, such as the FMA, describe objects and their relations to each other.<br />

Additionally, each such object (or rather the class descriptions for such objects)<br />

may carry terminological information in one or more languages. In the FMA,<br />

terms for classes are defined in several languages, i.e. 100,000 English terms,<br />

8,000 Latin, 4,000 French, 500 Spanish and 300 German terms. Terms in the<br />

FMA can be simple, consisting of just one word, or complex multiword terms,<br />

e.g. “muscular branch of lateral branch of dorsal branch of right third posterior<br />

intercostal artery”. In our approach we considered simple as well as complex<br />

terms although only a small number of such domain-specific terms will actually<br />

occur in WordNet as will be reported below in section 3.<br />

2.2 WSD Algorithm<br />

The core of our approach is the word sense disambiguation algorithm as shown<br />

in figure 1. The algorithm iterates over every synonym of every synset of the


378 Nils Reiter and Paul Buitelaar<br />

term in question. It calculates the χ 2 values of each synonym and adds them up<br />

for each synset.<br />

function getWeightForSynset(synset) {<br />

synonyms = all synonyms of synset<br />

weight = 0<br />

foreach synonym in synonyms<br />

c = chi-square(synonym)<br />

weight = weight + c<br />

end foreach<br />

return weight<br />

}<br />

s = synsets to which t belongs<br />

highest_weight = 0<br />

best_synsets = {}<br />

foreach synset in s<br />

synonyms = all synonyms of synset<br />

weight = getWeightForSynset(synset)<br />

if (weight == highest_weight)<br />

best_synsets = best_synsets + { synset }<br />

else if (weight > highest_weight)<br />

best_synsets = { synset }<br />

end if<br />

end foreach<br />

return best_synsets<br />

Fig. 1. Algorithm for the sense disambiguation of the term t<br />

Using the χ 2 -test (see, for instance, [16, p. 169]), one can compare the frequencies<br />

of terms in different corpora. In our case, we use a reference and a<br />

domain corpus and assume that the terms occurring (relatively) more often in<br />

the domain corpus than in the reference corpus are “domain terms”, i.e., are<br />

specific to this domain. If it is a domain term, it should be defined in the ontology.<br />

χ 2 (t) =<br />

N ∗ (O t 11O t 22 − O t 12O t 21) 2<br />

(O t 11 + Ot 12 )(Ot 11 + Ot 21 )(Ot 12 + Ot 22 )(Ot 21 Ot 22 ) (1)<br />

χ 2 , calculated according to formula 1, allows us to measure exactly this. O t 11<br />

and O t 12 denote the frequencies of the term t in the domain and reference corpora


Lexical Enrichment of a Human Anatomy Ontology using WordNet 379<br />

while O t 21 and O t 22 denote the frequency of any term but t in the domain and<br />

reference corpora:<br />

O t 11 = frequency of t in the domain corpus<br />

O t 12 = frequency of t in the reference corpus<br />

O t 21 = frequency of ¬t in the domain corpus<br />

O t 22 = frequency of ¬t in the reference corpus<br />

N = Added size of the two corpora<br />

The algorithm finally choses the synset with the highest weight as the appropriate<br />

one.<br />

The term “gum”, for instance, has six noun senses with on average 2 synonyms.<br />

The χ 2 value of the synonym “gum” itself is 6.22. Since this synonym<br />

occurs obviously in every synset of the term, it makes no difference for the rating.<br />

But the synonym “gingiva”, which belongs to the second synset and is the<br />

medical term for the gums in the mouth, has a χ 2 value of 20.65. Adding up<br />

the domain relevance scores of the synonyms for each synsets, we find that the<br />

second synset gets the highest weight and is therefore selected as the appropriate<br />

one.<br />

Relations The algorithm as shown in figure 1 uses the synonyms found in Word-<br />

Net. However, other relations that are provided by WordNet can be used as well.<br />

Figure 2 shows the improved algorithm. The main difference is that we calculate<br />

and add the weights for each synonym of each synset to which a synset of the<br />

original term is related.<br />

2.3 Lexical Representation<br />

Finally, after the synsets for an ambiguous term t have been ranked according<br />

to relevance to the domain, we can select the top one or more to be included<br />

as (additional) lexical/terminological information in the ontology, i.e., the synonyms<br />

that are contained in this synset can be added as (further) terms for the<br />

ontology class c that corresponds to term t.<br />

Here, we actually propose to extend the ontology with an ontology-based<br />

lexicon format, LingInfo, which has been developed for this purpose in the context<br />

of previous work ([15]). By use of the LingInfo model we will be able to<br />

represent each synonym for t as a linguistic object l that is connected to the<br />

corresponding class c. The object l is an instance of the LingInfo class of such<br />

linguistic objects that cover the representation of the orthographic form of terms<br />

as well as relevant morpho-syntactic information, e.g. stemming, head-modifier<br />

decomposition, part-of-speech. The implementation of a LingInfo-based linguistic<br />

knowledge base for the FMA is ongoing work, but a first version of a similar


380 Nils Reiter and Paul Buitelaar<br />

r = WordNet relations<br />

s = synsets to which t belongs<br />

highest_weight = 0<br />

best_synsets = {}<br />

foreach synset in s<br />

weight = getWeightForSynset(synset)<br />

related = with r related synsets<br />

foreach rsynset in related<br />

weight += getWeightForSynset(rsynset)<br />

end foreach<br />

if (weight == highest_weight)<br />

best_synsets = best_synsets + { synset }<br />

else if (weight > highest_weight)<br />

best_synsets = { synset }<br />

end if<br />

end foreach<br />

return best_synsets<br />

Fig. 2. Improved algorithm – As in figure 1 but including WordNet relations<br />

knowledge base for the football domain has been developed in the context of the<br />

SmartWeb project ([15, 17]).<br />

3 Experiment<br />

In an empirical experiment, we enrich the FMA (“Foundational Model of Anatomy”)<br />

ontology with lexical inforamtion (synonyms) derived from WordNet<br />

using Wikipedia pages on human anatomy as domain corpus.<br />

3.1 Data Sources<br />

Ontology: Foundational Model of Anatomy “The Foundational Model of<br />

Anatomy (FMA) ontology was developed by the Structural Informatics Group 4<br />

at the University of Washington. It contains approximately 75,000 classes and<br />

over 120,000 terms; over 2.1 million relationship instances from 168 relationship<br />

types link the FMA’s classes into a coherent symbolic model. The FMA is one<br />

of the largest computer-based knowledge sources in the biomedical sciences. The<br />

most comprehensive component of the FMA is the Anatomy taxonomy” (FMA<br />

4 http://sig.biostr.washington.edu/index.html


Lexical Enrichment of a Human Anatomy Ontology using WordNet 381<br />

website), organized around the top class Anatomical Structure. “Anatomical<br />

structures include all material objects generated by the coordinated expression<br />

of groups of the organism’s own structural genes. Thus, they include biological<br />

macromolecules, cells and their parts, tissues, organs and their parts, as well as<br />

organ systems and body parts (body regions)” (FMA website). For the purpose<br />

of the experiment reported on here we used the taxonomy component of the<br />

FMA, extracted all English terms and did a lookup for each of these terms in<br />

WordNet.<br />

Semantic Lexicon: WordNet The most recent version of WordNet (3.0) was<br />

used in our experiment. As an interface to our own implementation, we use the<br />

Java WordNet interface 5 . The number of English terms (simple and complex)<br />

we were able to extract from the FMA was 120,417, of which 118,785 were not<br />

in WordNet. This left us with a set of 1,382 terms that were in WordNet but<br />

only 250 of these were actually ambiguous and therefore of interest to our experiment.<br />

Interestingly, 10 of these were in fact multiword terms. The experiment<br />

as reported below is therefore concerned with the disambiguation of these 250<br />

FMA terms, given their sense assignments in WordNet.<br />

Medical Corpus: Wikipedia Pages on Human Anatomy Our approach<br />

requires the use of a domain corpus. As corpus from the anatomy domain, we<br />

use the Wikipedia pages from the category “Human Anatomy” and all its subordinated<br />

categories 6 . These are 7,251 single pages, containing over 4.4 million<br />

words.<br />

We removed the meta information (categories, tables of contents, weblinks,<br />

. . . ) using heuristic methods. By use of part of speech tagging with the Tree-<br />

Tagger ([18]), we automatically extracted all nouns from this corpus, resulting<br />

in 1.3 million noun tokens and 92,927 noun types.<br />

Reference Corpus: British National Corpus Our ranking of the domain<br />

relevance of a synset is based on comparing the frequencies of its synonyms<br />

in a domain corpus and a reference corpus. The reference corpus we use is the<br />

British National Corpus (BNC). Since we were only interested in the frequencies,<br />

we used the frequency lists provided by [19].<br />

3.2 Benchmark<br />

Our benchmark (gold standard) consists of randomly selected 50 ambiguous<br />

terms from the ontology. Four terms have been removed from the test set because<br />

none of of their senses belong to the domain of human anatomy.<br />

5 http://www.mit.edu/~markaf/projects/wordnet/<br />

6 http://en.wikipedia.org/wiki/Category:Human_anatomy


382 Nils Reiter and Paul Buitelaar<br />

Two annotators manually disambiguated them according to the domain of<br />

human anatomy. Each term is associated with one (or more) WordNet synsets.<br />

More than one synset is used due to the high granularity of WordNet. The<br />

agreement between the two annotators is, generally speaking, high. In every<br />

single case, there is an overlap in the associated synsets, i.e., for every term,<br />

there is at least one synset chosen by both annotators. If we count only a perfect<br />

match, i.e., both annotators chose exactly the same set of senses, the kappa value<br />

κ according to [20] is still κ = 0.71.<br />

Baseline The synsets in WordNet are sorted according to frequency. The word<br />

“jaw”, for instance, occurs more often with its first synset than with its third. It<br />

is therefore a reasonable assumption for any kind of word sense disambiguation<br />

to pick always the first sense (see, for instance, [21]). We use this simple approach<br />

as baseline for our evaluation.<br />

3.3 Evaluation<br />

The system was evaluated with respect to precision, recall and f-score. Precision<br />

is the proportion of the meanings predicted by the system which are correct.<br />

Recall is the proportion of correct meanings which are predicted by the system.<br />

Finally, the f-score is the harmonic mean of precision and recall, and is the final<br />

measure to compare the performance of systems.<br />

Table 1. Evaluation Results for the different WordNet relations<br />

Relation<br />

Precision Recall F-Measure<br />

Baseline (first sense) 58.69 46.56 51.93<br />

Only Synonyms 54.78 65.58 59.70<br />

Hypernym 56.52 47.10 51.38<br />

Hypernym (instance) 53.70 63.41 63.22<br />

Hyponym 64.93 63.95 64.44<br />

Topic 56.96 67.75 61.89<br />

Holonym (part) 63.04 63.41 63.22<br />

Holonym (substance) 55.87 65.58 60.34<br />

Meronym (member) 52.61 63.41 57.51<br />

Meronym (part) 58.05 68.12 62.68<br />

Meronym (substance) 55.51 62.32 58.72<br />

All other 54.78 65.58 59.70<br />

Hyponym, Holonym (part),<br />

Meronym (part)<br />

Topic, Holonym (substance),<br />

Meronym (part)<br />

77.53 70.11 73.63<br />

61.30 70.29 65.49


Lexical Enrichment of a Human Anatomy Ontology using WordNet 383<br />

We calculated precision, recall and f-score seperatedly for WordNet relations<br />

and test items. The test item “jaw”, for instance, was manually disambiguated<br />

to the first or second noun synset of the word. Our program, using the hyponymy<br />

relation, returned only the first noun synset. For this item, we calculate<br />

a precision of 100% and a recall of 50%.<br />

Table 1 shows the results averaged over all 46 test items for the different<br />

relations. The relations not shown did not gave other results than the algorithm<br />

without using any of the relations (results for “all other” are exactly the same<br />

as “only synonyms”).<br />

The two lines at the bottom of the table are combinations of relations. In the<br />

first line, we use the three relations with the highest precision (and f-score, but<br />

that is a coincidence) together (hyponym, holonym (part) and meronym (part)).<br />

The last line shows the three relations with the highest recall taken together<br />

(topic, holonym (substance) and meronym (part)). Note, that the meronym<br />

(part) relation is the only relation that is among the top three in both cases.<br />

3.4 Discussion<br />

Our results show – in almost any configuration – a clear improvement compared<br />

to the baseline.<br />

Using just the synonyms of WordNet and no additional relation(s), we observe<br />

an increase in recall (around 20%) and a relatively small decrease in precision<br />

(less than 5%). The increase in recall can easily be explained by the fact that<br />

our baseline takes only the first (and therefore: only one) synset – every term<br />

that is disambiguated to more than one synset already gets a recall of 50% or<br />

less. The decrease in precision can be explained by looking at the test samples.<br />

For some of the synsets, a synonym – especially when it comes to multi-word<br />

expressions – can not be found in the corpus. This leads to the same weight<br />

for a number of synsets and thus to more selected synsets, even if the evidence<br />

does not increase. The precision decreases because among the selected synsets,<br />

there are more inappropriate ones. Or, the other way around: if an appropriate<br />

synset has no synonyms (or only synonyms that do not appear in the corpus),<br />

the precision decreases.<br />

alveolus<br />

alveolus#1<br />

alveolus#2<br />

alveolus (137)<br />

air sac (0)<br />

air cell (0)<br />

alveolus (137)<br />

tooth socket (0)<br />

Fig. 3. Synonyms for the synsets of “alveolus”


384 Nils Reiter and Paul Buitelaar<br />

For the term “alveolus”, for instance, both noun synsets are annotated as<br />

appropriate in the gold standard. The baseline algorithm selects only the first<br />

synset and gets therefore a precision of 100% and a recall of 50%. Figure 3 shows<br />

the synonyms for the term “alveolus” graphically. In the configuration, where<br />

we just use the WordNet synonyms, both synsets get the same weight, because<br />

the synonym alveolus appears 137 times in the domain corpus, and all other<br />

synonyms do not appear at all (not a single occurrence of “air sac”, “air cell”<br />

and “tooth socket”).<br />

This problem diminishes, if WordNet relations are taken into account. By<br />

using WordNet relations, we increase the number of synonyms that we search in<br />

the corpus and thus increase the number of actually appearing synonyms.<br />

The relation that leads to the lowest recall is the hypernymy relation (47.1%).<br />

In general, one can speculate that this is due to the fact that a hypernym of a<br />

term does not necessarily lay in the same domain – and therefore receives a lower<br />

relevance ranking. Nevertheless, it may be a very general term that occurs very<br />

often, such that the low relevance score is compensated or even overruled by the<br />

high frequency.<br />

The term “plasma”, for instance, has three synsets, from which the first one<br />

is the most appropriate in our domain. Based on the synonyms only, our program<br />

returns all three synsets. But if we add the hypernymy relation, the third synset<br />

gets selected by our program. This mistake is due to the fact that this synset<br />

has a synset of “state” {state, state of matter} as one of its synonyms, which<br />

has not a high domain relevance but occurs extremely often. The first synset of<br />

“plasma” has “extracellular fluid” as hypernym, which does not occur at all.<br />

The relations hyponymy, holonymy and meronymy clearly stay in the same<br />

domain. A term like “lip” is partially disambiguated by looking at its holonyms:<br />

“vessel” or “mouth”. Since “mouth” lays in the domain of human anatomy, its<br />

relevance score is higher than vessel.<br />

It is no surprise either that the topic relation, that assigns a category to<br />

synsets, is among the relations leading to high recall values. However, as many<br />

synsets do have a related topic, it does not contribute to precision.<br />

There is a clear benefit of using several relations together. This combination<br />

increases the number of included synonyms further than by using a single<br />

relation.<br />

4 Conclusions and Future Work<br />

We presented a domain-specific corpus-based approach to the lexical enrichment<br />

of ontologies, i.e. enriching a given ontology with lexical entries derived from a<br />

semantic lexicon such as WordNet. Our approach was emprically tested by an experiment<br />

on combining the FMA with WordNet synsets that were disambiguated<br />

by use of a corpus of Wikipedia pages on human anatomy. The approach was<br />

evaluated on a benchmark of 50 ambiguous FMA terms with manually assigned


Lexical Enrichment of a Human Anatomy Ontology using WordNet 385<br />

WordNet synsets. Results show that the approach performs better than a mostfrequent<br />

sense baseline. Further refinements of the algorithm that include the use<br />

of WordNet relations such as hyponym, hypernym, meronym, etc. showed a much<br />

improved performance, which was again improved upon drastically by combining<br />

the best of these relations. In summary, we achieved good performance on the defined<br />

task with relatively cheap methods. This will allow us to use our approach<br />

in large-scale automatic enrichment of ontologies with WordNet derived lexical<br />

information, i.e. in the context of the OntoSelect ontology library and search<br />

engine 7 ([22]). In this context, lexically enriched ontologies will be represented<br />

by use of the LingInfo model for ontology-based lexicon representation ([15]).<br />

5 Acknowledgements<br />

We would like to thank Hans Hjelm of Stockholm University (Computational<br />

Linguistics Dept.) for making available the FMA term set and the Wikipedia<br />

anatomy corpus.<br />

This research has been supported in part by the THESEUS Program in the<br />

MEDICO Project, which is funded by the German Federal Ministry of Economics<br />

and Technology under the grant number 01MQ07016. The responsibility for this<br />

publication lies with the authors.<br />

References<br />

1. Gruber, T.R.: A translation approach to portable ontology specifications. Knowledge<br />

Acquisition 5(2) (1993) 199–220<br />

2. Guarino, N.: Formal ontology and information systems. In Guarino, N., ed.: Formal<br />

ontology in information systems, IOS Press (1998) 3–15<br />

3. Rosse, C., Mejino Jr, J.: A reference ontology for biomedical informatics: the<br />

foundational model of anatomy. Journal of Biomedical Informatics 36(6) (2003)<br />

478–500<br />

4. Dolbey, A., Ellsworth, M., Scheffczyk, J.: BioFrameNet: A Domain-specific<br />

FrameNet Extension with Links to Biomedical Ontologies. In: Proceedings of the<br />

”Biomedical Ontology in Action” Workshop at KR-MED, Baltimore, MD, USA.<br />

(2006) 87–94<br />

5. Buitelaar, P., Sacaleanu, B.: Ranking and selecting synsets by domain relevance.<br />

Proceedings of WordNet and Other Lexical Resources: Applications, Extensions<br />

and Customizations, NAACL 2001 Workshop (2001)<br />

6. McCarthy, D., Koeling, R., Weeds, J., Carroll, J.: Finding predominant senses in<br />

untagged text. In: Proceedings of the 42nd Annual Meeting of the Association for<br />

Computational Linguistics. (2004) 280–287<br />

7. Koeling, R., McCarthy, D.: Sussx: WSD using Automatically Acquired Predominant<br />

Senses. In: Proceedings of the Fourth International Workshop on Semantic<br />

Evaluations, Association for Computational Linguistics (2007) 314–317<br />

7 http://olp.dfki.de/ontoselect/


386 Nils Reiter and Paul Buitelaar<br />

8. Magnini, B., Cavaglia, G.: Integrating subject field codes into WordNet. Proceedings<br />

of LREC-2000, Second International Conference on Language Resources and<br />

Evaluation (2000) 1413–1418<br />

9. Magnini, B., Strapparava, C., Pezzulo, G., Gliozzo, A.: Using domain information<br />

for word sense disambiguation. Proceeding of SENSEVAL-2: Second International<br />

Workshop on Evaluating Word Sense Disambiguation Systems (2001) 111–114<br />

10. Cucchiarelli, A., Velardi, P.: Finding a domain-appropriate sense inventory for<br />

semantically tagging a corpus. Natural Language Engineering 4(04) (1998) 325–<br />

344<br />

11. Navigli, R., Velardi, P.: Automatic Adaptation of WordNet to Domains. In: Proceedings<br />

of 3rd International Conference on Language Resources and Evaluation-<br />

Conference (LREC) and OntoLex2002 workshop. (2002) 1023–1027<br />

12. Pazienza, M.T., Stellato, A.: An environment for semi-automatic annotation of<br />

ontological knowledge with linguistic content. In: Proceedings of the 3rd European<br />

Semantic Web Conference. (2006)<br />

13. Alexa, M., Kreissig, B., Liepert, M., Reichenberger, K., Rostek, L., Rautmann,<br />

K., Scholze-Stubenrecht, W., Stoye, S.: The Duden Ontology: An Integrated Representation<br />

of Lexical and Ontological Information. Proceedings of the OntoLex<br />

Workshop at LREC, Spain, May (2002)<br />

14. Gangemi, A., Navigli, R., Velardi, P.: The OntoWordNet Project: extension and<br />

axiomatization of conceptual relations in WordNet. In: Proceedings of ODBASE03<br />

Conference, Springer (2003)<br />

15. Buitelaar, P., Declerck, T., Frank, A., Racioppa, S., Kiesel, M., Sintek, M., Engel,<br />

R., Romanelli, M., Sonntag, D., Loos, B., et al.: LingInfo: Design and Applications<br />

of a Model for the Integration of Linguistic Information in Ontologies. Proceedings<br />

of OntoLex 2006 (2006)<br />

16. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing.<br />

The MIT Press, Cambridge, Massachusetts (1999)<br />

17. Oberle, D., Ankolekar, A., Hitzler, P., Cimiano, P., Schmidt, C., Weiten, M., Loos,<br />

B., Porzel, R., Zorn, H.P., Micelli, V., Sintek, M., Kiesel, M., Mougouie, B., Vembu,<br />

S., Baumann, S., Romanelli, M., Buitelaar, P., Engel, R., Sonntag, D., Reithinger,<br />

N., Burkhardt, F., Zhou, J.: Dolce ergo sumo: On foundational and domain models<br />

in swinto. Journal of Web Semantics (accepted for publication) (forthcoming)<br />

18. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. Proceedings<br />

of the conference on New Methods in Language Processing 12 (1994)<br />

19. Leech, G., Rayson, P., Wilson, A.: Word Frequencies in Written and Spoken English:<br />

Based on the British National Corpus. Longman (2001)<br />

20. Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educational and Psychological<br />

Measurement 20(1) (1960) 37<br />

21. McCarthy, D., Koeling, R., Weeds, J., Carroll, J.: Using automatically acquired<br />

predominant senses for word sense disambiguation. In: Proceedings of the ACL<br />

SENSEVAL-3 workshop. (2004) 151–154<br />

22. Buitelaar, P., Eigner, T., Declerck, T.: OntoSelect: A Dynamic Ontology Library<br />

with Support for Ontology Selection. Proceedings of the Demo Session at the<br />

International Semantic Web Conference. Hiroshima, Japan (2004)


Arabic WordNet: Current State and Future Extensions<br />

Horacio Rodríguez 1 , David Farwell 1 , Javi Farreres 1 , Manuel Bertran 1 , Musa<br />

Alkhalifa 2 , M. Antonia Martí 2 , William Black 3 , Sabri Elkateb 3 , James Kirk 3 , Adam<br />

Pease 4 , Piek Vossen 5 , and Christiane Fellbaum 6<br />

1 Politechnical University of Catalonia<br />

Jordi Girona, 1-3; 08034 Barcelona; Spain<br />

{horacio, farwell, farreres, mbertran}@lsi.upc.edu<br />

2 Universita de Barcelona<br />

Despatx: 5.19 Edifici Josep Carner, Gran Via 585; 08007 Barcelona; Spain<br />

musa@thera-clic.com, amarti@ub.edu<br />

3 The University of Manchester<br />

PO Box 88, Sackville St; Manchester, M60 1QD; UK<br />

{w.black, sabri.elkateb, James.E.Kirk}@manchester.ac.uk<br />

4 Articulate Software Inc,<br />

420 College Ave; Angwin, CA 94508; USA<br />

apease@articulatesoftware.com<br />

5 Irion Technologies<br />

Delftechpark 26; 2628XH, Delft, The Netherlands<br />

piek.vossen@irion.nl<br />

6 Princeton University,<br />

Department of Psychology, Green Hall; Princeton, NJ 08544; USA<br />

fellbaum@clarity.princeton.edu<br />

Abstract. We report on the current status of the Arabic WordNet project and in<br />

particular on the contents of the database, the lexicographer and user interfaces,<br />

the Arabic WordNet browser, linking to the SUMO ontology, the Arabic word<br />

spotter, and techniques for semi-automatically extending Arabic WordNet. The<br />

central focus of the presentation is on the semi-automatic extension of Arabic<br />

WordNet using lexical and morphological rules.<br />

Keywords: Arabic NLP, Arabic WordNet, Ontology, Semi-automatic WordNet<br />

extension.<br />

1 Introduction<br />

Arabic WordNet (AWN – [1], [2], [3], inter alia) is currently under construction<br />

following a methodology developed for EuroWordNet [4]. The EuroWordNet<br />

approach maximizes compatibility across WordNets and focuses on the manual<br />

encoding of a set of base concepts, the most salient and important concepts as defined<br />

by various network-based and corpus-based criteria as reported in Rodríguez, et al [5].<br />

Like EuroWordNet, there is a straightforward mapping from Arabic WordNet (AWN)<br />

onto Princeton WordNet 2.0 (PWN – [6]). In addition to constructing a WordNet for<br />

Arabic, the AWN project aims to extend a formal specification of the senses of its


388 Horacio Rodríguez et al.<br />

synsets using the Suggested Upper Merged Ontology (SUMO), a languageindependent<br />

ontology. This representation is essentially an interlingua between all<br />

WordNets ([7], [8]) and can serve as the basis for developing semantics-based<br />

computational tools for cross-linguistic NLP applications 1 .<br />

The following discussion is divided into two main parts. We first present the<br />

current status of the Arabic WordNet and then we describe different techniques for<br />

semi-automatically extending AWN.<br />

2 Current State of Arabic WordNet<br />

2.1 Content of the Arabic WordNet Database<br />

At the time of writing Arabic WordNet consists of 9228 synsets (6252 nominal, 2260<br />

verbal, 606 adjectival, and 106 adverbial), containing 18,957 Arabic expressions. This<br />

number includes 1155 synsets that correspond to Named Entities which have been<br />

extracted automatically and are being checked by the lexicographers. Since these<br />

numbers are constantly changing, the interested reader can find the most up-to-date<br />

statistics at: http://www.lsi.upc.edu/~mbertran/arabic/awn/query/sug_statistics.php.<br />

2.2 Interfaces<br />

Two different web-based interfaces have been developed for the AWN project.<br />

Lexicographer's Web Interface (Barcelona)<br />

http://www.lsi.upc.edu/~mbertran/arabic/awn/update/synset_browse.php<br />

The lexicographer’s interface has been designed to support the task of adding,<br />

modifying, moving or deleting WordNet synsets. Its functionalities include:<br />

• listing the synsets assigned to each lexicographer (here, the lexicographer has<br />

many options to select from, including listing ‘completed synsets’ or ‘incomplete<br />

synsets’ or both),<br />

• listing synsets by English word,<br />

• listing synsets by synset offsets,<br />

• listing synsets by date of creation,<br />

• listing synsets without associated lexical items, or yet to be reviewed (to enhance<br />

validation, each lexicographer can review and comment on the others’ entries).<br />

User's Web Interface (Barcelona)<br />

http://www.lsi.upc.edu/~mbertran/arabic/awn/index.html<br />

1 To our knowledge the only previous attempt to build a wordnet for the Arabic language<br />

consisted of a set of experiments by Mona Diab, [9] for attaching Arabic words to English<br />

synsets using only English WordNet and a parallel corpus Arabic English as knowledge<br />

sources.


Arabic WordNet: Current State and Future Extensions 389<br />

This interface enables the user to consult AWN and search for Arabic words, Arabic<br />

roots, Arabic synsets, English words, synset offsets for English WordNet 2.0. Search<br />

can be refined by selecting the appropriate part of speech. A virtual keyboard is also<br />

available for users who do not have access to an Arabic keyboard.<br />

2.3 WordNet to SUMO Mapping<br />

SUMO ([7], [10]) and its domain ontologies form the largest publicly available formal<br />

ontology today. It is formally defined and not dependent on a particular application.<br />

SUMO contains 1000 terms, 4000 axioms, 750 rules and is the only formal ontology<br />

that has been mapped by hand to all of the PWN synsets as well as to EuroWordNet<br />

and BalkaNet. However, because WordNet is much larger than SUMO, many links<br />

are from general SUMO terms to more specific WordNet synsets. As of this writing,<br />

there are 3772 equivalence mappings, 100,477 subsuming mappings, and 10,930<br />

mappings from a SUMO class to a WordNet instance. Most nouns map to SUMO<br />

classes, most verbs to subclasses of processes, most adjectives to subjective<br />

assessment attributes, and most adverbs to relations of and manners. While instance<br />

mappings are often from very specific SUMO classes, SUMO itself only includes a<br />

few sets of instances, such as the countries of the world. SUMO and its associated<br />

domain ontologies have a total of roughly 20,000 terms and 70,000 axioms.<br />

SUMO synset definitions of the relevant synset can be viewed from the user’s web<br />

interface by using the SUMO Search Tool which relates PWN synsets to concepts in<br />

the SUMO ontology. To facilitate understanding of the ontology by Arabic speakers,<br />

the Sigma ontology management system [10] automatically generates Arabic<br />

paraphrases of its formal, logical axioms. SUMO has been extended with a number of<br />

concepts that correspond to words lexicalized in Arabic but not in English. They<br />

include concepts related to Arabic/Muslim cultural and religious practices and kinship<br />

relations. This is one way in which having a formal ontology provides an interlingua<br />

that is not limited by the lexicalization of any particular human language. For more<br />

information, see:<br />

http://sigmakee.cvs.sourceforge.net/*checkout*/sigmakee/KBs/ArabicCulture.kif<br />

2.4 The AWN Browser<br />

The Arabic WordNet Browser is a stand-alone application that can be run on any<br />

computer that has a Java virtual machine. In its current state, its main facilities include<br />

browsing AWN, searching for concepts in AWN, and updating AWN with latest data<br />

from the lexicographers.<br />

Searching can be done using either English or Arabic. In Arabic, the search can be<br />

carried out using either Arabic script or Buckwalter transliteration [11] and can be for<br />

a word or root form, with the optional use of diacritics. For English, the browser<br />

supports a word-sense search alongside a graphical tree representation of PWN which<br />

allows a user to navigate via hyponym and hypernym relations between synsets. A<br />

combination of word-sense search and tree navigation enables a user to quickly and<br />

efficiently browse translations for English into Arabic.


390 Horacio Rodríguez et al.<br />

Since users unfamiliar with Arabic cannot be expected to know how to convert an<br />

Arabic word they have copied from a Web page into an appropriate citation form, we<br />

have integrated Arabic morphological analysis into the search function, using a<br />

version of AraMorph [12]. A virtual Arabic keyboard is also accessible to enable<br />

Arabic script entry for the different search fields.<br />

SUMO ontology navigation is currently being integrated into the browser, using a<br />

tree traversal procedure similar to that for PWN. Users will be able to search or<br />

browse AWN using SUMO as the interlingual index between English and Arabic.<br />

Also under construction are Arabic tree navigation and the automatic generation of<br />

Arabic glosses. These additions will be included in the next release version of the<br />

browser.<br />

More detailed information and screen shots can be found at:<br />

http://www.globalwordnet.org/AWN/AWNBrowser.html<br />

The browser is available for downloading from Sourceforge under the General<br />

Public License (GPL) at: http://sourceforge.net/projects/awnbrowser/<br />

2.5 The Arabic Word Spotter<br />

An Arabic Word Spotter has been developed to provide the user with a tool to test<br />

AWN’s coverage by identifying those words in an Arabic web page that can be found<br />

in AWN. The word spotter can be accessed at:<br />

http://www.lsi.upc.edu/~mbertran/arabic/wwwWn7/<br />

Arabic words are searched for first in AWN and, failing that, in a few bilingual<br />

dictionaries. The procedure relies on the AraMorph stemmer and, once a match is<br />

found, a word level translation is provided. Translation of stop words is provided as<br />

well.<br />

Help and HowTos are available from:<br />

http://www.lsi.upc.edu/~mbertran/arabic/wwwWn7/help/help.php?<br />

3 Approaches to the Semi-automatic Extension of AWN<br />

Although the construction of AWN has been manual, some efforts have been made to<br />

automate part of the process using available bilingual lexical resources. Using lexical<br />

resources for the semi-automatic building of WordNets for languages other than<br />

English is not new. In some cases a substantial part of the work has been performed<br />

automatically, using PWN as source ontology and bilingual resources for proposing<br />

correlates. An early effort along these lines was carried out during the development of<br />

Spanish WordNet within the framework of EuroWordNet project ([13], [5]). Later,<br />

the Catalan WordNet [14] and Basque WordNet [15] were developed following the<br />

same approach.<br />

Within the BalkaNet project [16] and the Hungarian WordNet project [17], this<br />

same methodology was followed. In this case, the basic approach was complemented<br />

by methods that relied on monolingual dictionaries. As an experiment with the<br />

Romanian WordNet, [18] follow a similar approach, but use additional knowledge


Arabic WordNet: Current State and Future Extensions 391<br />

sources including Magnini’s WordNet domains [19] and WordNet glosses. They use a<br />

set of metarules for combining the results of the individual heuristics and achieve<br />

91% accuracy for the 9610 synsets covered. Finally, to build both a Chinese WordNet<br />

and a Chinese-English WordNet, [20] complement their bilingual resources with<br />

information extracted from a monolingual Chinese dictionary.<br />

For AWN, we have investigated two different possible approaches. On the hand,<br />

we produce lists of suggested Arabic translations for the different words contained in<br />

the English synsets corresponding to the set of Base Concepts. In this case the input to<br />

the lexicographical task is the English synset, its set of synonyms and their Arabic<br />

translations. On the other hand, we derive new Arabic word forms from already<br />

existing, manually built, Arabic verbal synsets using inflectional and derivational<br />

rules and produce a list of suggested English synset associations for each form. In this<br />

case the input is the Arabic verb, the set of possible derivates and the set of English<br />

synsets which would be linked to corresponding Arabic synset. In both cases, the list<br />

of suggestions is manually validated by lexicographers.<br />

3.1 Suggested Translations<br />

For this approach, we start with a list of tuples<br />

extracted from several publicly available English/Arabic resources. The first step was<br />

to clean and standardize the entries. The available resources differ in many details.<br />

Some contain POS for each entry while others do not. Arabic words were in some<br />

cases vocalized and in others not. In some cases certain diacritics are used, such as<br />

shadda (i.e., consonant reduplication), while in others no diacritics at all appear. Some<br />

dictionaries contain the perfect tense form for verbs while others use the imperfect<br />

form. After this standardization process, we merged all the sources (using both<br />

directions of translation) into one single bilingual lexicon and then took the<br />

intersection of this lexicon with the set of Base Concept word forms This latter set<br />

was built merging the Base Concepts of EuroWordNet, 1024 synsets, with those of<br />

Balkanet, 8516 synsets.<br />

Following 8 heuristic procedures used in building the Spanish WordNet [21] as<br />

part of EuroWordNet [4], the associations between Arabic words and PWN synsets in<br />

the Arabic-English bilingual lexicon were scored. The methodology assigned a score<br />

to each association, but since the Arabic WordNet has been manually constructed, no<br />

threshold was set and all associations were provided to the lexicographer for<br />

verification. Thus, when editing an Arabic synset, the lexicographer begins with a<br />

suggested association, rather than an empty synset with only the English data to go<br />

by. Some suggestions were correct or very similar to correct ones. Others were<br />

incorrect but served to trigger an Arabic word that might otherwise have been missed.<br />

The result has been a much richer set of Arabic synsets.<br />

Initially 15,115 translations were suggested, of which only 9748 (64.5%) have<br />

been thus far checked by the lexicographers. The results show that of these, 392<br />

candidates (4.0%) were accepted without any changes, 1246 (12.8%) were accepted<br />

with minor changes (such as adding diacritics), 877 (9.0%), while good candidates,<br />

were rejected because they were identical or very similar to translations that had<br />

already been chosen by the lexicographer, and 7233 (74.2%) were rejected because


392 Horacio Rodríguez et al.<br />

they were incorrect given the gloss and examples. We will revise these results once all<br />

the Base concepts have been completed at the end of the project.<br />

At first glance, these results are not especially impressive and, as a result, we<br />

turned to an alternative approach. At the same time, it is difficult to compare these<br />

figures with results obtained for other languages because we are interested exclusively<br />

in generating suggestions for Base Concepts which are to be confirmed by<br />

lexicographers while other approaches do not have this objective. Since the words<br />

belonging to Base Concept synsets are often highly polysemous, the accuracy of<br />

predicting translations is generally lower. In addition, since we are more interested in<br />

high coverage, no filters were applied with a corresponding drop in precision.<br />

3.2 Semi-automatic Extension of AWN Using Lexical and Morphological Rules<br />

In this section we explore an alternative methodology for the semi-automatic<br />

extension of Arabic WordNet using lexical rules as applied to existing AWN entries.<br />

This methodology takes advantage of one of a central characteristic of Arabic, namely<br />

that many words having a common root (i.e. a sequence of typically three consonants)<br />

have related meanings and can be derived from a base verbal form by means of a<br />

reduced set of lexical rules. Since AWN entries must be manually reviewed, our aim<br />

is once again not to automatically attach new synsets but rather to suggest new<br />

attachments and to evaluate whether these suggestions can help the lexicographer. As<br />

with previous approach, we are more interested in getting a broad coverage than high<br />

accuracy, although an appropriate balance between these two measures is nonetheless<br />

desirable.<br />

3.2.1 Setting<br />

In the studies reported in this section, we deal only with a very limited but highly<br />

productive set of lexical rules which produce regular verbal derivative forms, regular<br />

nominal and adjectival derivative forms and, of course, inflected verbal forms.<br />

From most of the basic Arabic triliteral verbal entries, up to 9 additional verbal<br />

forms can be regularly derived as shown in Table 1. We refer to the set of lexical rules<br />

that account for these forms as Rule Set 1. They have been implemented as regular<br />

expression patterns.<br />

درس (DaRaSa, to study/to learn) has as its root درس For instance, the basic form<br />

(DRS). The first form pattern in Table 1 applied to this root produces the original<br />

basic forms (in this case simply adding diacritics). If we apply the second form<br />

pattern in Table 2 to the same root, the form درّس (DaRRaSa, to teach) is obtained.


Arabic WordNet: Current State and Future Extensions 393<br />

Table 1: Patterns of Arabic regular derived forms<br />

Class<br />

Arabic Pattern<br />

فعل (Basic) 1<br />

فعّل 2<br />

فاعل 3<br />

افعل 4<br />

تفعّل 5<br />

تفاعل 6<br />

انفعل 7<br />

افتعل 8<br />

افعلّ‏ 9<br />

استفعل 10<br />

From any verbal form (whether basic or derived by Rule Set 1), both nominal and<br />

adjectival forms can also be generated in a highly systematic way: the nominal verb<br />

(masdar) as well as masculine and feminine active and passive participles. We refer to<br />

this set of rules as Rule Set 2. Examples include the masdar درس (DaRSun, lesson,<br />

study) from درس (DaRaSa, to study/to learn) and مدرّس (MuDaRRiSun, male teacher)<br />

from درّس (DaRRaSa, to teach).<br />

Finally, a set of morphological rules for each basic or derived verb form is applied<br />

in order to produce the full set of inflected verb forms as exemplified in Table 2.<br />

Table 2: Some inflected verbal forms (of 82 possible) for درس (DaRaSa, to learn)<br />

English form Arabic form<br />

(he) learned<br />

(I) learned<br />

(I) learn<br />

(he) learns<br />

(we) learn<br />

... ...<br />

درس<br />

درست<br />

ادرس<br />

يدرس<br />

ندرس<br />

As reported below, these forms are especially useful for searching a corpus as well<br />

as in various applications. The number of different forms depends on the class of the<br />

verb but it ranges from 44 to 84 forms. Class 1, for instance, has 82 forms and, thus,<br />

requires the application of 82 different morphological rules. We refer to this set of<br />

rules as Rule Set 3.<br />

Beyond this, we aim to extend this basic approach to the derivation of additional<br />

forms including the feminine form from any nominal masculine form (for instance,<br />

MuDaRRiSun, male teacher), or ‏,مدرّس MuDaRRiSatun, female teacher, from ‏,مدرّسة<br />

the regular plural forms from any nominal singular form. For instance, the regular<br />

nominative plural form is created by adding the suffix (Una) to the singular form<br />

(e.g., مدرّسون MuDaRRiSUna, male teachers, is derived from ‏,مدرّس MuDaRRiSun,<br />

male teacher).


394 Horacio Rodríguez et al.<br />

3.2.2 Central Problems to Address<br />

Implementing the ideas stated in the previous section is not straightforward. Several<br />

problems have to be addressed but perhaps the two most important are 1) filtering<br />

noise caused by over the generation of derivative verb forms and 2) mapping the<br />

newly created Arabic word forms to appropriate WordNet synsets, i.e., mapping<br />

words to their appropriate sense. Obviously not all the derivative forms generated by<br />

Rule Sets 1 and 2 are valid for any given basic verbal form in Arabic. For instance,<br />

for درس (DaRaSa, to learn) of the nine possible derivates generated by the application<br />

of Rule Set 1, shown in Table 1, only the six shown in Table 3 are valid according to<br />

[22]. Thus, some kind of filtering has to take place in order to reduce the noise<br />

wherever possible. That is to say, only the most promising candidates should be<br />

proposed to the lexicographer. In addition, once the set of candidate derivates has<br />

been built and the corresponding nominal and adjectival forms generated, we have to<br />

map all these forms to English translations and from these to the appropriate PWN<br />

synsets.<br />

Table 3: Valid derivates from درس (DaRaSa, to learn)<br />

Class English form Arabic form<br />

(basic) 1<br />

to learn, to study درس 2 to teach<br />

درّس 3 someone) to study (together with<br />

دارس 4 to learn with<br />

ادرس 6 together) to study (carefully<br />

تدارّس 7 to vanish<br />

اندرس 3.2.3 Resources<br />

The procedures described below make use of the following resources:<br />

• Princeton’s English WordNet 2.0,<br />

• Arabic WordNet (specifically the set of Arabic verbal synsets currently<br />

available),<br />

• the LOGOS database of Arabic verbs which contains 944 fully conjugated Arabic<br />

verb (available at:<br />

http://www.logosconjugator.org/verbi_utf8/all_verbs_index_ar.html),<br />

• the NMSU bilingual Arabic-English lexicon (available at:<br />

http://crl.nmsu.edu/Resources/dictionaries/download.php?lang=Arabic),<br />

• the Arabic GigaWord Corpus (available through LDC:<br />

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02).<br />

3.2.4 Overview of the Approach<br />

Broadly speaking, the procedure we follow in generating a set of likely pairs is to:<br />

1. produce an initial list of candidate word forms (as described in Section<br />

3.2.6),<br />

2. filter the less likely candidates from this list (as described in Section 3.2.7),


Arabic WordNet: Current State and Future Extensions 395<br />

3. generate an initial list of candidate synsets attachments (as described in<br />

Section 3.2.8),<br />

4. score the reliability of these candidates (as described in Section 3.2.9),<br />

5. manually review the candidates and include the valid associations in AWN.<br />

3.2.4.1 Building the initial set of word candidates<br />

To build the initial set of candidate word forms, we first collect a set of basic (Class 1)<br />

verb forms, such as درس (DaRaSa, to learn), from the existing 2296 verbs in AWN<br />

and transliterate them using Buckwalter encoding [11]. We next apply Rule Set 1 to<br />

generate the 9 basic derivative verb forms (both valid or not). Then, for each of these<br />

new verb forms, we apply Rule Set 3 in order to derive the full set of possible<br />

inflected forms.<br />

3.2.4.2 Learning filters on translations<br />

In order to determine whether or not a particular possible word form is likely to turn<br />

out to be a valid word form, we build a decision tree classifier using machine learning<br />

for each of the 9 classes of derivation (i.e. Classes 2 through 10). The choice of<br />

decision trees is mainly motivated because their ease of interpretation, since otherwise<br />

they provided similar results to those of Adaboost, an alternative approach which we<br />

have tested. We used the C5.0 implementation within Weka toolbox [23]. The<br />

software can be obtained from: http://www.cs.waikato.ac.nz/~ml/weka/index.html.<br />

The features used for learning included the following:<br />

1. the relative frequency of each inflected form for a given class of derivatives<br />

in the GigaWord Corpus,<br />

2. whether the base form appears in the NMSU dictionary or not,<br />

3. the POS tag of the base form in NMSU dictionary,<br />

4. the attribute TRUE (positive example) or FALSE (negative example).<br />

In order to learn a decision tree, the algorithm must be presented with both positive<br />

and negative examples. For positive examples, we used the LOGOS database (946<br />

examples), AWN (2296 examples) and the NMSU dictionary (15,654 examples).<br />

LOGOS and AWN are the most accurate but do not provide enough material. NMSU<br />

has broad coverage but is less accurate because the entries are not vocalized and lack<br />

diacritics (for some classes the lack of the shadda diacritic2 is a serious problem).<br />

To build the training set, we matched each inflected form for each of the base<br />

forms (basic or derived) against the GigaWord Corpus and the NMSU dictionary in<br />

order to extract the relevant features for learning. Finally, we selected all the base<br />

forms corresponding to the word forms that occurred in the resources as positive<br />

examples, and used the remaining forms (i.e., those that do not occur in either the<br />

GigaWord Corpus or in the NMSU dictionary) as negative examples. All other forms<br />

are discarded. Table 4, for instance, shows the size of the training set used for learning<br />

the filter for Class 7.<br />

2 In Arabic shadda is how consonant reduplication or germination is marked. Obviously if this<br />

diacritic is lost the correct orthographic form of a word is affected.


396 Horacio Rodríguez et al.<br />

Table 4: Size of training set for learning the Class 7 filter<br />

Logos AWN NMSU Total<br />

positive 8 24 1718 1750<br />

negative 70 0 4856 4926<br />

Total 78 24 6574 6676<br />

Following this general procedure, a decision tree classifier was learned for each<br />

class of derivation (in fact, only 8 filters were learned because there were too few<br />

examples for Class 9). We applied 10-fold cross-validation. The results for all the<br />

classifiers but one were over 99% of F1 value although in some cases the resultant<br />

decision tree consisted of only a single query on the occurrence of the base form in<br />

the NMSU dictionary (i.e. the form was accepted simply if it occurs in NMSU<br />

dictionary).<br />

3.2.4.3 Building the list of candidate synsets attachments<br />

To build a list of candidate synset attachments, we first generate a list of possible base<br />

verb forms by applying the filters described above. We then apply Rule Set 2 to each<br />

of the base verb forms to generate the set of related Arabic noun and adjective forms.<br />

Only those forms occurring in NMSU dictionary with English equivalents occurring<br />

in PWN are retained. For each of these word forms, all the English translations from<br />

NMSU dictionary and all their PWN synsets are collected as candidates. The result of<br />

this process is a candidate set of tuples of the form . The final step is to assign a reliability score to each tuple.<br />

3.2.4.4 Scoring the candidate synset attachments<br />

Our scoring routine is based on the observation that in most cases the set of derivative<br />

forms have semantically related senses. For instance, درس (DaRaSa, to study) belongs<br />

to Class 1 and its masdar is درس (DaRSun, lesson). درّس (DaRRaSa, to teach) belongs<br />

to Class 2 and its masculine active participle is مدرّس (MuDaRRiSun, male teacher).<br />

Clearly these four words are semantically related. Therefore, if we map Arabic words<br />

to English translations and then to the corresponding PWN synsets, we can expect that<br />

the correct assignments will correspond to most semantically related synsets. In other<br />

words, the most likely associations are those<br />

corresponding to the most semantically related items.<br />

There are three levels of connections to be considered3:<br />

• relations between an Arabic word and its English translations,<br />

• relations between an English word and its PWN synsets,<br />

• relations between a PWN synset and other synsets in PWN.<br />

3 The relations A base -> A i have not been considered explicitly because A base comes from an<br />

existing AWN synset and thus its association has already been established manually.


Arabic WordNet: Current State and Future Extensions 397<br />

To identify the “most semantically related” associations between Arabic words and<br />

PWN synsets, we:<br />

1. collect the set of tuples for a<br />

given Arabic base verb form and its derivatives,<br />

2. extract the set of English synsets and identify all the existing semantic<br />

relations between these synsets in PWN4,<br />

3. build a graph with three levels of nodes corresponding to Arabic words,<br />

English words, and English synsets respectively and edges corresponding to<br />

the translation relation between Arabic words and English words, the<br />

membership relation between English words and PWN synsets and finally,<br />

the recovered relations between PWN synsets.<br />

These are represented in the graph in Figure 1.<br />

A 1 ...<br />

E 1<br />

...<br />

A base ...<br />

...<br />

...<br />

E i<br />

E j<br />

...<br />

S 1<br />

...<br />

S p<br />

A n<br />

...<br />

...<br />

E m<br />

Fig. 1. Example of Graph of dependencies<br />

Two approaches to scoring are being examined. The first, described below, is<br />

based on a set of heuristics that use the graph structure directly while the second,<br />

more complex, maps the graph onto a Bayesian Network and applies a learning<br />

algorithm. The latter approach is the subject of ongoing research and will be described<br />

in a separate forthcoming paper.<br />

Using the graph as input, the first approach to calculating the reliability of<br />

association between Arabic word and PWN synset consists of simply applying a set of<br />

five graph traversal heuristics. The heuristics are as follows (note that in what follows,<br />

“A base ”, “A 1 ”, “A 2 ”, etc., correspond to Arabic word forms, A base being the initial<br />

verbal base form, “E”, “E 1 ”, “E 2 ”, etc. to English word forms, and “S”, “S 1 ”, “S 2 ”, etc.<br />

to PWN synsets):<br />

1. If a unique path A-E-S exists (i.e., A is only translated as E), and E is<br />

monosemous (i.e., it is associated with a single synset), then the output tuple is tagged as 1. See Figure 2.<br />

4 As in the rest of experiments reported in this paper we have used the relations present in<br />

PWN2.0


398 Horacio Rodríguez et al.<br />

A base A E S<br />

A base<br />

A E 1 S<br />

Fig. 2. Graph for heuristic 1<br />

2. If multiple paths A-E 1 -S and A-E 2 -S exist (i.e., A is translated as E 1 or E 2 and<br />

both E 1 and E 2 are associated with S among other possible associations) then the<br />

output tuple is tagged as 2. See Figure 3.<br />

...<br />

...<br />

A base A E 1 S<br />

...<br />

E 2<br />

...<br />

...<br />

Fig. 3. Graph for heuristic 2<br />

3. If S in A-E-S has a semantic relation to one or more synsets, S 1 , S 2 … that have<br />

already been associated with an Arabic word on the basis of either heuristic 1 or<br />

heuristic 2, then the output tuple is tagged as 3. See Figure 4.<br />

...<br />

A base A E 1 S<br />

A 1<br />

Heuristic<br />

1 or 2<br />

R<br />

S 1<br />

Fig. 4. Graph for heuristic 3<br />

4. If S in A-E-S has some semantic relation with S 1 , S 2 … where S 1 , S 2 … belong to<br />

the set of synsets that have already been associated with related Arabic words,<br />

then the output tuple is tagged as 4. In this case there is only one<br />

translation E of A but more than one synset associated with E. This heuristic can<br />

be sub-classified by the number of input edges or supporting semantic relations<br />

(1, 2, 3, ...). See Figure 5.


Arabic WordNet: Current State and Future Extensions 399<br />

A i<br />

Heuristics<br />

1 to 5<br />

S i R<br />

...<br />

A base<br />

A E S<br />

...<br />

Heuristics<br />

A j<br />

1 to 5<br />

...<br />

R<br />

S j<br />

Fig. 5. Graph for heuristic 4<br />

5. Heuristic 5 is the same as heuristic 4 except that there are multiple translations<br />

E 1 , E 2 , … of A and, for each translation E i , there are possibly multiple associated<br />

synsets S i1 , S i2 , …. In this case the output tuple is tagged as 5 and again<br />

the heuristic can be sub-classified by the number of input edges or supporting<br />

semantic relations (1, 2, 3 ...). See Figure 6.<br />

A i<br />

Heuristics<br />

1 to 5<br />

S i R<br />

...<br />

E i<br />

...<br />

...<br />

A base A E i S<br />

...<br />

...<br />

E i<br />

A j<br />

R<br />

Heuristics<br />

S 1 to 5<br />

j<br />

Fig. 6. Graph for heuristic 5<br />

3.2.5 A Detailed Example<br />

Consider once more the case of verb درس (DaRaSa, to learn). From the 9 forms<br />

obtained by applying Rule Set 1 to the basic form, the filter accepts the Classes 2, 4<br />

and 7 (as shown in Table 3 on p.7 above). Here we look at the basic form and the<br />

Class 2 derivate. We begin by collecting the following tuples using the NMSU<br />

dictionary and PWN:


400 Horacio Rodríguez et al.<br />

‏:درس<br />

learn: '00580363', '00584743', '00579325', ['00578275', '00801981', '00890179']:verb<br />

‏:درس<br />

study: '00580363', '02104471', '00587590', ['00623929', '00587299', '00681070']:verb<br />

‏:دَر َّس<br />

instruct: ['00801981', '00725200', '00803912']:verb ‏:دَر َّس<br />

teach: ['00801981', '00264843']:verb ‏:درس<br />

teach: ['10599680']:noun ‏:درس<br />

study: '05374971', '06775158', '05422945', ['00608171', '04177786', '05644624', '04065428', '05450040',<br />

'09971266', '06616749']:noun<br />

‏:درس<br />

lesson: ['00836504', '06262123', '06198025', '00686199']:noun studied: ‏:مدروس<br />

['01738792', '01782596']:adjective researcher: ['09837494']:noun<br />

‏:دارس<br />

studying: ['06190701']:noun ‏:دارس<br />

student: ['09970518', '09869332']:noun ‏:درّس<br />

study: '00580363', '02104471', '00587590', ['00623929', '00587299', '00681070']:verb<br />

‏:تدريس<br />

teaching: ['00834401', '05811310', '00831015']:noun ‏:دارس<br />

instruction: ['06369463', '00831015', '00834401', '06178338']:noun<br />

‏:تدريس<br />

faculty: ['05325039', '07787222']:noun ‏:مدرس<br />

school: '07777509', '05424562', '03989548', ['07776854', '14342474', '07775337', '07512364']:noun<br />

‏:تدريس<br />

teacher: ['09997151', '05515561']:noun ‏:مدرس<br />

instructor: ['09997151']:noun ‏:مدرس<br />

Between the synsets identified above, the following relations hold:<br />

07776854 has as a member 07787222<br />

07787222 is a member of 07776854<br />

00801981 cause 00578275<br />

00686199 is a part of 00831015<br />

00831015 has as a part 00686199<br />

00578275 is a type of 00587299<br />

00587299 has as a type 00578275<br />

00587299 is a type of 00584743<br />

00584743 has as a type 00587299<br />

00834401 is a type of 00836504<br />

00836504 has as a type 00834401<br />

Using these relations, we build an undirected graph where nodes correspond to<br />

synsets and edges to semantic relations between synsets. Table 5 shows the 12<br />

candidate associations generated of which 9 are deemed correct by the lexicographers.<br />

Note that no candidates have been selected on the basis of the heuristic 1 or heuristic<br />

4. Note also that subclasses of heuristic 5 (rows 9 to 12) are somewhat overvalued<br />

because nodes connected by relations with inverses are counted twice.


Arabic WordNet: Current State and Future Extensions 401<br />

Table 5: Candidates for Class 1 and 2 derivates of درس (DaRaSa, to learn)<br />

Buckwalter POS Synset Off Class Arabic form Lex<br />

Judge<br />

1 drs verb 580363 2 درس ok<br />

2 drs verb 801981 2 درس ok<br />

3 tdrys noun 834401 2 تدريس ok<br />

4 tdrys noun 831015 2 تدريس ok<br />

5 mdrs noun 9997151 2 مدرس ok<br />

6 drs noun 836504 3 درس ok<br />

7 drs noun 686199 3 درس ok<br />

8 drs verb 578275 3 درس ok<br />

9 drs verb 587299 5,5 درس ok<br />

10 drs verb 584743 5,3 درس no<br />

11 mdrs noun 7776854 5,3 مدرس no<br />

12 tdrys noun 7787222 5,3 تدريس no<br />

The first row in Table 5 corresponds to the tuple < , ‎580363‎درس >. It has been<br />

selected on the basis of heuristic 2 because the synset 580363 occurs in:<br />

…] '00580363', […, learn: : to درس<br />

…]. '00580363', , [… study: : to درس<br />

The sixth row of Table 5 corresponds to the tuple < , ‎836504‎درس >. In this case,<br />

heuristic 3 can be applied because in<br />

,…] '00836504' […, lesson: : درس<br />

the synset 00836504 is related to the synset 00834401 by a hyponymy relation:<br />

00836504 has as a type 00834401<br />

which, in turn, has been suggested on the basis of heuristic 2 (see row 3 in Table 5).<br />

Finally, consider the tuple < , 00587299> in row 9 of Table 5. This is an example<br />

of the application of heuristic 5. In<br />

…] '00587299', …, [ study: : to درس<br />

the synset 00587299 receives support from (among others):<br />

00578275 is a type of 00587299<br />

00584743 has as a type 00587299<br />

where 00578275 and 00584743 have been associated with other derivative forms of<br />

5. (DaRaSa, to learn) as shown in rows 8 and 10 respectively of Table درس<br />

درس<br />

3.2.6 Evaluation<br />

To perform an initial evaluation of this approach, we randomly selected 10 of the<br />

2296 verbs currently in AWN that have a non null coverage and which satisfy all the<br />

requirements above. In addition, for the purpose of illustration, we added the verb<br />

(DaRaSa, to learn) as a known example. The process for building the candidate درس<br />

set of Arabic form-synset associations described in Section 3.2.4 was applied to each<br />

of the 11 basic verb forms resulting in 11 sets of candidate tuples. The size in words<br />

and synsets are presented in Table 6.


402 Horacio Rodríguez et al.<br />

Table 6: Size of the candidate sets for testing<br />

Arabic form # of words # of synsets<br />

عَامَلَ‏<br />

107 190 أَعْقَبَ‏<br />

71 77 صَقَلَ‏<br />

31 21 رَت َّبَ‏<br />

62 102 أَخ َّرَ‏<br />

19 9 أَخْبَرَ‏<br />

80 105 40 22 رَش َّحَ‏<br />

غَامَرَ‏<br />

56 49 أَشْبَعَ‏<br />

38 34 أَخْرَجَ‏<br />

85 140 دَر َّسَ‏<br />

57 51 Each of the tuples was then scored following the procedure described in Section<br />

3.2.4.4. We did not introduce a threshold and so the whole list of candidates, ordered<br />

by reliability score, was evaluated by a lexicographer. The results are presented in<br />

Table 7. Here, the first column indicates the scoring heuristic applied, the second the<br />

number of instances to which it applied, the third the number of instances judged<br />

acceptable by the lexicographer, the fourth the number of instances judged<br />

unacceptable, and the fifth the percentage correct.<br />

These results are very encouraging especially when compared with the results of<br />

applying the EuroWordNet heuristics reported in Section 3.1. While the sample is<br />

clearly insufficient (for instance, there are no instances of the application of heuristic<br />

1 and too few examples of heuristic 3), with few exceptions the expected trend for the<br />

reliability scores are as expected (heuristics 2 and 3 perform better than heuristic 4<br />

and the latter better than heuristic 5). It is also worth noting that heuristic 3, the first<br />

that relies on semantic relations between synsets in PWN, outperforms heuristic 2.<br />

However, we have not attempted to establish statistical significance because of the<br />

small size of the test set. Otherwise, an initial manual analysis of the errors shows that<br />

several are due to the lack of diacritics in the resources.<br />

Currently we are extending the coverage of the test set. We will then repeat the<br />

entire procedure using only dictionaries containing diacritics. We are also planning to<br />

refine the scoring procedure by assigning different weights to the different semantic<br />

relations between synsets. In addition, we expect to compare this approach with that<br />

based on Bayesian Networks mentioned earlier.


Arabic WordNet: Current State and Future Extensions 403<br />

Table 7: Results of the evaluation of proposed Arabic word-PWN<br />

synset associations<br />

Heuristic # # ok # no %<br />

correct<br />

1 0 0 0 0<br />

2 42 27 15 64<br />

3 19 13 6 68<br />

4,1 0 0 0 0<br />

4,2 7 4 3 57<br />

4,3 9 5 4 56<br />

4,4 2 1 1 50<br />

4,5 2 1 1 50<br />

4,6 0 0 0 0<br />

4,7 1 0 1 0<br />

5,1 0 0 0 0<br />

5,2 63 32 31 51<br />

5,3 109 41 68 38<br />

5,4 4 4 0 1<br />

5,5 10 6 4 60<br />

5,6 1 1 0 100<br />

5,7 2 0 2 0<br />

5,13 1 0 1 0<br />

Total 272 135 137 50<br />

4 Outlook and Conclusion<br />

We have presented the current state of Arabic WordNet and described some<br />

procedures for semi-automatically extending AWN’s coverage. On the one hand, the<br />

procedure for suggesting translations on the basis of 8 heuristics used for<br />

EuroWordNet was presented and discussed. On the other, we described a set of<br />

procedures for the semi-automatic extension of AWN using lexical and morphological<br />

rules and provided the results of their initial evaluation.<br />

We hope that work will continue on augmenting the AWN database by both<br />

manual and automatic means even after the current project ends. We welcome ideas,<br />

suggestions, and expressions of interest in contributing or collaborating on both<br />

further extension of the lexical database as well as on development of related<br />

software. Finally, we are looking forward to a wide range of NLP applications that<br />

make use of this valuable resource.


404 Horacio Rodríguez et al.<br />

Acknowledgement<br />

This work was supported by the United States Central Intelligence Agency.<br />

References<br />

1. Black, W., Elkateb, S., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A., Fellbaum, C.:<br />

Introducing the Arabic WordNet Project. In: Proceedings of the Third International<br />

WordNet Conference (2006)<br />

2. Elkateb, S.: Design and implementation of an English Arabic dictionary/editor. PhD thesis,<br />

The University of Manchester, United Kingdom (2005)<br />

3. Elkateb, S., Black. W., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A., Fellbaum, C.:<br />

Building a WordNet for Arabic. In: Proceedings of the Fifth International Conference on<br />

Language Resources and Evaluation. Genoa, Italy (2006)<br />

4. Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks.<br />

Dordrecht: Kluwer Academic Publishers (1998)<br />

5. Rodríguez, H., Climent, S., Vossen, P., Bloksma, L., Peters, W., Roventini, A., Bertagna, F.,<br />

Alonge, A.: The top-down strategy for building EuroWordNet: Vocabulary coverage, base<br />

concepts and top ontology. J. Computers and Humanities, Special Issue on EuroWordNet<br />

32, 117–152 (1998)<br />

6. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA<br />

(1998)<br />

7. Niles, I., Pease, A.: Towards a Standard Upper Ontology. In: Proceedings of FOIS 2001, pp.<br />

2–9. Ogunquit, Maine. (See also www.ontologyportal.org) (2001)<br />

8. Vossen, P.: EuroWordNet: a multilingual database of autonomous and language specific<br />

wordnets connected via an Inter-Lingual-Index. J. International Journal of Lexicography<br />

17(2), 161–173 (2004)<br />

9. Diab, M.: The Feasibility of Bootstrapping an Arabic WordNet leveraging Parallel Corpora<br />

and an English WordNet. In: Proceedings of the Arabic Language Technologies and<br />

Resources. NEMLAR, Cairo (2005)<br />

10. Pease, A. :The Sigma Ontology Development Environment. In: Working Notes of the<br />

IJCAI-2003 Workshop on Ontology and Distributed Systems. Volume 71 of CEUR<br />

Workshop Proceeding series (2003)<br />

11. Buckwalter, T.: Arabic transliteration. http://www.qamus.org/transliteration.htm. (2002)<br />

12. Brihaye, P.: AraMorph: http://www.nongnu.org/aramorph/ (2003)<br />

13. Farreres, J.: Creation of wide-coverage domain-independent ontologies. PhD thesis,<br />

Univertitat Politècnica de Catalunya (2005)<br />

14. Benítez, L., Cervell, S., Escudero, G., López, M., Rigau, G., Taulé, M.: Methods and tools<br />

for building the Catalan WordNet. In: Proceedings of LREC Workshop on Language<br />

Resources for European Minority Languages (1998)<br />

15. Agirre, E., Ansa, O., Arregi, X., Arriola, J., de Ilarraza, A. D., Pociello, E., Uria, L.:<br />

Methodological issues in the building of the Basque WordNet: Quantitative and qualitative<br />

analysis. In: Proceedings of the first International WordNet Conference, 21-25 January<br />

2002. Mysore, India (2002)<br />

16. Tufis, D. (ed.): Special Issue on the Balkanet Project. Romanian Journal of Information<br />

Science and Technology Special Issue 7(1–2) (2004)<br />

17. Miháltz, M., Prószéky, G.: Results and evaluation of Hungarian nominal WordNet v1.0. In:<br />

Proceedings of the Second International WordNet Conference (<strong>GWC</strong> 2004), pp. 175–180.<br />

Masaryk University, Brno (2003)


Arabic WordNet: Current State and Future Extensions 405<br />

18. Barbu, E., Barbu-Mititelu, V. B.: A case study in automatic building of wordnets. In:<br />

Proceedings of OntoLex 2005 - Ontologies and Lexical Resources (2005)<br />

19. Magnini, B., Cavaglia, G.: Integrating Subject Field Codes into WordNet. In: Gavrilidou<br />

M., Crayannis G., Markantonatu S., Piperidis, S., Stainhaouer, G. (eds.) Proceedings of the<br />

Second International Conference on Language, pp. 1413-1418. Resources and Evaluation.<br />

Athens, Greece, 31 May–2 June, 2000 (2000)<br />

20. Chen, H., Lin, C., Lin, W.: Building a Chinese-English WordNet for translingual<br />

applications. J. ACM Transactions on Asian Language Information Processing 1 (2), 103–<br />

122 (2002)<br />

21. Farreres J., Rodríguez, H., Gibert, K.: Semiautomatic creation of taxonomies. SemaNet'02:<br />

Building and Using Semantic Networks, in conjunction with COLING 2002, August 31,<br />

Taipei, Taiwan. See: (2002)<br />

22. Wehr, H.: Arabic English Dictionary. Cowan (1976)<br />

23. Witten, I.H., Frank, E.: Data mining: Practical machine learning tools and techniques<br />

(second edition). Morgan Kaufmann: San Francisco, CA (2005)


Building a WordNet for Persian Verbs<br />

Masoud Rouhizadeh, Mehrnoush Shamsfard, and Mahsa A. Yarmohammadi<br />

Natural Language Processing Laboratory, Shahid Beheshti University, Tehran, Iran<br />

m.rouhizadeh@mail.sbu.ac.ir, m-shams@sub.ac.ir, m_yarmohammadi@std.sbu.ac.ir<br />

Abstract. This article is a report of an ongoing project to develop a WordNet<br />

for Persian Verbs. To build this WordNet we apply the expand approach used in<br />

EuroWordNet and BalkaNet. We are now building the core WordNet of Persian<br />

verbs by translating the verbs of BalkaNet Concept Sets 1, 2 and 3. The<br />

translation process includes automatic suggested equivalences of English<br />

synsets in our WordNet editor, and then their manually refinement by a linguist<br />

using different dictionaries and corpora. We are also adding the frequent<br />

Persian verbs that are not included in the sets using an electronic Persian<br />

corpus. This core WordNet will be extended (semi)automatically. The most<br />

important fact about Persian verbs is that most of them are compound rather<br />

than simple. Compound verbs in Persian are formed in two major patterns:<br />

combination and incorporation. In many cases the compound verbs are<br />

semantically transparent, that is the meaning of compound verb is the function<br />

of the meaning of its verbal and non-verbal constituents. This suggests that<br />

many verbs in Persian WordNet can be directly connected to their non-verbal<br />

constituent in Persian WordNet and so inherit the existing relations among<br />

those words too.<br />

1 Introduction<br />

Persian is the official language of three countries and it is also spoken in more than<br />

six countries. There is no doubt in the necessity of basic NLP resources and tools such<br />

as a standard lexicon for this wide-spoken language. The WordNet of Persian verbs is<br />

an ongoing project to provide a part of the Persian WordNet, a powerful tool for<br />

Persian NLP applications.<br />

Persian verbs WordNet goes closely in the lines and principles of Princeton<br />

WordNet, EuroWordNet and BalkaNet to maximize its compatibility to these<br />

WordNets and to be connected to the other WordNets in the world for crosslinguistic<br />

abilities such as MT and multilingual dictionaries and thesauri. It also aims to be<br />

merged to the other existing WordNets of Persian nouns [1] and Persian adjectives [2]<br />

In this article, first we give an overview of the methodology we take, lexical<br />

resources we are using and the building process. Finally, we point to the most<br />

important characteristic of Persian verbs, that is they are mostly compound verbs and<br />

consist of a verbal and a non-verbal constituent. This feature leads to a highly cross<br />

part of speech connected WordNet.


Building a WordNet for Persian Verbs 407<br />

2 Methodology<br />

We are constructing the Persian Verbs WordNet according to the methods applied for<br />

EuroWordNet [3], [4] that is a widely used approach in many WordNets. This<br />

approach maximizes compatibility across WordNets and at the same time preserves<br />

the language specific structures of Persian. We follow the expand strategy in which a<br />

core WordNet should be developed manually and then extended (semi)automatically<br />

[5]. To develop the core WordNet of Persian verbs we are manually translating the<br />

verbs of BalkaNet Concept Sets 1, 2 and 3 (BCS1, BCS2 and BCS3) [6]. We are also<br />

adding the frequent Persian verbs that are not included in the sets using the electronic<br />

Persian corpus [7]. Adding hyperonmys and the first level hyponyms to these verb<br />

Base Concepts will result to the core WordNet of Persian verbs. This core WordNet<br />

have to be extended (semi)automatically using specifically available resources, e.g.<br />

monolingual and bilingual dictionaries, lexicons, ontologies, thesauri, etc.<br />

3 Building process and lexical resources<br />

In this project we are making use of a machine readable dictionary to suggest the<br />

Persian equivalences of PWN synsets in our WordNet editor, VisDic. In the next step,<br />

the suggestions are refined, using our linguistic knowledge of English and Persian and<br />

English-Persian Millennium Dictionary [8], the most reliable English-to-Persian<br />

dictionary. Then we refer to Anvari [9] a Persian monolingual dictionary, to check<br />

out the consistency and correctness of our equivalences. We also use the Persian<br />

Linguistic Database (PLDB), [7]. This is an on-line database for the contemporary<br />

(Modern) Persian. The database contains more than 16.000.000 words of all varieties<br />

of the Modern Persian language in the form of running texts. Some of the texts are<br />

annotated with grammatical, pronunciation and lemmatization tags. A special and<br />

powerful software provides different types of search and statistical listing facilities<br />

through the whole database or any selective corpus made up of a group of texts. The<br />

database is constantly improved and expanded. It provides us a means of handling<br />

various types of texts to determine the frequency of verbs and helps us find and add<br />

the verbs that are not included in BalkaNet Concept Sets 1, 2 and 3. (At this time we<br />

have translated verbs from BCS1 and BCS2).<br />

3.1 The editor<br />

The editor we use to build our WordNet is BalkaNet multilingual viewer and editor,<br />

VisDic [6]. It is a graphical application for viewing and editing WordNet lexical<br />

databases stored in XML format. Most of the program behavior and the dictionary<br />

design can be configured.<br />

Figure 1 shows the View tab of VisDic editor for the verb " دادن " ‘to teach’.<br />

The POS, ID, synonyms, hypernyms and other WordNet relations of the selected<br />

word are shown in this tab.<br />

درس


408 Masoud Rouhizadeh, Mehrnoush Shamsfard, and Mahsa A. Yarmohammadi<br />

Fig. 1. The View tab of VisDic editor for the verb " دادن " ‘to teach’.<br />

Figure 2 shows the Edit tab of VisDic editor for the verb " " ‘to teach’. This<br />

tab allows editing the actual entry. There are some other buttons in this tab: "New"<br />

button for creating a new entry with unique key, "Add" and "Update" buttons to add<br />

actual entry to the dictionary or update the actual entry.<br />

The output file generated by VisDic is a human readable XML file. In this file,<br />

each synset is defined within a including some other inner<br />

tags.<br />

درس<br />

درس دادن<br />

4 Compound verbs in Persian<br />

Persian verbs can be divided into two major morphological categories: simple and<br />

compound verbs. As the names suggest, simple verbs have simple morphological<br />

structure, the verbal constituent. Compound verbs, on the other hand, consist of a nonverbal<br />

constituent, such as a noun, adjective, past participle, prepositional phrase, or<br />

adverb, and a verbal constituent.<br />

As reported in Sadeghi [10] the maximum number of simple verbs in Persian<br />

today, is only 115 verbs. This and many other investigations such as existence of a<br />

great number of compound verbs formed based on various Arabic parts of speech,<br />

and using of all new borrowing verbs from western languages as compound verbs in<br />

Persian, reveal that compound-verb formation is highly productive in Persian today<br />

[11] . The number of registered compound verbs is around 2500-3000. Thus, most of<br />

the Persian verbs, containing the basic ones, are compound verbs.<br />

We follow Dabir-Moghaddam’s account [11] as he suggests two major types of<br />

compound-verb formation in Persian: Combination and Incorporation.


Building a WordNet for Persian Verbs 409<br />

Fig. 2. The Edit tab of VisDic editor for the verb " درس دادن " ‘to teach’


410 Masoud Rouhizadeh, Mehrnoush Shamsfard, and Mahsa A. Yarmohammadi<br />

4.1 Combination<br />

In this type of compound-verb formation the non-verbal and the verbal constituent are<br />

combined in the following patterns:<br />

4.1.1 Adjective + Auxiliary<br />

delxor-shodan ‘to become annoyed’ ‘annoyed-become’<br />

4.1.2 Noun + Verb<br />

bâzi-kardan ‘to play’ ‘play-do’<br />

pas-dâdan ‘to return’ ‘back-give’<br />

dast-datan ‘to be involved’ ‘hand-have’<br />

4.1.3 Prepositional Phrase + Verb<br />

be donya âmadan ‘to be born’ ‘to-world-come’<br />

4.1.4 Adverb + Verb<br />

dar yâftan ‘to perceive’ ‘in-find’<br />

4.1.5 Past Participle + Passive Auxiliary<br />

sâxte odan ‘to be built’ ‘built-ecome’<br />

4.2 Incorporation:<br />

In Persian, the direct objects (losing its grammatical endings) and can incorporate<br />

with the verb, to create an intransitive compound verb, which is a conceptual whole as<br />

shown in the following example:<br />

4.2.1<br />

a. mâ qazâ-y-e-m-ân- râ xor-d-im<br />

we food-our-pl.-DO eat-past-we<br />

‘We ate our food’<br />

b. mâ qazâ- xor-d-im<br />

‘We did food eating’<br />

Also, some prepositional phrases can incorporate with verbs. Here, the proposition<br />

disappears after incorporation:<br />

4.2.2<br />

a. ân-hâ be zamin xor-d-and<br />

that-pl. to ground eat-past-they<br />

‘They fell to the ground.’<br />

b. ân-hâ zamin xor-d-and<br />

‘They fell down.’<br />

As can be seen, the morphological structure of verbs in Persian is highly connected<br />

to the other parts of speech, especially nouns; and in many cases the compound verbs<br />

are semantically transparent, that is the meaning of the resulting compound verb is


Building a WordNet for Persian Verbs 411<br />

the function of the meaning of its verbal and non-verbal constituents. This suggests<br />

that many verbs in Persian WordNet can be directly connected to their non-verbal<br />

constituent in Persian WordNet, i.e. the nouns, the adjectives and the adverbs; and so<br />

inherit the existing relations among those words too; as in the verb qazâ- xordan ‘to<br />

eat’ which is connected directly to the noun qazâ ‘food’ and to its hyponyms, for<br />

instance, nâhâr ‘lunch’ in the verb nâhâr-xordan ‘to eat lunch’. So Persian WordNet<br />

will be a strongly cross part of speech connected WordNet.<br />

5 Conclusion<br />

In this article we had a review of the on-going project on building a WordNet of<br />

Persian verbs. Taking into consideration the fact that most of the verbs in Persian are<br />

compounds and highly connected to the other parts of speech, the WordNets of<br />

Persian verbs, noun, adjectives and adverbs have to be built closely to each other. The<br />

inter-dependency of verbs and the other parts of speech in Persian is an interesting<br />

fact that is not usually found in the other languages.<br />

This WordNet can be evaluated in the following ways: first, we have to compare<br />

the results with 3 reliable bilingual dictionaries; second, some human experts check<br />

and evaluate the synsets, third, when completed, we have to use the WordNet in some<br />

applications and evaluate the results and fourth, the WordNet has to be compared with<br />

other lexicons built with another approach.<br />

Acknowledgements<br />

Special thanks to Dr. Ali Famian for his everlasting supports and ideas and Dr.<br />

Mohammad Dabir-Moghaddam for his innovative ideas and findings on compound<br />

verbs in Persian. We are also thankful to Mr. Alireza Mokhtaripour and Dr. Mohsen<br />

Ebrahimi Moghaddam for their efforts to provide us the electronic dictionary.<br />

References<br />

1. Keyvan, F. (ed.): Developing PersiaNet: The Persian Wordnet. In: Proceedings of the 3rd<br />

Global WordNet conference, pp. 315-318. South Korea (2006)<br />

2. Famina, A., Aghajaney, D.: Towards Building a WordNet for Persian Adjectives. In:<br />

Proceedings of the 3rd Global WordNet conference, pp. 307–308. South Korea (2006)<br />

3. Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic<br />

Networks. Kluwer Academic Publishers, Dordrecht (1998)<br />

4. Vossen, P.: EuroWordNet General Document. EuroWordNet Project LE2-4003 & LE4-<br />

8328 report. University of Amsterdam (2002)<br />

5. Rodriquez, H. (ed.): The Top-Down Strategy for Building EuroWordNet: Vocabulary<br />

Coverage, Base Concepts and Top Ontology. J. Computers and the Humanities, Special<br />

Issue on EuroWordNet 32, 117–152 (1998)


412 Masoud Rouhizadeh, Mehrnoush Shamsfard, and Mahsa A. Yarmohammadi<br />

6. Tufis, D. (ed.): Romanian Journal of Information Science and Technology, Special Issue<br />

on the BalkaNet Project 7(1–2) (2004)<br />

7. Assi, S. M.: Farsi Linguistic Database (FLDB). J. International Journal of Lexicography,<br />

10(3), Euralex Newsletter (1997)<br />

8. Haghshenas, A.M., Samei, H., Entekahbi, N.: Farhang Moaser English-Persian<br />

Millennium Dictionary. Farhang Moaser Publication, Tehran (1992)<br />

9. Anvari, H.: Sokhan Dictionary (2 Vol.). Sokhan Publishers, Tehran (2004)<br />

10. Sadeghi, A. A.: On denominative verbs in Persian. (article in Persian) In: Proceedings of<br />

Persian Language and the Language of Science Seminar, pp. 236-246. Iran University<br />

Press, Tehran (1993)<br />

11. Dabir-Moghaddam, M.: Compound Verbs in Persian. J. Studies in the Linguistic Science,<br />

27(2), 25–59 (1997)


Developing FarsNet: A Lexical Ontology for Persian<br />

Mehrnoush Shamsfard<br />

NLP Research Laboratory, Faculty of Electrical & Computer Engineering,<br />

Shahid Beheshti University, Tehran, Iran.<br />

m-shams@sbu.ac.ir<br />

Abstract. Semantic lexicons and lexical ontologies are important resources in<br />

natural language processing. They are used in various tasks and applications,<br />

especially where semantic processing is evolved such as question answering,<br />

machine translation, text understanding, information retrieval and extraction,<br />

content management, text summarization, knowledge acquisition and semantic<br />

search engines. Although there are a number of semantic lexicons for English<br />

and some other languages, Persian lacks such a complete resource to be used in<br />

NLP works. In this paper we introduce an ongoing project on developing a<br />

lexical ontology for Persian called FarsNet. It uses a combination of WordNet<br />

and FrameNet features to represent words meanings. We exploited a hybrid<br />

semi-automatic approach to acquire lexical and conceptual knowledge from<br />

resources such as WordNet, bilingual dictionaries, mono-lingual corpora and<br />

morpho-syntactic and semantic templates. FarsNet provides links between<br />

various types of words and also between words and their corresponding<br />

concepts in other ontologies.<br />

Keywords: Lexical Ontology, Semantic Lexicon, Persian, WordNet, FrameNet.<br />

1 Introduction<br />

In recent years, there has been an increasing interest in semantic processing of natural<br />

languages. Some of the essential resources to make this kind of process possible are<br />

semantic lexicons and ontologies. Lexicon contains knowledge about words and<br />

phrases as the building blocks of language and ontology contains knowledge about<br />

concepts as the building blocks of human conceptualization (the world model)[1].<br />

Lexical ontologies or NL-ontologies are ontologies whose nodes are lexical units of a<br />

language. Moving from lexicons toward ontologies by representing the meaning of<br />

words by their relations to other words, results in semantic lexicons and lexical<br />

ontologies.<br />

One of the most popular lexical ontologies for English is WordNet. Princeton<br />

WordNet [2], is widely used in NLP research works. It covers English language and<br />

has been first developed by Miller in a hand-crafted way. Many other lexical<br />

ontologies (such as EuroWordNet, BalkaNet, …) have been created based on<br />

Princeton WordNet for other languages such as Dutch, Italian, Spanish, German,<br />

French, Czech and Estonian. Although there exist such semantic, lexical resources for<br />

English and some other languages, some languages such as Persian (Farsi) lack such a


414 Mehrnoush Shamsfard<br />

semantic resource for use in NLP works. There have been some efforts to create a<br />

WordNet for Persian language too [3,4] but no available product have been<br />

announced yet. The only available lexical resources for Persian are some lexicons<br />

containing phonological and syntactic knowledge of words (such as [5]).<br />

On the other hand, the major problems with WordNet are (1) its restricted relations<br />

(2) weak semantic knowledge on verbs. WordNet does not support cross-POS<br />

relationships and does not allow defining arbitrary relations. There are not coded<br />

information about verb arguments and their conceptual properties in WordNet.<br />

In this paper we introduce an effort to develop a lexical ontology called FarsNet for<br />

Persian language which overcomes the above shortcomings. We exploit a semi<br />

automatic approach to acquire lexical and ontological knowledge from available<br />

resources and build the lexicon. FarsNet is a bilingual lexical ontology which not only<br />

represents the meaning of Persian words and phrases, but also links them to their<br />

corresponding concepts in other ontologies such as WordNet, Cyc, Sumo, etc.<br />

FarsNet aggregates the power of WordNet on nouns and the power of frameNet on<br />

verbs.<br />

At the rest of the paper first I will discuss construction of lexicons based on<br />

WordNet, then discussing new features of FarsNet; I will explain our approach in<br />

brief.<br />

2 Construction of Semantic Lexicons Based on Princeton<br />

WordNet<br />

Semantic lexicons may be generated using automatic or manual methods. Manual<br />

approach requires direct interference of human and is estimated be a time-consuming<br />

task; therefore the use of automatic methods seems to be more desirable. One of the<br />

major resources for creating a semantic lexicon for a language (other than English) is<br />

Princeton WordNet that was constructed for English.<br />

However it should be noted that although concepts are represented in the form of<br />

different words in different languages, the relation of these concepts are almost the<br />

same. Therefore we may take advantage of Princeton WordNet as a main resource for<br />

the development of WordNets for different languages.<br />

The main challenges in this procedure are the lexical gaps that exist among<br />

different languages and the ambiguities produced during the translation procedures.<br />

Lexical gap results in words in a language that have no direct mate in the other<br />

language and can only be translated into a group of words that convey the same<br />

meaning instead of a single word. Ambiguities are results of translating polysemous<br />

words from one language to other polysemous words in other languages in creating a<br />

WordNet from other.<br />

There are some proposed approaches to overcome these problems and build new<br />

WordNets for new languages based on Princeton WordNet. In this paper we use some<br />

of them to create some parts of FarsNet which is related to WordNet.


Developing FarsNet: A Lexical Ontology for Persian 415<br />

3 Introducing FarsNet<br />

FarsNet consists of two main parts: a semantic lexicon and a lexical ontology. Each<br />

entry in the semantic lexicon contains natural language descriptions, phonological,<br />

morphological, syntactic and semantic knowledge about a lexeme. The lexemes can<br />

participate in relations with other lexemes in the same lexicon or to entries of other<br />

lexicons and ontologies, in the ontology part. Here, the semantic lexicon is serving as<br />

a lexical index to the ontology. The ontology part contains not only the standard<br />

relations defined in WordNet but also some additional conceptual ones. FarsNet is<br />

able to add new relations for its words or concepts. We have developed an interface<br />

for FarsNet from which one can add, remove or change the entries. From this<br />

interface the user can define new relations or use the existing ones and relate words<br />

by them. It can relate words from different syntactic types together (e.g. nouns to<br />

adjectives and verbs). It can also relate a word to its corresponding concept in an<br />

existing ontology. This makes the interoperability between various resources and<br />

various languages easier.<br />

These general features are available for all types of words. In addition there are<br />

some specific features for specific POS tags too. For instance, adjectives may accept<br />

selectional restrictions. This way, in addition to features defined by WordNet, we<br />

have defined a new relation for adjectives which shows the category of nouns who<br />

can accept this word as a modifier. For example ‘khoshmazeh’ (delicious) usually is<br />

used for edibles while ‘dana’ (wise) is used for humans. This feature, showing the<br />

selectional restrictions of Persian adjectives, helps NLP systems to disambiguate<br />

syntactic parsing, chunking and understanding.<br />

On the other hand FarsNet covers the relations introduced for verbs in WordNet<br />

and also adds the number, names and conceptual characteristics of the arguments of<br />

each verb in a similar way to FrameNet. We have defined the arguments for about<br />

300 Persian verbs [6]. The next activity is to complete it for other verbs and define the<br />

feature set of each argument. For example now we know that ‘khordan’ (to eat) is a<br />

verb belonging to a verb class which needs an agent and a theme and can have an<br />

instrument, but we have not yet defined that the theme of this verb should be edible,<br />

its agent should be an animated thing, and the size of its instrument is small (usually<br />

smaller than a mouth) and it may be one of spoon, fork, knife, …. This feature helps<br />

NLP systems to extract thematic roles, represent the sentence meaning and acquire<br />

knowledge from texts.<br />

The next section will discuss the building procedure of FarsNet.<br />

4 Semi-Automatic Knowledge Acquisition for FarsNet<br />

We use an incremental approach to build FarsNet; developing a kernel and extending<br />

it in a semi automatic way. The acquisition approach consists of the following main<br />

steps:


416 Mehrnoush Shamsfard<br />

1- Providing initial resources<br />

2- Developing an initial lexicon based on WordNet and performing word sense<br />

disambiguation<br />

3- Extracting new knowledge (words and relations) from available resources<br />

4- Evaluation and refinement<br />

4.1 Initial Resources<br />

We have the following resources available and use them to develop FarsNet.<br />

WordNet<br />

a Lexicon [5] containing more than 50,000 entries with their POS tags,<br />

a bilingual (English- Persian) dictionary.<br />

POS tagged corpora<br />

a morphological analyzer for Persian [7]<br />

4.2 Developing an Initial Lexicon Based on WordNet<br />

To develop an initial lexicon we exploited three separate approaches in parallel: (a)<br />

automatic creation of a small kernel containing just the base concepts (2) automatic<br />

creation of an initial big lexicon containing almost anything covered by the bilingual<br />

dictionary (3) manually gathering a small lexicon.<br />

For (a) we should start form English base concepts and translate them to Persian,<br />

but for (b) we move in two directions, from English to Persian and from Persian to<br />

English separately to compare their results.<br />

Moving from Persian is simpler. Each Persian word will be assigned to an English<br />

synset using a Persian-English dictionary. To move from English, for each English<br />

synset, first we translate all the words in the synset using an electronic bilingual<br />

dictionary. Then we should arrange the Persian synsets by exploiting some heuristics<br />

and WSD (word sense disambiguation) methods. It is obvious that each synset has<br />

some English words and each word may have several senses and each sense may have<br />

several translations to Persian. So creating Persian synsets from English ones is not a<br />

straight forward task and each Persian word may be connected to a group of synsets in<br />

WordNet. For example the Persian word “dast” (hand) is connected to 14 synsets in<br />

WordNet. Some of them are listed below:<br />

• Hand, Manus, mitt, paw -- the (prehensile) extremity of the superior limb;<br />

• Hired hand, hand, hired man -- a hired laborer on a farm or ranch<br />

• Handwriting, hand, script -- something written by hand<br />

It can be seen that from all these 14 synsets, only the first one is a valid choice for<br />

the Persian word “dast”.<br />

Therefore it is important to identify the right sense(s) of English word, the right<br />

translation of it and putting the right sense of translated word in the corresponding<br />

synset. We use some heuristics to find the corresponding synsets fast. For example, to


Developing FarsNet: A Lexical Ontology for Persian 417<br />

find the appropriate Persian synset for an English one, we consider word pairs in the<br />

English synset. For each word in this pair we list all synsets they appear in. If those<br />

two words appear together only in the current synset, their common Persian<br />

translations would be connected to that synset. The existence of a single common<br />

synset in fact implies the existence of a single common sense between the two words<br />

and therefore their Persian translations shall be connected to this synset.<br />

On the other hand, if a word is known to be the English equivalent of a Persian<br />

word according to dictionary, the Persian word should at least be connected to one of<br />

the synsets that include the English word as a member. There will obviously be no<br />

ambiguities if the English word has only one sense and so appears at only one synset.<br />

In this case its translations will be added to that synset too.<br />

We plan to use dictionary based WSD according to Lesk [8] too. In this approach<br />

we use other English translations of Persian word (PW) as context words.<br />

4.3 Extracting New Knowledge<br />

After creating the initial lexicon, extra words will be gathered from a tagged corpus,<br />

and assign to a synset as mentioned before.<br />

Another part of ontology learning in FarsNet is dedicated to finding some relations<br />

from corpora exploiting lexico- syntactic patterns. The patterns we have tested so far<br />

are some adaptations of Hearst’s patterns for Persian. We are going to test other<br />

templates introduced at [9] too.<br />

4.4 Evaluation<br />

As it was mentioned before, we build each part of FarsNet using more than one<br />

approach. The evaluation procedure is done by tow methods too.<br />

In the first method a linguistic expert reviews the extracted knowledge and confirms<br />

or corrects them according to valid Resources (manual evaluation). The manual<br />

evaluation of the part of lexicon built so far shows an accuracy of about 70% in the<br />

resulting Persian lexicon.<br />

In the second method we compare the results of various exploited methods on a<br />

common task to find the common built knowledge. For example to confirm the<br />

inclusion hierarchies, we extract hierarchical relations from text using templates in<br />

one hand and find this hierarchy according to the hyponym/hypernym relations<br />

between corresponding English synsets on the other hand. Comparing the results<br />

shows the most confident knowledge extracted by both two methods. However as it is<br />

an ongoing project, the evaluation procedures as well as some other parts are not<br />

complete yet.<br />

5 Conclusion<br />

FarsNet project is an ongoing project in NLP research laboratory of Shahid Beheshti<br />

University. The ontology development methodology we proposed for developing


418 Mehrnoush Shamsfard<br />

FarsNet forces some series of tasks to be done in parallel and then combining or<br />

comparing the results.<br />

We have done the following parallel activities:<br />

- Manually developing a small lexicon as the kernel of FarsNet containing<br />

2500 entries [7].<br />

- Manually translating the base concepts of WordNet into Persian<br />

- Automatic finding the corresponding WordNet synsets for each entry of the<br />

syntactic lexicon using the bilingual dictionary.<br />

- Automatic making the preliminary list of potential synsets for Persian using<br />

WordNet and the above translations.<br />

- Automatic learning of new words and relations from the tagged corpus.<br />

Although a base ontology has been created with 32000 persian synsets, but there<br />

are many things to be added yet. The following activities are within our further works<br />

to continue the project:<br />

- Exploiting (linking to) FrameNet as a basis for developing the verb<br />

knowledge base of FarsNet<br />

- Completeing the verbs knowledge base,<br />

- Enhancing the sense disambiguation modules in the automatic translations<br />

- Designing and using more templates to extract non-taxonomic relations from<br />

text.<br />

- Working on some statistical approaches for lexical acquisition<br />

- Exploiting other methods to learn ontological knowledge<br />

- Finding a mapping between various ontologies.<br />

- Integrating the works done.<br />

References<br />

1. Shamsfard, M., Barforoush, A.A.: Learning Ontologies from Natural Language Texts. J.<br />

International Journal of Human-Computer Studies 60, 17–63 (2004)<br />

2. Fellbaum, C.: WordNet: An electronic lexical database. Cambridge, Mass. MIT Press (1998)<br />

3. Famian, A., Aghajaney, D.: Towards Building a WordNet for Persian Adjectives. In: 3rd<br />

Global Wordnet Conference (2007)<br />

4. Keyvan, F., Borjian, H., Kasheff, M., Fellbaum, C.: Developing PersiaNet: The Persian<br />

Wordnet. In: 3rd Global wordnet conference (2007)<br />

5. Eslami, M.: The generative lexicon. In: 2nd workshop on Persian language and computer.<br />

Tehran (2006)<br />

6. Shamsfard, M., SadrMousavi, M.: A Rule-based Semantic Role Labeling Approach for<br />

Persian Sentences. In: Second workshop on Computational Approaches to Arabic-script<br />

Languages (CAASL’2). Stanford, USA (2007)<br />

7. Shamsfard, M., Mirshahvalad, A., Pourhassan, M., Rostampour, S.: Developing basic<br />

analysers for Persian: combining morphology, syntax and semantic. In: 15th Iranian<br />

conference on Electrical Engineering,. Tehran (2007)<br />

8. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: how to tell a<br />

pine cone from an ice cream cone. In: Proceedings of the 5th annual international<br />

conference on Systems documentation, pp. 24–26. ACM Press (1986)<br />

9. Shamsfard, M.: Introducing Linguistic and Semantic Templates for Knowledge Extraction<br />

from Texts. In: Workshop on ontologies in text technology. Germany (2006)


KUI: Self-organizing Multi-lingual<br />

WordNet Construction Tool<br />

Virach Sornlertlamvanich 1 , Thatsanee Charoenporn 1 ,<br />

Kergrit Robkop 1 , and Hitoshi Isahara 2<br />

1<br />

Thai Computational Linguistics Lab.,<br />

NICT Asia Research Center, Pathumthani, Thailand<br />

2 National Institute of Information and Communications Technology<br />

3-5 Hikaridai, Seika-cho, Soraku-gaun, Kyoto, Japan 619-0289<br />

{virach, thatsanee, krergrit}@tcllab.org, isahara@nict.go.jp<br />

Abstract. This paper describes a multi-lingual WordNet construction tool,<br />

called KUI (Knowledge Unifying Initiator), which is a knowledge user<br />

interface for online collaborative knowledge construction. KUI facilitates online<br />

community in developing and discussing multi-lingual WordNet. KUI is a sort<br />

of social networking system that unifies the various discussions following the<br />

process of thinking model, i.e. initiating the topic of interest, collecting the<br />

opinions to the selected topics, localizing the opinions through the translation or<br />

customization and finally posting for public hearing to conceptualize the<br />

knowledge. The process of thinking is done under the selectional preference<br />

simulated by voting mechanism in the case that there are many alternatives. By<br />

measuring the history of participation of each member, KUI adaptively<br />

manages the reliability of each member’s opinion and vote according to the<br />

estimated ExpertScore. As a result, the multi-lingual WordNet can be created<br />

online and produce a reliable result.<br />

Keywords: Multi-lingual WordNet, KUI, ExpertScore, social networking<br />

system, information reliability.<br />

1 Introduction<br />

The constructions of the WordNet [1] for languages can be varied according to the<br />

availability of the language resources. Some were developed from scratch, and some<br />

were developed from the combination of various existing lexical resources. Spanish<br />

and Catalan Wordnets 1 , for instance, are automatically constructed using hyponym<br />

relation, mono-lingual dictionary, bi-lingual dictionary and taxonomy [2]. Italian<br />

WordNet [3] is semi-automatically constructed from definition in mono-lingual<br />

dictionary, bi-lingual dictionary, and WordNet glosses. Hungarian WordNet uses bilingual<br />

dictionary, mono-lingual explanatory dictionary, and Hungarian thesaurus in<br />

the construction [4], etc.<br />

1<br />

http://www.lsi.upc.edu/~nlp/


420 Virach Sornlertlamvanich et al.<br />

A tool to facilitate the construction is one of the important issues related to the<br />

WordNet construction. Some of the previous efforts were spent for developing the<br />

tools such as Polaris [5], the editing and browsing for EuroWordNet, and VisDic [6],<br />

the XML based Multi-lingual WordNet browsing and editing tool developed by<br />

Czech WordNet team. To facilitate an online collaborative development and annotate<br />

a reliability score to the proposed word entries, we, therefore, proposed KUI<br />

(Knowledge Unifying Initiator) to be a Knowledge User Interface (KUI) for online<br />

collaborative construction of multi-lingual WordNet. KUI facilitates online<br />

community in developing and discussing multi-lingual WordNet. KUI is a sort of<br />

social networking system that unifies the various discussions following the process of<br />

thinking model, i.e. initiating the topic of interest, collecting the opinions to the<br />

selected topics, localizing the opinions through the translation or customization and<br />

finally posting for public hearing to conceptualize the knowledge. The process of<br />

thinking is done under the selectional preference simulated by voting mechanism in<br />

the case that there are many alternatives.<br />

This paper illustrates an online tool to facilitate the multi-lingual WordNet<br />

construction by using the existing resources having only English equivalents and the<br />

lexical synonyms. Since the system is opened for online contribution, we need a<br />

mechanism to inform the reliability of the result. We introduce ExpertScore which<br />

can be estimated from the history of the participation of each member. The weight of<br />

each vote and opinion will be determined by the ExpertScore. The result will then be<br />

ranked according to this score to show the reliability of the opinion.<br />

The rest of this paper is organized as follows: Section 2 describes the process of<br />

managing the knowledge. Section 3 explains the design of KUI for Collaborative<br />

resource development. Section 4 provides some examples of KUI for WordNet<br />

construction. And, Section 5 concludes our work.<br />

2 Process of Knowledge Development<br />

A thought is dynamically formed up by a trigger which can be an interest from inside<br />

or a proposed topic from outside. However, knowledge can be formed up from the<br />

thought only when managed in an appropriate way. Since we are considering the<br />

knowledge of a community, we can consider the knowledge that is formed by a<br />

community in the following manner.<br />

• Knowledge is managed by the knowledge users.<br />

• Knowledge is dynamically changed.<br />

• Knowledge is developed in an individual manner or a community manner.<br />

• Knowledge is both explicit and tacit.<br />

The environment of online community can successfully serve the requirement of<br />

knowledge management. Under the environment, the knowledge should be grouped<br />

up and narrowed down into a specific domain for each group. The domain specific<br />

group can then be managed to generate a concrete knowledge after receiving the<br />

consensus from the participants at any moment.<br />

Open Source software development is a model for open collaboration in the<br />

domain of software development. The openness of the development process has


KUI: Self-organizing Multi-lingual WordNet Construction Tool 421<br />

successfully established a largest software community that shares their development<br />

and using experience. The activities are dedicated to the domain of software<br />

knowledge development. SourceForge.net 2 is a platform for project based Open<br />

Source software development. Open Source software developers deploy<br />

SourceForge.net to announce their initiation, to call for participation, to distribute<br />

their works and to receive feedbacks concerning their proposed software. Developers<br />

and users are actively using SourceForge.net to communicate with each other.<br />

Adopting the concept of Open Source software development, we will possibly be<br />

able to develop a framework for domain specific knowledge development under the<br />

open community environment. Sharing and collaboration are the considerable features<br />

of the framework. The knowledge will be finally shared among the communities by<br />

receiving the consensus from the participants in each step. To facilitate the knowledge<br />

development, we deliberate the process into 4 steps.<br />

1) Topic of interest<br />

The topic will be posted to draw the intention from the participants. The selected<br />

topics will then be further discussed in the appropriate step.<br />

2) Opinion<br />

The selected topic is posted to call for opinions from the participants in this step.<br />

Opinion poll is conducted to get the population of each opinion. The result of the<br />

opinion poll provides the variety of opinions that reflects the current thought of the<br />

communities together with the consensus to the opinions.<br />

3) Localization<br />

Translation is the straightforward implementation of the localization. Collaborative<br />

translation helps producing the knowledge in multiple languages in the most efficient<br />

way.<br />

4) Public-Hearing<br />

The result of discussion will be revised and confirmed by gathering the opinions to<br />

the final draft of proposal.<br />

Fig. 1 shows the process of how knowledge is developed within a community.<br />

Starting from posting 'Topic of Interest', participants express their supports by casting<br />

a vote. Upon a threshold the 'Topic of Interest' is selected for conducting a poll on<br />

'Opinion', or introducing to the community by 'Localization', or posting a draft for<br />

'Public-Hearing' to gather feedbacks from the community. The transition from<br />

'Opinion' to either 'Localization' or 'Public-Hearing' occurs when the 'Opinion' has a<br />

concrete view for implementation. The discussion in 'Localization' and 'Public-<br />

Hearing' is however interchangeable due to purpose of implementation whether to<br />

adopt the knowledge to the local community or to get feedbacks from the community.<br />

The knowledge creating is managed in 4 different categories corresponding to the<br />

stage of knowledge. Each individual in the community casts a vote to rank the<br />

appropriateness of solutions at each category. The community can then form the<br />

2<br />

http://www.sourceforge.net/


422 Virach Sornlertlamvanich et al.<br />

community knowledge under the 'Selectional Preference' background. On the other<br />

hand, the under-threshold solutions become obsolete by nature of the 'Selectional<br />

Preference'.<br />

Opinion<br />

Topic of<br />

Interest<br />

Localization<br />

Public-Hearing<br />

Fig. 1. Process of knowledge development<br />

3 Knowledge User Interface for Knowledge Unifying Initiative<br />

3.1 What is KUI?<br />

KUI is a GUI for knowledge engineering, in other words Knowledge User Interface<br />

(KUI). It provides a web interface accessible for pre-registered members. An online<br />

registration is offered to manage an account by profiling the login participant in<br />

making contribution. A contributor can comfortably move around in the virtual space<br />

from desk to desk to participate in a particular task. A working desk can be a meeting<br />

place for collaborative work that needs discussion through the 'Chat', or allow a<br />

contributor to work individually by using the message slot to record each own<br />

comment. The working space can be expanded by closing the unnecessary frames so<br />

the contributor can concentrate on the task. All working topics can be statistically<br />

viewed through the provided tabs. These tabs help contributors to understand KUI in<br />

the aspects of the current status of contribution and the tasks. A knowledge<br />

community can be formed and can efficiently create the domain knowledge through<br />

the features provided by KUI. These KUI features fulfill the process of human<br />

thought to record the knowledge.<br />

KUI also provides a 'KUI look up' function for viewing the composed knowledge.<br />

It is equipped with a powerful search and statistical browse in many aspects.<br />

Moreover, the 'Chatlog' is provided to learn about the intention of the knowledge<br />

composers. We frequently want to know about the background of the solution for<br />

better understanding or to remind us about the decision, but we cannot find one. To<br />

avoid the repetition of a mistake, we systematically provide the 'Chatlog' to keep the<br />

trace of discussion or the comments to show the intention of knowledge composers.


KUI: Self-organizing Multi-lingual WordNet Construction Tool 423<br />

3.2 Feature of KUI<br />

• Poll-based Opinion or Public-Hearing<br />

A contributor may choose to work individually by posting an opinion e.g.<br />

localization, suggestion etc., or join a discussion desk to conduct 'Public-Hearing'<br />

with others on the selected topic. The discussion can be conducted via the provided<br />

'Chat' frame before concluding an opinion. Any opinions or suggestions are<br />

committed to voting. Opinions can be different but majority votes will cast the belief<br />

of the community. These features naturally realize the online collaborative works to<br />

create the knowledge.<br />

• Individual or Group works<br />

Thought may be formed individually or though a concentrated discussion. KUI<br />

facilitates a window for submitting an opinion and another window for submitting a<br />

chat message. Each suggestion can be cast through the 'Opinion' window marked with<br />

a degree of its confidence. By working individually, comments to a suggestion can be<br />

posted to mark its background to make it better understanding. On the other hand,<br />

when working as a group, discussions among the group participants will be recorded.<br />

The discussion can be resumed at any points to avoid the iterating words.<br />

• Record of Intention<br />

The intention of each opinion can be reminded by the recorded comments or the trace<br />

of discussions. Frequently, we have to discuss again and again on the result that we<br />

have already agreed. Misinterpretation of the previous decision is also frequently<br />

faced when we do not record the background of decision. Record of intention is<br />

therefore necessary in the process of knowledge creation. The knowledge<br />

interpretation also refers to the record of intention to obtain a better understanding.<br />

• Selectional Preference<br />

Opinions can be differed from person to person depending on the aspects of the<br />

problem. It is not always necessary to say what is right and what is wrong. Each<br />

opinion should be treated as a result of intelligent activity. However, the majority<br />

accepted opinions are preferred at the moment. Experiences could tell the preference<br />

via vote casting. The dynamically vote ranking will tell the selectional preference of<br />

the community at the moment.<br />

3.3 ExpertScore<br />

KUI heavily depends on members’ voting score to produce a reliable result.<br />

Therefore, we introduce an adjustable voting score to realize a self-organizing system.<br />

Each member is initially provided a default value of voting score equals to one. The<br />

voting score is increased according to ExpertScore which is estimated by the value of<br />

Expertise, Contribution, and Continuity of the participation history of each member.<br />

Expertise is a composite score of the accuracy of opinion and vote, as shown in<br />

Equation 1. Contribution is a composite score of the ratio of opinion and vote posting


424 Virach Sornlertlamvanich et al.<br />

comparing to the total, as shown in Equation 2. Continuity is a regressive function<br />

based on the assumption that the absence of participation of a member will gradually<br />

decrease its ExpertScore to one after a year (365 days) of the absence, as shown in<br />

Equation 3.<br />

count(<br />

BestOpinion)<br />

count(<br />

BestVote)<br />

Expertise = α + β<br />

count(<br />

Opinion)<br />

count(<br />

Vote)<br />

. (1)<br />

Contribution<br />

count(<br />

Opinion)<br />

count(<br />

Vote)<br />

= γ + ρ<br />

count(<br />

TotalOpinion)<br />

count(<br />

TotalVote)<br />

. (2)<br />

⎛ D ⎞<br />

Continuity = 1−<br />

⎜ ⎟<br />

⎝ 365 ⎠<br />

where,<br />

α + β + γ + ρ = 1<br />

D is number of recent absent date (0≤D


KUI: Self-organizing Multi-lingual WordNet Construction Tool 425<br />

We adopt the proposed criteria for automatic synset assignment for Asian<br />

languages which has limited language resources. Based on the result from the above<br />

synset assignment algorithm, we provide KUI (Knowledge Unifying Initiator) [13],<br />

[14] to establish an online collaborative work in refining the WorNets.<br />

KUI allows registered members including language experts revise and vote for the<br />

synset assignment. The system manages the synset assignment according to the<br />

preferred score obtained from the revision process. The revision history reflects in the<br />

ExpertScore of each participant, and the reliability of the result is based on the<br />

summation of the ExpertScore of the contributions to each record. In case of multiple<br />

mapping, the record with the highest score is selected to report the mapping result.<br />

As a result, the community WordNets will be accomplished and exported into the<br />

original form of WordNet database. Via the synset ID assigned in the WordNet, the<br />

system can generate a cross language WordNet result. Through this effort, an initial<br />

version of Asian WordNet can be established.<br />

Table 1 shows a record of WordNet displayed for translation in KUI interface.<br />

English entry together with its part-of-speech, synset, and gloss are provided if exists.<br />

The members will examine the assigned lexical entry whether to vote for it or propose<br />

a new translation.<br />

Table 1. A record of WordNet.<br />

Car<br />

[Options]<br />

POS : NOUN<br />

Synset : auto, automobile, machine, motorcar<br />

Gloss : a motor vehicle with four wheels; usually propelled<br />

by an internal combustion engine;<br />

Fig. 2. KUI Participation page.


426 Virach Sornlertlamvanich et al.<br />

Fig. 2 illustrates the translation page of KUI 3 . In the working area, the login<br />

member can participate in proposing a new translation or vote for the preferred<br />

translation to revise the synset assignment. Statistics of the progress as well as many<br />

useful functions such as item search, record jump, chat, list of online participants are<br />

also provided. KUI is actively facilitating members in revising the Asian WordNet<br />

database.<br />

Fig. 3. KUI Lookup page.<br />

Fig. 3 illustrates the lookup page of KUI. The returned result of a keyword lookup<br />

is sorted according to the best translated word of each language. The best translated<br />

word is determined by the highest vote score. As a result, the user can consult the<br />

WordNet to obtain a list of equivalent words of the same sense sorted by the<br />

languages. The ExpertScore provided in KUI will help selecting the best translation of<br />

each word.<br />

5 Conclusion<br />

KUI is a platform for composing knowledge in the Open Source style. A contributor<br />

can naturally follow the process of knowledge development that includes posting in<br />

'Topic of interest', 'Opinion', 'Localization' and 'Public-Hearing'. The posted items are<br />

committed to voting to perform the selectional preference within the community. The<br />

results will be ranked according to the vote preference estimated by the ExpertScore<br />

for the purpose of managing the multiple results. 'Chatlog' is kept to indicate the<br />

record of intention of knowledge composers. A contributor may participate KUI<br />

individually or join a discussion group to compose the knowledge. We are expecting<br />

KUI to be a Knowledge User Interface for composing the knowledge in the Open<br />

Source style under the monitoring of the community. The statistical-base visualized<br />

3<br />

http://www.tcllab.org/kui/


KUI: Self-organizing Multi-lingual WordNet Construction Tool 427<br />

'KUI look up' is also provided for the efficient consultation of the knowledge. We<br />

introduce KUI for Asian WordNet development. The ExpertScore efficiently ranks<br />

the results especially in the case where there is more than one equivalent.<br />

References<br />

1. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Mass<br />

(1998)<br />

2. Atserias, J., Clement, S., Farreres, X., Rigau, G., Rodríguez, H.: Combining Multiple<br />

Methods for the Automatic Construction of Multilingual Word-Nets. In: Proceedings of the<br />

International Conference on Recent Advances in Natural Language, Bulgaria (1997)<br />

3. Magnini, B., Strapparava, C., Ciravegna, F., Pianta, E.: A Project for the Construction of an<br />

Italian Lexical Knowledge Base in the Framework of WordNet. IRST Technical Report #<br />

9406-15 (1994)<br />

4. Proszeky, G., Mihaltz, M.: Semi-Automatic Development of the Hungarian WordNet. In:<br />

Proceedings of LREC2002. Spain (2002)<br />

5. Louw, M.: Polaris User’s Guide. Technical report. Belgium (1998)<br />

6. Horák, A., Smrž, P.: New Features of Wordnet Editor VisDic. J. Romanian Journal of<br />

Information Science and Technology. 7(1–2), 201–213 (2004)<br />

7. Choi, K. S.: CoreNet: Chinese-Japanese-Korean wordnet with shared semantic hierarchy. In:<br />

Proceedings of Natural Language Processing and Knowledge Engineering. Beijing (2003)<br />

8. Choi, K. S., Bae, H. S., Kang, W., Lee, J., Kim, E., Kim, H., Kim, D., Song1, Y., Shin, H.:<br />

Korean-Chinese-Japanese Multilingual Wordnet with Shared Semantic Hierarchy. In:<br />

Proceedings of LREC2004. Portugal (2004)<br />

9. Kaji, H., Watanabe, M.: Automatic Construction of Japanese WordNet. In: Proceedings of<br />

LREC2006. Italy (2006)<br />

10. Korlex: Korean WordNet. Korean Language Processing Lab, Pusan National University,<br />

2007. Available at http://164.125.65.68/ (2006)<br />

11. Huang, C. R.: Chinese Wordnet. Academica Sinica. Available at<br />

http://bow.sinica.edu.tw/wn/ (2007)<br />

12. Hindi Wordnet: Available at http://www.cfilt.iitb.ac.in/wordnet/webhwn/ (2007)<br />

13. Sornlertlamvanich, V.: KUI: The OSS-Styled Knowledge Development System. In:<br />

Proceedings of the 7th AOSS Symposium. Malaysia (2006)<br />

14. Sornlertlamvanich, V., Charoenporn, T., Robkop, K., Isahara, H.: Collaborative Platform for<br />

Multilingual Resource Development and Intercultural Communication. In: Proceedings of<br />

the First International Workshop on Intercultural Collaboration (IWIC2007), LNCS4568,<br />

pp. 91–102 (2007)


Extraction of Selectional Preferences for French using a<br />

Mapping from EuroWordNet to the<br />

Suggested Upper Merged Ontology<br />

Dennis Spohr<br />

Institut für Linguistik/Romanistik<br />

Universität Stuttgart<br />

Stuttgart, Germany<br />

dennis.spohr@ling.uni-stuttgart.de<br />

Abstract. This paper presents an approach to extracting selectional preferences<br />

of French verbal predicates with respect to the ontological types of their arguments.<br />

Selectional preference is calculated on the basis of Resnik’s measure of selectional<br />

association between a predicate and the class of its argument [1]. However, instead<br />

of using WordNet synsets to express sortal restrictions (cf. [2]), we employ<br />

conceptual classes of the Suggested Upper Merged Ontology (SUMO; [3]) that<br />

have been automatically mapped to synsets of the French EuroWordNet [4] in a<br />

generic way that is in principle applicable to all WordNets which are linked to the<br />

Inter-Lingual-Index.<br />

1 Introduction<br />

Lexical-semantic NLP and with it semantic lexicons have become increasingly important<br />

over the last decades, and the contribution of (Euro-)WordNet [5, 4] and FrameNet [6]<br />

within this field is of course so fundamental and well-known that it need not be discussed<br />

here. However, recent years have further seen a strong tendency towards interfacing such<br />

resources with knowledge bases or taxonomies of general knowledge, both commonly<br />

referred to as ontologies. Well-known examples of such efforts are e.g. [7], who linked<br />

EuroWordNet’s Inter-Lingual-Index to a number of base concepts and a top ontology<br />

as integral part of the EuroWordNet project, [8] who mapped Princeton WordNet to the<br />

Suggested Upper Merged Ontology (SUMO), and [9] who linked FrameNet and SUMO.<br />

Moreover, the recent Global WordNet Grid 1 is pursuing such efforts on a considerable<br />

scale to create mappings from SUMO to all existing WordNets.<br />

One of the main reasons why such approaches are so important is that while resources<br />

like (Euro-)WordNet and FrameNet attempt to model lexical-semantic knowledge, ontologies<br />

try to mediate common knowledge or knowledge of the world. Therefore, linking<br />

these two types of resources may be able to bridge the gap between language-dependent<br />

lexical knowledge and language-independent facts or statements about the world.<br />

1 http://www.globalwordnet.org/gwa/gwa_grid.htm


Extraction of Selectional Preferences for French... 429<br />

Such statements appear to have a more universal character, and this is what makes<br />

combinations of ontological and lexical-semantic resources interesting for the formulation<br />

of selectional restrictions or preferences. We believe that a statement like “X prefers<br />

subjects of type Human or CognitiveAgent” is – from a meta-linguistic perspective –<br />

more informative than saying “X prefers the subjects {human_1, individual_1, mortal_1,<br />

person_1, someone_1, soul_1} or . . . ”. In this paper, we present a general methodology<br />

for mapping EuroWordNets to the SUMO ontology by using both an existing mapping<br />

from Princeton WordNet 1.6 to SUMO [8] and the linking of the EuroWordNets to<br />

the Inter-Lingual-Index [7]. We apply our methodology to the French EuroWordNet<br />

and extract sortal selectional preferences that are calculated on the basis of an established<br />

measure of selectional association between a predicate and the classes of its<br />

argument [1]. Section 2 of this paper introduces some background on WN and SUMO,<br />

and Resnik’s approach to selectional preference extraction. In Section 3, we will present<br />

our methodology for mapping EWN to SUMO, and Section 4 gives details on how we<br />

extract selectional preferences based on this mapping. After an evaluation of the mapping<br />

methodology and selectional preference extraction, we conclude in Section 6 and briefly<br />

discuss ways to apply and further extend our approach.<br />

2 Background<br />

2.1 WordNet and the Suggested Upper Merged Ontology<br />

In this section, we will briefly discuss work that has been done on linking WordNet to<br />

the Suggested Upper Merged Ontology. [8] have created such a mapping for version 1.6<br />

of WordNet, and have in subsequent years released new mappings for each new version<br />

of WordNet, with the latest release in summer 2007 for WN3.0. In creating their linking,<br />

[8] have decided to use the following three mapping relations: synonymy (equivalence,<br />

indicated by ’=’; cf. Section 3.2), hypernymy (subclass-superclass relation, indicated by<br />

’+’), and instantiation (indicated by ’@’). In contrast to the approach presented in their<br />

paper, we try to create mappings automatically, though relying heavily on their manual<br />

preparatory work.<br />

As was mentioned in the introduction, the Global WordNet Grid initiative, which was<br />

launched in early 2006, is trying to provide WordNet-SUMO mappings for all existing<br />

WordNets. The current state, as of late 2007, comprises mappings for 5,000 English base<br />

concepts, as well as for the Spanish and Catalan WordNets.<br />

2.2 Selectional Preference Extraction<br />

The measure we use for the calculation of selectional preference is that of [1] 2 , who<br />

uses the notion of relative entropy known from information theory [11]. The strength of<br />

selectional preference S R (p) of a predicate p with respect to a grammatical relation R is<br />

2 See [10] for a recent survey of several other approaches to selectional preference acquisition.


430 Dennis Spohr<br />

defined as follows.<br />

SELECTIONAL PREFERENCE STRENGTH:<br />

S R (p) = ∑ c<br />

P r(c|p) log P r(c|p)<br />

P r(c)<br />

The better P r(c) approximates P r(c|p), the closer log P r(c|p)<br />

P r(c)<br />

is to 0, i.e. the less<br />

influence p has on its argument, and therefore the less strong is its selectional preference.<br />

The selectional preference strength is on the one hand an indicator as to “how much<br />

information [. . . ] predicate p provides about the conceptual class of its argument” ([1]:<br />

p. 53). On the other hand, it is used for normalising the selectional preference values<br />

of a predicate, in order to be able to compare the values of different predicates: a predicate<br />

that is generally weak in showing preferences will thus receive a higher value<br />

if it really shows a preference for a particular conceptual argument class. Selectional<br />

preference for a particular conceptual class is calculated in the form of selectional association<br />

A R (p, c) between p and the class c of its argument. Its definition is given below.<br />

SELECTIONAL ASSOCIATION:<br />

A R (p, c) = 1<br />

P r(c|p)<br />

P r(c|p) log<br />

S R (p) P r(c)<br />

As Resnik points out, the fact that text corpora have usually not been annotated<br />

with explicit and unambiguous conceptual classes requires some sort of distribution<br />

of frequencies among the possible conceptual class of a word. The following formula<br />

calculates the frequency of predicate p and class c.<br />

FREQUENCY PROPAGATION:<br />

freq R (p, c) ≈ ∑ w∈c<br />

count R (p, w)<br />

classes(w)<br />

This means that the actual frequency count of a word w, which stands in relation R<br />

(e.g. verb-object) to p, is distributed equally among the classes c which w is a member<br />

of. In a hierarchical resource such as SUMO, this also has the effect of propagating the<br />

freq value up the hierarchy: if w is a member of class c, it is, of course, also a member<br />

of the superclasses of c, and thus freq(p, c) is also added to all the superclasses of c.<br />

3 Mapping EuroWordNet to SUMO<br />

In this section, we will present how the French EuroWordNet has been mapped onto<br />

conceptual classes of the Suggested Upper Merged Ontology. The general methodology<br />

of creating the mapping to the French EuroWordNet is described in the following


Extraction of Selectional Preferences for French... 431<br />

subsection. Although a quite recent mapping to version 3.0 of WordNet exists, we<br />

decided to use the very first mapping – namely that of WordNet version 1.6 – as a starting<br />

point. The reasons for doing so mainly concern the sensemaps between different versions<br />

of WordNet, and are explained in detail in Section 3.2 below.<br />

3.1 General Methodology<br />

As was just mentioned, we use the mapping of SUMO to version 1.6 of WordNet in order<br />

to link the French EuroWordNet to SUMO. The French EWN itself – as is the case with<br />

all EuroWordNets – is linked to the Inter-Lingual-Index, a set of concepts that is intended<br />

to be largely language-independent (cf. [7]). A crucial prerequisite for our approach<br />

to function is that the identifiers of entities in the Inter-Lingual-Index correspond to<br />

synset identifiers in version 1.5 of WordNet. For example, entity 00058624-n of<br />

the Inter-Lingual-Index, which is glossed by “the launching of a rocket under its own<br />

power”, corresponds to synset {décollage_1,lancement_d’une_fusée_1} in<br />

the French EWN and to {blastoff_1,rocket_firing_1,rocket_launching_1,shoot_1}<br />

in WN1.5. Starting from these observations, i.e. the mapping of<br />

SUMO to WN1.6 and the linking of the French EWN to the Inter-Lingual-Index (≈<br />

WN1.5), the only remaining task that is left is to move from WN1.5 to WN1.6. In order<br />

to do this, we can avail ourselves of the sensemap files that came with the 1.6 release<br />

of WordNet, which indicate the changes from WN1.5 to WN1.6. Ignoring particular<br />

mapping issues for the moment (see Section 3.2 below), the resulting EuroWordNet<br />

entries look like the one shown in Figure 1. The structure is based on the format suggested<br />

by the Global WordNet Grid. The whole mapping process is summarised in Figure 2.<br />

3.2 WordNet Sensemaps<br />

Whenever updates of WordNet are released, the updated version comes with files<br />

that, among others, indicate changes in the structure of the synsets. For example,<br />

synset 00058624-n from above has been split in the step from WN1.5 to WN1.6:<br />

{shoot_1} is now a member of synset 00078261-n, {blastoff_1} of synset<br />

00065319-n, and {rocket_firing_1, rocket_launching_1} of synset<br />

00065148-n. Therefore, version 1.6 contains new synsets that did not exist in WN1.5,<br />

and further cases in which a synset is reorganised thus that some of its items belong to<br />

different synsets in the updated version. The primary problem for the task of mapping<br />

such instances comes from the fact that individual members of a synset do not have<br />

unique identifiers themselves, but only the synset as a whole 3 . Therefore, when a synset<br />

has been split, it is not possible to automatically determine the correct position at which<br />

the synset has to be split in a different language, or even whether it has to be split at all.<br />

Moreover, each update comes with a large number of such changes, and therefore using<br />

3 This is, of course, not a problem of the WordNet approach, but rather of the fact that there is no<br />

one-to-one mapping between languages.


432 Dennis Spohr<br />

the most recent mapping between SUMO and WN3.0, which is without a doubt desirable,<br />

would multiply the inaccuracies in the mapping right from the start. Just imagine a case<br />

where a synset has been split e.g. from WN1.5 to WN1.6, and the new synset is then<br />

split again when going to WN1.7, and so on.<br />

<br />

n<br />

<br />

organisme<br />

1<br />

<br />

forme de vie<br />

1<br />

<br />

\^etre<br />

2<br />

<br />

vie<br />

11<br />

<br />

<br />

00002728-n<br />

00002403-n<br />

Organism<br />

=<br />

<br />

any living entity<br />

<br />

Fig. 1. Sample EuroWordNet entry of synset 00002728-n ({organisme_1, forme_de_<br />

vie_1, être_2, vie_11}) after the mapping<br />

The decision that was made for cases like these is to assign to the original synset two<br />

(or more if necessary) SUMO classes: first the one that has been mapped to this synset,<br />

and second the ones to which the new (or relevant existing) synsets have been mapped in<br />

WN1.6. The justification of this decision is based on the assumption that on a level as<br />

abstract as that of SUMO conceptual classes, a “slight” reorganisation of the synsets and<br />

some of their items should not lead to significant conceptual clashes, as this would imply<br />

that grave errors had been made when putting the respective senses into one synset in<br />

the first place. In Figure 3 below, which depicts the entry of synset 00058624-n after<br />

the mapping, we see that the two SUMO classes that have been assigned to this synset<br />

do at least remotely fit the senses: more specific than Impelling and Motion, and<br />

equivalent to Shooting. Of course, a qualitative evaluation is needed to determine the


Extraction of Selectional Preferences for French... 433<br />

SUMO<br />

Princeton WordNet<br />

Version 1.6<br />

WordNet−SUMO mapping<br />

Niles and Pease (2003)<br />

New EWN−SUMO<br />

mapping<br />

Sense mapping from<br />

WordNet 1.5 to 1.6<br />

French EuroWordNet<br />

Version 1.0<br />

EWN−ILI mapping<br />

Vossen (1998)<br />

Inter−Lingual−Index<br />

Fig. 2. Process of mapping the French EuroWordNet to SUMO (clockwise from top left)<br />

degree of inaccuracy that is introduced. However, such an evaluation would rely heavily<br />

on manual inspection and could therefore not be carried out up to this moment.<br />

4 Extraction of Selectional Preferences<br />

4.1 Corpus extraction<br />

The (potential) nominal arguments of the verbal predicates have been extracted from a<br />

portion of more than 350 million tokens from the French Agence France-Presse corpus<br />

licensed by the Linguistic Data Consortium 4 . The corpus has been part-of-speech tagged<br />

using the French TreeTagger parameter files [12] and has been stored in the widely-used<br />

Corpus Workbench format [13]. Figure 4 below shows the CQP query that extracted<br />

potential direct objects of ’manger’.<br />

We have decided to use a quite rigid syntactic structure, and therefore the query<br />

contains both the potential subject and direct object although only one of them is focussed<br />

on at a time. Lines 1-4 in Figure 4 represent the subject position – with the head of the<br />

subject noun phrase at the end of in line 1 –, and the direct object is described in lines<br />

10-12. The verbal predicate, in this case ’manger’, is shown in line 6. The results of this<br />

query, when grouped e.g. by object, look like the following (see Table 1).<br />

4 http://www.ldc.upenn.edu/


434 Dennis Spohr<br />

<br />

...<br />

00058624-n<br />

00058381-n<br />

Impelling<br />

+<br />

<br />

Motion<br />

+<br />

<br />

Shooting<br />

=<br />

<br />

<br />

Fig. 3. Part of the EuroWordNet entry of synset 00058624-n ({décollage_1, lancement_d’une_fusée_1})<br />

after the mapping<br />

1 [pos="DET:(ART|POS)"]? [pos="AD(V|J)"]{0,3} [pos="N(A|O)M"]<br />

2 [pos="DET:(ART|POS)"]? [pos="AD(J|V)"]{0,3}<br />

3 ([pos="PRP.*"] [pos="DET:(ART|POS)"]? [pos="AD(V|J)"]?<br />

4 [pos="N(A|O)M"] [pos="AD(V|J)"]?){0,3}<br />

5 [pos="VER.*"]{0,2} [pos="ADV"]{0,2} [lemma="avoir|faire"]?<br />

6 [lemma="manger" & pos!="VER:ppre"]<br />

7 [pos="ADV" & lemma!="que"]{0,2}<br />

8 [pos="DET:(ART|POS)"]? [pos="AD(V|J)" & lemma!="que"]{0,3}<br />

9 [pos="NUM"]?<br />

10 [pos="N(A|O)M" & lemma!="(lundi|mardi|mercredi|jeudi|<br />

11 vendredi|samedi|dimanche|janvier|f\’evrier|mars|avril|mai|<br />

12 juin|juillet|ao\^ut|septembre|octobre|novembre|d\’ecembre)"];<br />

Fig. 4. CQP query extracting direct objects of ’manger’


Extraction of Selectional Preferences for French... 435<br />

Table 1. Results of the query in Figure 4 after grouping by object<br />

Word Frequency Word Frequency<br />

pain (’bread’) 16 revenu (’revenue’) 2<br />

enfant (’child’) 8 chose (’thing’) 2<br />

plat (’dish’) 4 nourriture (’nutrition’) 2<br />

glace (’ice’) 4 partie (’part’) 2<br />

poisson (’fish’) 3 méchoui (≈ “Arabian dish”) 1<br />

chapeau (’hat’) 3 victuaille (’comestible’) 1<br />

cœur (’heart’) 3 pélican (’pelican’) 1<br />

steak (’steak’) 2 vipère (’viper’) 1<br />

poussin (’poult’) 2 raisin (’grape’) 1<br />

abat (’innards’) 2 cervelle (’brains’) 1<br />

singe (’monkey’) 2 sandwich (’sandwich’) 1<br />

soupe (’soup’) 2 hamburger (’hamburger’) 1<br />

feuille (’leaf’) 2 christmas (’christmas’) 1<br />

4.2 Storage and Retrieval<br />

Before we calculated the selectional preferences, we converted the file containing the<br />

SUMO-EuroWordNet mappings to OWL (Web Ontology Language; cf. [14]). We have<br />

further created a “class only” OWL version of SUMO based on the XML version of<br />

SUMO that is distributed with the KSMSA ontology browser 5 (version 1.0.9.1.1). The<br />

reasons for not using the OWL version available from the SUMO project site 6 are (i)<br />

that it is difficult to process by the “standard” ontology editing tool Protégé [15], which<br />

is mainly due to the fact that SUMO was originally written in the far more expressive<br />

Suggested Upper Ontology Knowledge Interchange Format (SUO-KIF 7 ) and contains,<br />

e.g., entities which are one-place predicates and two-place predicates at the same time<br />

and therefore occur in both class and property hierarchies, and (ii) that processing a<br />

class hierarchy for frequency propagation is far more straightforward and intuitive than<br />

processing a mixed hierarchy (see below). Therefore, if a synset had been mapped onto a<br />

SUMO concept in an instance relation (cf. ’@’ in Section 2.1 above), it was still created<br />

as an OWL class with a subclass relation to the SUMO concept. We believe that the<br />

cognitive differences between the instantiation and hypernymy relations (cf. [8]) can be<br />

neglected for this purpose. The two files (SUMO and the EWN-SUMO mapping) were<br />

then stored as an RDFS database in the Sesame Framework [16]. The main reason for<br />

doing all this is that we thus have the benefit of using OWL’s – and of course RDF’s –<br />

built-in subsumption and inheritance mechanism, which is very advantageous since the<br />

frequencies for the calculation of selectional preferences have to be propagated along the<br />

5 http://virtual.cvut.cz/ksmsaWeb/browser/title/<br />

6 http://www.ontologyportal.org/<br />

7 http://suo.ieee.org/SUO/KIF/suo-kif.html


436 Dennis Spohr<br />

hierarchy (cf. Section 2.2). A further benefit is that we are thus able to use the Protégé<br />

OWL API 8 and the Sesame API 9 in order to perform the propagation of frequencies.<br />

4.3 Calculation of Selectional Preferences<br />

In order to calculate the selectional association between the verbal predicate and its<br />

arguments, it is necessary to first calculate prior values, i.e. to propagate the frequencies<br />

of all arguments irrespective of the verbal predicate up the SUMO hierarchy. For each<br />

word in the list (cf. Table 1), the synsets it belongs to are looked up in the database.<br />

If it belongs to more than one synset, which is typically the case, then its frequency<br />

is divided by the number of readings (cf. Section 2.2 above). After that, for each of<br />

these synsets, first its equivalent SUMO classes are extracted, and then the frequency is<br />

propagated up the hierarchy along the direct superclass relationship. In case of multiple<br />

inheritance, i.e. one class having more than one direct superclass, the frequency is divided<br />

by the number of direct superclasses, similar to what has already been explained for the<br />

different readings of a word. The result is a structure in which every SUMO class has an<br />

associated prior value. As was already mentioned in Section 2.2, the same is done in<br />

order to determine the posterior values for the words occurring as arguments of a given<br />

verbal predicate, and these are then compared with the prior values. Section 5.2 below<br />

shows and discusses results for two French verbal predicates.<br />

5 Evaluation<br />

5.1 Evaluation of SUMO Mapping<br />

Table 2 below shows the results of the mapping procedure. Lines 1-3 in the table display<br />

the total number of synsets in the French EuroWordNet, as well as the numbers of those<br />

which have or have not received a SUMO mapping. Of those 22,351 synsets which have<br />

been assigned a SUMO class (cf. lines 4-6), 98.54% have been assigned exactly one class,<br />

whereas 0.96% have been mapped to two and 0.50% to three or more SUMO classes 10 .<br />

In line 8 we see that almost 55% of the synsets that have been assigned SUMO classes<br />

occurred in multiple sensemaps, but were all mapped onto synsets belonging to the same<br />

SUMO class, while only 1.46% where mapped onto two or more SUMO classes (cf. line<br />

9). This means that only 1.46% are in principle able to cause “conceptual clashes” when<br />

retaining the strategy presented in Section 3.2. Table 3 displays the 20 most frequent<br />

SUMO classes that have been mapped to synsets in the French EuroWordNet.<br />

8 http://protege.stanford.edu/plugins/owl/api/<br />

9 http://www.openrdf.org/doc/sesame/users/ch07.html<br />

10 One synset even received 16 SUMO classes. This was due to the fact that the English synset<br />

contained the highly polysemous ’cut’, which was split into 26 new synsets in the step from<br />

WN1.5 to WN1.6. The fact that this number is reduced to 16 mappings shows that many of<br />

them are still covered by the same conceptual class in SUMO.


Extraction of Selectional Preferences for French... 437<br />

Table 2. Number of SUMO mappings according to different types<br />

Type<br />

Frequency<br />

abs rel<br />

1 Synsets in French EWN 22,745 100.00%<br />

2 . . . with SUMO mapping 22,351 98.27%<br />

3 . . . without SUMO mapping 394 1.73%<br />

Of those with SUMO mapping<br />

4 . . . with one mapping 22,026 98.54%<br />

5 . . . with two mappings 214 0.96%<br />

6 . . . with three or more mappings 111 0.50%<br />

7 . . . with only one sensemap 9,739 43.57%<br />

8 . . . with more than one sensemap 12,287 54.97%<br />

but only one SUMO class<br />

9 . . . with more than one sensemap 325 1.46%<br />

and more than one SUMO class<br />

Table 3. Distribution of the top 20 assigned SUMO classes<br />

Type Frequency 11<br />

abs rel<br />

SubjectiveAssessmentAttribute 1,293 5.78%<br />

Device 1,088 4.87%<br />

Artifact 689 3.08%<br />

Motion 583 2.61%<br />

OccupationalRole 555 2.48%<br />

Communication 478 2.14%<br />

Human 460 2.06%<br />

Food 441 1.97%<br />

SocialRole 404 1.81%<br />

Process 379 1.70%<br />

IntentionalProcess 361 1.62%<br />

IntentionalPsychologicalProcess 276 1.23%<br />

Text 247 1.11%<br />

City 246 1.10%<br />

StationaryArtifact 243 1.09%<br />

NormativeAttribute 238 1.06%<br />

EmotionalState 227 1.02%<br />

Clothing 223 1.00%<br />

DiseaseOrSyndrome 220 0.98%<br />

FloweringPlant 205 0.92%


438 Dennis Spohr<br />

5.2 Evaluation of Preference Extraction<br />

In the following, we will discuss the results for the selectional preference extraction of<br />

direct objects of ’lire’ (’read’) and ’manger’ (’eat’). These words were chosen because<br />

we believe them to show strong selectional preferences as far as their direct objects are<br />

concerned. Thus they may serve as proof-of-concept cases for our approach.<br />

The selectional preference strength S obj (lire), i.e. the preference strength of ’lire’<br />

wrt. to the direct object relation (cf. Section 2.2) is 1.37296, whereas S obj (manger)<br />

is 3.46397. This means that ’manger’ generally shows a stronger preference wrt. to<br />

its direct object than ’lire’. The effect of this is that if ’lire’ shows a preference for a<br />

particular SUMO class, this preference will weigh more than a preference of ’manger’,<br />

since it is generally weaker wrt. preferential behaviour. This is due to the fact that the<br />

value of the selectional association between a predicate p and the class c of its argument<br />

(cf. Section 2.2 above) is normalised by the selectional preference strength of p.<br />

Table 4 shows that ’lire’ shows a strong preference for objects of type “Text”, and<br />

further preferences for “ContentBearingPhysical” and “LinguisticExpression”. After<br />

these three items, the figures indicate a bigger gap to the next entity. As far as ’manger’<br />

is concerned, it shows a very strong preference for direct objects of type “Food”, with<br />

the second best class (“SelfConnectedObject”) reaching just over half of the score for<br />

“Food”. Looking at these results, it is fair to say that they do match our intuitions.<br />

6 Conclusion<br />

We have presented a generic method for mapping EuroWordNets to the Suggested Upper<br />

Merged Ontology and have shown its application to the French EuroWordNet. The<br />

mapping procedure builds on existing work on SUMO and version 1.6 of Princeton<br />

WordNet [8], EuroWordNet’s Inter-Lingual-Index [7] and WordNet’s sensemap files.<br />

The resulting mapping was used in the calculation of selectional preferences of French<br />

verbal predicates with respect to nominal arguments. Preference extraction within an<br />

experimental setup shows promising results for the French verbs ’manger’ (’eat’) and<br />

’lire’ (’read’).<br />

In the future, we intend to carry out a qualitative evaluation on a larger scale, both for<br />

the mapping procedure (cf. Section 3.2) and the extraction of selectional preferences. The<br />

ultimate goal is to use the extracted selectional preferences for word sense disambiguation<br />

of verbal predicates as well as their arguments, and we will work on this in the near<br />

future. Moreover, we plan to consider extracting pairs of subjects and objects in order<br />

to calculate preferences of a direct object given the subject and vice versa. Finally, it<br />

would be interesting to see how the mapping methodology performs when applied to<br />

11 The frequency indicates the number of synsets which have been mapped directly onto the<br />

respective SUMO class, so no accumulation of frequency counts along the SUMO hierarchy<br />

was made, since that would, of course, leave the top 20 slots in the table to the top 20 nodes in<br />

the hierarchy. A synset such as 00058624-n (cf. examples above), which has been mapped<br />

onto three different SUMO classes, counts for each of these classes.


Extraction of Selectional Preferences for French... 439<br />

EuroWordNets other than French, provided that they are linked to the Inter-Lingual-Index<br />

as well. We do, however, expect our methodology to be generic enough to be applied to<br />

other languages without any major issues.<br />

Table 4. Selectional preferences of ’lire’ (’read’) and ’manger’ (’eat’) wrt. direct objects<br />

SUMO concept c A obj (lire, c)<br />

Text 0.3868<br />

ContentBearingPhysical 0.2548<br />

LinguisticExpression 0.2431<br />

Disseminating 0.1259<br />

Communication 0.1083<br />

Stating 0.0840<br />

Word 0.0611<br />

Noun 0.0601<br />

Artifact 0.0582<br />

ContentBearingProcess 0.0541<br />

OccupationalRole 0.0498<br />

LinguisticCommunication 0.0409<br />

CorpuscularObject 0.0377<br />

name 0.0368<br />

Book 0.0311<br />

Proposition 0.0295<br />

SelfConnectedObject 0.0285<br />

Physical 0.0225<br />

FamilyGroup 0.0218<br />

destination 0.0156<br />

SUMO concept c A obj (manger, c)<br />

Food 0.2179<br />

SelfConnectedObject 0.1253<br />

NonFullyFormed 0.0779<br />

Object 0.0745<br />

DevelopmentalAttribute 0.0662<br />

Meat 0.0599<br />

Animal 0.0453<br />

Organism 0.0403<br />

OrganicObject 0.0402<br />

Vertebrate 0.0382<br />

FruitOrVegetable 0.0368<br />

WarmBloodedVertebrate 0.0295<br />

Arachnid 0.0258<br />

Monkey 0.0239<br />

BodySubstance 0.0222<br />

Mammal 0.0216<br />

AnatomicalStructure 0.0201<br />

Fish 0.0190<br />

CorpuscularObject 0.0187<br />

PlantAnatomicalStructure 0.0184<br />

Acknowledgements<br />

The research described in this work has been carried out as part of the project ’Polysemy<br />

in a Conceptual System’ (project B5 of SFB 732) and was funded by grants from the<br />

German Research Foundation. I should like to thank Adam Pease, Achim Stein, Piek<br />

Vossen, Sabine Schulte im Walde and Christian Hying for their valuable comments<br />

and suggestions at the outset of this work, as well as the two anonymous reviewers for<br />

helping to improve the structure and content of the paper.<br />

References<br />

1. Resnik, P.: Selectional preference and sense disambiguation. In: Proceedings of the ACL<br />

SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, Wash-


440 Dennis Spohr<br />

ington, DC (1997) 52–57<br />

2. Li, H., Abe, N.: Generalizing Case Frames using a Thesaurus and the MDL Principle.<br />

Computational Linguistics 24(2) (1998) 217–244<br />

3. Niles, I., Pease, A.: Towards a Standard Upper Ontology. In Welty, C., Smith, B., eds.:<br />

Proceedings of the 2nd International Conference on Formal Ontology in Information Systems<br />

(FOIS-2001), Ogunquit, ME (2001)<br />

4. Vossen, P., ed.: EuroWordNet: A Multilingual Database with Lexical Semantic Networks.<br />

Kluwer Academic Publishers (1998)<br />

5. Fellbaum, C., ed.: WordNet: An Electronic Lexical Database. MIT Press (1998)<br />

6. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Proceedings of<br />

the ACL/COLING, Montreal (1998)<br />

7. Vossen, P., Bloksma, L., Rodriguez, H., Climent, S., Calzolari, N., Roventini, A., Bertagna, F.,<br />

Alonge, A., Peters, W.: The EuroWordNet Base Concepts and Top Ontology. (1998)<br />

8. Niles, I., Pease, A.: Linking Lexicons and Ontologies: Mapping WordNet to the Suggested<br />

Upper Merged Ontology. In: Proceedings of the 2003 International Conference on Information<br />

and Knowledge Engineering (IKE ’03), Las Vegas, NV (2003)<br />

9. Scheffczyk, J., Pease, A., Ellsworth, M.: Linking FrameNet to the SUMO Ontology. In:<br />

Proceedings of the 4th International Conference on Formal Ontology in Information Systems<br />

(FOIS-2006), Baltimore, MD (2006)<br />

10. Schulte im Walde, S.: The Induction of Verb Frames and Verb Classes from Corpora. In<br />

Lüdeling, A., Kytö, M., eds.: Corpus Linguistics. An International Handbook. Handbooks of<br />

Linguistics and Communication Science. Mouton de Gruyter, Berlin (To appear)<br />

11. Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statistics<br />

22 (1951) 79–86<br />

12. Stein, A., Schmid, H.: Etiquetage morphologique de textes français avec un arbre de décisions.<br />

Traitement automatique des langues 36(1-2) (1995) 23–35<br />

13. Christ, O.: A modular and flexible architecture for an integrated corpus query system. In:<br />

Proceedings of the 3rd International Conference on Computational Lexicography (COMPLEX<br />

’94), Budapest (1994)<br />

14. Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider,<br />

P.F., Stein, L.A.: OWL Web Ontology Language Reference. Technical report (2004)<br />

15. Knublauch, H., Musen, M.A., Rector, A.L.: Editing description logic ontologies with the<br />

Protégé OWL plugin. In: Proceedings of DL 2004, Whistler, BC (2004)<br />

16. Broekstra, J., Kampman, A., van Hermelen, F.: Sesame: A generic architecture for storing<br />

and querying RDF and RDF Schema. In: Proceedings of the 1st International Semantic Web<br />

Conference (ISWC ’02), Sardinia (2002)


Romanian WordNet: Current State, New Applications<br />

and Prospects<br />

Dan Tufiş, Radu Ion, Luigi Bozianu, Alexandru Ceauşu, and Dan Ştefănescu<br />

Romanian Academy Research Institute for Artificial Intelligence<br />

13, Calea 13 Septembrie, 050711, Bucharest 5, Romania<br />

{tufis, radu, bozi, aceausu, danstef}@racai.ro<br />

1 Introduction<br />

The development of the Romanian WordNet began in 2001 within the framework of<br />

the European project BalkaNet which aimed at building core WordNets for 5 new<br />

Balkan languages: Bulgarian, Greek, Romanian, Serbian and Turkish. The philosophy<br />

of the BalkaNet architecture was similar to EuroWordNet [1, 2]. As in EuroWordNet,<br />

in BalkaNet the concepts considered highly relevant for the Balkan languages (and<br />

not only) were identified and called BalkaNet Base Concepts. These are classified in<br />

three increasing size sets (BCS1, BCS2 and BCS3). Altogether BCS1, BCS2 and<br />

BCS3 contain 8516 concepts that were lexicalized in each of the BalkaNet WordNets.<br />

The monolingual WordNets had to have their synsets aligned to the translation<br />

equivalent synsets of the Princeton WordNet (PWN). The BCS1, BCS2 and BCS3<br />

were adopted as core WordNets for several other WordNet projects such as Hungarian<br />

[3], Slovene [4], Arabic [5, 6], and many others.<br />

At the end of the BalkaNet project (August 2004) the Romanian WordNet,<br />

contained almost 18,000 synsets, conceptually aligned to Princeton WordNet 2.0 and<br />

through it to the synsets of all the BalkaNet WordNets. In [7], a detailed account on<br />

the status of the core Ro-WordNet is given as well as on the tools we used for its<br />

development.<br />

After the BalkaNet project ended, as many other project partners did, we continued<br />

to update the Romanian WordNet and here we describe its latest developments and a<br />

few of the projects in which Ro-WordNet, Princeton WordNet or some of its<br />

BalkaNet companions were of crucial importance.<br />

2 The Ongoing Ro-WordNet Project and its Current Status<br />

The Ro-WordNet is a continuous effort going on for 6 years now and likely to<br />

continue several years from now on. However, due to the development methodology<br />

adopted in BalkaNet project, the intermediate WordNets could be used in various<br />

other projects (word sense disambiguation, word alignment, bilingual lexical<br />

knowledge acquisition, multilingual collocation extraction, cross-lingual question<br />

answering, machine translation etc.).


442 Dan Tufiş, Radu Ion, Luigi Bozianu, Alexandru Ceauşu, and Dan Ştefănescu<br />

Recently we started the development of an English-Romanian MT system for the<br />

legalese language of the type contained in JRC-Acquis multilingual parallel corpus [8]<br />

and of a cross-lingual question answering system in open domains [9, 10]. For these<br />

projects, heavily relying on the aligned Ro-En WordNets, we extracted a series of<br />

high frequency Romanian nouns and verbs not present in Ro-WordNet but occurring<br />

in JRC-Acquis corpus and in the Romanian pages of Wikipedia and proceeded at their<br />

incorporation in Ro-WordNet. The methodology and tools were essentially the same<br />

as described in [11], except that the dictionaries embedded into the WNBuilder and<br />

WNCorrect were significantly enlarged.<br />

The two basic development principles of the BalkaNet methodology, that is<br />

Hierarchy Preservation Principle (HPP) and Conceptual Density Principle (CDP),<br />

were strictly observed. For the sake of self-containment, we restate them here.<br />

Hierarchy Preservation Principle<br />

If in the hierarchy of the language L1 the synset M 2 is a hyponym of<br />

synset M 1<br />

(M 2 H m M 1 ) and the translation equivalents in L2 for M 1 and M 2 are N 1<br />

and N 2 respectively, then in the hierarchy of the language L2 N 2 should be a<br />

hyponym of synset N 1 (N 2 H n N 1 ). Here H m and H n represent a chain of m and<br />

n hierarchical relations between the respective synsets (hyperonymy relations<br />

composition).<br />

Conceptual Density Principle (noun and verb synsets)<br />

Once a nominal or verbal concept (i.e. an ILI concept that in PWN is<br />

realized as a synset of nouns or as a synset of verbs) was selected to be<br />

included in Ro-WordNet, all its direct and indirect ancestors (i.e. all ILI<br />

concepts corresponding to the PWN synsets, up to the top of the hierarchies)<br />

should be also included in Ro-WordNet.<br />

By observing HPP, the lexicographers were relieved of the task to establish the<br />

semantic relations for the synsets of the Ro-WordNet. The hypernym relations as well<br />

as the other semantic relations were imported automatically from the PWN. The CDP<br />

compliance ensures that no dangling synsets, harmful in taxonomic reasoning, are<br />

created.<br />

The tables below give a quantitative summary of the Romanian WordNet at the<br />

time of writing (September, 2007). As these statistics are changing every month, the<br />

updated information should be checked at http://nlp.racai.ro/Ro-wordnet.statistics.<br />

The Ro-wordnet is currently mapped on the various versions of Princeton WordNet:<br />

PWN1.7.1, PWN2.0 and PWN2.1. The mapping onto the last version PWN3.0 is also<br />

considered. However, all our current projects are based on the PWN2.0 mapping and<br />

in the following, if not stated otherwise, by PWN we will mean PWN2.0.


Romanian WordNet: Current State, New Applications and Prospects 443<br />

Noun<br />

synsets<br />

Table 1. POS distribution of the synsets in the Romanian WordNet.<br />

Verb<br />

synsets<br />

Adj.<br />

synsets<br />

Adv. synsets<br />

Total<br />

33151 8929 851 834 43765<br />

Table 2. Internal relations used in the Romanian WordNet.<br />

hypernym 42794 category_domain 2668<br />

near_antonym 2438 also_see 586<br />

holo_part 3531 subevent 335<br />

similar_to 899 holo_portion 327<br />

verb_group 1404 causes 171<br />

holo_member 1300 be_in_state 570<br />

DOMAINS classes 165 SUMO&MILO categories 1836<br />

objective synsets 34164 subjective synsets 9601<br />

As one can see from Table 2, the synsets in Ro-WordNet have attached, via PWN,<br />

DOMAINS-3.1 [12], SUMO&MILO [13, 14] and SentiWordNet [15] labels.<br />

The DOMAINS labeling (http://wndomains.itc.it/) uses Dewey Decimal<br />

Classification codes and the 115425 PWN synsets are classified into 168 distinct<br />

classes (domains).<br />

The SUMO&MILO upper and mid level ontology is the largest freely available<br />

(http://www.ontologyportal.org/) ontology today. It is accompanied by more than 20<br />

domain ontologies and altogether they contain about 20,000 concepts and 60,000<br />

axioms. They are formally defined and do not depend on a particular application. Its<br />

attractiveness for the NLP community comes from the fact that SUMO, MILO and<br />

associated domain ontologies were mapped onto Princeton WordNet. SUMO and<br />

MILO contain 1107 and respectively 1582 concepts. Out of these, 844 SUMO<br />

concepts and 1582 MILO concepts were used to label almost all the synsets in PWN.<br />

Additionally, 215 concepts from some specific domain ontology were used to label<br />

the rest of synsets in PWN (instances).<br />

The SentiWordnet [15] adds subjectivity annotations to the PWN synsets. Their<br />

basic assumptions were that words have graded polarities along the Subjective-<br />

Objective (SO) & Positive-Negative (PN) orthogonal axes and that the SO and PN<br />

polarities depend on the various senses of a given word (context). The word senses in<br />

a synset are associated with a triple P (positive subjectivity), N (negative subjectivity)<br />

and O (objective) so that the values of these attributes sum up to 1. For instance, the<br />

sense 2 of the word nightmare (a terrifying or deeply upsetting dream) is marked-up<br />

by the values P:0.0, N:0,25 and O:0.75, signifying that the word denotes to a large<br />

extent an objective thing with a definite negative subjective polarity.<br />

Due to the BalkaNet methodology adopted for the monolingual WordNets<br />

development, most of the DOMAINS, SUMO and MILO conceptual labels in PWN<br />

are represented in our Ro-WordNet (see Table 3).


444 Dan Tufiş, Radu Ion, Luigi Bozianu, Alexandru Ceauşu, and Dan Ştefănescu<br />

Table 3. The ontological labeling (DOMAINS, SUMO, MILO, etc.) in RO-WordNet vs. PWN.<br />

LABELS PWN Ro-WordNet<br />

DOMAINS-3.1 168 165<br />

SUMO 844 781<br />

MILO 949 882<br />

Domain ontologies 215 173<br />

The BalkaNet compliant XML encoding of a synset, including the new subjectivity<br />

annotations is exemplified in Figure 1.<br />

<br />

ENG20-05435872-n<br />

n<br />

<br />

nightmare2<br />

<br />

hypernymENG20-05435381-n<br />

a terrifying or deeply upsetting dream<br />

PsychologicalProcess+<br />

factotum<br />

0.00.250.75<br />

<br />

<br />

ENG20-05435872-n<br />

n<br />

<br />

co•mar1<br />

<br />

Vis urât, cu senza•ii de ap•sare •i de în•bu•ire<br />

ENG20-05435381-nhypernym<br />

factotum<br />

PsychologicalProcess+<br />

0.00.250.75<br />

<br />

Fig. 1. Encoding of two EQ-synonyms synsets in PWN and Ro-WordNet.<br />

The visualization of the synsets in Figure 1, by means of the VISDIC editor 1 [16],<br />

is shown in Figure 2.<br />

1<br />

http://nlp.fi.muni.cz/projekty/visdic/


Romanian WordNet: Current State, New Applications and Prospects 445<br />

Fig. 2. VISDIC synchronized view of PWN and Ro-WordNet.<br />

The Ro-WordNet can be browsed by a web interface implemented in our language<br />

web services platform (see Figure 3). Although currently only browsing is<br />

implemented, Ro-WordNet web service will, later on, include search facilities<br />

accessible via standard web services technologies (SOAP/WSDL/UDDI) such as<br />

distance between two word senses, translation equivalents for one or more senses,<br />

semantically related word-senses, etc.<br />

3 Recent applications of the Ro-WordNet<br />

In previous papers [17, 18] we demonstrated that difficult processes such as word<br />

sense disambiguation and word alignment of parallel corpora can reach very high<br />

accuracy if one has at his/her disposal aligned WordNets. Various other researchers<br />

showed the invaluable support of aligned WordNets in improving the quality of<br />

machine translation. In this section we will discuss some new applications of Ro-En<br />

pair of WordNets, the performances of which strongly argue for the need to keep-up<br />

the Ro-WordNet development endeavor.


446 Dan Tufiş, Radu Ion, Luigi Bozianu, Alexandru Ceauşu, and Dan Ştefănescu<br />

Fig. 3. Web interface to Ro-WordNet browser.<br />

3.1 WordNet as an important resource to monolingual WSD<br />

The WordNet concept has practically revolutionized the way a WSD application is<br />

thought. The explicit semantic structure of WordNet enables WSD application writers<br />

to use the semantic relations between synsets as a way of primitive reasoning when<br />

establishing senses of the words in a text. The presence of a particular semantic<br />

relation called hypernymy has also provided the much-expected mechanism of<br />

generalization of words’ senses allowing for deploying of machine learning methods<br />

to WSD. It can be safely stated that WordNet has pushed forward the very nature of<br />

WSD algorithms in the direction of true semantic processing.<br />

In [19] we presented an unsupervised WSD algorithm whose disambiguation<br />

philosophy is entirely based on the WordNet architecture. The idea of the algorithm is<br />

that of combining the paradigmatic information provided by the WordNet with the<br />

contextual information of the word in both the training and the disambiguation phases.<br />

The context of the word is given by its dependency relations with the neighboring<br />

words (which are not necessarily adjacent). In [20] we introduced the concept of<br />

meaning attraction model as theoretical basis for our monolingual WSD algorithm.<br />

In the training phase, we estimate the measure of the meaning attraction between<br />

dependency-related words of a sentence. Given two dependency-related words, W a


Romanian WordNet: Current State, New Applications and Prospects 447<br />

and W b , each with its associated WordNet synset identifiers 2 , the meaning attraction<br />

between synset id i of word W a and synset id j of word W b is a function of the<br />

frequency counts between pairs , and collected from the entire<br />

training corpus. As meaning attraction functions, we chose DICE, Log Likelihood and<br />

Pointwise Mutual Information which can be all computed given the pair frequencies<br />

described above. Consider for instance the examples in Figure 4.<br />

Fig. 4. Two examples of dependency pairs with the relevant information for learning.<br />

Both “recommended” and “suggested” are in the same synset with the id<br />

00071572. Also “class” and “course” are in the same synset with the id 00831838.<br />

This means that the pair receives count 2 from these two<br />

examples as opposed to any other pair from the cartesian products which is seen only<br />

once. This fact translates to a preference for the meaning association “mentioned as<br />

worthy of acceptance” and “education imparted in a series of lessons or class<br />

meetings” which may not be correct in all contexts but it is a part of the natural<br />

learning bias of the training algorithm.<br />

The synonymy lexical-semantic relation is just the first way of generalization in the<br />

learning phase. Another one, more powerful is given the by the semantic relations<br />

graph encoded in WordNet. In Figure 4 we have simply used the synsets’ ids for<br />

computing frequencies. But their number is far too big to give us reliable counts. So,<br />

we are making use of the hypernym hierarchies for nouns and verbs to generalize the<br />

meanings that are learned but without introducing ambiguities. So, for a given synset<br />

id we select the uppermost hypernym that only subsumes one meaning of the word.<br />

This WSD algorithm has recently participated in the 4 th Semantic Evaluation<br />

Forum, SEMEVAL 2007 on the English All-Words Coarse and Fine-Grained tasks<br />

where it attained the top performance among the unsupervised systems.<br />

Because it is language independent, it also has been applied with encouraging<br />

results to Romanian using the Romanian WordNet. The test corpus was the Romanian<br />

SemCor, a controlled translation of the English version of the corpus. The test set<br />

comprised of 48392 meaning annotated content word occurrences and for different<br />

meaning attraction functions and combinations of results using them, the best F-<br />

measure was 59.269%.<br />

2<br />

By synset identifier, we understand the offset of the synset in the WordNet database. Knowing<br />

this ID and the word, we can extract the sense number of that word in the respective synset.


448 Dan Tufiş, Radu Ion, Luigi Bozianu, Alexandru Ceauşu, and Dan Ştefănescu<br />

3.2 Romanian WordNet and Cross-Language QA<br />

Romanian WordNet and its translation equivalence with the Princeton WordNet have<br />

been used as a general-purpose translation lexicon in the CLEF 2006 Romanian to<br />

English question answering track [9]. The task required asking questions in Romanian<br />

and finding the answer from an English text collection. For this task, the question<br />

analysis (focus/topic identification, answer type, keywords detection, query<br />

formulation, etc.) was made in Romanian and the rest of the process (text searching<br />

and answer extraction) was made in English.<br />

Our approach was to generate the query for the text-searching engine in Romanian<br />

and then to translate every key element of the query (topic, focus, keywords) into<br />

English without modifying the query. Since we don’t have a Romanian to English<br />

translation system and because neither the question nor the text collection were word<br />

sense disambiguated, for every key element of the query, we selected all the<br />

synonyms in which it appeared from the Romanian WordNet. Then, for every<br />

synonym of the latter list, we extracted all English literals of the corresponding<br />

English synset making a list of all possible translation equivalents for the source<br />

Romanian word. Finally, we ordered this list by the frequency of its elements<br />

computed from the English text collection and selected the first 3 elements as<br />

translation equivalents of the Romanian word. While this translation method does not<br />

assure a correct translation of each source Romanian word of the initial question, it is<br />

good enough for the search engine to return a set of documents in which the correct<br />

answer would be eventually identified. The evaluation of the recall for the IR part of<br />

the QA system [10] was close to 80%, but its major drawback was not in the<br />

translation part but in the identification of the keywords subject to translation.<br />

3.3 Machine Translation Development Kit<br />

The aligned Ro-En WordNets have been incorporated into our MT development kit,<br />

which comprises tokenization, tagging, chunking, dependency linking, word<br />

alignment and WSD based on the word alignment and the respective WordNets. The<br />

interface of the MTKit platform allows for editing the word alignment, word sense<br />

disambiguation, importing annotations from one language to the other and a friendly<br />

visualization of all the preprocessing steps in both languages. Figure 5 illustrates a<br />

snapshot from the MTKit interface. One can see the word alignment of translation<br />

unit (no. 16) from a document (nou-jrc42002595) contained in the JRC-Acquis<br />

multilingual parallel corpus. The right-hand windows display the morpho-lexical<br />

information attached to selected word from the central window (journal). The upperright<br />

window displays the POS-tag, the lemma, the orthographic form and the<br />

WordNet sense number. The windows below it display the WordNet relevant<br />

information as well as the SUMO&MILO label pertaining to the corresponding sense<br />

number. The lowest-right window displays the appropriate WordNet gloss and SUMO<br />

documentation.


Romanian WordNet: Current State, New Applications and Prospects 449<br />

Fig. 5. MTKit interface.<br />

3.4 Opinion analysis<br />

One of the hottest research topics nowadays is related to subjectivity web mining with<br />

many applications in opinionated question answering, product review analysis,<br />

personal and institutional decision making, etc. Recent release of SentiWordNet [15]<br />

allowed for automatic import of the subjectivity annotations from PWN into any<br />

WordNet aligned with it. Thus, it became possible to develop subjectivity analysis<br />

programs for various languages equipped with a WordNet aligned to PWN.<br />

We made some preliminary experiments with a naive opinion sentence classifier<br />

[21]. It simply sums up the O, P and N scores for each word in a sentence. For the<br />

words in the chunks immediately following a valence shifter, until the next valence<br />

shifter, the O, P and N scores are modified so that the new values are the following:<br />

O new =1-O old , P new = P old O old /(P old +N old ) and N new = N old O old /(P old +N old ). Taking<br />

advantage of the 1984 Romanian-English parallel corpus which is word aligned and<br />

word sense disambiguated in both languages, we applied our naive opinion sentence<br />

classifier on the English original sentences and their Romanian translations and<br />

OpinionFinder [22] on the English original sentences. Since the WordNet opinion<br />

annotations are the same in PWN and Ro-WordNet aligned synsets it was obvious that<br />

our opinion classifier would give similar results for the two languages. So, in the end<br />

we compared the classifications made by our opinion classifier and OpinionFinder for<br />

a few English sentences. From the total number of 6411 sentences in 1984 corpus<br />

there were selected 954 for which both internal classifiers of Opinion Finder agreed in


450 Dan Tufiş, Radu Ion, Luigi Bozianu, Alexandru Ceauşu, and Dan Ştefănescu<br />

judging the respective sentences as being subjective (see for details [22]) as in the<br />

sentence below:<br />

The stuff was<br />

like nitric_acid , and moreover , in swallowing it one had the sensation of being hit on<br />

the back of the head with a rubber club .<br />

We manually analyzed the 20 top-certainty sentences from the 954 selected ones,<br />

extracted the valence shifters they contained and dry-run the naive classifier described<br />

above, using the subjectivity values from SentiWordNet. When the O value for a<br />

sentence was smaller then 0.5 we arbitrary decided that it was subjective. All the 20<br />

sentences were thus classified as subjective. For the same sentence 3 , chunked and<br />

WSDed as below, the naive opinion classifier computed the following scores:<br />

P:0.063; N:0.563; O:0.375.<br />

[The stuff(1)] was(1) like (3) [nitric_acid(1)] , and moreover(1) , [in swallowing<br />

(1) it] one had (1) the sensation (1) of [being hit (4)] [on the back_of_the_head (1)]<br />

[with a rubber(1) club(3)].<br />

Whether the threshold value or the final P, N and O values might be debatable, the<br />

main idea here is that one can use the SentiWordNet annotation of the synsets in a<br />

WordNet for language L, aligned to PWN, for dwelling on subjectivity mining in<br />

arbitrary texts in language L.<br />

4 Conclusions and further work<br />

The development of Ro-WordNet is a continuous project, trying to keep up with the<br />

new updates of the Princeton WordNet. The increase in its coverage is steady<br />

(approximately 10,000 synsets per year for the last three years) with the choice for the<br />

new synsets imposed by the applications built on the basis of Ro-WordNet. Since<br />

PWN was aimed to cover general language, it is very likely that specific domain<br />

applications would require terms not covered by Princeton WordNet. In such cases, if<br />

available, several multilingual thesauri (EUROVOC - http://europa.eu/eurovoc/, IATE<br />

- http://iate.europa.eu/iatediff/about_IATE.html, etc.) can complement the use of<br />

WordNets. Besides further augmenting the Ro-WordNet, we plan the development of<br />

an environment where various multilingual aligned lexical resources (WordNets,<br />

framenets, thesauri, parallel corpora) could be used in a consistent but transparent way<br />

for a multitude of multilingual applications.<br />

3<br />

The underlined words represent valence shifters, and the square parentheses delimit chunks as<br />

determined by our chunker; the numbers following the words represent their PWN sense<br />

number.


Romanian WordNet: Current State, New Applications and Prospects 451<br />

Acknowledgements<br />

The work reported here was supported by the Romanian Academy program<br />

“Multilingual Acquisition and Use of Lexical Knowledge”, the ROTEL project<br />

(CEEX No. 29-E136-2005) and by the SIR-RESDEC project (PNCDI2, 4 th<br />

Programme, No. D1.1-0.0.7), the last two granted by the National Authority for<br />

Scientific Research. We are grateful to many colleagues who contributed or continue<br />

to contribute the development of Ro-WordNet with special mentions for Cătălin<br />

Mihăilă, Margareta Manu Magda and Verginica Mititelu.<br />

References<br />

1. Vossen, P. (ed.): A Multilingual Database with Lexical Semantic Networks. Kluwer<br />

Academic Publishers, Dordrecht (1998)<br />

2. Rodriguez, H., Climent, S., Vossen, P., Bloksma, L., Peters, W., Alonge, A., Bertagna, F.,<br />

Roventini, A.: The Top-Down Strategy for Building EuroWordNet: Vocabulary Coverage,<br />

Base Concepts and Top Ontology. J. Computers and the Humanities 32 (2-3), 117-152<br />

(1998)<br />

3. Miháltz, M., Prószéky, G.: Results and evaluation of Hungarian nominal wordnet v1.0. In:<br />

Proceedings of the Second International Wordnet Conference (<strong>GWC</strong> 2004), pp. 175–180.<br />

Masaryk University, Brno (2003)<br />

4. Erjavec, T., Fišer, D.: Language Resources and Evaluations, LREC 2006 22–28 May 2006.<br />

Genoa, Italy (2006)<br />

5. Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, V., Pease, A., Fellbaum, C.:<br />

Introducing the Arabic WordNet Project. In: Sojka, P., Choi, K.S., Fellbaum, C., Vossen, P.<br />

(eds.) Proceedings of the third Global Wordnet Conference, Jeju Island, 2006, pp. 295–299<br />

(2006)<br />

6. Elkateb, S., Black, W, Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A., Fellbaum, C.:<br />

Building a WordNet for Arabic: In Proceedings of the Fifth International Conference on<br />

Langauge Resources and Evaluation. Genoa, Italy. (2006)<br />

7. Tufiş D., Cristea, D., Stamou, S.: BalkaNet: Aims, Methods, Results and Perspectives: A<br />

General Overview. J. Romanian Journal on Information Science and Technology, Special<br />

Issue on BalkaNet, Romanian Academy, 7(2-3) (2004a)<br />

8. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D.: The JRC-<br />

Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the<br />

5 th LREC Conference, Genoa, Italy, 22-28 May, 2006, pp. 2142-2147, ISBN 2-9517408-2-4,<br />

EAN 9782951740822 (2006)<br />

9. Puşcasu, G., Iftene, A., Pistol, I., Trandabăţ, D., Tufiş, D., Ceauşu, A., Ştefănescu, D., Ion,<br />

R., Orăşan, C., Dornescu, I., Moruz, A., Cristea, D.: Developing a Question Answering<br />

System for the Romanian-English Track at CLEF 2006. In: Peters, C., Clough, P., Gey, F.C.,<br />

Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) LNCS Lecture<br />

Notes in Computer Science, ISBN: 978-3-540-74998-1, pp. 385–394. Springer-Verlag<br />

(2007)<br />

10. Tufiş, D, Ştefănescu, D., Ion, R., Ceauşu, A.: RACAI’s Question Answering System at<br />

QA@CLEF 2007. CLEF2007 Workshop, p. 15., September, 2007. Budapest, Hungary<br />

(2007)<br />

11. Tufiş, D., Barbu, E., Mititelu, V., Ion, R., Bozianu, L.: The Romanian Wordnet. J.<br />

Romanian Journal on Information Science and Technology, Special Issue on BalkaNet,<br />

Romanian Academy, 7(2-3) (2004b)


452 Dan Tufiş, Radu Ion, Luigi Bozianu, Alexandru Ceauşu, and Dan Ştefănescu<br />

12. Bentivogli, L, Forner, P., Magnini, B., Pianta, E.: Revising WordNet Domains Hierarchy:<br />

Semantics, Coverage, and Balancing. In: Proceedings of COLING 2004 Workshop on<br />

"Multilingual Linguistic Resources", pp. 101–108. Geneva, Switzerland, August 28, 2004<br />

(2004)<br />

13. Niles, I., Pease, A. Towards a Standard Upper Ontology. In: Proceedings of the 2nd<br />

International Conference on Formal Ontology in Information Systems (FOIS-2001).<br />

Ogunquit, Maine, October 17–19, 2001 (2001)<br />

14. Niles, I. Pease, A.: Linking Lexicons and Ontologies: Mapping WordNet to the Suggested<br />

Upper Model Ontology. In: Proceedings of the 2003 International Conference on<br />

Information and Knowledge Engineering. Las Vegas, USA (2003)<br />

15. Esuli, A., Sebastiani, F.: SentiWordNet: A publicly Available Lexical Resourced for<br />

Opinion Mining. LREC2006 22 - 28 May 2006. Genoa, Italy (2006)<br />

16. Horák, A., Smrž, P.: New Features of Wordnet Editor VisDic. J. Romanian Journal of<br />

Information Science and Technology 7(2-3) (2004)<br />

17. Tufiş, D., Ion, R., Ide, N.: Fine-Grained Word Sense Disambiguation Based on Parallel<br />

Corpora, Word Alignment, Word Clustering and Aligned Wordnets. In: Proceedings of the<br />

20 th International Conference on Computational Linguistics, COLING2004, pp. 1312–1318.<br />

Geneva (2004d)<br />

18. Tufiş, D., Ion, R., Ceauşu, Al., Ştefănescu, D.: Combined Aligners. In: Proceeding of the<br />

ACL2005 Workshop on “Building and Using Parallel Corpora: Data-driven Machine<br />

Translation and Beyond”, pp. 107–110. Ann Arbor, Michigan, June, 2005 (2005)<br />

19. Ion, R.: Word Sense Disambiguation Methods Applied to English and Romanian. (in<br />

Romanian). PhD thesis. Romanian Academy, Bucharest (2007)<br />

20. Ion, R., Tufiş, D.: Meaning Affinity Models. In: Proceedings of the 4th International<br />

Workshop on Semantic Evaluations, SemEval-2007, p. 6. Prague, Czech Republic, June 23–<br />

24 2007, ACL 2007 (2007)<br />

21. Tufiş, D., Ion, R.: Cross lingual and cross cultural textual encoding of opinions and<br />

sentiments. Tutorial at Eurolan 2007: "Semantics, Opinion and Sentiment in Text" Iaşi, July<br />

23–August 3, 2007 (2007)<br />

22. Wilson, T., Hoffmann, P., Somasundaran, S., Kessler, J., Wiebe, J., Choi, Y., Cardie, C.,<br />

Riloff, E., Patwardhan, S.: OpinionFinder: A system for subjectivity analysis. In:<br />

Proceedings of HLT/EMNLP 2005 Demonstration Abstracts, pp. 34–35. Vancouver,<br />

October 2005 (2005)<br />

23. Fellbaum C. (ed.): WordNet: An Electronic Lexical Database. MIT Press (1998)<br />

24. Magnini, B., Cavaglià, G.: Integrating Subject Field Codes into WordNet. In: Gavrilidou,<br />

M., Crayannis, G., Markantonatu, S., Piperidis, S., Stainhaouer, G. (eds.): Proceedings of<br />

LREC-2000, Second International Conference on Language Resources and Evaluation, pp.<br />

1413–1418. Athens, Greece, 31 May–2 June, 2000 (2000)<br />

25. Miller, G.A., Beckwidth, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to<br />

WordNet: An On-Line Lexical Database. J. International Journal of Lexicography 3(4),<br />

235–244 (1990)<br />

26. Tufiş, D., Cristea, D.: Methodological issues in building the Romanian Wordnet and<br />

consistency checks in Balkanet. In: Proceedings of LREC2002 Workshop on Wordnet<br />

Structures and Standardisation, pp. 35–41. Las Palmas, Spain (2002)<br />

27. Tufiş, D., Ion, R., Barbu, E., Mititelu, V.: Cross-Lingual Validation of Wordnets. In:<br />

Proceedings of the 2 nd International Wordnet Conference, pp. 332–340. Brno (2004c)<br />

28. Tufiş, D., Mititelu, V., Bozianu, L., Mihăilă, C.: Romanian WordNet: New Developments<br />

and Applications. In: Proceedings of the 3 rd Conference of the Global WordNet Association,<br />

pp. 337–344. Seogwipo, Jeju, Republic of Korea, January 22–26, 2006, ISBN 80-210-3915-<br />

9 (2006)<br />

29. Tufiş, D., Barbu, E.: A Methodology and Associated Tools for Building Interlingual<br />

Wordnets. In: Proceedings of the 4 th LREC Conference, pp. 1067–1070. Lisbon (2004)


Enriching WordNet with Folk Knowledge<br />

and Stereotypes<br />

Tony Veale 1 and Yanfen Hao 1<br />

1<br />

School of Computer Science and Informatics, University College Dublin, Dublin, Ireland<br />

{Tony.Veale, Yanfen.Hao}@UCD.ie<br />

Abstract. The knowledge that is needed to understand everyday language is not<br />

necessarily the knowledge one finds in an encyclopedia or dictionary. Much of<br />

this is “folk” knowledge, based on stereotypes and culturally-inherited<br />

associations that do not hold in all situations, or which may, strictly speaking,<br />

be false. We can open a linguistic window onto this knowledge through simile,<br />

since explicit similes make use of highly evocative and inference-rich concepts<br />

to ground comparisons and make the unfamiliar seem familiar. This paper<br />

describes a means of enriching WordNet with commonly ascribed cultural<br />

properties by mining explicit similes of the form "as ADJ as a NOUN" from the<br />

internet. We also show how these properties can be leveraged, through further<br />

web search, into rich frame structures for the most evocative WordNet<br />

concepts.<br />

Keywords: simile, folk knowledge, frame representation.<br />

1 Introduction<br />

Many of the beliefs that one uses to reason about everyday entities and events are<br />

neither strictly true or even logically consistent. Rather, people appear to rely on a<br />

large body of folk knowledge in the form of stereotypes, clichés and other prototypecentric<br />

structures (e.g., see [1]). These prototypes comprise the landmarks of our<br />

conceptual space against which other, less familiar concepts can be compared and<br />

defined. For instance, people readily employ the animal concepts Snake, Bear, Bull,<br />

Wolf, Gorilla and Shark in everyday conversation without ever having had first-hand<br />

experience of these entities. Nonetheless, our culture equips us with enough folk<br />

knowledge of these highly evocative concepts to use them as dense short-hands for all<br />

manner of behaviours and property complexes. Snakes, for example, embody the<br />

notions of treachery, slipperiness, cunning and charm (as well as a host of other,<br />

related properties) in a single, visually-charged package. To compare someone to a<br />

snake is to suggest that many of these properties are present in that person, and thus,<br />

one would do well to treat that person as one would treat a real snake.<br />

Descriptors like “snake”, “shark” and “wolf” find a great deal of traction in<br />

everyday conversation because they are “dense descriptors” – they convey a great<br />

deal of useful information in a simple and concise way. The information imparted is<br />

open-ended, so that a listener may take meaning X from the description when it is<br />

initially used (e.g., that a given person is treacherous) and meaning X+Y (e.g., that


454 Tony Veale and Yanfen Hao<br />

this person is both treacherous and charming) in a later, more informed context. But<br />

the information imparted is rarely of the kind one finds in a dictionary or<br />

encyclopaedia, or in a resource like WordNet [2], because it is neither contributes to<br />

the definition of the given concept or is actually true of that concept. Insofar as<br />

WordNet is used to make sense of real texts by real, culturally-grounded speakers, it<br />

can be enriched considerably by the addition of such stereotypical knowledge. But<br />

where can this knowledge be found and exploited?<br />

In “A Christmas Carol”, Dickens [3] notes that “the wisdom of our ancestors is in<br />

the simile; and my unhallowed hands shall not disturb it, or the Country’s done for”<br />

(chapter 1, page 1). In other words, folk knowledge is passed down through a culture<br />

via language, most often in specific linguistic forms. The simile, as noted by Dickens,<br />

is one common vehicle for folk wisdom, one that uses explicit syntactic means (unlike<br />

metaphor; see [4]) to mark out those concepts that are most useful as landmarks for<br />

linguistic description. Similes do not always convey truths that are universally true, or<br />

indeed, even literally true (e.g., bowling balls are not literally bald). Rather, similes<br />

hinge on properties that are possessed by prototypical or stereotypical members of a<br />

category (see [5]), even if most members of the category do not also possess them. As<br />

a source of knowledge, similes combine received wisdom, prejudice and oversimplifying<br />

idealism in equal measure. As such, similes reveal knowledge that is<br />

pragmatically useful but of a kind that one is unlikely to ever acquire from a<br />

dictionary (or, indeed, from WordNet). Although a simpler rhetorical device than<br />

metaphor, we have much to learn about language and its underlying conceptual<br />

structure by a comprehensive study of real similes in the wild (see [6]), not least about<br />

the recurring vehicle categories that signpost this space (see [7]).<br />

In this paper we describe a means through which we can enrich WordNet with<br />

stereotypical folk-knowledge from similes that are mined from the text of the worldwide<br />

web. We describe the Google-based mining process in section 2, before<br />

describing how the acquired knowledge is sense-linked to WordNet in section 3. In<br />

section 4 we describe on-going work to elaborate this property-rich knowledge into<br />

more complex frame-representations, before providing an empirical evaluation of the<br />

basic properties in section 5. The paper concludes with thoughts on future work in<br />

section 6.<br />

2 Acquiring Knowledge from Simile<br />

As in the study reported in [6], we employ the Google search engine as a retrieval<br />

mechanism for accessing relevant web content. However, the scale of the current<br />

exploration requires that retrieval of similes be fully automated, and this automation is<br />

facilitated both by the Google API and its support for the wildcard term *. In essence,<br />

we consider here only partial explicit similes conforming to the pattern “as ADJ as<br />

a|an NOUN”, in an attempt to collect all of the salient values of ADJ for a given value<br />

of NOUN. We do not expect to identify and retrieve all similes mentioned on the<br />

world-wide-web, but to gather a large, representative sample of the most commonly<br />

used.


Enriching WordNet with Folk Knowledge and Stereotypes 455<br />

To do this, we first extract a list of antonymous adjectives, such as “hot” or “cold”,<br />

from WordNet [2], the intuition being that explicit similes will tend to exploit<br />

properties that occupy an exemplary point on a scale. For every adjective ADJ on this<br />

list, we send the query “as ADJ as *” to Google and scan the first 200 snippets<br />

returned for different noun values for the wildcard *. From each set of snippets we<br />

can ascertain the relative frequencies of different noun values for ADJ. The complete<br />

set of nouns extracted in this way is then used to drive a second phase of the search.<br />

In this phase, the query “as * as a NOUN” is used to collect similes that may have<br />

lain beyond the 200-snippet horizon of the original search, or that hinge on adjectives<br />

not included on the original list. Together, both phases collect a wide-ranging series<br />

of core samples (of 200 hits each) from across the web, yielding a set of 74,704 simile<br />

instances (of 42,618 unique types) relating 3769 different adjectives to 9286 different<br />

nouns.<br />

2.1 Simile Annotation<br />

Many of these similes are not sufficiently well-formed for our purposes. In some<br />

cases, the noun value forms part of a larger noun phrase: it may be the modifier of a<br />

compound noun (as in “bread lover”), or the head of complex noun phrase (such as<br />

“gang of thieves”). In the former case, the compound is used if it corresponds to a<br />

compound term in WordNet and thus constitutes a single lexical unit; if not, or if the<br />

latter case, the simile is rejected. Other similes are simply too contextual or underspecified<br />

to function well in a null context, so if one must read the original document<br />

to make sense of the simile, it is rejected. More surprisingly, perhaps, a substantial<br />

number of the retrieved similes are ironic, in which the literal meaning of the simile is<br />

contrary to the meaning dictated by common sense. For instance, “as hairy as a<br />

bowling ball” (found once) is an ironic way of saying “as hairless as a bowling ball”<br />

(also found just once). Many ironies can only be recognized using world (as opposed<br />

to word) knowledge, such as “as sober as a Kennedy” and “as tanned as an Irishman”.<br />

In addition, some similes hinge on a new, humorous sense of the adjective, as in “as<br />

fruitless as a butcher-shop” (since the latter contains no fruits) and “as pointless as a<br />

beach-ball” (since the latter has no points).<br />

Given the creativity involved in these constructions, one cannot imagine a reliable<br />

automatic filter to safely identify bona-fide similes. For this reason, the filtering task<br />

was performed by human judges, who annotated 30,991 of these simile instances (for<br />

12,259 unique adjective/noun pairings) as non-ironic and meaningful in a null<br />

context; these similes relate a set of 2635 adjectives to a set of 4061 different nouns.<br />

In addition, the judges also annotated 4685 simile instances (of 2798 types) as ironic;<br />

these similes relate 936 adjectives to a set of 1417 nouns. Perhaps surprisingly, ironic<br />

pairings account for over 13% of all annotated simile instances and over 20% of all<br />

annotated simile types.


456 Tony Veale and Yanfen Hao<br />

3 Establishing Links to WordNet<br />

It is important to know which sense of a noun is described by a simile if an accurate<br />

conceptual picture is to be constructed. For instance, “as stiff as a zombie” might refer<br />

either to a reanimated corpse or to an alcoholic cocktail (both are senses of “zombie”<br />

in WordNet, and drinks can be “stiff” too). Sense disambiguation is especially<br />

important if we hope to derive meaningful correlations from property co-occurrences;<br />

for instance, zombies are described in web similes as exemplars of not just stiffness,<br />

but of coldness, slowness and emotionlessness. If such co-occurrences are observed<br />

often enough, a cognitive agent might usefully infer a causal relationship among pairs<br />

of properties.<br />

Disambiguation is trivial for nouns with just a single sense in WordNet. For nouns<br />

with two or more fine-grained senses that are all taxonomically close, such as<br />

“gladiator” (two senses: a boxer and a combatant), we consider each sense to be a<br />

suitable target. In some cases, the WordNet gloss for as particular sense will actually<br />

mention the adjective of the simile, and so this sense is chosen. In all other cases, we<br />

employ a strategy of mutual disambiguation to relate the noun vehicle in each simile<br />

to a specific sense in WordNet. Two similes “as ADJ as NOUN 1 ” and “as ADJ as<br />

NOUN 2 ” are mutually disambiguating if NOUN 1 and NOUN 2 are synonyms in<br />

WordNet, or if some sense of NOUN 1 is a hypernym or hyponym of some sense of<br />

NOUN 2 in WordNet. For instance, the adjective “scary” is used to describe both the<br />

noun “rattler” and the noun “rattlesnake” in bona-fide (non-ironic) similes; since these<br />

nouns share a sense, we can assume that the intended sense of “rattler” is that of a<br />

dangerous snake rather than a child’s toy. Similarly, the adjective “brittle” is used to<br />

describe both saltines and crackers, suggesting that it is the bread sense of “cracker”<br />

rather than the hacker, firework or hillbilly senses (all in WordNet) that is intended.<br />

These heuristics allow us to automatically disambiguate 10,378 bona-fide simile<br />

types (85% of those annotated), yielding a mapping of 2124 adjectives to 3778<br />

different WordNet senses. Likewise, 77% (or 2164) of the simile types annotated as<br />

ironic are disambiguated automatically. A remarkable stability is observed in the<br />

alignment of noun vehicles to WordNet senses: 100% of the ironic vehicles always<br />

denote the same sense, no matter the adjective involved, while 96% of bona-fide<br />

vehicles always denote the same sense. This stability suggests two conclusions: the<br />

disambiguation process is consistent and accurate; but more intriguingly, only one<br />

coarse-grained sense of any word is likely to be sufficiently exemplary of some<br />

property to be useful as a simile vehicle.<br />

4 Acquiring Frame Representations<br />

Each bona-fide simile contributes a different salient property to the representation of a<br />

vehicle concept. In our data, one half (49%) of all bona-fide vehicle nouns occur in<br />

two or more similes, while one third occur in three or more and one fifth occur in four<br />

or more. The most frequently used figurative vehicles can have many more;<br />

“snowflake”, for instance, is ascribed over 30 in our database, including: white, pure,


Enriching WordNet with Folk Knowledge and Stereotypes 457<br />

fresh, beautiful, natural, intricate, delicate, identifiable, fragile, light, dainty, frail,<br />

weak, sweet, precious, quiet, cold, soft, clean, detailed, fleeting, unique, singular,<br />

distinctive and lacy.<br />

Because the same adjectival properties are associated with multiple vehicles, the<br />

resulting property graph allows different vehicles to be perceived as similar by virtue<br />

of these shared properties. For instance, Ninja and Mime are deemed similar by virtue<br />

of the shared property silent, while Artist and Surgeon are similar by virtue of the<br />

properties skilled, sensitive and delicate. Nonetheless, it can be claimed the property<br />

level is simply too shallow to allow for nuanced similarity judgements. For instance,<br />

are ninjas and mimes silent in the same way? Both surgeons and bloodhounds are<br />

prototypes of sensitivity, but the former has sensitive hands while the latter has a<br />

sensitive nose. To put these properties in context, we need to know the specific facet<br />

of each concept that is modified, so that sensible comparisons can be made. In effect,<br />

we need to move from a simple property-ascription representation to a richer,<br />

frame:slot:filler representation. In such a scheme, the property sensitive is a typical<br />

filler for the hands slot of Surgeon and the nose slot of Bloodhound, thereby<br />

disallowing any mis-matched comparisons.<br />

This process of frame construction can also be largely automated via targeted websearch.<br />

For every bona-fide simile-type “as ADJ as a Noun vehicle ” (all 10,378 of<br />

them that have been WordNet-linked in section 3), we automatically generate the<br />

web-query “the ADJ * of a Noun vehicle ” and harvest the top 200 results from<br />

Google. From these snippets, we then extract all noun values of the wildcard *. In<br />

many cases, these noun values are precisely the conceptual facets we desire for a<br />

culturally-accurate and nuanced representation, ranging from hands for Surgeon to<br />

roar for Lion to eye for Hawk. The frequency of these values also allows us to create<br />

a textured representation for each concept, so that e.g., both hands and eye are notable<br />

facets for surgeon, but the latter is higher ranked. However, this web-pattern also<br />

yields a non-trivial amount of noise: while “the proud strut of a peacock” is very<br />

revealing about the concept Peacock, the snippet “the proud owner of a peacock” is<br />

not. Quite simply, we seek to fill intrinsic facets of a concept like hands, eye, gait<br />

and strut that contribute to the folk definition of the concept, while ignoring extrinsic<br />

and contingent facets such as owner, husband, brother and so on.<br />

One can look to specific abstractions in WordNet – such as {trait} – to serve as a<br />

filter on the facet-nouns that are extracted, but such a simple filter would be unduly<br />

coarse. Instead, we consider all facet-nouns, but generalize the WordNet vehiclesenses<br />

to which they are attached, to create a high-level mapping of vehicle types<br />

(such as Person, Animal, Implement, Substance, etc.) to facets (such as hands, eye,<br />

sparkle, father, etc.). This high-level (and considerably more compressed) map is then<br />

human-edited, to remove any facets that are unrevealing or simply appropriate for the<br />

WordNet vehicle type. In this editing process (which requires about one man-day),<br />

contingent facets such as father, wife, etc. are quickly identified and removed.


458 Tony Veale and Yanfen Hao<br />

peacock<br />

Has_feather: brilliant<br />

Has_plumage: extravagant<br />

Has_strut: proud<br />

Has_tail: elegant<br />

Has_display: colorful<br />

Has_manner: stately<br />

Has_appearance: beautiful<br />

lion<br />

Has_eyes:<br />

Has_teeth:<br />

Has_gait:<br />

Has_strength:<br />

Has_roar:<br />

Has_soul:<br />

Has_heart:<br />

fierce<br />

ferocious<br />

majestic<br />

magnificent<br />

threatening<br />

noble<br />

courageous<br />

Fig. 1. The acquired Frame:slot:filler representations for Peacock and Lion.<br />

As can be seen in the examples of Lion and Peacock in Figure 1, the slot:filler<br />

pairs that are acquired for each concept do indeed reflect the most relevant cultural<br />

associations for these concepts. Moreover, there is a great deal of anthropomorphic<br />

rationalization of an almost poetic nature about these representations, of the kind that<br />

is instantly recognizable to native speakers of a language but which one would be<br />

hard pressed to find in a conventional dictionary (except insofar as some lexical<br />

concepts may give rise to additional word senses, such as “peacock” for a proud and<br />

flashily dressed person).<br />

Overall, frame representations of this kind are acquired for 2218 different WordNet<br />

noun senses, yielding a combined total of 16,960 slot:filler pairings (or an average of<br />

8 slot:filler pairs per frame). As the examples of Figure 1 demonstrate, these frames<br />

provide a level of representational finesse that greatly enriches the basic property<br />

descriptions yielded by similes alone. To answer an earlier question then, mimes and<br />

ninjas are now similar by virtue of each possessing the slot:filler Has_ art: silent. But<br />

as this and other examples suggest, the introduction of finely discriminating frame<br />

structures can decrease a system’s ability to recognize similarity, if comparable slots<br />

or fillers are given different names. In Figure 1, for instance, a human can easily<br />

recognize that Has_strut:proud and Has_gait:majestic are similar properties, but to a<br />

computer they can appear very different ideas. WordNet can play a significant role in<br />

reconciling these superficial differences in structure (e.g., by recognizing the obvious<br />

relationship between strut and gait), while corpus-based co-occurrence models can<br />

reveal the comparable nature of proud and majestic. This work, however, is outside<br />

the scope of the current paper and is the subject of future development and research.<br />

5 Empirical Evaluation<br />

If similes are indeed a good place to mine the most salient properties of WordNet’s<br />

lexical concepts, we should expect the set of properties for each concept to accurately<br />

predict how that concept is perceived as a whole. For instance, humans – unlike


Enriching WordNet with Folk Knowledge and Stereotypes 459<br />

computers – do not generally adopt a dispassionate view of ideas, but rather tend to<br />

associate certain positive or negative feelings, or affective values, with particular<br />

ideas. Unsavoury activities, people and substances generally possess a negative affect,<br />

while pleasant activities and people possess a positive affect. Whissell [8] uses<br />

human-assigned ratings to reduce the notion of affect to a single numeric dimension,<br />

to produce a dictionary of affect that associates a numeric value in the range 1.0 (most<br />

unpleasant) to 3.0 (most pleasant) with over 8000 words across a range of syntactic<br />

categories (including adjectives, verbs and nouns). So to the extent that the adjectival<br />

properties yielded by processing similes paint an accurate picture of each noun<br />

vehicle, we should be able to predict the affective rating of each vehicle via a<br />

weighted average of the affective ratings of the adjectival properties ascribed to these<br />

vehicles (i.e., where the affect of each adjective contributes to the estimated affect of<br />

a noun in proportion to its frequency of co-occurrence with that noun in our webderived<br />

simile data). More specifically, we should expect ratings estimated via these<br />

simile-derived properties to exhibit a strong correlation with the independent ratings<br />

of Whissell’s dictionary.<br />

To determine whether similes do offer the clearest perspective on a concept’s most<br />

salient properties, we calculate and compare this correlation using the following data<br />

sets:<br />

a) Adjectives derived from annotated bona-fide (non-ironic) similes of section 2.1.<br />

b) Adjectives derived from all annotated similes (both ironic and non-ironic).<br />

c) Adjectives derived from ironic similes only.<br />

d) All adjectives used to modify the given vehicle noun in a large corpus. We use<br />

over 2-gigabytes of text from the online encyclopaedia Wikipedia as our corpus.<br />

e) All adjectives used to describe the given vehicle noun in any of the WordNet text<br />

glosses for that noun. For instance, WordNet defines Espresso as “strong black<br />

coffee made …” so this gloss yields the properties strong and black for Espresso.<br />

Predictions of affective rating were made from each of these data sources and then<br />

correlated with the ratings reported in Whissell’s dictionary of affect using a twotailed<br />

Pearson test (p < 0.01). As expected, property sets derived from bona-fide<br />

similes only (A) yielded the best correlation (+0.514) while properties derived from<br />

ironic similes only (C) yielded the worst (-0.243); a middling correlation coefficient<br />

of 0.347 was found for all similes together, demonstrating the fact that bona-fide<br />

similes outnumber ironic similes by a ratio of 4 to 1. A weaker correlation of 0.15 was<br />

found using the corpus-derived adjectival modifiers for each noun (D); while this data<br />

provides far richer property sets for each noun vehicle (e.g., far richer than those<br />

offered by the simile database), these properties merely reflect potential rather than<br />

intrinsic properties of each noun and so do not reveal what is most salient about a<br />

vehicle concept. More surprisingly, perhaps, property sets derived from WordNet<br />

glosses (E) are also poorly predictive, yielding a correlation with Whissell’s affect<br />

ratings of just 0.278.


460 Tony Veale and Yanfen Hao<br />

While it is true that the WordNet-derived properties in (E) are not sense-specific,<br />

so that properties from all senses of a noun are conflated into a single property set for<br />

that noun, this should not have dramatic effects on predictions of affective rating.<br />

Instead, if one sense of a word acquires a negative connotation, then following what is<br />

often called “Gresham’s law of language”[9], the “bad meanings should drive out the<br />

good” so that the word as a whole becomes tainted. Rather, it may be that the<br />

adjectival properties used to form noun definitions in WordNet are simply not the<br />

most salient properties of those nouns. To test this hypothesis, we conducted a second<br />

experiment wherein we automatically generated similes for each of the 63,935 unique<br />

adjective-noun associations extracted from WordNet glosses, e.g., “as strong as<br />

espresso”, “as Swiss as Emmenthal” and “as lively as a Tarantella”, and counted how<br />

many of these manufactured similes can be found on the web, again using Google’s<br />

API.<br />

We find that only 3.6% of these artificial similes have attested uses on the web.<br />

From this meagre result we can conclude that: a) few nouns are considered<br />

sufficiently exemplary of some property to serve as a meaningful vehicle in a figure<br />

of speech; b) the properties used to describe concepts in the glosses of general<br />

purpose resources like WordNet are not always the properties that best reflect how<br />

humans actually think about, and use, these concepts. Of course, the truth is most<br />

likely to lie somewhere between these two alternatives. The space of potential similes<br />

is doubtless much larger than that currently found on the web, and many of the<br />

similes generated from WordNet are probably quite meaningful and apt. However,<br />

even WordNet-based similes that can be found on the web are of a different character<br />

to those that populate our database of annotated web-similes, and only 9% of the webattested<br />

WordNet similes (or 0.32% overall) also reside in this database. Thus, most<br />

(> 90%) of the web-attested WordNet similes must lie outside the 200-hit horizon of<br />

the acquisition process described in section 2, and so are less frequent (or used in less<br />

authoritative pages) than our acquired similes.<br />

6 Conclusion<br />

In this paper we have presented an approach to enriching WordNet with the cultural<br />

associations that pervade our everyday use of language yet which one rarely finds in<br />

authoritative linguistic resources like dictionaries and encyclopaedias. Our means of<br />

acquiring these associations – via explicit similes that are mined from the internet –<br />

has several important consequences for our enrichment scheme. First, we acquire<br />

associations that are neither necessarily true or necessarily consistent with each other,<br />

but which people happily assume to be true and consistent for purposes of habitual<br />

reasoning. Second, a large-scale mining effort allows us to identify the most<br />

frequently used vehicles of comparison, and thus, the landmarks of our shared<br />

conceptual space that are most deserving of enrichment in WordNet. Thirdly, we<br />

identify the most salient properties of these landmarks, also frequency weighted, as<br />

well as the most notable conceptual facets of these landmarks. Interestingly, these<br />

combinations of facets and properties (i.e., slot:filler pairings) have a poetic quality<br />

that can, in future work, be exploited in the automatic natural-language generation of


Enriching WordNet with Folk Knowledge and Stereotypes 461<br />

creative descriptions.<br />

Despite these benefits, our continued reference to the notion of “culture” may seem<br />

misplaced given our focus on English-language similes and an English-language<br />

WordNet. Nonetheless, we see this work as a platform from which to explore the<br />

cultural diversity of ontological categorizations, and to this end, we are currently<br />

planning to replicate this approach for Chinese and Korean. In the case of Chinese,<br />

we intend the enrichment process to apply to the Chinese-English lexical ontology of<br />

HowNet [10]. To see how similes reflect different biases in different cultures,<br />

consider that of the 12,259 unique adjective/noun pairings judged as bona-fide (nonironic)<br />

in section 2.1., only 2,440 (or 20%) have a Chinese translation that can also be<br />

found on the web (where translation is performed using the bilingual HowNet). The<br />

replication rate for the ironic similes of section 2.1. is even lower, at 5%, reflecting<br />

the fact that ironic comparisons are more creatively ad-hoc and less culturally<br />

entrenched than non-ironic similes. We can thus expect that the mining of Chinese<br />

texts on the web will yield a set of similes – and thus conceptual descriptions (both<br />

properties and frames) – that substantially differs from the English-language set<br />

described here, to enrich HowNet in an altogether different, culturally-specific way.<br />

References<br />

1. Lakoff, G.: Women, fire and dangerous things. Chicago University Press (1987)<br />

2. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. The MIT Press, Cambridge,<br />

MA (1998)<br />

3. Dickens, C. A.: Christmas Carol. Puffin Books, Middlesex, UK (1843/1984)<br />

4. Hanks, P.: The syntagmatics of metaphor. J. Int. Journal of Lexicography 17(3) (2004)<br />

5. Ortony, A.: Beyond literal similarity. J. Psychological Review 86, 161–180 (1979)<br />

6. Roncero, C., Kennedy, J. M., Smyth, R.: Similes on the internet have explanations. J.<br />

Psychonomic Bulletin and Review 13(1), 74–77 (2006)<br />

7. Veale, T., Hao, Y.: Making Lexical Ontologies Functional and Context-Sensitive. In:<br />

Proceedings of ACL 2007, the 45th Annual Meeting of the Association of Computational<br />

Linguistics, pp. 57–64. Prague, Czech Republic (2007)<br />

8. Whissell, C.: The dictionary of affect in language. In: Plutchnik, R., Kellerman, H. (eds.)<br />

Emotion: Theory and research, pp. 113–131. Harcourt Brace, New York (1989)<br />

9. Rawson, H.: A Dictionary of Euphemisms and Other Doublespeak. Crown Publishers, New<br />

York (1995)<br />

10. Dong, Z., Dong, Q.: HowNet and the Computation of Meaning. World Scientific:<br />

Singapore (2006)


Comparing WordNet Relations to Lexical Functions<br />

Veronika Vincze, Attila Almási, and Dóra Szauter<br />

Hungarian Academy of Sciences (MTA) and University of Szeged (SZTE),<br />

Research Group on Artificial Intelligence,<br />

H-6720 Szeged, Aradi vértanúk tere 1., Hungary.<br />

vinczev@inf.u-szeged.hu, vizipal@gmail.com, szauter.dora@freemail.hu<br />

Abstract. In this paper, basic relations of WordNet and EuroWordNet are<br />

revisited and reconsidered from the viewpoint of lexical functions, that is,<br />

formalized semantic relations. Definitions of lexical functions and those of<br />

WordNet relations are contrasted and analyzed. The relation near_antonym is<br />

found to cover two different semantic relations. Thus, it is suggested that this<br />

relation should be divided into two new relations: conversive and antonym. The<br />

coding of derivational morphology can also be improved by introducing new<br />

relations that encode not only morphological but semantic derivations as well.<br />

Finally, some new semantic relations based on lexical functions are also<br />

proposed.<br />

Keywords: semantic relations, lexical functions, synonymy, antonymy,<br />

holonymy, derivation<br />

1 Introduction<br />

WordNet is a lexical database in which words and lexical units are organized in terms<br />

of their meanings and these clusters of words are linked to each other through<br />

different semantic and lexical relations. Among the many possible lexical and<br />

semantic relations, it is synonymy, hypernymy and antonymy that are of special<br />

importance in the construction of WordNet, however, other relations are also encoded<br />

in the database. In this paper, basic relations of WordNet are revisited and<br />

reconsidered from the viewpoint of lexical functions, that is, formalized semantic<br />

relations [1]. We will contrast the definitions of lexical functions and those of<br />

WordNet relations and we will show how this comparison can be fruitfully applied in<br />

the lexicographical practice of WordNet. We will pay special attention to antonymy,<br />

however, other relations such as synonymy, holonymy, hypernymy and different<br />

derivations are also analysed in detail. Finally, we will give some hints for some new<br />

relations that are listed among lexical functions but are not yet applied in WordNet.


Comparing WordNet Relations to Lexical Functions 463<br />

2 Relations between Words in WordNet<br />

Dictionaries are usually structured on the basis of word forms: words are<br />

alphabetically listed in the dictionary, and their meanings are given one after the<br />

other. However, the most innovative aspect of WordNet is that lexical information is<br />

organized in terms of meaning, that is, a synset (the basic unit of WordNet) contains<br />

words which have approximately the same meaning. Thus, it is synonymy that<br />

functions as the essential principle in the construction of WordNet [2].<br />

There are two types of relations among words in WordNet. On the one hand,<br />

semantic relations can be found between concepts, in other words, it is not the form of<br />

the word that counts but its meaning. Such relations include hyponymy and<br />

meronymy. On the other hand, lexical relations are related to word forms, for<br />

instance, synonymy, antonymy and different morphological relations belong to this<br />

group [2]. Thus, the inner structure of WordNet is based on a specific lexical relation,<br />

namely, synonymy.<br />

3 Lexical Functions<br />

The theory of lexical functions was born within the framework of Meaning Text<br />

Theory (the model is described in detail in e.g. [1, 3, 4, 5, 6, 7, 8, 9, 10]). The most<br />

important theoretical innovation of this model is the theory of lexical functions, which<br />

is universal: with the help of lexical functions, all relations between lexemes of a<br />

given language can be described – a lexeme is a word in one of its meanings. This<br />

theory has been thoroughly applied to different languages such as Russian [7], French<br />

[8], English [9, 10] or German [9, 10] and, occasionally to other languages such as<br />

Hungarian [11, 12].<br />

Lexical functions have the form f (x) = y, where f is the lexical function itself, x<br />

stands for the argument of the function and y is the value of the function. The<br />

argument of the lexical function is a lexeme, while its value is another lexeme or a set<br />

of lexemes. A given lexical function always expresses the same semanto-syntactic<br />

relation, that is, the relation between an argument and the value of the lexical function<br />

is the same as the relation between another argument and value of the same lexical<br />

function. Thus, lexical functions express semantic relations between lexemes [1].<br />

4 Relations Used in WordNet Compared to Lexical Functions<br />

In the following, lexical functions and WordNet relations are contrasted. First,<br />

semantic relations such as hypernymy and meronymy are discussed, then lexical and<br />

semanto-lexical relations (synonymy and antonymy) are analysed. Finally, derivations<br />

are also presented.


464 Veronika Vincze, Attila Almási, and Dóra Szauter<br />

4.1 Hypernymy and Hyponymy<br />

In Meaning – Text Theory, it is the lexical function Gener (generic) that expresses<br />

hypernymic relations between words [1]. Some illustrative examples:<br />

Gener(gas) = substance<br />

Gener(wardrobe) = furniture<br />

This relation completely corresponds to hypernym used in WordNet since both<br />

relations give a more generic term for the word. Thus, it is possible to encode<br />

taxonomic relations (such as hypernymy) with the help of lexical functions [13]. The<br />

abovementioned examples are present in WordNet, too:<br />

{substance:1, matter:1}is hypernym of {fluid:2}, which is hypernym of {gas:2}<br />

{furniture:1, piece of furniture:1, article of furniture:1} is hypernym of<br />

{wardrobe:1, closet:3, press:6}<br />

That is, in the first case, the application of lexical functions provided a hypernym<br />

situated one level higher than the one present in WordNet, but it is still true that it is a<br />

hypernym of the original word. In the second case, however, the two versions are in<br />

complete accordance: they provide the same (direct) hypernym for the same word.<br />

Hyponymy can be grabbed by the lexical function Spec (specific) (proposed in<br />

[14]). By definition, Spec is the inverse function of Gener, thus, it yields less general,<br />

i.e. more specific terms (that is, hyponyms) for the word:<br />

Spec(furniture) = wardrobe, table, chair, desk, bed etc.<br />

In WordNet, the synset {furniture:1, piece of furniture:1, article of furniture:1} has<br />

got no less than 24 hyponyms including {table:3}, {chest of drawers:1, chest:3,<br />

bureau:2, dresser:1} and {bed:1}. Thus, similarly to the case of hypernymy, we can<br />

state that the lexical function Spec and the WordNet relation hyponym are equivalent.<br />

4.2 Meronymy and Holonymy<br />

In WordNet, holonymy is encoded by three different relations, and in EuroWordNet<br />

there are two more relations besides those. First, holo_part indicates that a thing is a<br />

component part of another thing, second, holo_member indicates that a thing or<br />

person is a member of a group, and third, holo_portion refers to the stuff that a thing<br />

is made from [15], however, this relation links a whole and a portion of the whole in<br />

EuroWordNet [16]. Fourth, holo_madeof encodes the stuff a thing is made from in<br />

EuroWordNet, and fifth, holo_location indicates a thing that is situated within another<br />

place [16].<br />

As for the first relation, its reversed form is mero_part, which corresponds to the<br />

lexical function Part proposed in [17]:<br />

Part(label) = bar code


Comparing WordNet Relations to Lexical Functions 465<br />

The lexical function Mult (collective) yields either the group which the word is a<br />

member of or a bigger quantity of the entity referred to by the word [1]:<br />

Mult(ship) = fleet<br />

Mult(sheep) = flock<br />

The inverse function of Mult is Sing, that is, it expresses a member of a group or a<br />

minimal unit of a thing:<br />

Sing(fleet) = ship<br />

Sing(crew) = seaman<br />

Sing(bread) = slice<br />

In WordNet, these relations are shown in the following way:<br />

{fleet:3} is holo_member of {ship:1}<br />

{sheep:1} is mero_member of {flock:5}<br />

In EuroWordNet, the last one is encoded by the relation holo_portion:<br />

{bread:1} is holo_portion of {piece:8, slice:2}<br />

Thus, the functions Mult and Sing overlap with the WordNet relations<br />

holo_member and mero_member, respectively. Sing can also encode mero_portion in<br />

EuroWordNet. However, other relations such as holo_portion in WordNet and<br />

holo_madeof and holo_location in EuroWordNet are not (yet) encoded in the system<br />

of lexical functions.<br />

4.3 Synonymy<br />

The most basic relation in WordNet is synonymy. Among lexical functions, it is Syn<br />

that expresses this relation. In the case of the lexical function Syn, emphasis is put on<br />

quasi-synonyms, that is, the value of this function can be – besides total synonyms – a<br />

partial (or quasi-)synonym as well. For instance:<br />

Syn(bicycle) = bike, cycle, wheel<br />

The relation encoded by this lexical function appears in two forms in WordNet,<br />

and there is an additional form in EuroWordNet [16]. First, synonymy is the<br />

organizing principle behind the structure of WordNet, thus, between the literals of a<br />

synset the relation of synonymy holds by definition (that is, without any explicit<br />

reference to this relation). The above-mentioned example is shown in WordNet in the<br />

following format:<br />

{bicycle:1, bike:2, wheel:6, cycle:6}


466 Veronika Vincze, Attila Almási, and Dóra Szauter<br />

Second, in the case of adjectives, the relation similar_to expresses that the meaning<br />

of two adjectives are similar though they do not belong to the same synset [17]. An<br />

example:<br />

{heavy:2} is similar_to {harsh:4}<br />

{heavy:2} is similar_to {thick:8}<br />

Third, EuroWordNet makes use of the relation near_synonym, which stands<br />

between synsets whose meanings are similar but not similar enough to be included in<br />

the same synset [16], for instance:<br />

{device:1} is near_synonym of {tool:1}<br />

4.4 Antonymy<br />

Antonymy proved to be hard to define since “[t]he antonym of a word x is sometimes<br />

not-x but not always” [2]. Conceptually, people can easily find the antonym of a<br />

word: they usually give the antonym of a word as a response in word association tests.<br />

However, words with similar meanings (especially adjectives and adverbs) do not<br />

always have the same antonym. For instance, heavy and light are considered to be<br />

antonyms, furthermore, weighty and weightless are also antonyms, heavy and<br />

weightless are hardly seen as antonyms, nevertheless [18].<br />

Thus, antonymy seems to behave in two different ways: on the one hand, it is a<br />

semantic relation between word meanings, and, on the other hand, it is a lexical<br />

relation between word forms, since the antonym of an adjective or an adverb is mostly<br />

produced morphologically (with the addition of a negative prefix). The organization<br />

of adverbs and adjectives in WordNet reflects this ambiguity: there is an antonymy<br />

relation stated only between “real” antonyms (that is, conceptual opposites that are<br />

lexical pairs) by the means of the relation near_antonym 1 , and antonymy between<br />

indirect antonyms is expressed in WordNet only through inheritance [18].<br />

In the theory of lexical functions, however, conceptual antonyms are not<br />

distinguished from lexical antonyms, that is, antonymy is considered to be a semantic<br />

relation rather than a lexical one. Antonymy (that is, the lexical function Anti) is<br />

defined in the following way [1]: "la lexie L 1 est un antonyme de la lexie L 2 si et<br />

seulement si leurs signifiés sont identiques sauf pour la négation se trouve « au sein »<br />

d’un des deux signifiés ” [lexeme L 1 is an antonym of lexeme L 2 if and only if their<br />

meanings are identical except that negation is present « within » one of the meanings<br />

– translation is ours]. Some examples are given here:<br />

1<br />

In the case of adjectives of Italian WordNet, antonymy is further divided into two<br />

subrelations: complementary_antonymy (if a word holds, then its opposite is excluded, e.g.<br />

dead and alive) and gradable_antonymy (words referring to gradable properties, e.g. big and<br />

small). Besides, the underspecified relation antonymy also survives for cases when the nature<br />

of the opposition is unclear [19]. In the present discussion, however, only the general relation<br />

antonymy is examined thoroughly.


Comparing WordNet Relations to Lexical Functions 467<br />

Anti(despair) = hope<br />

Anti(construct) = destroy<br />

Anti(respect) = disrespect<br />

There is another lexical function, Conv (conversive), that expresses a relation<br />

similar to antonymy: the semantic content of the argument and the value of Conv are<br />

identical, however, the actants, that is, the participants of situation described are<br />

reversed, which is indicated by index numbers [1] such as in the following examples:<br />

Conv 21 (frighten) = fear (Death frightens me vs. I fear death)<br />

Conv 3214 (buy) = sell (John bought a pair of shoes from Mary for $25 vs. Mary sold<br />

a pair of shoes to John for $25)<br />

It is important to emphasize that the lexical functions Anti and Conv differ from<br />

each other: Conv often yields a lexeme which seems to be a quasi-antonym of the<br />

original word, however, it is not the case since the antonym of the word is provided<br />

by the application of Anti. The following examples nicely illustrate the difference<br />

between Anti and Conv:<br />

Conv 31 (send) = receive (Peter sent a letter to John vs. John received a letter from<br />

Peter)<br />

Anti(send) = intercept (to cause that the letter does not arrive)<br />

Conv 21 (equal) = equal (1000 metres are equal to 1 kilometre vs. 1 kilometre is<br />

equal to 1000 metres)<br />

Anti(equal) = unequal<br />

It is quite clear from the examples above that the semantic content of the two<br />

lexical functions are different for their application to the same word provides different<br />

values. This is especially striking in the second case: equal is its own conversive,<br />

however, it cannot be its own antonym.<br />

Nevertheless, in WordNet, these two relations are not differentiated: they are<br />

covered by the relation near_antonym (or, in Italian WordNet, antonymy,<br />

grad_antonymy and comp_antonymy). Some of “real” antonyms are not listed at all or<br />

some conversives are given as (near-)antonyms. Here we provide a selection of synset<br />

pairs that are connected through the relation near_antonym in WordNet:<br />

{give:3} is near_antonym of {take:8}<br />

{sell:1} is near_antonym of {buy:1, purchase:1}<br />

{rise:16, come up:10, uprise:5, ascend:6} is near_antonym of {set:10, go down:7,<br />

go under:2}<br />

{hire:1, engage:3, employ:2} is near_antonym of {fire:4, give notice:1, can:2,<br />

dismiss:4, give the axe:1, send away:2, sack:2, force out:2, give the sack:1,<br />

terminate:4}


468 Veronika Vincze, Attila Almási, and Dóra Szauter<br />

{get off:1} is near_antonym of {board:1, get on:2}<br />

{man:1, adult male:1}is near_antonym of {woman:1, adult female:1}<br />

{foe:2, enemy:4} is near_antonym of {ally:2, friend:2}<br />

{wife:1, married woman:1} is near_antonym of {husband:1, hubby:1, married<br />

man:1}<br />

{parent:1} is near_antonym of {child:2, kid:4}<br />

It can be seen on the basis of the examples that there is no unique well-defined<br />

relation that holds between all members of pairs connected by the relation<br />

near_antonym. Basically, these antonym pairs can be divided into two groups: in one<br />

group we can find pairs that are opposites of each other in the sense that the words of<br />

the pair cannot be applied in the same situation, that is, the members of the pair are<br />

mutually exclusive – for instance, if you get on a bus, you cannot get off the bus at the<br />

same time). However, in the other group, there are pairs whose members necessarily<br />

coexist – as an example, if there exists a wife, then there must be a husband, too.<br />

Thus, we suggest that the relation near_antonym (and antonymy) should be divided<br />

into two new relations: antonym and conversive. On the one hand, antonym would<br />

function similar to the lexical function Anti, that is, it would form a link between<br />

synsets whose meanings differs from each other only with respect to an inner<br />

negation. On the other hand, synsets connected to each other through the relation<br />

conversive would describe the same situation or refer to the same action but from a<br />

different perspective: another participant of the situation becomes more important,<br />

thus, a new aspect is emphasized – just like the lexical function Conv does.<br />

The above-mentioned examples can be categorized into an Anti- and a Convgroup.<br />

This is shown here:<br />

Conv 31 (sell) = buy, purchase<br />

Conv 31 (give) = take<br />

Conv 21 (parent) = child, kid<br />

Conv 21 (wife) = husband<br />

Anti(get on) = get off<br />

Anti(man) = woman<br />

Anti(rise) = set<br />

Anti(enemy) = friend<br />

Anti(employ) = fire<br />

We also mention that there are some words having both conversive and antonym<br />

pairs. These are the most illustrative examples of the necessity of splitting the original<br />

near_antonym relation into two relations: conversive and antonym. Besides the<br />

examples of equal and receive (given above), another case is provided here:<br />

Conv 21 (spouse) = spouse (that is, spouse is its own conversive)


Comparing WordNet Relations to Lexical Functions 469<br />

Anti(spouse) = lover (that is, someone who acts similarly to a spouse towards<br />

someone who is not his or her spouse)<br />

However, in WordNet, {spouse:1, partner:1, married person:1, mate:4, better<br />

half:1} and {lover:3} are not connected to each other in any way. They share their<br />

hypernym but no antonymy relation holds between them. According to our proposed<br />

relations, these synsets should be represented in the following way:<br />

{spouse:1, partner:1, married person:1, mate:4, better half:1} is antonym of<br />

{lover:3}<br />

{spouse:1, partner:1, married person:1, mate:4, better half:1} is conversive of<br />

{spouse:1, partner:1, married person:1, mate:4, better half:1}<br />

In the same way, the synsets {wife:1, married woman:1} and {husband:1, hubby:1,<br />

married man:1} should also be linked to {mistress:1, kept woman:1, fancy woman:2}<br />

and {fancy man:2, paramour:1 } with the relation antonym, respectively. However,<br />

they are each other’s conversive.<br />

4.5 Derivation<br />

Certain morphological relations between word forms – namely, instances of<br />

derivational morphology – are also encoded in WordNet by the means of the relations<br />

eng_derivative and derived. These relations differ from the previously mentioned<br />

ones in an important aspect: they do not hold between all members of the two synsets<br />

linked to each other. It is usually one literal in the synset that serves as the basic form<br />

for the derivation, and, on the other hand, the derived form can also have some<br />

synonyms within its own synset. Thus, morphological relations hold rather between<br />

word forms than between synsets.<br />

In WordNet, we can find examples of nominal, verbal, adjectival and adverbial<br />

derivations as well although the nature of the derivation (that is, nominal, verbal etc.)<br />

is not explicitly stated:<br />

{quantify:2, measure:2} -->> {quantification:2}<br />

{energy:1} -->> {excite:3, energize:2, energise:1}<br />

{membrane:2, tissue layer:1} -->> {membranous:1}<br />

{real:1, existent:2} -->> {actually:1, really:1}<br />

However, these derivations are not differentiated with respect to the semantic<br />

connection that exists between the original word and the derived one. To put it in<br />

another way, it is indicated that a certain morphological relation holds between the<br />

words but the semantic nature of this relation is left underspecified. Obviously, the<br />

definitions of the words contain pieces of information from which the semantic<br />

relation can be calculated but when looking for a special type of word or words<br />

having specific grammatical features (for instance, agents or patients), it is time-


470 Veronika Vincze, Attila Almási, and Dóra Szauter<br />

consuming to look up every single synset that is connected to the original one through<br />

the relation eng_derivative or derived in order to find the necessary ones.<br />

As it can be expected, there are lexical functions which – instead of changing the<br />

semantic content of the word – change the syntactic features of the word, in other<br />

words, they preserve the semantic content but change the part-of-speech of the word<br />

by derivation. Lexical functions S 0 , V 0 , A 0 and Adv 0 nominalize, verbalize,<br />

adjectivize and adverbialize the original word, respectively [1]:<br />

S 0 (present) = presentation<br />

V 0 (verbal) = verbalize<br />

A 0 (beauty) = beautiful<br />

Adv 0 (hard) = hard<br />

Other lexical functions specify a participant in the situation described by a verb.<br />

For instance, the lexical function S 1 gives the agent while the patient is provided by S 2<br />

and S 3 generates another participant who is involved in the situation:<br />

S 1 (write) = author<br />

S 2 (speak) = speech<br />

S 3 (speak) = addressee<br />

Lexical functions belonging to the latter type usually represent derivations that are<br />

not considered to be productive or systematic, however, the syntacto-semantic<br />

relation between the two lexical units is evident [1].<br />

As the comparison of lexical functions and WordNet relations concerning<br />

derivational morphology reveals, lexical functions offer a more detailed analysis of<br />

derivational relations than WordNet relations in their present form do. Thus, we<br />

propose that WordNet relations encoding derivational morphology should be<br />

enhanced in order to provide a more precise and accurate network of words from a<br />

derivational point of view as well. With the introduction of relations S 0 , V 0 , A 0 and<br />

Adv 0 , instances of derivational morphology would be easier to be detected in<br />

WordNet. On the other hand, the application of the relations S 1 , S 2 and S 3 would make<br />

it possible to search for semantic derivations in WordNet. Finally, both innovations<br />

would be of use in second language acquisition (for WordNet used in second<br />

language teaching, see e.g. [20]).<br />

To conclude the comparison of lexical functions and WordNet relations, we<br />

provide a table summarizing the parallels between the two systems:


Comparing WordNet Relations to Lexical Functions 471<br />

Table 1. Parallels between WordNet relations and lexical functions.<br />

Relation WordNet, EuroWordNet Lexical function<br />

synonymy<br />

synset<br />

Syn<br />

similar_to<br />

near_synonym<br />

antonymy<br />

near_antonym<br />

Anti, Conv<br />

antonymy<br />

hypernymy hypernym Gener<br />

hyponymy hyponym Spec<br />

holonymy holo_member Mult<br />

meronymy<br />

mero_member, mero_portion Sing, Part<br />

(in EWN), mero_part<br />

derivational<br />

morphology<br />

eng_derivative, derived S 0 , V 0 , A 0 , Adv 0<br />

4.6 New Relations<br />

In this section we discuss some semantic relations encoded by lexical functions that<br />

have no equivalent in WordNet. However, we think that their addition would<br />

contribute to the future development of WordNet in a useful way.<br />

A semantic relation that is not yet present in WordNet is formalized with the help<br />

of the lexical function Cap: it signals the leader or boss of something [1]. For<br />

instance:<br />

Cap(university) = chancellor<br />

Cap(ship) = captain<br />

Cap(school) = headmaster<br />

In WordNet, this semantic relation could be marked in the following way:<br />

{master:7, captain:4, sea captain:1, skipper:2} is leader of {ship:1}<br />

Another lexical function that can be applied in WordNet, too is Equip, which<br />

refers to the staff of something [1]. Some illustrations:<br />

Equip(theatre) = company<br />

Equip(ship) = crew<br />

This relation can be encoded in WordNet in this way:<br />

{crew:1} is staff of {ship:1}


472 Veronika Vincze, Attila Almási, and Dóra Szauter<br />

A third relation that we propose links nouns and verbs, namely, it is the verb<br />

having the sense of producing the typical sound of the noun. This lexical function is<br />

called Son [1]:<br />

Son(dog) = bark<br />

Son(pig) = grunt<br />

In WordNet, the newly proposed relation sound can stand for this relation:<br />

{grunt:1} sound of {hog:3, pig:1, grunter:2, squealer:2, Sus scrofa:2}<br />

In sections 4.4 and 4.5 we already proposed the splitting of the relation<br />

near_antonym into conversive and antonym and the introduction of relations encoding<br />

derivational connections. Thus, we refrain from repeating our argumentation here.<br />

Instead, we summarize the newly proposed semantic relations in the following table:<br />

Table 2. Proposed semantic relations.<br />

Proposed relation WordNet Lexical Function<br />

conversion conversive Conv<br />

antonymy antonym Anti<br />

leadership leader Cap<br />

staff staff Equip<br />

typical sound sound Son<br />

5 Conclusion<br />

In this paper, we compared some semantic and lexical relations used in WordNet and<br />

EuroWordNet and the corresponding lexical functions in Meaning-Text Theory. We<br />

found that in the case of synonymy, hypernymy and holonymy, lexical functions and<br />

WordNet relations are equivalent. However, we found that the relation near_antonym,<br />

that is, the one encoding antonymy in WordNet covers two different semantic<br />

relations. Thus, we suggested that this relation should be divided into two new<br />

relations: conversive and antonym. The coding of derivational morphology can also<br />

be improved by introducing new relations that encode not only morphological but<br />

semantic derivations as well. Finally, we proposed some new semantic relations that<br />

may contribute to the further development of the complex but colourful network of<br />

words found in WordNet.<br />

Acknowledgements<br />

Thanks are due to our two reviewers, Antonietta Alonge and Kadri Vider for their<br />

useful comments and remarks, which helped us improve the quality of this article.


Comparing WordNet Relations to Lexical Functions 473<br />

References<br />

1. Mel'čuk, I., Clas, A., Polguère, A.: Introduction à la lexicologie explicative et combinatoire.<br />

Duculot: Louvain-la-Neuve (1995)<br />

2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: an<br />

On-line Lexical Database. J. International Journal of Lexicography 3(4), 235–244 (1990)<br />

3. Mel'čuk, I.: Esquisse d'un modèle linguistique du type "SensTexte". In: Problèmes actuels<br />

en psycholinguistique. Colloques inter. du CNRS, nº 206. pp. 291–317. CNRS, Paris (1974)<br />

4. Mel'čuk, I.: Semantic Primitives from the Viewpoint of the Meaning-Text Linguistic Theory.<br />

J. Quaderni di Semantica 10(1), 65-102 (1989)<br />

5. Mel'čuk, I.: Lexical Functions: A Tool for the Description of Lexical Relations in the<br />

Lexicon. In: Wanner, L. (ed.) Lexical Functions in Lexicography and Natural Language<br />

Processing, pp. 37–102. Benjamins, Amsterdam (1996)<br />

6. Mel'čuk, I.: Collocations and Lexical Functions. In: Cowie, A. P. (ed.) Phraseology. Theory,<br />

Analysis, and Applications, pp. 23–53. Clarendon Press, Oxford (1998)<br />

7. Mel'čuk, I., Žolkovskij, A.: Explanatory Combinatorial Dictionary of Modern Russian.<br />

Wiener Slawistischer Almanach, Vienna (1984)<br />

8. Mel'čuk, I. et al.: Dictionnaire explicatif et combinatoire du français contemporain:<br />

Recherches lexico-sémantiques I–IV. Presses de l'Université de Montréal, Montréal (1984,<br />

1988, 1992, 1999)<br />

9. Wanner, L. (ed.): Selected Lexical and Grammatical Issues in the Meaning-Text Theory. In<br />

honour of Igor Mel'čuk. Benjamins, Amsterdam (2007)<br />

10. Wanner, L. (ed.): Recent Trends in Meaning-Text Theory. Benjamins, Amsterdam (1997)<br />

11. Répási Gy., Székely G.: Lexikográfiai előtanulmány a fokozó értelmű szavak és<br />

szókapcsolatok szótárához [A lexicographic pilot study on the dictionary of intensifying<br />

words and collocations]. J. Modern Nyelvoktatás 4(2–3), 89–95 (1998)<br />

12. Székely G.: A fokozó értelmű szókapcsolatok magyar és német szótára. [Hungarian –<br />

German dictionary of intensifying collocations]. Tinta Könyvkiadó, Budapest (2003)<br />

13. Dancette, J., L'Homme, M.-C.: The Gate to Knowledge in a Multilingual Specialized<br />

Dictionary: Using Lexical Functions for Taxonomic and Partitive Relations. In: EURALEX<br />

2002 Proceedings, pp. 597–606. Copenhagen (2002)<br />

14. Grimes, J. E.: Inverse Lexical Functions. In: Steele, J. (ed.) Meaning-Text Theory:<br />

Linguistics, Lexicography and Implications, pp. 350–364. Ottawa University Press, Ottawa<br />

(1990)<br />

15. Miller, G. A. Nouns in WordNet: A Lexical Inheritance System. J. International Journal of<br />

Lexicography 3(4), 245–264 (1990)<br />

16. Alonge, A., Bloksma, L., Calzolari, N., Castellon, I., Marti, T., Peters, W., Vossen P.: The<br />

Linguistic Design of the EuroWordNet Database. J. Computers and the Humanities. Special<br />

Issue on EuroWordNet, 32(2–3), 91–115 (1998)<br />

17. Fontenelle, T.: Turning a Bilingual Dictionary into a Lexical-Semantic Database. Max<br />

Niemeyer, Tübingen (1997)<br />

18. Fellbaum, C., Gross, D., Miller, K.: Adjectives in WordNet. J.: International Journal of<br />

Lexicography 3(4), 265–277 (1990)<br />

19. Alonge, A., Bertagna, F., Calzolari, N., Roventini, A., Zampolli, A.: Encoding Information<br />

on Adjectives in a Lexical-Semantic Net for Computational Applications. In: Proceedings of<br />

NAACL 2000, pp. 42–49. Seattle (2000)<br />

20. Hu, X., Graesser, A. C.: Using WordNet and latent semantic analysis to evaluate the<br />

conversational contributions of learners in tutorial dialogue. In: Proceedings of ICCE'98. 2.,<br />

pp. 337–341. Higher Education Press, Beijing (1998)


KYOTO: A System for Mining, Structuring, and<br />

Distributing Knowledge Across Languages and Cultures<br />

Piek Vossen 1, 2 , Eneko Agirre 3 , Nicoletta Calzolari 4 , Christiane Fellbaum 5, 6 , Shu-<br />

Kai Hsieh 7 , Chu-Ren Huang 8 , Hitoshi Isahara 9 , Kyoko Kanzaki 9 , Andrea Marchetti 10 ,<br />

Monica Monachini 4 , Federico Neri 11 , Remo Raffaelli 11 , German Rigau 3 , Maurizio<br />

Tesconi 10 , and Joop VanGent 2<br />

1<br />

Faculteit der Letteren, Vrije Universiteit Amsterdam, De Boelelaan 1105,<br />

1081HV Amsterdam, Netherlands<br />

p.vossen@let.vu.nl<br />

2<br />

Irion technologies, Delftechpark 26, 2628XH Delft, Netherlands<br />

{piek.vossen, gent}@irion.nl<br />

3<br />

IXA NLP group, University of the Basque Country, Manuel Lardizabal 1, Donostia,<br />

Basque –Country<br />

{e.agirre, g.rigau@ehu.es}<br />

4<br />

Istituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche, Via Moruzzi 1,<br />

56124 Pisa, Italy<br />

nicoletta.calzolari@ilc.cnr.it, monica.monachini@ilc.cnr.it<br />

5<br />

Berlin-Brandenburg Academy of Sciences, Berlin, Germany<br />

6<br />

Princeton University, Princeton, USA<br />

7<br />

National Taiwan Normal University, Republic of China<br />

8<br />

Academia Sinica, Taipei, Republic of China<br />

9<br />

NICT, Kyoto, Japan<br />

10<br />

Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Via Moruzzi 1,<br />

56124 Pisa, Italy<br />

andrea.marchetti@iit.cnr.it, maurizio.tesconi@iit.cnr.it<br />

11<br />

Synthema, Pisa, Italy<br />

Abstract. We outline work to be carried out within the framework of an<br />

impending EC project. The goal is to construct a language-independent<br />

information system for a specific domain (environment/ecology) anchored in a<br />

language-independent ontology that is linked to WordNets in several languages.<br />

For each language, information extraction and identification of lexicalized<br />

concepts with ontological entries will be done by text miners ("Kybots"). The<br />

mapping of language-specific lexemes to the ontology allows for crosslinguistic<br />

identification and translation of equivalent terms. The infrastructure developed<br />

within this project will enable long-range knowledge sharing and transfer to<br />

many languages and cultures, addressing the need for global and uniform<br />

transition of knowledge beyond the domain of ecology and environment<br />

addressed here.<br />

Keywords: Global WordNet Grid, ontologies and WordNets, multilinguality,<br />

semantic indexing and search, text mining.


KYOTO: A System for Mining, Structuring, and Distributing… 475<br />

1 Introduction<br />

Economic globalization brings challenges and the need for new solutions that can<br />

serve all countries. Timely examples are environmental issues related to rapid growth<br />

and economic developments such as global warming. The universality of these<br />

problems and the search for solutions require that information and communication be<br />

supported across a wide range of languages and cultures. Specifically, a system is<br />

needed that can gather and represent in a uniform way distributed information that is<br />

structured and expressed differently across languages. Such a system should<br />

furthermore allow both experts and laymen to access information in their own<br />

language and without recourse to cultural background knowledge.<br />

Addressing sudden and unpredictable environmental disasters (fires, floods,<br />

epidemics, etc.) requires immediate decisions and actions relying on information that<br />

may not be available locally. Moreover, the sharing and transfer of knowledge are<br />

essential for sustainable growth and long-term development. In both cases, it is<br />

important that information and experience are not only distributed to assist with local<br />

emergencies but are universally re-usable. In these settings, natural language is the<br />

most ubiquitous and flexible interface between users – especially non-experts – and<br />

information systems.<br />

The goal of "Knowledge-Yielding Ontologies for Transition-Based Organization"<br />

(KYOTO) is, first, to develop a content enabling system that provides deep semantic<br />

search. KYOTO will cover access to a broad range of multimedia data from a large<br />

number of sources in a variety of culturally diverse languages. The data will be<br />

accessible to both experts and the general public on a global scale.<br />

KYOTO is funded under project number 211423 in the 7 th Frame work project in<br />

the area of Digital Libraries: FP7-ICT-2007-1, Objective ICT-2007.4.2: Intelligent<br />

Content and Semantics. It will start early <strong>2008</strong> and last 3 years. The consortium<br />

consists of research institutes, companies and environmental organizations: Vrije<br />

Universiteit Amsterdam (Amsterdam, The Netherlands), Consiglio Nazionale delle<br />

Ricerche (Pisa, Italy), Berlin-Brandenburg Academy of Sciences and Humantities<br />

(Berlin, Germany), University of the Basque Country (Donostia, Basque Country),<br />

Academia Sinica (Tapei, Taiwan), National Institute of Information and<br />

Communications Technology (Kyoto, japan), Irion Technologies (Delft, The<br />

Netherlands), Synthema (Pisa, Italy), European Centre for Nature Conservation<br />

(Tilburg, The Netherlands), World Wide Fund for Nature (Zeist, The Netherlands),<br />

Masaryk University (Brno, Czech). In total 364 person months of work are involved.<br />

The partners from Taiwan and Japan are funded by national grants.<br />

2 The KYOTO System: Overview<br />

KYOTO is a generic system offering knowledge transition and information across<br />

different target groups, transgressing linguistic, cultural and geographic boundaries.<br />

Initially developed for the environmental domain, KYOTO will be usable in any<br />

knowledge domain for mining, organizing, and distributing information on a global<br />

scale in both European and non-European languages.


476 Piek Vossen et al.<br />

KYOTO's principle components are an ontology linked to WordNets in a broad<br />

range of languages (Basque, Chinese, Dutch, English, Italian, Japanese, Spanish),<br />

linguistic text miners, a Wiki environment for supporting and maintaining the system,<br />

and a portal for the environment domain that allows for deep semantic searches.<br />

Concept extraction and data mining are applied through a chain of semantic<br />

processors ("Kybots") that share a common ground and knowledge base and re-use<br />

the knowledge for different languages and for particular domains.<br />

Information access is provided through a cross-lingual user-friendly interface that<br />

allows for high-precision search and information dialogues for a variety of data from<br />

wide-spread sources in a range of different languages. This is made possible through a<br />

customizable, shared ontology that is linked to various WordNets and that guarantees<br />

a uniform interpretation for diverse types of information from different sources and<br />

languages.<br />

The system can be maintained and kept up to date by specialists in the field using<br />

an open Wiki platform for ontology maintenance and WordNet extension.<br />

Citizens<br />

Governors<br />

Companies<br />

Environmental<br />

organizations<br />

Environmental<br />

organizations<br />

Domain<br />

Wiki<br />

Universal Ontology<br />

Θ<br />

Top<br />

Abstract<br />

Physical<br />

Process<br />

Substance<br />

Wordnets<br />

Capture<br />

Concept<br />

Mining<br />

Fact<br />

Mining<br />

Docs<br />

URLs<br />

Dialogue<br />

Search<br />

Middle<br />

water CO2<br />

Index<br />

Experts<br />

Domain<br />

water<br />

CO2<br />

pollution<br />

emission<br />

Images<br />

Fig. 1. System architecture<br />

Figure 1 gives an overview of the complete system. In this schema, information<br />

stored in various media and languages, distributed over different locations, is<br />

collected through a Capture module and stored in a uniform XML representation. For<br />

each language, concept miners are applied to derive concepts that occur in the textual<br />

data and compare these with the given WordNets for the different languages. The<br />

WordNets provide a mapping to a single shared ontology. Both the WordNets and the<br />

ontology can be modified and edited in a special Wiki environment by the people in a<br />

community; in the present project, these will be specialists in the environment<br />

domain. Encoding of knowledge and WordNets for a domain will result in more


KYOTO: A System for Mining, Structuring, and Distributing… 477<br />

precise and effective mining of information and data through fact mining by the socalled<br />

Kybots. Kybots will be able to detect specific patterns and relations in text<br />

because of the concepts and constraints coded by the experts. These relations are<br />

added to the XML representation of the captured text. An indexing module then<br />

creates the indexes for different databases and data types that can be accessed by the<br />

users through a text search interface or possibly dialogue systems. The users can be<br />

the same environmental organisations, and/or governments and citizens.<br />

In the next sections, we will discuss KYOTO's major components in more detail.<br />

3 The Ontology<br />

The ontology, where knowledge of concepts is formally encoded, consists of three<br />

layers. The top layer is based on existing top level ontologies, among them SUMO<br />

[10, 11], DOLCE [8] and the MEANING Top Concept Ontology [3]. We will<br />

investigate what ontology will be the best basis for our purpose and can also be shared<br />

across the diverse languages and cultures. If necessary, ontology fragments or<br />

elements can be shared or a selection will be made. We do not expect major<br />

differences in the fundamental semantic organisation of the different languages.<br />

Recent studies, for example, show that the Chinese radical system and character<br />

compounding tend to be based on the same qualia distinctions as in the Generative<br />

Lexicon [4, 5].<br />

The middle layer will be derived from existing WordNets, where concepts are<br />

mapped to lexical units. The ontology's mid-level must be developed such that it<br />

connects domain terms and concepts to the top-level. We define all the high-level and<br />

mid-level concepts that are needed to accommodate the information in the<br />

environmental domain. Knowledge is implemented at the most generic level to<br />

maximize re-usability yet precisely enough to yield useful constraints in detecting<br />

relations. Within the domain, we extend the ontologies to cover all necessary concepts<br />

and applicable, sharable relations.<br />

The domain terms are extracted semi-automatically from the source documents or<br />

manually created through a Domain Wiki. The Domain Wiki allows experts to modify<br />

and extend the domain level of the ontology and extend the WordNets accordingly. It<br />

enables community-based resource building, which will lead to increased, shared<br />

understanding of the domain and at the same time result in the formalization of this<br />

knowledge, so that it can be used by an automatic system.<br />

This resource will build on the Multilingual Central Repository (MCR) knowledge<br />

base [1] developed in the MEANING project [12]. Currently, the MCR consistently<br />

integrates more than 1.6 million semantics links among concepts. Moreover, the<br />

current MCR has been enriched with about 460.000 semantic and ontological<br />

properties [2]: Base Concepts and Top Concept Ontology [3], WordNet Domains [7],<br />

Suggested Upper Merged Ontology (SUMO) [10], providing ontological coherence to<br />

all the uploaded WordNets.<br />

Extensions to WordNets and the ontology will be propagated through appropriate<br />

sharing protocols, developed exploiting LeXFlow, a framework for rapid prototyping<br />

of cooperative applications for managing lexical resources (XFlow [9] and LeXFlow


478 Piek Vossen et al.<br />

[13, 14, 15, 16]). The shared ontology guarantees a uniform interpretation layer for<br />

the diverse information from different sources and languages. At the lowest level of<br />

the ontology, we expect that abstract constraints and structures can be hidden for the<br />

users but can still be used to prevent fundamental errors, e.g. creating a concrete<br />

concept for an adjective. The Wiki users should focus on formulating conditions and<br />

specifications that they understand without having to worry about the linguistic and<br />

knowledge engineering aspects. They can discuss these specifications within their<br />

community to reach consensus and provide proper labels in each language.<br />

4 Kybots<br />

Once the ontological anchoring is established, it will be possible to build text mining<br />

software that is able to detect semantic relations and propositions. Data miners, socalled<br />

Kybots (Knowledge-yielding robots), can be defined using constraints among<br />

relations at a generic ontological level. These logical expressions need to be<br />

implemented in each language by mapping the conceptual constraint onto linguistic<br />

patterns. A collection of Kybots created in this way can be used to extract the relevant<br />

knowledge from textual sources represented in a variety of media and genres and<br />

across different languages and cultures. Kybots will represent such knowledge in a<br />

uniform and standardized XML format, compatible with WWW specifications for<br />

knowledge representation such as RDF and OWL.<br />

Kybots will be developed to cover users' questions and answers as well as generic<br />

concepts and relations occurring in any domain, such as named-entities, locations,<br />

time-points, etc. Kybots are primarily defined at a generic level to maximize reusability<br />

and inter-operability. We develop the Kybots that are necessary for the<br />

selected domain but the system can easily be extended and ported to other domains.<br />

The Kybots will operate on a morpho-syntactic and semantic encoding level that<br />

will be the same across all the languages. Every group will use existing linguistic<br />

processors or develop additional ones when needed to foresee in a basic linguistic<br />

analysis, which involves: tokenization, segmentation, morpho-syntactic tagging,<br />

lemmatization and basic syntactic parsing. Each of these processes can be different<br />

but the XML encoding of the output will be the same. This will guarantee that Kybots<br />

can be applied to the output of text in different languages in a uniform way. We will<br />

use as much as possible existing and available free software for this process. Note that<br />

the linguistic expression rules of ontological patterns in a specific Kybot are to be<br />

defined on the basis of the common output encoding of the linguistic processors.<br />

Likewise, they can share specifications of linguistic expression in so far the relations<br />

are expressed in the same way in these languages.<br />

5 Indexing, Searching, and Interfacing<br />

The extracted knowledge and information is indexed by an existing search system that<br />

can handle fast semantic search across languages. It uses so-called contextual<br />

conceptual indexes, which means that occurrences of concepts in text are interpreted


KYOTO: A System for Mining, Structuring, and Distributing… 479<br />

by their co-occurrence with other concepts within a linguistically defined context,<br />

such as a noun phrase or sentence. The co-occurrence patterns of concepts can be<br />

specified in various ways, possibly based on semantic relations that are defined in the<br />

logical expressions. Thus, the system yields different results for searches for polluting<br />

substance and polluted substance, because these involve different semantic relations<br />

between the same concepts. By mapping a query to concepts and relations, very<br />

precise matches can be generated, without the loss of scalability and robustness found<br />

in regular search engines that rely on string matching and context windows.<br />

Reasoning over facts and ontological structures will make it possible to handle<br />

diverse and more complex types of questions. Crosslinguistic and crosscultural<br />

understanding is vouchsafed through the ontological anchoring of language via<br />

WordNets and text miners.<br />

6 The Wiki Environment<br />

The Wiki environment enables domain experts to easily extend and manage the<br />

ontology and the WordNets in a distributed context, to constantly reflect the<br />

continuous growth and changes of the data they describe. It owns the characteristics<br />

typical of a generic Wiki engine:<br />

• Web based highly-interactive interface, tailored to domain experts who don't know<br />

the underlying complex data model (ontology plus WordNet of different<br />

languages);<br />

• tools to support collaborative editing and consensus achievement such as<br />

discussion forums, and list of last updates;<br />

• automatic acquisition of information from external Web resources (e.g.<br />

Wikipedia);<br />

• rollback mechanism: each change to the content is versioned;<br />

• search functions providing the possibility to define different search patterns (synset<br />

search, textual search and so on);<br />

• role-based user management.<br />

In addition, the Wiki engine manages the underlying complex data model of the<br />

ontology and the WordNets so as to keep it consistent: this is achieved through the<br />

definition of appropriate sharing protocols. For instance, when a new domain term<br />

such as water pollution is inserted into a language-specific WordNet by a domain<br />

expert, a new entry, referred to as dummy entry because of the incompleteness of the<br />

information represented, will be automatically created and added to the ontology and<br />

in the remaining WordNets. The Wiki environment will list all dummy entries to be<br />

filled in, in order to notify them to domain experts allowing for their complete<br />

definition and integration into KYOTO ontological and lexical resources. In this<br />

context, English can be used as the common ground language in order to support the<br />

extension process and the propagation of changes among the different WordNets and<br />

the ontology.


480 Piek Vossen et al.<br />

7 Sharing<br />

Knowledge sharing is a central aspect of the KYOTO system and occurs on multiple<br />

levels.<br />

7.1 Sharing and Re-Use of Generic Knowledge<br />

Sharing of generic ontological knowledge in the domain takes place mainly through<br />

subclass relations. We collect all the relevant terms in each language for the domain<br />

and add them to the general ontology. Possibly, these concepts can be imported from<br />

a specific WordNet and "ontologized." It will be important to specify exactly the<br />

ontological status of the terms. Only disjunct types need to be added [6]. For example,<br />

CO2 is a type of substance, whereas greenhouse gases do not represent a different<br />

type of gas or substance but refers to substances that play a specific role in specific<br />

circumstances. In so far as new definitions and axioms need to be specified, they can<br />

be added for the specific subtypes in the domain. However, this is only necessary if<br />

the related information also needs to be mined from the text and is not already<br />

covered by the generic miners. Next, the generic and domain knowledge is shared<br />

among all participating languages through the mapping of the different WordNets to<br />

the ontology.<br />

Extension to different domains is possible though not within the scope of the<br />

current project.<br />

7.2 Sharing and Re-Use of Generic Kybots<br />

The sharing of Kybots is more subtle. For example, concentrations of substances,<br />

causal relations between processes or conditional states for processes can be stated as<br />

general conceptual patterns using a simple logical expression. Within a specific<br />

domain, any of these relations and conditions could be detected in the textual data by<br />

just using these general patterns. Thus, people usually do not use special words in a<br />

language to refer to the causal relation itself but they use general words such as<br />

"cause" or "factor". Since any causal relation may hold among processes and or states,<br />

they can also hold in the environmental domain. Certain valid conditions can be<br />

specified in addition to the general ones, as they are relevant for the users. For<br />

example, CO 2 emissions can be derived from a certain process involving certain<br />

amounts of the substance CO 2 but critical levels can be defined in the text miner as a<br />

conceptual constraint. Furthermore, we may want to limit the ambiguity of<br />

interpretation that arises at the generic levels to only one interpretation at the domain<br />

level; it is currently an open question to what extent generic patterns can be used or<br />

need to be tuned.<br />

Each language group can build a Kybot, capturing a particular relation. A given<br />

logical expression that underlies the Kybot of another language can be re-used, or a<br />

new pattern can be formulated for a language and a generic universal pattern derived<br />

from it. We foresee a system where the text miners can load any set of Kybots in<br />

combination with the ontology, a set of WordNets and expression rules in each


KYOTO: A System for Mining, Structuring, and Distributing… 481<br />

language. Each Kybot, a textual XML file, contains a logical expression with<br />

constraints from the ontology (either the general ontology or a domain instantiation).<br />

Through the WordNets and the expression rules, the text miner knows how to detect a<br />

pattern in running text for each specific language. In this way, logical patterns can be<br />

shared across languages and across domains.<br />

A Kybot can likewise be developed by a group in one language and taken up by<br />

another group to apply it to another language. Consider the case where a generic<br />

linguistic text miner is formulated for Dutch, based on Dutch words and expressions.<br />

This Kybot is projected to the ontology via the Dutch WordNet, becoming a generic<br />

ontological expression which relates two ontological classes: a Substance to a<br />

Process. This expression may be extended to a domain, where it is applied to CO2 and<br />

CO2 emissions. Next, the Spanish group can load the domain specific expression and<br />

transform it into a Spanish Kybot that can be applied to a domain text in Spanish. To<br />

turn an ontological expression into a Kybot, language expressions rules and functions<br />

need to be provided. This process can be applied to all the participating languages,<br />

where the basic knowledge is shared.<br />

7.3 Cross-Linguistic Sharing of Ontologies<br />

KYOTO will thus generate Kybots in each language that go back to a shared ontology<br />

and shared logical expressions. Thus, KYOTO can be seen as a sophisticated platform<br />

for anchoring and grounding meaning within a social community, where meaning is<br />

expressed and conceived differently across many languages and cultures. It also<br />

immediately makes this shared knowledge operational so that factual knowledge can<br />

be mined from unstructured text in domains. KYOTO supports interoperability and<br />

sharing across these communities since much knowledge can be re-used in any other<br />

domain, and the ontologies support both generic and domain-specific knowledge.<br />

8 Evaluation<br />

The KYOTO system is evaluated in various ways:<br />

1. WordNets and ontologies are evaluated across linguistic partners;<br />

2. Language and ontology experts will use the Wiki system to build the basic<br />

ontology and WordNet layers needed for the extension to the domain;<br />

3. Domain experts will use the top layer and middle layer of WordNets and<br />

ontologies plus the Wiki system to encode the knowledge in their domains and<br />

reach consensus;<br />

4. The system is tested by integration in a retrieval system;<br />

Cross-linguistic re-use and agreement on the semantic organization is the prime<br />

evaluation of the architecture and the system. Proposals for concepts are verified by<br />

other WordNet builders and need to be agreed across the languages and cultures. The<br />

same happens by domain experts in their domain, except that they do not need to


482 Piek Vossen et al.<br />

discuss the technical conceptual issues. Both groups will extensively use the Wiki<br />

environment to reach agreements and consensus.<br />

The application driven evaluation will use a baseline evaluation that uses the<br />

current indexing and retrieval system and the multilingual WordNet database. The<br />

knowledge in KYOTO will lead to more advanced indexes in those cases that Kybots<br />

have been able to detect the relations in the text. These will lead to more precision in<br />

the indexes and also make it possible to detect complex queries for these relations.<br />

The performance if the system will be evaluated with respect to the baseline systems.<br />

This will be done in two ways:<br />

1. using an overall benchmark system that runs a fixed set of queries on the different<br />

indexes and compares the results;<br />

2. using end-user scenarios and interviews carried out on different indexes by test<br />

persons;<br />

The questions and queries are selected to show the capabilities of deep semantic<br />

processing. They will be harvested from current portals in the environmental domain.<br />

Finally, we plan to give public access to the databases (ontologies and WordNets)<br />

and to the retrieval system through the project website. Visitors are invited to try the<br />

system and give feedback.<br />

9 Summary and Outlook<br />

KYOTO will represent a unique platform for knowledge sharing across languages and<br />

cultures that can represent a strong content based standardisation for the future that<br />

enables world wide communication.<br />

KYOTO will advance the state-of-the-art in semantic processing because it is a<br />

unique collaboration that bridges technologies across semantic web technologies,<br />

WordNet development and acquisition, data and knowledge mining and information<br />

retrieval.<br />

On top of the systems and data described earlier, we will build a Wiki environment<br />

that will allow communities to maintain the knowledge and information, without<br />

expert knowledge of ontologies, knowledge engineering and language technology.<br />

The system can be used by other groups and for other domains. Through simple and<br />

clear interfaces that exploit the generic knowledge and check the underlying<br />

structures, users can reach semantic agreement on the definition and interpretation of<br />

crucial notion in their domain. The agreed knowledge can be taken up by generic<br />

Kybots that can then detect possible relations on the basis of this knowledge in text<br />

that will be indexed and made searchable. All knowledge resources in KYOTO will<br />

be public and open source (GPL). This applies to the ontology and the WordNets<br />

mapped to the ontology. The GPL condition also applies to the data miners in each<br />

language, the DEB servers, the LeXFlow API and the Wiki environments. Any<br />

research group should be able to further develop the system, to integrate their own<br />

language and/or to apply it to any other domain.


KYOTO: A System for Mining, Structuring, and Distributing… 483<br />

Acknowledgement<br />

The work described here is funded by the European Community, 7 th Framework<br />

References<br />

1. Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Carroll, J., Magnini, B., Vossen, P.: The<br />

MEANING Multilingual Central Repository. In: Proceedings of the Second International<br />

WordNet Conference-<strong>GWC</strong> 2004. 23–30 January 2004, Brno, Czech Republic. ISBN 80-<br />

210-3302-9 (2004)<br />

2. Atserias, J., Climent, S., Rigau, G.: Towards the MEANING Top Ontology: Sources of<br />

Ontological Meaning. LREC’04. ISBN 2-9517408-1-6. Lisboa (2004)<br />

3. Atserias, J., Climent, S., Moré, J., Rigau, G.: A proposal for a Shallow Ontologization of<br />

WordNet. In: Proceedings of the 21th Annual Meeting of the Sociedad Española para el<br />

Procesamiento del Lenguaje Natural, SEPLN’05. Granada, España. Procesamiento del<br />

Lenguaje Natural 35, 161–167. ISSN: 1135-5948 (2005)<br />

4. Chou, Y-.M., Huang C.R.: Hantology - A Linguistic Resource for Chinese Language<br />

Processing and Studying. In: Proceedings of the Fifth International Conference on Language<br />

Resources and Evaluation (LREC 2006). Genoa, Italy (2006)<br />

5. Chou, Y.M., Hsieh, S.K., Huang, C.R.: Hanzi Grid: Toward a Knowledge Infrastructure for<br />

Chinese Character-Based Cultures. In: Ishida, T., Fussell, S.R., Vossen, P.T.J.M. (eds.)<br />

Intercultural Collaboration I. Lecture Notes in Computer Science. Springer-Verlag (2007)<br />

6. Fellbaum, C.,Vossen, P.: Connecting the Universal to the Specific: Towards the Global Grid.<br />

In: Proceedings of the First International Workshop on Intercultural Communication.<br />

Reprinted in: Ishida, T., Fussell, S. R. and Vossen, P. (eds.) Intercultural Collaboration: First<br />

International Workshop. Lecture Notes in Computer Science 4568, 1–16. Springer, New<br />

York (2007)<br />

7. Magnini, B., Cavaglia, G.: Integrating Subject Field Codes into WordNet. In Gavrilidou, M.,<br />

Carayannis, G., Markantonatu, S., Piperidis, S., Stainhaouer, G. (eds.) Proceedings of<br />

LREC-2000, Second International Conference on Language Resources and Evaluation,<br />

Athens, Greece, 31 May- 2 June 2000, pp. 1413–1418 (2000)<br />

8. Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A.: WonderWeb Deliverable<br />

D18 Ontology Library, IST Project 2001-33052 WonderWeb: Ontology Infrastructure for<br />

the Semantic Web Laboratory For Applied Ontology - ISTC-CNR. Trento (2003)<br />

9. Marchetti, A., Tesconi, M., Ronzano, F., Rosella, M., Bertagna, F., Monachini, M., Soria, C.,<br />

Calzolari, N., Huang, C.R., Hsieh, S.K.: Towards an Architecture for the Global WordNet<br />

Initiative. In: Proceedings of the 3rd Italian Semantic Web Workshop Semantic Web<br />

Applications and Perspectives (SWAP 2006), Pisa, Italy, 18-20 December, 2006 (2006)<br />

10. Niles, I., Pease, A.: Towards a Standard Upper Ontology. In: Welty, C., Smith, B. (eds.)<br />

Proceedings of the 2nd International Conference on Formal Ontology in Information<br />

Systems (FOIS-2001), Ogunquit, Maine, October 17-19, 2001 (2001)<br />

11. Pease, A.: The Sigma Ontology Development Environment. In: Working Notes of the<br />

IJCAI-2003 Workshop on Ontology and Distributed Systems. Proceedings of CEUR 71<br />

(2003)<br />

12. Rigau G., Magnini B., Agirre E., Vossen. P., Carroll, J.: MEANING: A Roadmap to<br />

Knowledge Technologies. Proceedings of COLING Workshop. A Roadmap for<br />

Computational Linguistics. Taipei, Taiwan (2002)<br />

13. Tesconi, M., Marchetti, A., Bertagna, F., Monachini, M., Soria, C., Calzolari, N.: LeXFlow:<br />

a framework for cross-fertilization of computational lexicons. In: Proceedings of


484 Piek Vossen et al.<br />

COLING/ACL 2006 Interactive Presentation Session, 17-21 July 2006 Sydney, Australia<br />

(2006)<br />

14. Tesconi, M., Marchetti, A., Bertagna, F., Monachini, M., Soria, C., Calzolari, N.: LeXFlow:<br />

a Prototype Supporting Collaborative Lexicon Development and Cross-fertilization. In:<br />

Intercultural Collaboration, First International Workshop, IWIC 2007, Demo and Poster<br />

session, Kyoto, Japan (2007)<br />

15. Soria, C., Tesconi, M., Bertagna, F., Calzolari, N., Marchetti, A., Monachini, M.: Moving<br />

to dynamic computational lexicons with LeXFlow" In: Proceedings LREC2006 22-28 May<br />

2006, Genova, Italy (2006)<br />

16. Soria, C., Tesconi, M., Marchetti, A., Bertagna, F., Monachini, M., Huang, C.R., Calzolari,<br />

N.: Towards agent-based cross-lingual interoperability of distributed lexical resources. In:<br />

Proceedings of COLING-ACL Workshop on Multilingual Lexical Resources and<br />

Interoperability, 22-23 July 2006, Sydney, Australia (2006)<br />

Relevant URLs<br />

XML and Database Ronald Bourret: http://www.rpbourret.com/index.htm<br />

Wiki engines:<br />

http://c2.com/cgi/wiki?WikiEngines<br />

Global WordNet Association: http://www.globalwordnet.org<br />

Princeton WordNet:<br />

http://wordnet@cogsci.princeton.edu<br />

Chinese WordNet:<br />

http://bow.sinica.edu.tw<br />

Italian WordNet:<br />

http://www.ilc.cnr.it/iwndb_php/<br />

LeXFlow:<br />

http://xmlgroup.iit.cnr.it:8888/xflow/login<br />

Multilingual WordNet Service: http://xmlgroup.iit.cnr.it:88/exist/wordnet/wordnet/<br />

MCR:<br />

http://adimen.si.ehu.es/cgi-bin/wei5/public/wei.consult.perl


The Cornetto Database: Architecture and Alignment<br />

Issues of Combining Lexical Units,<br />

Synsets and an Ontology<br />

Piek Vossen 1,2 , Isa Maks 1 , Roxane Segers 1 ,<br />

Hennie van der Vliet 1 , and Hetty van Zutphen 2<br />

1<br />

Faculteit der Letteren, Vrije Universiteit Amsterdam, De Boelelaan 1105, 1081 HV<br />

Amsterdam, The Netherlands<br />

{p.vossen, e.maks, r.segers, hd.vandervliet}@let.vu.nl<br />

2 Irion Technologies, Delftechpark 26, 2628 XH, Delft, The Netherlands<br />

hetty.van.zuthphen@irion.nl<br />

Abstract. Cornetto is a two-year Stevin project (project number STE05039) in<br />

which a lexical semantic database is built that combines WordNet with<br />

Framenet-like information for Dutch. The combination of the two lexical<br />

resources (the Dutch WordNet and the Referentie Bestand Nederlands) will<br />

result in a much richer relational database that may improve natural language<br />

processing (NLP) technologies, such as word sense-disambiguation, and<br />

language-generation systems. In addition to merging the Dutch lexicons, the<br />

database is also mapped to a formal ontology to provide a more solid semantic<br />

backbone. Since the database represents different traditions and perspectives of<br />

semantic organization, a key issue in the project is the alignment of concepts<br />

across the resources. This paper discusses our methodology to first<br />

automatically align the word meanings and secondly to manually revise the<br />

most critical cases.<br />

Keywords: WordNet, synsets, lexical units, frames, ontologies, automatic<br />

alignment<br />

1 Introduction<br />

Cornetto is a two-year Stevin project (project number STE05039) in which a lexical<br />

semantic database is built that combines WordNet with Framenet-like information for<br />

Dutch. In addition, the database is also mapped to a formal ontology to provide a<br />

more solid semantic backbone. The combination of the lexical resources will result in<br />

a much richer relational database that may improve natural language processing<br />

(NLP) technologies, such as word sense-disambiguation, and language-generation<br />

systems. The database will be filled with data from the Dutch WordNet [18] and the<br />

Referentie Bestand Nederlands [10]. The Dutch WordNet (DWN) is similar to the<br />

Princeton WordNet for English, and the Referentie Bestand (RBN) includes framelike<br />

information as in FrameNet plus other information on the combinatoric behaviour<br />

of word meanings. RBN has corpus-based examples and rich morpho-syntactic


486 Piek Vossen et al.<br />

structures with complementation information. It furthermore contains many multi<br />

word expressions, both free, partly fixed and frozen expressions.<br />

An important aspect of combining the resources is the alignment of the semantic<br />

structures. In the case of RBN these are lexical units (LUs) and in the case of DWN<br />

these are synsets. Various heuristics have been developed to do an automatic<br />

alignment. Following automatic alignment of RBN and DWN, this initial version of<br />

the Cornetto database will be further extended both automatically and manually. The<br />

resulting data structure is stored in a database that keeps separate collections for<br />

lexical units (mainly derived from RBN), for the synsets (derived from DWN) and for<br />

a formal ontology: SUMO/MILO plus extensions [15]. These 3 semantic resources<br />

represent different viewpoints and layers of linguistic, conceptual information. The<br />

alignment of the viewpoints is stored in a separate mapping table. The database is<br />

itself set up so that the formal semantic definition of meaning can be tightened for<br />

lexical units and synsets by exploiting the semantic framework of the ontology. At the<br />

same time, we want to maintain the flexibility to have a wide coverage for a complete<br />

lexicon and to encode additional linguistic information. The resulting resource will be<br />

made freely available for research in the form of an XML database.<br />

Combining two lexical semantic databases with different organizational principles<br />

offers the possibility to study the relations between these perspectives on a large<br />

scale. However, it also makes it more difficult to align the two databases and to come<br />

to a unified view on the lexical semantic organization and the sense distinctions of the<br />

Dutch vocabulary. In this paper, we discuss the alignment issues. In section 2, we first<br />

give an overview of the structure of the database. Section 3 describes the approach<br />

and results of the automatic alignment. Section 4, discusses the manual work of<br />

checking and improving the automatic process. This work mainly involves comparing<br />

the LUs from RBN with the synset structure of DWN. Finally, in section 5, we<br />

discuss the relation between synsets and the ontology.<br />

2 Architecture of the Database<br />

The Cornetto database (CDB) consists of 3 main data collections:<br />

- Collection of Lexical Units, mainly derived from the RBN<br />

- Collection of Synsets, mainly derived from DWN<br />

- Collection of Terms and axioms, mainly derived from SUMO and MILO<br />

Both DWN and RBN are semantically based lexical resources. RBN uses a<br />

traditional structure of form-meaning pairs, so-called Lexical Units [3]. Lexical Units<br />

are word senses in the lexical semantic tradition. They contain all the necessary<br />

linguistic knowledge that is needed to properly use the word in a language. Word<br />

meanings that are synonyms are separate structures (records) in RBN. They have their<br />

own specification of information, including morpho-syntax and semantics. DWN is<br />

organized around the notion of Synsets. Synsets are concepts as defined by Miller and<br />

Fellbaum [4, 12, 13] in a relational model of meaning. They are mainly conceptual<br />

units strictly related to the lexicalization pattern of a language. Concepts are defined


The Cornetto Database: Architecture and Alignment Issues of … 487<br />

by lexical semantic relations. 1 Typically in WordNet, information is provided for the<br />

synset as a whole and not for the individual word meanings. For example, in WordNet<br />

the synset has a single gloss but the different lexical units in RBN each have their<br />

own definition. From a WordNet point of view, the definitions of lexical units that<br />

belong to the same synset should thus semantically be compatible or synonymous.<br />

Outside the lexicon, an ontology will provide a third layer of meaning. The Terms<br />

in an ontology represent the distinct types in a formal representation of knowledge.<br />

Terms can be combined in a knowledge representation language to form expressions<br />

of axioms. In principle, meaning is defined in the ontology independently of language<br />

but according to the principles of logic. In Cornetto, the ontology represents an<br />

independent anchoring of the relational meaning in WordNet. The ontology is a<br />

formal framework that can be used to constrain and validate the implicit semantic<br />

statements of the lexical semantic structures, both the lexical units and the synsets. In<br />

addition, the ontology provides a mapping of a vocabulary to a formal representation<br />

that can be used to develop semantic web applications.<br />

In addition to the 3 data collections, a separate table of so-called Cornetto<br />

Identifiers (CIDs) is provided. These identifiers contain the relations between the<br />

lexical units and the synsets in the CDB but also to the original word senses and<br />

synsets in the RBN and DWN. In Figure 1, a single CID record is shown that contains<br />

the following records:<br />

C_form = form of the word in Cornetto<br />

C_seq = the sequence of sense number in Cornetto<br />

C_lu_id = the identifier of the lexical unit in Cornetto<br />

C_syn_id = the identifier of the synset in Cornetto<br />

R_lu_id = the identifier of the lexical unit in RBN from which it was derived<br />

R_seq_nr = the orginal sequence number or sense number in RBN<br />

D_lu_id = the identifier of the synonym in DWN<br />

D_syn_id = the identifier of the of the synset in DWN from which it was derived<br />

D_seq_nr = the original sequence number or sense number in DWN<br />

Figure 1 shows an overview of the different data structures and their relations. The<br />

different data can be divided into 3 layers of resources, from top to bottom:<br />

▪ The RBN and DWN (at the top): the original database from which the data are<br />

derived;<br />

▪ The Cornetto database (CDB): the ultimate database that will be built;<br />

▪ External resources: any other resource to which the CDB will be linked, such as<br />

the Princeton WordNet, WordNets through the Global WordNet Association,<br />

WordNet domains, ontologies, corpora, etc.<br />

The center of the CDB is formed by the table of CIDs. The CIDs tie together the<br />

separate collections of LUs and Synsets but also represent the pointers to the word<br />

meaning and synsets in the original databases: RBN and DWN and their mapping<br />

relation. As you can see in this example, the identifiers of the record match the<br />

1<br />

For Cornetto, the semantic relations from EuroWordNet are taken as a starting point [18].


488 Piek Vossen et al.<br />

original identifiers of synsets and lexical units in the original databases. The CIDs are<br />

just administrative records. The Cornetto data itself are stored in the collection of LUs<br />

and the collection of Synsets.<br />

Referentie<br />

Bestand<br />

Nederlands (RBN)<br />

R_lu_id=4234<br />

R_seq_nr=1<br />

Dutch<br />

Wordnet (DWN)<br />

D_lu_id=7366<br />

D_syn_id=2456<br />

D_seq_nr=3<br />

Cornetto<br />

Database<br />

(CDB)<br />

Collection<br />

Collection<br />

Cornetto Identifiers<br />

of<br />

of<br />

Lexical Units<br />

Synsets<br />

CID<br />

LU<br />

C_form=band<br />

SYNSET<br />

Collection<br />

C_lu_id=5345<br />

C_seq_nr=1<br />

C_syn_id=9884<br />

C_form=band<br />

C_lu_id=5345<br />

of<br />

syn onym<br />

C_seq_nr=1<br />

C_syn_id=9884<br />

- C_form=band<br />

Terms & Axioms<br />

Combinatorics<br />

R_lu_id=4234<br />

- C_seq_nr=1<br />

- de band speelt<br />

R_seq_nr=1<br />

relations<br />

- een band vormen<br />

D_lu_id=7366<br />

Term<br />

+ muziekgezelschap<br />

- een band treedt op<br />

D_syn_id=2456<br />

MusicGroup<br />

- popgroep; jazzband<br />

- optreden van een band<br />

D_seq_nr=3<br />

LU<br />

C_lu_id=4265<br />

C_form=band<br />

SUMO<br />

C_seq_nr=2<br />

Combinatorics<br />

MILO<br />

- lekke band<br />

Princeton<br />

- een band oppompen<br />

Wordnet<br />

- de band loopt leeg<br />

- volle band<br />

Czech<br />

German<br />

Wordnet<br />

Wordnet<br />

Wordnet<br />

Domains<br />

Korean<br />

Spanish<br />

Wordnet<br />

French<br />

Wordnet Arabic<br />

Wordnet<br />

Wordnet<br />

Fig. 1. Data collections in the Cornetto Database.<br />

The LUs will contain semantic frame representations. The frame elements may<br />

have co-indexes with Synsets from the WordNet and/or with Terms from the<br />

ontology. This means that any semantic constraints in the frame representation can<br />

directly be related to the semantics in the other collections. Any explicit semantic<br />

relation that is expressed through a frame structure in the LU can also be represented<br />

as a conceptual semantic relation between Synsets in the WordNet database. The<br />

Synsets in the WordNet are represented as a collection of synonyms, where each<br />

synonym is directly related to a specific LU. The conceptual relations between<br />

Synsets are backed-up by a mapping to the ontology. This can be in the form of an<br />

equivalence relation or a subsumption relation to a Term or an expression in a<br />

knowledge representation language. Finally, a separate equivalence relation is<br />

provided to one ore more synsets in the Princeton WordNet.<br />

The Cornetto database provides unique opportunities for innovative NLP<br />

applications. The LUs contain combinatoric information and the synsets place these<br />

words within a semantic network. Figure 2 shows an example of this combination for<br />

several meanings of the word band: with meanings as musical band, as a tube or tire<br />

filled with air, a magnetic band, and a relationship. The semantic network position of


The Cornetto Database: Architecture and Alignment Issues of … 489<br />

the word is depicted in separate WordNet fragments, relating the meanings to<br />

hypernyms, hyponyms and other related concepts. Above each fragment, we list the<br />

framelike combinatoric information that is given in RBN for these different meanings.<br />

A musical band is started, performs, a tube or tire is inflated, can leak, can blow, or<br />

you can fix it, etc. Each of these examples not only illustrates a typical conceptual<br />

usage or interaction but also the particular wording of it in Dutch. From these<br />

combinations, Dutch speakers immediately know what meaning of the word band<br />

applies. These typical examples can be used for the disambiguation of occurrences in<br />

text. Moreover, the same contexts can also be used for other words related to these<br />

meanings. We can easily extend the examples of band as a tire/tube to the hyponyms<br />

fietsband (bike tire) and autoband (car tire) and the examples of band as a<br />

relationship to the hypernym verhouding (affair) and relatie (relation).<br />

Combinatorics<br />

Combinatorics<br />

de band oppompen<br />

in een band spelen<br />

(to pump air in a tire)<br />

(to play in a band)<br />

een band plakken<br />

een band oprichten<br />

(to fix a whole in a tire)<br />

(to start a band)<br />

een lekke band<br />

(flat tire)<br />

de band speelt<br />

(the band plays)<br />

de band springt<br />

(the tire explodes)<br />

groep<br />

artiest (artist)<br />

(groep)<br />

voorwerp (object)<br />

muziek<br />

gezelschap<br />

(group of people) muzikant (music)<br />

(musician)<br />

ring (ring)<br />

muziekgezelschap<br />

(music group)<br />

musiceren<br />

band#1<br />

(band)<br />

(to make music)<br />

band#2 (tire)<br />

Combinatorics<br />

Combinatorics<br />

de band starten<br />

(to start a tape)<br />

een goede/sterke band<br />

(a good strong bond)<br />

op de band opnemen<br />

(to record on a tape)<br />

de banden verbreken<br />

de band afspelen<br />

(to break all bonds)<br />

(to play from a tape)<br />

een band hebben met iemand<br />

(to have a bond with s.o.)<br />

lezen<br />

(read)<br />

middel (device)<br />

informatiedrager<br />

(data carrier)<br />

schrijven<br />

(write)<br />

geluidsdrager<br />

(audio carrie r)<br />

band#3/geluidsband<br />

(audio tape)<br />

toestand (state)<br />

relatie (reltion)<br />

verhouding<br />

(relation)<br />

band#5 (bond)<br />

jazzband<br />

(jazz band)<br />

popgroep<br />

(pop group)<br />

fietsband<br />

(bike tire)<br />

zwe mband<br />

(tire for swimming)<br />

autoband<br />

(car tire)<br />

binnenband<br />

(inne r tire)<br />

buitenband<br />

(oute r tire)<br />

cassettebandje<br />

(audio cassette)<br />

familieband<br />

(family bond)<br />

bloedband<br />

(blood bond)<br />

moede rband<br />

(mother bond)<br />

Fig. 2. Combinatorics and semantics combined.<br />

Another example, where combinatorics and semantic network relations are<br />

combined, relates to drinks. In Dutch, the preparation of drinks is usually referred to<br />

by the general verb maken (to prepare). However, in the case of koffie (coffee) and<br />

thee (tea), another specific verb is used: zetten. So, you typically use the phrases<br />

koffie zetten and thee zetten (to make coffee or tea) but you use the standard phrase<br />

limonade maken (to make lemonade) in Dutch. This example illustrates that<br />

conceptual combinations and constraints that are encoded in the WordNet or the<br />

ontology, do not explain the proper and most intuitive way of phrasing relations. The<br />

benefits of combining resources in this way are however only possible if the word<br />

meanings, representing concepts are properly aligned in the database. This is<br />

discussed in the next sections.


490 Piek Vossen et al.<br />

3 Aligning automatically RBN with DWN<br />

To create the initial database, the word meanings in the Referentie Bestand<br />

Nederlands (RBN) and the Dutch part of EuroWordNet (DWN) have been<br />

automatically aligned. The word koffie for example has 2 word meanings in RBN<br />

(drink and beans) and 4 word meanings in DWN (drink, bush, powder and beans).<br />

This can result in 4, 5, or 6 distinct meanings in the Cornetto database depending on<br />

the degree of matching across these meanings. This alignment is different from<br />

aligning WordNet synsets because RBN is not structured in synsets. For measuring<br />

the match, we used all the semantic information that was available. Since DWN<br />

originates from the Van Dale database VLIS, we could use the definitions and domain<br />

labels from that database. The domain labels from RBN and VLIS have been aligned<br />

separately by first cleaning up the labels manually (e.g., pol and politiek can be<br />

merged) and then measuring the overlap in vocabulary associated with each domain.<br />

The overlap was expressed using a correlation figure for each domain in the matrix<br />

with each other domain. Domain labels across DWN and RBN do not require an exact<br />

match. Instead, the scores of the correlation matrix can be used for associating them.<br />

Overlap of definitions was based on the overlapping normalized content words<br />

relative to the total number of content words. For other features, such as part-ofspeech,<br />

we manually defined the relations across the resources.<br />

We only consider a possible match between words with the same orthographic<br />

form and the same part-of-speech. The strategies used to determine which word<br />

meanings can be aligned are:<br />

1. The word has one meaning and no synonyms in both RBN and DWN<br />

2. The word has one meaning in both RBN and DWN<br />

3. The word has one meaning in RBN and more than one meaning in DWN<br />

4. The word has one meaning in DWN and more in RBN<br />

5. If the broader term (BT) of a set of words is linked, all words which are under that<br />

BT in the semantic hierarchy and which have the same form are linked<br />

6. If some narrow term (NT) in the semantic hierarchy is related, siblings of that NT<br />

that have the same form are also linked.<br />

7. Word meanings that have a linked domain, are linked<br />

8. Word meanings with definitions in which one in every three content words is the<br />

same (there must be more than one match) are linked.<br />

Each of these heuristics will result in a score for all possible mappings between<br />

word meanings. In the case of koffie, we thus will have 8 possible matches. The<br />

number of links found per strategy is shown in Table 1.To weigh the heuristics, we<br />

manually evaluated each heuristics. Of the results of each strategy, a sample was<br />

made of 100 records. Each sample was checked by 8 persons (6 staff and 2 students).<br />

For each record, the word form, part-of-speech and the definition was shown for both<br />

RBN and DWN (taken from VLIS). The testers had to determine whether the<br />

definitions described the same meaning of the word or not. The results of the tests<br />

were averaged, resulting in a percentage of items which were considered good links.<br />

The averages per strategy are shown in Table 1.


The Cornetto Database: Architecture and Alignment Issues of … 491<br />

Table 1. Results for aligning strategies<br />

Conf. Dev.<br />

Factor LINKS<br />

1: 1 RBN & 1 DWN meaning, no synonyms 97.1 4,9 3 9936 8,1%<br />

2: 1 RBN & 1 DWN meaning 88.5 8,6 3 25366 20,8%<br />

3: 1 RBN & >1 DWN meaning 53.9 8,1 1 22892 18,7%<br />

4: >1 RBN & 1 DWN meaning 68.2 17,2 1 1357 1,1%<br />

5: overlapping hyperonym word 85.3 23,3 2 7305 6,0%<br />

6: overlapping hyponyms 74.6 22,1 2 21691 17,7%<br />

7: overlapping domain-clusters 70.2 15,5 2 11008 9,0%<br />

8: overlapping definition words 91.6 7,8 3 22664 18,5%<br />

The minimal precision is 53.9 and the highest precision is 97.1. Fortunately, the<br />

low precision heuristics also have a low recall. On the basis of these results, the<br />

strategies were ranked: some were considered very good, some were considered<br />

average, and some were considered relatively poor. The ranking factors per strategy<br />

are:<br />

• Strategies 1, 2 and 8 get factor 3<br />

• Strategies 5, 6 and 7 get factor 2<br />

• Strategies 3 and 4 get factor 1<br />

A factor 3 means that it counts 3 times as strong as factor 1. It is thus considered<br />

to be a better indication of a link than factor 2 and factor 1, where factor 1 is the<br />

weakest score. The ranking factor is used to determine the score of a link. The score<br />

of the link is determined by the number of strategies that apply and the ranking factor<br />

of the strategies. In total, 136K linking records are stored in the Cornetto database.<br />

Within the database, only the highest scoring links are used to connect WordNet<br />

meanings to synsets. There are 58K top-scoring links, representing 41K word<br />

meanings. In total 47K different RBN word meanings were linked, and 48K different<br />

VLIS/DWN word meanings. 19K word meanings from RBN were not linked, as well<br />

as 59K word meanings from VLIS/DWN. Note that we considered here the complete<br />

VLIS database instead of DWN. The original DWN database represented about 60%<br />

of the total VLIS database. VLIS synsets that are not part of DWN can still be useful<br />

for RBN, as long as they ultimately get connected to the synset hierarchy of DWN.


492 Piek Vossen et al.<br />

4 Aligning Manually RBN with DWN<br />

The next alignment step is a manual process that consists of the editing of low-scoring<br />

and non existing links between lexical units and synsets. We identified four major<br />

groups of problematic cases and defined editing guidelines for them, which will be<br />

presented in the following sections. Many of the low-scoring links turned out to be,<br />

not unexpectedly, links between lexical units and synsets of very frequent and highly<br />

polysemous words (section 4.1 and 4.2). Many of the non-links, i.e. a link between a<br />

synset and an automatically created and therefore empty lexical unit or vice versa,<br />

turned out to be between adjective synsets and lexical units (section 4.3). The fourth<br />

group, the multiword expressions, is different from the others, since for these<br />

automatic alignment could only be performed for few cases (section 4.4).<br />

4.1 Frequent polysemous verbs and nouns<br />

The low-scoring links within the group of verb synsets and lexical units and within<br />

the group of noun synsets and lexical units are in great deal due to the difference<br />

regarding the underlying principles of meaning discrimination which plays an<br />

important role in the alignment of synsets and lexical units. We defined a set of 1000<br />

most frequent verbs in Dutch as a set to manually verify. For nouns, we defined a<br />

similar set of 1800 words that are most polysemous (4 or more word meanings). The<br />

matching of nouns is relatively straight forward and the manual process consists<br />

mainly of correcting the choices or cases where different meanings are given in the<br />

two resources. In the latter case, we either create a new synset or add the word to an<br />

existing synset as a synonym or we provide the information in the lexical unit that is<br />

lacking. Mappings for verbs are more complicated as will be explained below.<br />

Characteristic for the verbal LUs is that they contain detailed information on<br />

verbal complementation, event structure and combinatoric properties. For the verb<br />

behandelen (to treat), the complementation patterns are:<br />

▪ np:<br />

▪ np, pp:<br />

iemand behandelen (to treat someone)<br />

iemand aan/ voor/ tegen/met iets behandelen<br />

(to treat someone for /with/ … something)<br />

In the representation of complementation patterns, all possible patterns are<br />

encoded. This may lead to a lot of patterns, but the result is a very explicit description<br />

of the syntactic behavior of the LU. As a rule, each pattern is worked out as an<br />

example in the combinatoric information. The corresponding event structure of<br />

behandelen contains the information that:<br />

▪ this meaning of behandelen is an action verb.<br />

▪ the subject np is the agent<br />

▪ the object-np is the patient<br />

▪ an optional pp-complement with met (with) is the instrument<br />

▪ an optional pp-complement with aan/voor/tegen (for/with/against) is the theme


The Cornetto Database: Architecture and Alignment Issues of … 493<br />

In the Dutch WordNet, these complements and roles are reflected in semantic<br />

relations:<br />

▪ [causes] [v] genezen:2, beteren:1, herstellen:1 (to recover)<br />

▪ [involved_agent] [n] arts:1; dokter:1 (doctor)<br />

▪ [involved_patient] [n] zieke:1; patiënt:1 (patient)<br />

▪ [involved_instrument] [n] hart-longmachine:1 (heart-long machine)<br />

▪ [involved_instrument] [n] mitella:1, draagdoek:1 (sling)<br />

▪ [involved_instrument] [n] geneesmiddel:1; medicijn:1 (medicine)<br />

etc.<br />

As long as there is a one-to-one mapping from LUs and synsets, the features of the<br />

two resources will probably match. However, difficulties arise when the mapping is<br />

not one-to-one. Frequent verbs are often very polysemous. The RBN, as the source of<br />

the LUs, tries to deal with polysemy in a systematic and efficient way. The synsets are<br />

however much more detailed on different readings. As a result, in many cases there<br />

are more synsets than LUs. In combination with the detailed information on<br />

complementation, event structure and lexical relation, this results in interesting (and<br />

time consuming!) editing problems.<br />

A typical example of an economically created LU in combination with a detailed<br />

synset is aflopen (to come to an end, to go off (an alarm bell), to flow down, to run<br />

down, to slope down, etc. ). Input to the alignment are seven LUs and 13 synsets.<br />

Much of the asymmetry was caused by the fact that one of the LUs represents one<br />

basic and comprehensive meaning: to walk to, to walk from, to walk alongside<br />

something or someone. In DWN these are all different meanings, with different<br />

synsets. This is the result of describing lexical meaning by synsets; these three<br />

readings of aflopen obviously have a lot in common, but they match with different<br />

synonyms. Aligning the LUs and synsets leads to splitting the LUs and may lead to<br />

subtle changes in the complementation patterns, event structure and certainly to<br />

adapting and extending the combinatoric information. Sometimes the LUs are more<br />

detailed. In that case a synset must be split, which of course gives rise to changes in<br />

all related synsets and to new sets of lexical relations.<br />

In every day editing of frequent verbs it is often a problem to find out the exact<br />

meaning of a verb in a synset. This is certainly the case for isolated meanings without<br />

synonyms, forming a synset on their own, but also for frequent verbs with other<br />

frequent verb meanings in the synset. It does not help to know that afspelen (to take<br />

place) is in a synset with passeren, spelen and geschieden (to happen, take place,<br />

occur), all being ambiguous in the same way. These puzzles can often be solved by<br />

keeping a close watch on the lexical relations; especially instrument-relations are<br />

often of great help in disambiguating. However, it will be clear that alignment in the<br />

case of frequent verbs is hardly ever a matter of just confirming a suggestion for a<br />

mapping.


494 Piek Vossen et al.<br />

4.2 Nouns and semantic shifts<br />

As is mentioned above, there are some differences in the lexicographical approach<br />

between the DWN and RBN resource for Cornetto. One important aspect is the<br />

economical distribution of LUs in the RBN, compared to the more extensive<br />

distribution of synsets. With regard to the nouns, this dissimilarity is mainly caused<br />

by the use of semantic shifts in the RBN.<br />

A semantic shift can be defined as an aspect of a meaning that is closely<br />

connected to the central meaning. A shift can thus be seen as an extension of a<br />

meaning. Like in the RBN, the extension is not explicitly given but indicated, whereas<br />

DWN follows another approach to explicitly list these meanings. The RBN uses the<br />

semantic shift for groups of words that show the same semantic behavior. In the case<br />

of artikel (article) we find a LU with a shift that predicts that besides ‘text’, an artikel<br />

can also be an Artifact. This shift from Non-Dynamic to Artifact is also found<br />

consequently in LUs like reprint and script. There are about 30 different defined<br />

types of shifts that can occur in verbs, adjectives and nouns, like Process → Action in<br />

verbs and Dynamic → Non-dynamic in nouns. Due to the difference in approach, we<br />

expect that the matching of LUs from RBN to synonyms in DWN is more likely to be<br />

incorrect for all words labeled with a shift in RBN. We therefore decided to manually<br />

verify all the mappings for shifts. The vast majority of 4500 LUs with a semantic shift<br />

is found in nouns, on which we have decided to concentrate the manual work.<br />

Because of the difference in approach, the DWN resource will have an extra<br />

synset for the meaning that is implied with a shift in the LU. If not, the presence of a<br />

shift might be a reason to create a new synset. This makes editing the LUs with a<br />

semantic shift a successful strategy to improve and extend the Cornetto database.<br />

Editing an LU with a shift however, does not only mean splitting it and align it<br />

with the corresponding synset. Both resources show sometimes subtle differences in<br />

their description of a meaning, or a meaning happens to be missing in one of the<br />

resources. This means that if we want to edit the shift cases properly, we need to edit<br />

entries that contain an LU with a shift, and not just only the shift cases. This approach<br />

means that we aim at editing about 15.000 LUs and synsets, since most of the entries<br />

with a semantic shift are polysemous or will be so after editing. For these and some<br />

other edit related issues and decisions, we keep an edit log that will result in a final<br />

editing guideline.<br />

All of this can be demonstrated by the word bekendmaking (announcement) that<br />

has one LU with a shift in RBN from Dynamic to Non-dynamic. This means that (in<br />

Dutch) an announcement can be a process and the result of this process. In DWN, we<br />

find a synset for each of these aspects, stating that the first one is a subclass of the<br />

SUMO term ‘Communicating’, and the second one is equivalent to ‘Statement’. We<br />

can see this as a good argument to split the LU and define the difference in terms of<br />

the definition and the semantic relations. In almost all of the dynamic and nondynamic<br />

cases we use the following scheme to specify the relation and differences<br />

between both synsets and LUs (fig. 3 and 4):


The Cornetto Database: Architecture and Alignment Issues of … 495<br />

Dynamic X<br />

LU resume<br />

LU combinatorics/example<br />

Synset semantic relation 1<br />

Synset semantic relation 2<br />

The X-ing<br />

(…)<br />

HAS_HYPERONYM ‘Y’<br />

XPOS_NEAR SYNONYM ‘X-ing’<br />

Non-dynamic X<br />

LU resume<br />

LU combinatorics/example<br />

Synset semantic relation 1<br />

Synset semantic relation 2<br />

(…)<br />

(…)<br />

HAS_HYPERONYM ‘X’<br />

ROLE, CAUSE, ROLE_RESULT,<br />

etc<br />

Fig. 3. Schemes for editing nouns with a dynamic/non-dynamic shift.<br />

In the case of ‘announcement’, this scheme can be filled for Dutch like this (fig. 4):<br />

Dynamic X<br />

LU resume<br />

LU combinatorics/example<br />

HAS_HYPERONYM<br />

XPOS_NEAR_ SYN<br />

announcement<br />

‘the announcing’<br />

-<br />

statement (dynamic in Dutch)<br />

announcing<br />

Non dynamic X<br />

LU resume<br />

LU combinatorics/example<br />

HAS_HYPERONYM<br />

ROLE_RESULT<br />

announcement<br />

‘something that has been announced’<br />

-<br />

message<br />

announcing<br />

Fig. 4. An editing example for a noun with a dynamic/non-dynamic shift.<br />

The main advantage of editing shifts is the expansion and enrichment of the<br />

database. By creating a new LU for a synset we can add essential combinatory<br />

information and example sentences. When we add a new synset for a LU, we create<br />

new semantic relations, thus enriching the existing semantic structure of DWN. By<br />

editing clusters of the same shift type as e.g. dynamic → non-dynamic, we can ensure<br />

consistency at the same time. Note that the label shift will be kept in both LUs: in the<br />

original LU from the RBN and in the new LU which is the explicit meaning of the<br />

shift. In this way, we can always reconstruct the original RBN approach to store a<br />

single condensed meaning, or use the fact that there is a metonymic relation between<br />

these LUs. Furthermore, we express that there is a tight relation between these<br />

synsets.


496 Piek Vossen et al.<br />

4.3 Adjectives and fuzzy synsets<br />

A considerable part of the adjectives is not successfully aligned by the automatic<br />

alignment procedures. This is especially due to the fact that adjective synsets have<br />

few semantic relations lacking hypernyms and hyponyms. By consequence, the<br />

automatic alignment strategies which involve broader and narrower terms, are in these<br />

cases not applicable.<br />

Another problematic aspect of the adjective synsets is the fact that the<br />

automatically formed DWN adjective synsets are not – unlike the noun and verb<br />

synsets – edited and corrected manually. As a result, DWN adjective synsets have the<br />

following two characteristics:<br />

▪<br />

they are rather large and fuzzy often including words which are semantically<br />

related but not really synonymous, eg. Synset A: [dol, gek, dwaas, gaga (mad,<br />

crazy, foolish) achterlijk, gestoord (retarded, disturbed)]<br />

The synset needs to be splitted up in at least two new synsets: A1 [dol, gek,<br />

dwaas, gaga] ‘behaving irrational’ and A2 [gestoord , achterlijk] ‘affected with<br />

insanity’.<br />

▪ They are often quite similar to each other, e.g. Synset B [dol, dwaas, maf, (mad,<br />

crazy, foolish) idioot (idiotic), krankzinnig (mad, insane)...]<br />

Although synset A includes other synonyms than synset B, they are both quite<br />

similar with respect to their meanings. They need to be partly merged into a new<br />

synset C [dol, dwaas, maf, gaga] ‘behaving irrational’ as is illustrated below (example<br />

1).<br />

Of course, RBN’s lexical units - with numerous corpus based examples - can be<br />

helpful in solving these problems. However, it is already mentioned that the<br />

systematic and efficient way of word sense discrimination is often not consistent with<br />

the WordNet approach. For example, the following lexical unit kort (short) shows that<br />

the RBN does not always take into consideration possible synonym or hypernym<br />

relations.<br />

Ex. 1. RBN kort (short).<br />

LU Resume Syntax Combinatorics<br />

Kort of time and attr/pred een korte dag (a short day),<br />

(short) length<br />

een korte vakantie (a short holiday),<br />

een korte broek (short trousers)<br />

kort haar (short hair)<br />

In this case the LU need to be split up in two LUs, distinguishing one temporal<br />

(with the combinations (1) and (2)) and one spatial sense (with the combinations (3)<br />

and (4)). Thus DWN’s semantic relations can be aligned correctly to the LUs<br />

(example 2):


The Cornetto Database: Architecture and Alignment Issues of … 497<br />

Ex. 2. DWN kort (short).<br />

Synset Synonyms Semantic relations<br />

Of time kort, kortdurend, Antonym : lang [1], langdurig ( for a long<br />

kortstondig<br />

period of time)<br />

Of<br />

length<br />

kort<br />

Near-synonym: klein (small)<br />

Antonym: lang [2] (long, of relatively great<br />

length)<br />

To be able to deal in a systematic way with these problems, we introduced the use<br />

of a semantic classification system for adjectives (Hundschnurser & Splett,<br />

Germanet). The classification regards the relation between the adjective and the<br />

modified noun. Adjectives are split up in 70 semantic classes which are organized in<br />

15 main classes. In addition to this class, we also encode the ‘semantic orientation’<br />

indicating a positive (+), negative (-) or neutral ( ) connotation of the involved<br />

adjectives. Since the semantic class and the semantic orientation hold for all<br />

synonyms within the synset, it is encoded at the level of the synset.<br />

The following example presents the aligned version – after editing both LUs and<br />

synsets – of the word dol (crazy, fond). We distinguished three LUs and aligned them<br />

to synsets A, B and C respectively (example 3).<br />

Ex. 3. dol (LUs and Synsets).<br />

LU LU Resume Syntax Combinatorics to Synset<br />

1 With a strong Predicative Dol op kinderen (fond of A<br />

liking for Fixed preposition:<br />

‘op’ (on)<br />

children)/ dol op chocola<br />

(fond of chocolate)<br />

2 Offering fun and Attr/pred Een dolle avond (a B<br />

gaiety<br />

3 Behaving<br />

irrational<br />

Attr/pred<br />

merry evening)<br />

Het is genoeg om dol<br />

van te worden (it is<br />

enough to drive you<br />

crazy)<br />

Synset Synonyms Semantic classification Semantic<br />

orientation<br />

A dol, verzot, gek, CHARACTER/BEHAVIOUR +<br />

verrukt<br />

B dol, uitgelaten, jolig MOOD +<br />

(crazy, jolly)<br />

C dol, gek ,maf dwaas,<br />

gaga, geflipt (crazy,<br />

foolish)<br />

CHARACTER/BEHAVIOUR -<br />

C


498 Piek Vossen et al.<br />

4.4 Multiword units<br />

Special attention is paid to the encoding and alignment of multiword units. The<br />

combinatoric information in the Cornetto Database is classified into the following<br />

types: (1) free illustrative examples, (2) grammatical collocations (3) pragmatic<br />

formula (4) transparent lexical collocations (5) semi-transparent lexical collocations<br />

(6) idioms and (7) proverbs. In RBN these combinations were not included in the<br />

macrostructure, but given within the microstructure of the meaning of one particular<br />

word contained in the expression. One of the objectives of Cornetto is to introduce<br />

part of them, i.e. the fixed combinations with a reduced semantic (and often syntactic)<br />

transparency - into the macrostructure thus making it possible to align them with a<br />

synset and via the synset with the ontology. We focus on those combinations which<br />

have a reduced semantic (and often syntactic) transparency and a reduced or lack of<br />

compositionality. The following 3 types meet the criterium set for this new group:<br />

▪ Idioms: expressions with a reduced or lack of semantic transparency (e.g.<br />

stoken in een goed huwelijk (drive a wedge between two people), een rare snijboon<br />

(an odd person)).<br />

▪ Proverbs: completely frozen sentences.<br />

▪ Semi-transparent lexical collocations: these are lexical collocations of which<br />

one of the combination words has got a more specific meaning or less literal meaning<br />

than its basic meaning. Therefore the whole combination has a reduced semantic<br />

transparency. (systematische catalogus (systematic catalogue), open breuk (compound<br />

fracture), enkelvoudige breuk (simple fracture)).<br />

The alignment of the idioms and proverbs multiword units with the synsets will be<br />

done exclusively by hand. The alignment of the semi-transparent lexical collocations<br />

with synset hierarchy will be performed in a semi-automatic way: in most cases the<br />

synset which includes the head of the NP (systematische catalogus) will be the<br />

hypernym synset of the multiword unit.<br />

With regard to their semantic description, multiword units are regarded as a<br />

sequence of words that act as a single unit. Examples (2) and (3) illustrate the<br />

encoding of a lexical collocation and an idiom respectively. The description focuses<br />

on the semantics of the whole expression: each entry consists of a canonical form, its<br />

syntactic category, a textual form (if this applies), a lexicographic definition,<br />

information regarding its use if needed, one or more examples of the construction in<br />

context. The link to the synset is realised by a pointer to a cid-entry (c_cid_id) and<br />

links to the individual words of the combination are realised by pointers to single<br />

word lexical units (c_lu_id). Morpho-syntactic information relative to the individual<br />

words is included in the description of those particular words. The pointers to the<br />

individual words are pointers to lexical units. This seems contradictory - and<br />

sometimes is- with the uncompositionality of the multiword units. However, many<br />

multiword units are only semi-transparent and their syntactic and semantic behaviour<br />

is often related to their individual parts.


The Cornetto Database: Architecture and Alignment Issues of … 499<br />

Ex. 4. Multiword unit blinde muur (blank wall).<br />

Canonicalform<br />

Sy-subtype<br />

meaningdescription<br />

C_LU_ID<br />

C_LU_ID<br />

Synset<br />

Hypernym<br />

OntologicalType<br />

blinde muur (NP)<br />

lexical collocation<br />

muur zonder ramen of deuren (a wall unbroken by<br />

windows or other openings)<br />

muur (N) (wall)<br />

blind (A) (blind)<br />

[blinde muur] blank wall<br />

[muur] (wall)<br />

StationaryArtifact (an artifact that has a fixed spatial<br />

location)<br />

Ex. 5. Multiword unit roomser dan de paus (more Catholic than the Pope).<br />

CanonicalForm<br />

Sy-subtype<br />

Sem-meaningdescription<br />

Prag-Connotation<br />

C_LU_ID<br />

C_LU_ID<br />

Synset<br />

Hypernym<br />

OntologicalType<br />

roomser dan de paus (AdjP)<br />

idiom<br />

overdreven principieel (extremely principled)<br />

pejorative<br />

rooms (A) (catholic)<br />

paus (N) (pope)<br />

roomser dan de paus<br />

principieel (principled), beginselvast (consistent)<br />

TraitAttribute<br />

5 Aligning synsets with ontology terms<br />

A new relation is the mapping from the synset to the ontology. The ontology is seen<br />

as an independent anchoring of concepts to some formal representation that can be<br />

used for reasoning. Within the ontology, Terms are defined as disjoint Types,<br />

organized in a Type hierarchy where:<br />

▪ a Type represents a class of entities that share the same essential properties.<br />

▪ Instances of a Type belong to only a single Type: => disjoint (you cannot be<br />

both a cat & a dog)<br />

Terms can further be combined in a knowledge representation language to form<br />

expressions of axioms (you can be a watch dog & a bull dog), i.e. the Knowledge


500 Piek Vossen et al.<br />

Interchange Format, KIF, based on first order predicate calculus and primitive<br />

elements.<br />

Following the OntoClean method [6, 7], identity criteria can be used to determine<br />

the set of disjunct Types. These identity criteria determine the essential properties of<br />

entities that are instances of these concepts:<br />

▪ Rigidity: to what extent are properties of an entity true in all or most worlds?<br />

E.g., a man is always a person but may bear a Role like student only temporarily.<br />

Thus manhood is a rigid property while studenthood is anti-rigid.<br />

▪ Essence: which properties of entities are essential? For example, shape is an<br />

essential property of vase but not an essential property of the clay it is made of.<br />

▪ Unicity: which entities represent a whole and which entities are parts of these<br />

wholes? An ocean or river represents a whole but the water it contains does not.<br />

The identity criteria are based on certain fundamental requirements. These include<br />

that the ontology is descriptive and reflects human cognition, perception, cultural<br />

imprints and social conventions [21].<br />

The work of Guarino and Welty [6, 7] has demonstrated that the WordNet<br />

hierarchy, when viewed as an ontology, can be improved and reduced. For example,<br />

roles such as AGENTS of processes are anti-rigid. They do not represent disjunct<br />

types in the ontology and complicate the hierarchy. As an example, consider the<br />

hyponyms of dog in WordNet, which include both types (races) like poodle,<br />

Newfoundland, and German shepherd, but also roles like lapdog, watchdog and<br />

herding dog. “Germanshepherdhood” is a rigid property, and a German shepherd will<br />

never be a Newfoundland or a poodle. But German shepherds may be herding dogs.<br />

The ontology would only list the rigid types of dogs (dog races): Canine =><br />

PoodleDog; NewfoundlandDog; GermanShepherdDog, etc.<br />

The lexicon of a language then may contain words that are simply names for these<br />

types and other words that do not represent new types but represent roles (and other<br />

conceptualizations of types). For example, English poodle, Dutch poedel and Japanse<br />

pudoru will become simple names for the ontology type: ⇔ ((instance x PoodleDog).<br />

On the other hand, English watchdog, the Dutch word waakhond and the Japanese<br />

word banken will be related through a KIF expression that does not involve new<br />

ontological types: ⇔ ((instance x Canine) and (role x GuardingProcess)), where we<br />

assume that GuardingProcess is defined as a process in the hierarchy as well. The fact<br />

that the same expression can be used for all the three words indicates equivalence<br />

across the three languages.<br />

In a similar way, we can use the notions of Essence and Unicity to determine<br />

which concepts are justifiably included in the type hierarchy and which ones are<br />

dependent on such types. If a language has a word to denote a lump of clay (e.g. in<br />

Dutch kleibrok denotes an irregularly shaped chunk of clay), this word will not be<br />

represented by a type in the ontology because the concept it expresses does not satisfy<br />

the Essence criterion. Similarly, a word like river water (Dutch rivierwater) is not<br />

represented by a type in the ontology as it does not satisfy Unicity; such words are<br />

dependent on valid types. Satisfying the rigidity criterion, for example, is a condition<br />

for type status.


The Cornetto Database: Architecture and Alignment Issues of … 501<br />

From this basic starting point, we can derive two types of mappings from synsets<br />

to the ontology [5, 19]:<br />

▪ Synsets represent disjunct types of concepts, where they are defined as:<br />

a. names of Terms;<br />

b. subclasses of Terms, in case the equivalent class is not provided by the ontology<br />

▪ Synsets represents non-rigid conceptualizations, which are defined through a<br />

KIF expression;<br />

When we look at the different dogs in the Dutch WordNet then we see 3 types of<br />

hyponyms:<br />

▪ bokser; corgi; loboor; mopshond; pekinees; pointer; spaniel (all dog races)<br />

▪ pup (puppy); reu (male dog); teef (bitch)<br />

▪ bastaard (bastard); straathond (street dog); blindengeleidehond (dog for blind<br />

people); bullebijter (nasty dog); diensthond (police dog); gashond (dog for detecting<br />

gas leaks); jachthond (hunting dog); lawinehond (aveline dog); schoothondje (lap<br />

dog);waakhond (watch dog)<br />

The first group are names for dog races that are clearly rigid and disjunct. They<br />

represent names for Terms. The second group are words for male/female and baby<br />

dogs. They can be encoded in the same way as man, woman and child for humans.<br />

The third group refers to dogs with certain non-rigid attributes. They will thus not<br />

represent names for types but are related to the ontology by a mapping to the term<br />

Canine and the attribute that applies.<br />

The KIF expressions are currently restricted to triplets consisting of the relation<br />

name, a first argument and a second argument. The default operator of the triplets is<br />

AND, and we assume default existential quantification of any of the variables,<br />

specified as a value of the arguments. Furthermore, we follow the convention to use a<br />

zero symbol as the variable that corresponds to the denotation of the synset being<br />

defined and any other integer for other denotations. Finally, we use the symbol ⇔ for<br />

full equivalence (bidirectional subsumption). In the case of partial subsumption, we<br />

use the symbol ⇒, meaning that the KIF expression is more general than the meaning<br />

of the synset. If no symbol is specified, we assume an exhaustive definition by the<br />

KIF expression. The symbol ⇔ applies by default.<br />

The following simplified expression can then be found in the Cornetto database<br />

for the non-rigid synset {waakhond} (watchdog): (instance, 0 , Canine) (instance, 1 ,<br />

GuardingProcess) (role, 0 , 1). This should then be read as follows:<br />

⇔ The expression exhaustively defines the synset<br />

(instance, 0 ,Canine)<br />

Any referent of an expression with this synset as the head is also an<br />

instance of the type Canine (the special status of the zero variable),<br />

AND<br />

There exists an instance of the type Canine 0,<br />

AND<br />

(instance, 1 ,GuardingProcess)


502 Piek Vossen et al.<br />

There exists an instance of the type GuardingProcess 0+1,<br />

AND<br />

(role, 0 ,1)<br />

The entity 0 has a role relation with the entity 1.<br />

Other expressions that we use are:<br />

Bokser (+, 0, Canine)<br />

The synset {bokser} is a rigid concept which is a subclass of the type<br />

Canine<br />

hond (=, 0, Canine)<br />

The synset {hond} is a Dutch name for the rigid type Canine<br />

The latter two relations are mainly imported from the SUMO mappings to the<br />

English WordNet. In the case of {bokser} it is manually added because it is dog race<br />

that is not in the English WordNet.<br />

Another case of mixed hyponyms are words for water. In the Dutch WordNet<br />

there are over 40 words that can be used to refer to water in specific circumstances or<br />

with specific attributes. Water is in SUMO a CompoundSubstance just as other<br />

molecules. We can thus expect that the synset of water in Dutch matches directly to<br />

Water in SUMO, just as zand matches to Sand. However, water has 3 major meanings<br />

in the Dutch WordNet: water as liquid, water as a chemical element and a water area,<br />

while there are only two concepts in SUMO: Water as the CompoundSubstance and a<br />

WaterArea. In SUMO there is no concept for water in its liquid form, even though<br />

this is the most common concept for most people. Most of the hyponyms of water in<br />

the Dutch WordNet are linked to the liquid. To properly map them to the ontology,<br />

we thus first must map water as a liquid. This can be done by assigning the Attribute<br />

Liquid to the concept of Water as a CompoundSubstance:<br />

(and<br />

(exists ?L ?W)<br />

(instance, ?W, Water) ,<br />

(instance, ?L LiquidState<br />

(hasAttributeinstance, ?W, ?L) )<br />

In the Cornetto database, this complex KIF expression is represented by the<br />

slightly simpler relation triplets:<br />

(instance, 0, Water)<br />

(instance, 1 LiquidState)<br />

(hasAttributeinstance, 0, 1)<br />

The hyponyms of water in the Dutch WordNet can further be divided into 3<br />

groups:


The Cornetto Database: Architecture and Alignment Issues of … 503<br />

▪ Water used for a purpose: theewater (for making tea), koffiewater (for making<br />

coffee), bluswater (for extinguishing fire), scheerwater (for shaving), afwaswater (for<br />

cleaning dishes), waswater (for washing), badwater (for bading), koelwater (for<br />

cooling), spoelwater (for flushing), drinkwater (for drinking)<br />

▪ Water occurring somewhere or originating from: putwater (in a well),<br />

slootwater (in a ditch), welwater (out of a spring), leidingwater, gemeentepils,<br />

kraanwater (out of the tap), gootwater (in the kitchen sink or gutter), grachtwater (in a<br />

canal), kwelwater (coming from underneath a dike), grondwater, grondwater (in the<br />

ground), buiswater (on a ship)<br />

▪ Being the result of a process: pompwater (being pumped away), smeltwater,<br />

dooiwater (melting snow and ice), afvalwater (waste water), condens,<br />

condensatiewater, condenswater (from condensation), lekwater (leaking water),<br />

regenwater (rain water), spuiwater (being drained for water maintenance)<br />

In Figure 6, you find some of the mapping expressions that are used to relate these<br />

synsets to the ontology:<br />

theewater (tea water)<br />

(instance, 0, Water)<br />

(instance, 1, Human)<br />

(instance, 2, Making)<br />

(instance, 3, Tea)<br />

(agent, 1, 2)<br />

(resource, 0, 2)<br />

(result, 3, 2)<br />

bluswater (water for extinguishing<br />

fire)<br />

(instance, 0, Water)<br />

(instance, 1, Human)<br />

(instance, 2, Extinguishing)<br />

(instrument, 0, 2)<br />

(agent, 1, 2)<br />

putwater (water at the bottom of well)<br />

(instance, 0, Water)<br />

(instance, 1, MineOrWell)<br />

(located, 0, 1)<br />

slootwater (in a ditch)<br />

(instance, 0, Water)<br />

(instance, 1, SmallStaticWaterArea)<br />

(part, 0, 1)<br />

drinkwater (drinking water)<br />

(instance, 0, Water)<br />

(instance, 1, Drinking)<br />

(resource, 0, 1)<br />

(capability, 0, 1)<br />

leidingwater,<br />

gemeentepils,<br />

kraanwater (out of the tap)<br />

(instance, 0, Water)<br />

(instance, 1, Faucet)<br />

(instance, 2, Removing)<br />

(origin, 1, 2)<br />

(patient, 0, 2)<br />

Fig. 6. KIF-like mapping expressions for some hyponyms of the Dutch water.<br />

Through the complex mappings of non-rigid synsets to the ontology, the latter can<br />

remain compact and strict. Note that the distinction between Rigid and non-Rigid


504 Piek Vossen et al.<br />

does not down-grade the relevance or value of the non-rigid concepts. To the<br />

contrary, the non-rigid concepts are often more common and relevant in many<br />

situations. In the Cornetto database, we want to make the distinction between the<br />

ontology and the lexicon clearer. This means that rigid properties are defined in the<br />

ontology and non-rigid properties in the lexicon. The value of their semantics is<br />

however equal and can formally be used by combining the ontology and the lexicon.<br />

The work on the ontology is mainly carried out manually. The mappings of the<br />

synsets to SUMO/MILO are primarily imported through the equivalence relation to<br />

the English WordNet. We used the SUMO-WordNet mapping provided on:<br />

http://www.ontologyportal.org/, dated on April 2006. If there are more than one<br />

equivalence mappings with English WordNet, this may result in many to one<br />

mappings from SUMO to the synset. The mappings are manually revised traversing<br />

the Dutch WordNet hierarchy top-down so that we give priority to the most essential<br />

synsets. Furthermore, we will revise all synsets with a large number of equivalence<br />

relations or low-scoring equivalence relations. Finally, we also plan to clarify the<br />

synset-type relations for large sets of co-hyponyms as shown above for water. This<br />

work is still in progress. We do not expect this to be completed for all the synsets in<br />

this 2-year project with limited funding but we hope that a discussion on this topic can<br />

be started by working out the specification for a number of synsets and concepts.<br />

6 Conclusion<br />

In this paper, we presented the Cornetto project that combines three different semantic<br />

resources in a single database. Such a database presents unique opportunities to study<br />

different perspectives of meaning on a large scale and to define the relations between<br />

the different ways of defining meaning in a more strict way. We discussed the<br />

methodology of automatic and manual aligning the resources and some of the<br />

differences in encoding word-concept relations that we came across. The work on<br />

Cornetto is still ongoing and will be completed in the summer of <strong>2008</strong>. The database<br />

and more information can be found on:<br />

http://www.let.vu.nl/onderzoek/projectsites/cornetto/start.htm<br />

Acknowledgments<br />

This research has been funded by the Netherlands Organisation for Scientic Research<br />

(NWO) via the STEVIN programme for stimulating language and speech technology<br />

in Flanders and The Netherlands.<br />

References<br />

1. Copestake, A., Briscoe, T.: Lexical operations in a unification-based framework. In:<br />

Pustejovsky, J. and Bergler, S. (eds.) Lexical semantics and knowledge representation.


The Cornetto Database: Architecture and Alignment Issues of … 505<br />

Proceedings of the first SIGLEX Workshop, Berkeley, pp. 101–119. Springer-Verlag, Berlin<br />

(1992)<br />

2. Copestake, A.: Representing Lexical Polysemy. In: Klavans, J. (ed.) Representation and<br />

Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity, pp. 21–26.<br />

Menlo Park, California (2003)<br />

3. Cruse, D.: Lexical semantics. University Press, Cambridge (1986)<br />

4. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge MA<br />

(1998)<br />

5. Fellbaum, C., Vossen. P.: Connecting the Universal to the Specific: Towards the Global<br />

Grid. In: Proceedings of the First International Workshop on Intercultural Communication.<br />

Reprinted in: Ishida, T., Fussell, S. R. and Vossen, P. (eds.) Intercultural Collaboration: First<br />

International Workshop. Lecture Notes in Computer Science 4568, 1–16. Springer, New<br />

York (2007)<br />

6. Guarino, N., Welty, C.: Identity and subsumption. In: Green, R., Bean, C., Myaeng, S. (eds.)<br />

The Semantics of Relationships: an Interdisciplinary Perspective. Kluwer, Dordrecht (2002)<br />

7. Guarino, N., Welty, C.: Evaluating Ontological Decisions with OntoClean. J.<br />

Communications of the ACM 45(2), 61–65 (2002)<br />

8. Gruber, T.R.: A translation approach to portable ontologies. J. Knowledge Acquisition 5(2),<br />

199–220 (1993)<br />

9. Horák, A., Pala, P., Rambousek, A., Povolný, M.: DEBVisDic – First Version of New<br />

Client-Server WordNet Browsing and Editing Tool. In: Proceedings of the Third<br />

International WordNet Conference (<strong>GWC</strong>-06). Jeju Island, Korea (2006)<br />

10. Maks, I., Martin, W., Meerseman, H. de: RBN Manual. Vrije Universiteit Amsterdam<br />

(1999)<br />

11. Magnini, B., Cavaglià, G.: Integrating subject field codes into WordNet. In: Proceedings of<br />

the Second International Conference Language Resources and Evaluation Conference<br />

(LREC), pp. 1413–1418. Athens (2000)<br />

12. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet:<br />

An On-line lexical Database. J. International Journal of Lexicography 3/4, 235–244 (1990)<br />

13. Miller, G. A., Fellbaum, C.: Semantic Networks of English. J. Cognition, special issue,<br />

197–229 (1991)<br />

14. Niles, I., Pease, A.: Towards a Standard Upper Ontology. In: Proceedings of FOIS 2, pp. 2–<br />

9. Ogunquit, Maine (2001)<br />

15. Niles, I., Pease, A.: Linking Lexicons and Ontologies: Mapping WordNet to the Suggested<br />

Upper Merged Ontology. In: Proceedings of the International Conference on Information<br />

and Knowledge Engineering. Las Vegas, Nevada (2003)<br />

16. Niles, I., Terry, A.: The MILO: A general-purpose, mid-level ontology. In: Proceedings of<br />

the International Conference on Information and Knowledge Engineering. Las Vegas,<br />

Nevada (2004)<br />

17. Pustejovsky, J.: The Generative Lexicon. MIT Press, Cambridge MA (1995)<br />

18. Vossen, P. (ed.): EuroWordNet: a multilingual database with lexical semantic networks for<br />

European Languages. Kluwer, Dordrecht (1998)<br />

19. Vossen. P., Fellbaum, C. (to appear). Universals and idiosyncrasies in multilingual<br />

wordnets. In: Boas, H. (ed.) Multilingual Lexical Resources. De Gruyter, Berlin<br />

20. Vliet, H.D. van der: The Referentie Bestand Nederlands as a multi-purpose lexical<br />

database. J. International Journal of Lexicography (2007) (forthcoming)<br />

21. Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A.: WonderWeb Deliverable<br />

D18 Ontology Library, IST Project 2001-33052 WonderWeb: Ontology Infrastructure for<br />

the Semantic Web Laboratory For Applied Ontology - ISTC-CNR. Trento (2003)


CWN-Viz : Semantic Relation Visualization<br />

in Chinese WordNet<br />

Ming-Wei Xu 1 , Jia-Fei Hong 2 , Shu-Kai Hsieh 3 , and Chu-Ren Huang 1<br />

1<br />

Institute of Linguistics Academia Sinica<br />

No. 128, Section 2, Academia Road 115, Taipei, Taiwan R.O.C<br />

2<br />

National Taiwan University, Graduate Institute of Linguistics<br />

No. 1, Sec. 4, Roosevelt Road 106, Taipei, Taiwan R.O.C<br />

3 Department of English, National Taiwan Normal University<br />

NO. 162, Section 1, He-ping East Road 106, Taipei, Taiwan R.O.C<br />

javanxu@gmail.com<br />

{jiafei, churen}@gate.sinica.edu.tw<br />

shukai@gmail.com<br />

Abstract. This paper reports our recent work to use visualization to present<br />

semantic relation for Chinese WordNet. We design a visualization interface,<br />

named CWN-Viz, based on “TouchGraph”. There are three import design<br />

features of this visualization interface: First, visualization is driven by wordform,<br />

the most intuitive lexical search unit in Chinese. Second, the CWN-Viz<br />

allows visualization of bilingual semantic relations by incorporating Sinica<br />

BOW (Bilingual Ontological WordNet) information. Third, the semantic<br />

distance of each relation is calculated and used in both clustering and<br />

visualization.<br />

Keywords: Chinese Lexical Knowledgebase, Chinese WordNet, Semantic<br />

Relation, Visualization.<br />

1 Introduction<br />

George Miller [1] thought that synonym sets can be used to anchor the representation<br />

of the lexicon concepts and describe the (mental) lexicon. This was the original<br />

motivation to construct WordNet. Recently, there are many research teams to deal<br />

with semantic relations by the knowledge-base of WordNet. With the bilingual<br />

Chinese-English WordNet Sinica BOW [14], the Chinese WordNet Group at<br />

Academia Sinica has been worked on dividing and analyzing Chinese lemmas, senses<br />

and their semantic relations.<br />

Regard semantic relations as the foundation, the establishment of the reliability and<br />

relevant problems for the Chinese WordNet. In these data, we would put emphasis on<br />

presenting visualization system and show their semantic relations of whole senses.<br />

Finally, we will make use of calculating principle in order to cluster related senses<br />

and present several different similar synonym groupings.


CWN-Viz : Semantic Relation Visualization in Chinese WordNet 507<br />

Based on these viewpoints, we will obtain several different similar synonym<br />

groupings. And then, we can auto-create more and more different similar synonym<br />

groupings. In the same time, we also can establish huger Chinese semantic relation<br />

WordNet.<br />

2 Chinese WordNet<br />

The relationship between language and meaning has always been one of the problems<br />

that people think about all the time since human languages and cultures started.<br />

“Word” is the minimum meaning unit in the human languages. Dividing and<br />

expressing the meaning of a “word” and the interaction between accessing senses and<br />

expressing knowledge are the most fundamental researches. Sense division and<br />

expression need to be established basing on a complete set of lexical semantic<br />

theories and on the basic frames of ontology. In the Institute of Linguistics at<br />

Academia Sinica, under the direction of Chu-Ren Huang, the Chinese WordNet<br />

Group (CWN Group) have been working on the research called “Chinese meaning<br />

and Sense,” This research provides a explicit data by analyzing Chinese lexical senses<br />

manually.<br />

Huang [2] proposed the criteria and operational guidelines for process of dividing<br />

lexical senses. Besides, the criteria are also the basis for constructing a Chinese sense<br />

knowledgebase and codifying the Dictionary of Sense Discrimination. The entries in<br />

the Dictionary of Sense Discrimination can be singular word, two words or multiple<br />

words and are limited to the common words in modern Chinese. As shown in Fig. 1,<br />

this dictionary lists the complete information of each entry, including the phonetic<br />

symbols (Pinyin and National Phonetic Alphabets), sense definition, corresponding<br />

synset, part-of-speech (POS), example sentences and explanatory notes.<br />

WordNet was the first application that integrated all the different linguistic<br />

elements, such as sense, synonyms (synset), semantic relation and examples. And<br />

they have designed an online interface for their current version— WN3.0. In here, we<br />

want to have the comparison of the data structures between WN3.0 and Chinese<br />

WordNet.<br />

As shown in Fig.2, in WN, the information of each lexicon is simply listed its<br />

sense, synonyms (synset), semantic relations with other English synsets and some<br />

examples. As shown in Fig.3, the data structure in Chinese WordNet is divided into<br />

the parts of Chinese lexicon, Chinese lexical knowledge, and Linking to related<br />

language resource. In the section of Chinese lexical knowledge, similar to WN, each<br />

lexicon is accompanied with its sense, domain, English synset and synset number<br />

from WN1.6, semantic relations with other lexicons and example sentences, but<br />

common lexicons do not belong to any domain. Synset is used to link the other<br />

English resources in the system. The uniqueness and value of Chinese WordNet are to<br />

present the analysis result of after the Sense division and link the result with some<br />

English resources.


508 Ming-Wei Xu, Jia-Fei Hong, Shu-Kai Hsieh, and Chu-Ren Huang<br />

Fig.1 Example of Chinese Lexical Lemma.<br />

Fig.2 Data Structure of WN.


CWN-Viz : Semantic Relation Visualization in Chinese WordNet 509<br />

Fig.3 Data Structure of CWN.<br />

In other words, Chinese WordNet equips the function for searching cross-lingual<br />

information because Chinese WordNet integrates with the varied English resources,<br />

such as WN and SUMO (Suggested Upper Merged Ontology). Therefore, users can<br />

easily compare the concept differences between Chinese and English via Chinese<br />

WordNet.<br />

Normally, a text-searching system can only search a target document that include<br />

the words as its content, but it cannot be searched by the word senses or other relevant<br />

information of the words. Such searching function obviously cannot fulfill the<br />

requirements of linguistic researches. Through the experience from the pervious<br />

relevant linguistic researches, it is possible to collect a lot of varied lexical<br />

information, which is considered and required for the linguistic research purpose.<br />

This research is based on analyzing the Chinese lexicons. After conscientiously<br />

analyzing and researching, each Chinese lexicon accompanies with the information<br />

about lemma, sense, domain, sense definition, semantic relation, synset, example<br />

sentences, explanatory note and so on. Such conscientious analysis is helpful to<br />

preserve the lexical knowledge systematically and fulfill the different needs for the<br />

relevant linguistic researches.<br />

From the beginning of year 2003 to the beginning of September 2007, totally CWN<br />

Group had analyzed 6653 lemmas and identified 16693 senses. In order to refer to the<br />

data from the cumulative knowledgebase clearly, in 2005, we started to use “edition”<br />

to divide the content in Chinese WordNet. The data cumulating until the end of year<br />

2003 was the first edition. The research result cumulating until the end of year 2004<br />

was the second edition. The third edition was the data cumulating until the end of year<br />

2005 and was presented to the public on April, 2006. Now, the fourth edition was our


510 Ming-Wei Xu, Jia-Fei Hong, Shu-Kai Hsieh, and Chu-Ren Huang<br />

sense division cumulating until the end of year 2006 and has already published on<br />

April, 2007.<br />

3 Visualization<br />

In this study, we want to follow a well-accepted design paradigm to create a working<br />

prototype of a visualization suite for Chinese WordNet which has sense division of<br />

the data, as well as the ability to focus on specific synsets of interest and get some<br />

details.<br />

Before, information visualization technique was applied in computer science or<br />

biology science and shown relationship constructions of large data. Ware [3] suggests<br />

five advantages of effective information visualization:<br />

1) Comprehension: Visualization provides an ability to comprehend huge<br />

amounts of data.<br />

2) Perception: Visualization reveals properties of the data that were not<br />

anticipated.<br />

3) Quality control: Visualization makes problems in the data immediately<br />

apparent.<br />

4) Detail + Context: Visualization facilitates understanding of small scale<br />

features in the context of the large scale picture of the data.<br />

5) Interpretation: Visualization supports hypothesis formation, leading to<br />

further investigation.<br />

Recently, this technique was applied in NLP gradually such as Visual Thesaurus<br />

[15] and WordNet Explorer [4]. These studies only focus on usage and showing<br />

partial information of WordNet. However, so few studies focus on WordNet for<br />

designing, completely showing the relation between lemma and senses.<br />

Following similar design, we construct a visualization interface, named CWN-Viz,<br />

based on “TouchGraph”, a pen source graph layout system that has been extended and<br />

adapted to suit our requirements. The interface can completely show all lemmas,<br />

senses, and semantic relations for a word form recorded in Chinese WordNet. An<br />

important design feature of this interface is the iconic representation of semantic<br />

distance. We propose a set of principles to measure the distance of each semantic<br />

relation by calculating the node distances connected by that semantic relation. The<br />

distance information is used in clustering of related information and presented in this<br />

visualization interface such as below Fig.4 to Fig.6:


CWN-Viz : Semantic Relation Visualization in Chinese WordNet 511<br />

Fig. 4 The basic visualization construction.<br />

Fig. 5 Semantic relations of visualization construction.


512 Ming-Wei Xu, Jia-Fei Hong, Shu-Kai Hsieh, and Chu-Ren Huang<br />

Fig. 6 The interface for semantic relations of visualization construction.<br />

4 Analysis<br />

According to above CWN analysis and visualization construction, we develop the<br />

calculating principle in order to cluster for different lemmas and several senses and<br />

cluster their semantic relations. First, we regard a keyword root as a center node and<br />

extend two levels. Based on the first level nodes to calculate sub-roots of the sub-trees<br />

such as A, B, C, and D, respectively expand and make several sub-trees. Through<br />

calculating the numbers of semantic relations of these sub-trees, we can evaluate the<br />

relationship score for each sub-tree. When we calculate the relationship score for each<br />

sub-tree such as Fig.7, we would like present the calculating matrix for each cluster<br />

such as Table1. Finally, we select the most nodes of the numbers semantic relations<br />

until all nodes were selected. Consequently, we can obtain Cluster1 for [A, B] and<br />

Cluster2 for [C, D].


CWN-Viz : Semantic Relation Visualization in Chinese WordNet 513<br />

Fig. 7 The clusters of semantic relations of visualization construction.<br />

Table 1. The calculating matrix for each cluster.<br />

A B C D<br />

A 1 0 1<br />

B 2 1<br />

C 4<br />

D<br />

Following these constructs, we take CWN analysis into this model and the results<br />

were shown as:


514 Ming-Wei Xu, Jia-Fei Hong, Shu-Kai Hsieh, and Chu-Ren Huang<br />

Fig. 8 The visualization of semantic relations for Zheng4.<br />

We name this visualization system as CWN-Viz. We make the idea of CWN<br />

analysis and visualization construction into the website version. The available website<br />

address is http://cwn.ling.sinica.edu.tw/cwnviz/. In this website, there are several<br />

resources such as Chinese WordNet analysis, SinicaBOW, Sinica Corpus, WordNet<br />

1.6 version, WordNet 1.7 version, the textbook of elementary school and Chinese<br />

Dictionary of Ministry of Education. So far, there are 3000 lemma, 5568 senses, 1260<br />

facets, 9828 nodes and 11978 relations in this website.<br />

In here, we want to explain briefly for Fig.8 and Fig.9:<br />

正 zheng3 right<br />

正 2<br />

The second lemma of 正<br />

正 2(0620) The sixth senses of 正 2<br />

美 麗 mei3 li4 beautiful<br />

美 麗 (0101) The first meaning facet of the first sense of 美 麗


CWN-Viz : Semantic Relation Visualization in Chinese WordNet 515<br />

: for lemma<br />

: for sense<br />

: for facet<br />

In this system, blue line represents synonym, red line represents antonym, green<br />

line represents hyper or hypo and yellow line presents near synonym.<br />

In this website, we can look up some words and check their semantic relations.<br />

Also, we can select different resource to show different content in here. The website<br />

is such as Fig.9 and if we select Chinese WordNet analysis or Sinica BOW as the<br />

content resource, we can get some information in here such as Fig.10 and Fig.11.<br />

Fig. 9 The website of visualization construction.


516 Ming-Wei Xu, Jia-Fei Hong, Shu-Kai Hsieh, and Chu-Ren Huang<br />

Fig. 10 The visualization of Chinese WordNet analysis for keyword.<br />

Fig. 11 The visualization of Sinica BOW for keyword.


CWN-Viz : Semantic Relation Visualization in Chinese WordNet 517<br />

5 Conclusion<br />

WordNet by nature is a representation of a network of lexical relations in the mental<br />

lexicon. It is no surprising that many previous works have attempt to make these<br />

relational network more explicit and easier to understand and process by visualization<br />

tools. Work such as VISdic [5] and University of Toronto’s WordNet Explorer all<br />

made substantial contribution to this line of study. Our current study aims integrate<br />

sharable tools from the visualization field with WordNet studies. We also design the<br />

tool such that cross-lingual lexical semantic relations can be visualized in parallel. It<br />

is hoped that such a tool will be facilitate linguists’ use and understanding of<br />

WordNets, as well as facilitate international initiatives such as the Global WordNet<br />

Grid. One area that has not been explored sufficiently so far is the calculation of<br />

semantic distances according to visualized relations. According to above analysis,<br />

calculating principle and cluster, we can obtain visualizations such as Fig.8 to Fig.11.<br />

Ideally, we will create more related clusters and show their semantic relations.<br />

Further, we want to show these related clusters by different color and link related<br />

senses by different line in the interface. Also we provide more details for each sense.<br />

Fig.12 will be presented our prototypical type for this study.<br />

Fig. 12 The interface for semantic relations of visualization with sense division.


518 Ming-Wei Xu, Jia-Fei Hong, Shu-Kai Hsieh, and Chu-Ren Huang<br />

References<br />

1. Miller, G.; Beckwith, R.; Fellbaum, C.; Gross, D., Miller, K. J.: Introduction to WordNet:<br />

an on-line lexical database. J. International Journal of Lexicography (1990)<br />

2. Huang, C.R., Chen, C., Weng, C.X., Lee, H.P., Chen, Y.X., Chen, K.J.: The Sinica Sense<br />

Management System: Design and Implementation. J. Computational Linguistics and<br />

Chinese Language Processing 10(4). (2005)<br />

3. Colin, W.: Information Visualization: Perception for Design. Morgan Kaufmann (2000)<br />

4. Collins, C.: WordNet Explorer: Applying Visualization Principles to Lexical Semantics.<br />

Technical report, KMDI, University of Toronto (2006)<br />

5. Pavelek, T., Pala, K.: VisDic - A New Tool for WordNet Editing. Mysore, India: Central<br />

Institute of Indian Languages, 4 s. neexistuje (2002)<br />

6. Ahrens, K., Chang, L.L., Chen, K.J., Huang, C.R.: Meaning Representation and Meaning<br />

Instantiation for Chinese Nominals. J. International Journal of Computational Linguistics<br />

and Chinese Language Processing 3(1), 45–60 (1998)<br />

7. Ahrens, K.: Timing Issues in Lexical Ambiguity Resolution. In: Nakayama, J. (ed.) Sentence<br />

Processing in East Asian Languages, pp.1–26. CSLI, Stanford (2002)<br />

8. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database and Some of its Applications,<br />

MIT Press (1998b)<br />

9. Fellbaum, C.: The Organization of Verbs and Verb Concepts in a Semantic Net Predicative<br />

Forms in Natural Language and in Lexical Knowledge Bases, pp. 93–109. Kluwer,<br />

Dordrecht, Holland (1998a)<br />

10. Huang, C.R., Kilgarriff, A., Wu, Y., Chiu, C.M., Smith, S., Rychly, P., Bai, M.H., Chen,<br />

K.J.: Chinese Sketch Engine and the Extraction of Collocations. Presented at the Fourth<br />

SigHan Workshop on Chinese Language Processing. October 14–15. Jeju, Korea (2005)<br />

11. Huang, C.R., Chang, R.Y., Lee, S.B.:. Sinica BOW (Bilingual Ontological WordNet):<br />

Integration of bilingual wordnet and SUMO”. The 4th International Conference on<br />

Language Resources and Evaluation (LREC2004). Lisbon. Portugal. 26–28 May, 2004<br />

(2004)<br />

12. Huang, C.R., Ahrens, K., Chang, L.L., Chen, K.J., Liu, M.C., Tsai, M.C.: The moduleattribute<br />

representation of verbal semantics: From semantics to argument structure. J.<br />

Computational Linguistics and Chinese Language Processing. Special Issue on Chinese<br />

Verbal Semantics 5(1), 19–46 (2000)<br />

13. Huang, C.R., Tseng, I.J.E:, Tsai, D.B.S., Murphy, B.: Cross-lingual Portability of Semantic<br />

relations: Bootstrapping Chinese WordNet with English WordNet Relations. J. Languages<br />

and Linguistics 4(3), 509–532 (2003)<br />

Resource<br />

14. Sinica BOW: Academia Sinica Bilingual Ontological WordNet<br />

http://BOW.sinica.edu.tw<br />

15. ThinkMap. 2005. Thinkmap visual thesaurus.<br />

http://www.visualthesaurus.com.<br />

16. Chinese WordNet<br />

http://cwn.ling.sinica.edu.tw/<br />

17. Sinica Corpus: Academia Sinica Balanced Corpus of Mandarin Chinese<br />

http://www.sinica.edu.tw/SinicaCorpus/<br />

18. TouchGraph.<br />

http://www.touchgraph.com/index.html


CWN-Viz : Semantic Relation Visualization in Chinese WordNet 519<br />

19. WordNet<br />

http://wordnet.princeton.edu/<br />

20. WordNet Explorer (Prototype)<br />

http://www.cs.toronto.edu/~ccollins/wordnetexplorer/index.html#controlpanel


Using WordNet in Extracting the Final Answer from<br />

Retrieved Documents in a Question Answering System<br />

Mahsa A. Yarmohammadi, Mehrnoush Shamsfard,<br />

Mahshid A. Yarmohammadi, and Masoud Rouhizadeh<br />

Natural Language Processing Laboratory, Shahid Beheshti University, Tehran, Iran<br />

m_yarmohammadi@std.sbu.ac.ir, m-shams@sbu.ac.ir,<br />

yarmohammadi@modares.ac.ir, m.rouhizadeh@mail.sbu.ac.ir<br />

Abstract. In this project we propose a model for answer extraction component<br />

of a question answering system called SBUQA. Methods which extract answers<br />

based on only the keywords ignore many acceptable answers of the question.<br />

Therefore, in our proposed system we exploit methods for meaning extension of<br />

the question and the candidate answers and also make use of ontology<br />

(WordNet). In order to represent the question and the candidate answers and<br />

comparing them to each other, we use LFG - Lexical Functional Grammar, a<br />

meaning based grammar that analyses sentences in a deeper level than syntactic<br />

parsing- and obtain the f-structure of the sentences. We recognize the<br />

appropriate f-structure pattern of the question and based on that, the f-structure<br />

patterns of the answers. Then, the answer’s pattern is matched to the pattern of<br />

the candidate answer by the proposed matching method, called extended<br />

unification of f-structure. Finally, the sentences which acquire the minimum<br />

score to be offered the user are selected; the answer clause is identified in them<br />

and displayed to the user in descending order.<br />

Keywords: Question answering systems, Answer extraction, Information<br />

retrieval, Natural language processing<br />

1 Introduction<br />

Information Retrieval (IR) systems are the systems in which the user enters his/her<br />

query in form of separated keywords, and the search engine retrieves all the related<br />

documents from its knowledge base in a limited time. Most of retrieved documents<br />

are just syntactically –and not semantically- related to the user query.<br />

Users need exact and accurate information and don't like to waste their time by<br />

reading all retrieved documents to find the answer, and IR systems are not sufficient<br />

for this reason. So, a new kind of IR named Question Answering (QA) systems<br />

appeared from the late 1970's and early 1980's. In these systems, the user ask his/her<br />

natural language question with no restriction in its syntax or semantic. The system is<br />

responsible for finding the exact, short, and complete answer at the shortest possible<br />

time. To do this, QA system applies both IR and NLP techniques. In this article, we<br />

propose a method for answer extraction component of a question answering system.


Using WordNet in Extracting the Final Answer from Retrieved… 521<br />

Methods which extract answers based on only the keywords ignore many<br />

acceptable answers of the question. Therefore, in our proposed system we approach to<br />

methods for meaning extension of the question and the candidate answers and also<br />

make use of ontology (WordNet). In order to match the question and the candidate<br />

answers we use Lexical Functional Grammar and the benefits of its functionalstructure<br />

representation, and propose a unification algorithm. The question answering<br />

system we have designed and implemented for this purpose is named SBUQA 1 .<br />

In the following sections, we first introduce the overall architecture of QA systems,<br />

Lexical Functional Grammar and advantages of using this grammar in QA systems.<br />

Then, we present SBUQA. Finally, we describe the implementation and evaluation of<br />

SBUQA and mention future works.<br />

2 QA Systems Architecture<br />

A question answering system that is based on searching among a set of documents, is<br />

composed of three main components [3]:<br />

1) Getting user question and processing it that converts question given in natural<br />

language to query (or queries) to be used in the information retrieval component.<br />

2) Retrieving documents (search engine), that retrieves related documents -<br />

documents that probably contain the answer- based on the input query.<br />

3) Extracting final answer from retrieved documents, that extracts sentence or<br />

expression or text part containing the answer among documents.<br />

A survey on existing question answering systems reveals that all of these systems<br />

include the above three components, but use different methods performing them.<br />

3 Lexical Functional Grammar<br />

based formalism which analyses sentences at a deeper level than syntactic parsing [1].<br />

LFG views language as being made up of multiple dimensions of structure. The<br />

primary structures that have figured in LFG research are the structure of syntactic<br />

constituents (c-structure) and the representation of grammatical functions (fstructure).<br />

For example, in the sentence "The old woman eats the falafel", the c-<br />

structure analysis is that this is a sentence which is made up of two pieces, a noun<br />

phrase (NP) and a verb phrase (VP). The VP is itself made up of two pieces, a verb<br />

(V) and another NP. The NPs are also analyzed into their parts. Finally, the bottom of<br />

the structure is composed of the words out of which the sentence is constructed. The<br />

f-structure analysis, on the other hand, treats the sentence as being composed of<br />

attributes, which include features such as number and tense or functional units such as<br />

subject, predicate, or object. This type of analysis is useful in that it is a more abstract<br />

representation of linguistic information than a parse tree structure. In addition, long<br />

distance dependencies, which are very common in interrogative sentences and fact<br />

seeking questions, are resolved in order to have a complete and correct f-structure<br />

1<br />

Shahid Beheshti University Question Answering system


522 Mahsa A. Yarmohammadi et al.<br />

analysis. This makes LFG analysis useful for QA tasks because it identifies the focus<br />

of the question and also which functional role (e.g. subject and object) the focus can<br />

fulfill. LFG analysis also provides valuable information for the detailed interpretation<br />

of complex questions which can potentially form a significant component in<br />

answering them correctly.<br />

In our proposed system, we use LFG f-structure for representing and matching the<br />

question and its candidate answers.<br />

4 SBUQA System<br />

From overall view, the third component of a QA system, gets the user question and a<br />

set of retrieved text documents as the input, and shows user the answer(s) extracted<br />

from the document set as the output. The document set is composed of text documents<br />

that are the output of the second component of the QA system (search engine). Figure<br />

1, shows the architecture of SBUQA.<br />

Input<br />

question<br />

Documents<br />

(retrieved by search<br />

engine)<br />

Building question<br />

f-structure<br />

Docum<br />

Document<br />

preprocess<br />

Matching f-structure of the<br />

question with defined question<br />

templates & building answer<br />

instance based on the matched<br />

question template<br />

Answer instance f-<br />

structure<br />

Building f-structure<br />

of document sentences<br />

Answer f-structure<br />

Answer scoring based on the extended unification of<br />

answer and answer instance f-structures<br />

Sorting the answers in order of their scores and highlighting the<br />

answer phrase<br />

Figure 1. The architecture of SBUQA<br />

The components of the system and relationships between them are described in the<br />

following.


Using WordNet in Extracting the Final Answer from Retrieved… 523<br />

4.1 Getting Question and Building its f-structure<br />

The system gets the user natural language question and sends it to the LFG parser to<br />

build its f-structure representation. This representation makes one of the inputs of<br />

“Building f-structures of Answer Instances” component.<br />

4.2 Document Preprocessing and Building its Sentences f-structures<br />

Documents retrieved by the search engine are saved as text files in the system’s<br />

document bank. These documents are preprocessed by JAVARAP 2 tool, so that the<br />

sentences are separated and pronouns are replaced by their referents. Then, the<br />

sentences of each document are sent to the LFG parser to represent as f-structures<br />

(called fs C ).These representations make one of the inputs of “Extended Unification<br />

Algorithm and Answer Scoring” component.<br />

4.3 Building f-structures of Answer Instances<br />

We have defined some templates -represented in f-structure format- for Wh questions<br />

(called fs TQ ). Question f-structure is compared to fs TQ s and is matched with one of<br />

them, say X. For each fs TQ , we have defined one or more answer template(s)<br />

represented in f-structure format (called fs TA ). fs TA s of the X (the fs TQ matching with<br />

user question) are filled with question keywords and make answer instances (called<br />

fs A ). These instances are the other inputs of “Scoring the answer” component.<br />

Question and answer templates are described in section 4.5.<br />

4.4 Extended Unification Algorithm and Answer Scoring<br />

fs C of each sentence of document (input from document preprocessing component) is<br />

compared to fs A s (input from “Building f-structures of Answer Instances”<br />

component). The comparison is done by the Extended Unification Algorithm,<br />

introduced in section 4.5 and sentences are scored. Finally, sentences acquiring the<br />

score more than a defined threshold are selected, ordered by their scores, and shown<br />

to the user.<br />

4.5 Question and Answer Templates<br />

We define templates for questions and related answers based on the following<br />

categorization of English sentences [2]:<br />

1. Active sentences with transitive verb, containing subject, verb and<br />

optional object.<br />

2. Active sentences with intransitive verb, containing subject and verb.<br />

2<br />

http://www.comp.nus.edu.sg/~qiul/NLPTools/JavaRAP.html


524 Mahsa A. Yarmohammadi et al.<br />

3. Passive sentences, containing adverb which is representative of the<br />

omitted subject, verb and object.<br />

4. Sentences containing copula.<br />

Each of the above sentences can contain some complement or adverb.<br />

We tried to define question templates for wh questions so that they can cover all<br />

standard forms of questions. We divided five types of wh questions in two groups:<br />

one group containing who, where and when questions and the other group containing<br />

what and which questions. Then by considering four forms described above, we<br />

defined the following templates for wh questions:<br />

4.5.1 Who, Where, and When Questions<br />

The following four templates are the question templates ofr forms 1 and 2 (active<br />

sentences). These templates are numbered from I to IV. The FOCUS property<br />

indicates the type of wh question. The PRED property indicates the main verb of the<br />

sentence, TENSE indicates the tense of the verb, OBJ represents the object, ADJ<br />

represents the adjuncts especially adverb, SUBJ indicated the subject and XCOMP<br />

represented the complement The MODAL property in template II indicates that the<br />

sentence contains a modal verb.<br />

Template I is used for active interrogatives that contains only the main verb and<br />

template II covers active interrogatives that contain modal verb in addition to main<br />

verb.<br />

Templates III and IV covers interrogatives that use auxiliary verb do or have.


Using WordNet in Extracting the Final Answer from Retrieved… 525<br />

Template for passive interrogatives (form 3) is as template V. The PASSIVE<br />

property with + represents the passive sentence. Template for copula interrogatives<br />

(form 4) is as VI.<br />

For each of the defined question templates (fs TQ ), one or more answer templates<br />

(fs TA ) are defined. As mentioned in section 4-2-3, if the user question matches with<br />

one of the fs TQ s, the fs TA s for that are filled with words of the question in order to<br />

make answer instances and are used in the extended unification algorithm with<br />

sentences of candidate answer.<br />

Template for answer of active interrogatives – that matches with forms 1 and 2- are<br />

as the followings:<br />

Answer template I A is defined based on question template I Q . As the same, the<br />

answer template II A is defined for question template II Q , III A for III Q and IV A for IV Q .


526 Mahsa A. Yarmohammadi et al.<br />

A question in active form can be answered by a passive sentence; so the template<br />

for passive answer (V A ) is added to answer templates for form 1 and 2. Answer<br />

templates for questions that match form 3 (passive sentences) are as follows:<br />

Answer template V A is defined based on question template V Q . Here it is possible<br />

that the question in passive form have answer in active form. Hence four answer<br />

templates for active sentences (I A , II A , III A , IV A ) are also added to answer templates of<br />

form 3.<br />

Answer template of the questions matched with form 4 (copula) is like the<br />

following. This template is defined based on the question template VI Q<br />

4.5.1 What, and Which Questions<br />

Templates offered for who, where and when questions all are applicable to what and<br />

which questions. In addition to them, defining some additional templates for these two<br />

types of questions is possible. If a word (or expression) appears after words which and<br />

what, that actually is the topic of the question, expression<br />

is replaced by<br />

in all of the previous templates.<br />

Answer templates for what and which question, are the same templates defined for<br />

who, where and when questions.


Using WordNet in Extracting the Final Answer from Retrieved… 527<br />

4.6 Extended f-Structure Unification<br />

Answer extraction is a result of unifying the f-structure of the candidate answer and<br />

instance of the answer (that is generated based on the question). Experiments Shows<br />

that unification strategy based on exact matching of values is not sufficiently flexible<br />

[5]. For example, sentence "Benjamin killed Jefferson." is not answer to question<br />

"Who murdered Jefferson?" by exact matching. In our proposed system, we<br />

considered approximate and semantic matching in addition to exact and keywordbased<br />

matching. Approximate matching is performed by ontology based extended<br />

comparison between different parts of the question template and the candidate answer<br />

template (including subj, obj, adjunct, verb and …) and comparing of their types.<br />

In Our unification algorithm, by slot we mean various parts of templates (including<br />

subj, obj, adjunct, verb and …) accompanied with their types, and by filler we mean<br />

values (instances) that slots are filled with them.<br />

For determining the level of matching between fs A and fs C , we proposed a<br />

hierarchical pattern based on the exact matching, approximate matching, or no<br />

matching between slots and fillers of the two structures. Levels of scoring the<br />

candidate answer abased on the matching of fs A and fs C is as follows:<br />

A) Existence of all slots of fs A in fs C and<br />

Exact matching of the fillers.<br />

Approximate matching of the fillers.<br />

No matching of the fillers.<br />

B) Existence of all slots of fs A in fs C, plus additional slots in fs C and<br />

Exact matching of the fillers.<br />

Approximate matching of the fillers.<br />

No matching of the fillers.<br />

C) Existence of some (no all) slots of fs A in fs C and<br />

Exact matching of the fillers.<br />

Approximate matching of the fillers.<br />

No matching of the fillers.<br />

D) Existence of some slots of fs A in fs C, plus additional slots in fs C and<br />

Exact matching of the fillers.<br />

Approximate matching of the fillers.<br />

No matching of the fillers.<br />

For approximate matching of the fillers, a hierarchical pattern as the following is<br />

defined:<br />

Approximate matching of the fillers of type verb:<br />

-Value of fs C is synonym of value of fs A .<br />

-Value of fs C is troponym of value of fs A .<br />

-Value of fs A is troponym of value of fs C .<br />

-Value of fs A is hypernym of value of fs C .<br />

-Value of fs C is hypernym of value of fs A .<br />

Approximate matching of the fillers of other parts of the sentence (obj, subj,<br />

adjunct):<br />

-Value of fs C is synonym of value of fs A .


528 Mahsa A. Yarmohammadi et al.<br />

-Value of fs A is hypernym of value of fs C .<br />

-Value of fs C is hypernym of value of fs A .<br />

-Value of fs C is meronym of value of fs A .<br />

-Value of fs C is holonym of value of fs A .<br />

5 SBUQA Implementation & Evaluation<br />

SBUQA is implemented using Java programming language as a Java Applet and is<br />

developed in Oracle JDeveloper 10g IDE. The software is composed of several<br />

functions and built-in or user-defined libraries. One of the most important libraries<br />

used, is JWNL 3 , the Java API for WordNet.<br />

The software package, interacts with some available tools such as Probabilistic<br />

LFG f-structure parser – developed by National Centre for Language Technology<br />

(NCLT) in Dublin City university- and JAVARAP anaphora resolution tool that is<br />

used in document preprocessing component.<br />

User enters his/her question via the user interface. User can select option "Find<br />

answer from the document set" to limit the searching of answer in a previously<br />

prepared document set. For preparing this document set, user question is given to<br />

google search engine and some online question-answering systems (AnswerBus,<br />

Start, Ask, …) and some first documents retrieved by these systems, are saved as text<br />

files in the system's document base. We have not designed any graphical user<br />

interface for this process yet and user must prepare the document ser manually. If user<br />

selects the second option, "Find answer from the following input", a text area is<br />

opened and user can type his text and final answer is searched among this text.<br />

After entering the question and choosing the source, user clicks Ask button and the<br />

process of searching and extracting final answer starts. After this process, possible<br />

extracted answer(s) are displayed in "Possible Answers" section in decreasing order of<br />

assigned scores.<br />

A sample of running the software for question "Where was George Washington<br />

born?" is shown in figure 2.<br />

For evaluating the operation of the proposed system in finding final answer, we<br />

selected 10 questions (two questions of each type of where, who, what, when and<br />

which) from TREC question set. These questions selected in a way that cover various<br />

kinds of question templates. For each question, sentences of documents retrieved by<br />

google search engine and AnswerBus, Start and Ask online question-answering<br />

systems are extracted. Level of matching of these sentences with answer templates,<br />

are determined using the implemented tool. If the sentence matches with one of the<br />

templates, the answer part is extracted from the sentence using the tool and<br />

correctness or incorrectness of it is determined.<br />

Based on the evaluation, the precision of matching level A is equal to 0.78, level B<br />

is equal to 0.67, level C is equal to 0.50, and the precision of level D is equal to 0.33.<br />

The recall of the system is equal to 0.54.<br />

3<br />

Java WordNet Library


Using WordNet in Extracting the Final Answer from Retrieved… 529<br />

Figure 2. A sample of running the software for a question<br />

Also, each test question is evaluated based on some QA systems metrics: First Hit<br />

Success (FHS), First Answer Reciprocal Rank (FARR), First Answer Reciprocal<br />

Word Rank (FARWR), Total Reciprocal Rank (TRR), and Total Reciprocal Word<br />

Rank (TRWR).<br />

Possible values for FHS, FARR, FARWR, and TRWR are in the range of 0 to 1<br />

and the ideal value in an errorless QA system is equal to 1. Possible values for TRR<br />

are from 0 to ∞ (it doesn’t have an upper bound). The greater value for TRR,<br />

indicates that more correct answers are extracted. System evaluation based on these<br />

metrics, gives 0.78 for FHS and FARWR, 0.82 for FARR, 1.33 for TRR, and 0.72 for<br />

TRWR, that are good (greater that the average) values.<br />

6 Conclusion and Future Works<br />

The SBUQA system that is proposed for the third component of question-answering<br />

systems, operates based on f-structure of the question and candidate answers and<br />

extended unification based on ontology (WordNet). According to the evaluation<br />

measures of question-answering systems, the SBUQA system resulted a good (better<br />

that average) operation in retrieving final answer.<br />

The proposed system is designed for wh questions in open domain. Further<br />

extensions can cover yes/no questions and other types of questions. f-structure is<br />

beyond some shallow representations that are dependent to language. Although<br />

languages are different in shallow representations, but they can be represented by the


530 Mahsa A. Yarmohammadi et al.<br />

same (or very similar) syntactic (and semantic) slot-value structures. This feature of f-<br />

structure, make it possible to use the algorithms introduced in the proposed system for<br />

other languages including Persian. Now it is not possible to implement the system for<br />

Persian, because lack of usable and available tools for processing Persian language<br />

(such as parser and WordNet ontology for Persian). But we consider this as future<br />

extensions of the system.<br />

References<br />

1. Yarmohamadi, M. A.: Organization and Retrieving Web Information Using Automatic<br />

Conceptualization and Annotation of Web Pages. MS dissertation, Computer Engineering<br />

Department, Faculty of Engineering, Tarbiat Modarres University, Tehran, Iran (2006)<br />

2. Dabir-Moghadam, M.: Theoritical Linguistics:Emergence and Development of Generative<br />

Grammar. 2nd edition. Samt Publication, Tehran, Iran (2004)<br />

3. Eshragh, F., Sarabi, Z.: Question Answering Systems. BS dissertation. Electrical &<br />

Computer Engineering Department, Shahid Beheshti University, Tehran, Iran (2006)<br />

4. Judge, J., Guo, Y., Jones, G. J.: An Analysis of Question Processing of English and Chinese<br />

for the NTCIR 5 Cross-Language Question Answering Task. In: Proceedings of NTCIR-5<br />

Workshop Meeting. Tokyo, Japan (2005)<br />

5. Lin, J. J., Katz, B.: Question answering from the web using knowledge annotation and<br />

knowledge mining techniques. In: Proceedings of the ACM Int. Conf. on Information and<br />

Knowledge Management (CIKM) (2003)<br />

6. Von-Wun, S., Hsiang-Yuan, Y., Shis_Neng, L., Wen-Ching, C.: Ontology-based knowledge<br />

extraction from semantic annotated biological literature. In: The Ninth Conference on<br />

Artificial Intelligence and Applications (2004)<br />

7. Molla, D., Van Zaanen, M.: AnswerFinder at TREC 2005. In: Proceedings of the Fourteenth<br />

Text REtrieval Conference Proceedings (TREC 2005). Gaithersburg, Maryland, The United<br />

States (2005)<br />

8. Kil, J. H., Lloyd, L., Skiena, S.: Question Answering with Lydia. In: Proceedings of the<br />

Fourteenth Text REtrieval Conference Proceedings (TREC 2005). Gaithersburg, Maryland,<br />

The United States (2005)


Towards the Construction of a Comprehensive<br />

Arabic WordNet<br />

Hamza Zidoum<br />

Sultan Qaboos University<br />

Department of Computer Science Po Box 36, Al Khod 123 Oman<br />

zidoum@squ.edu.om<br />

Abstract. Arabic is a Semitic language spoken by millions of people in more<br />

than 20 different countries. However, not much work has been done in the field<br />

of online dictionaries or lexical resources. WordNet is an example of a lexical<br />

resource that has not been yet developed for Arabic. WordNet, a lexical<br />

database developed at Princeton University, has seen life 15 years ago. Ever<br />

since then, it has proved to be widely successful and extremely necessary for<br />

today’s demands for processing natural languages. Accordingly, the motivation<br />

of developing an Arabic WordNet became strong. In this paper we tackle some<br />

of the challenges inherent to constructing an Arabic lexical reference system.<br />

The paper goes through some solutions adopted in existing WordNets and<br />

presents justifications for adopting the Arabic WordNet's (AWN) philosophy.<br />

We address the nominal part of Arabic WordNet as the first step towards the<br />

construction of a comprehensive Arabic WordNet. The nominal part means<br />

nouns as a part of speech.<br />

Key Words: WordNet, Synsets, Arabic Processing, Lexicon<br />

1 Introduction<br />

WordNet is an online lexical reference system, which groups words into sets of<br />

synonyms and records the various semantic relations between these synonym sets. It<br />

has become an important aspect of NLP and computational linguistics [1]. Many<br />

WordNets were constructed for different languages [11, 13, 14, 15, 16, 19, 20, 22, 24,<br />

29, and 30].<br />

WordNet design is inspired by current psycholinguistic theories of human lexical<br />

memory [30]. It groups words into sets of synonyms called synsets, which are the<br />

basic building blocks of WordNet [30]. A synset is simply a set of words that express<br />

the same meaning in at least one context [1, 30]. WordNet also provides short<br />

definitions, and records the various semantic relations between these synonym sets<br />

[6]. Nouns, verbs and adjectives are organized into synonym sets, each representing<br />

one underlying lexical concept [1, 2, 3, and 4]. The lexical database is a hierarchy that<br />

can be searched upward or downward with equal speed. WordNet is a lexical<br />

inheritance system [2]. The success of WordNet is largely due to its accessibility,<br />

quality and potential in terms of NLP. WordNet was successfully applied in machine<br />

translation, information retrieval, document classification, image retrieval, and


532 Hamza Zidoum<br />

conceptual identification [10]. Different WordNets can be aligned, resulting in the<br />

possibility of translation between different languages, as so called machine<br />

translation. Information retrieval can be achieved by improving the performance of<br />

query answering systems using lexical relations. Since WordNet links between<br />

semantically related words, semantic annotation and classification of texts and<br />

documents are possible [16]. The visual thesaurus is a dictionary and a thesaurus with<br />

an interesting interface. It is an excellent way of learning the English vocabulary and<br />

understanding how the English words link together. It has 145,000 words and 115,000<br />

meanings and shows 16 kinds of semantic relationships. The user can as well hear the<br />

pronunciation of the word using a British or an American accent. Once the user enters<br />

a word, it is kept in the center, and all of the related words surround it. The user can<br />

click on a word to bring it to the center, roll over a word to learn more about it, and<br />

print the output chart [17]. Another interesting application is "READER". A person<br />

reading a text can click on a word, which is linked to a lexical database "WordNet",<br />

and reads its meaning in the given context [21]. An English dictionary, thesaurus and<br />

word finder program called WordWeb was developed based on the database from<br />

Princeton WordNet. It shows synonyms, antonyms, types and parts of a word. It has<br />

the advantage of integrating a dictionary and a thesaurus, unlike similar programs,<br />

where the dictionary and thesaurus are separate programs [5].<br />

As shown from WordNet's applications, WordNet is inevitable for any language<br />

that aims to be part of today's ever-evolving NLP related applications. Surprisingly<br />

relatively few efforts have been made to develop an original Arabic WordNet. Filling<br />

this gap by developing an Arabic WordNet is a challenging and non trivial task. This<br />

project aims towards constructing a Nominal Arabic WordNet for Modern Standard<br />

Arabic, which will be the starting point of developing a complete Arabic WordNet.<br />

Our goal is to develop an Arabic WordNet freely distributed to the community. Our<br />

objectives are (1) Producing an Arabic WordNet, which contain nouns, verbs and<br />

adjectives, (2) Collect basic lexical data from available resources, and (3) Create a set<br />

of computer programs that would accept the user's queries and display output info to<br />

the user.<br />

This paper presents Micro Arabic WordNet. This is a first step towards developing<br />

a complete Arabic WordNet. In this paper, we concentrate on using a subset of nouns<br />

for implementing this system. The other parts of speech e.g. verbs and adjectives, are<br />

considered as future work.<br />

The next section gives a definition of Arabic language and its characteristics.<br />

Section 3 introduces some system requirements and specifications, general<br />

approaches for constructing WordNet, details of other WordNets, reasons of adopting<br />

the Arabic WordNet's (AWN) philosophy and challenges. Section 4 explains the<br />

different aspects of system design, where a system architecture, dataflow diagrams,<br />

entity-relation diagram, data structures and interface designs are sketched. Section 5<br />

lists the sample data used in the database. It presents our test cases and our<br />

observations regarding those cases, and provides the performance tests. Finally, we<br />

include a cross-check validation of the requirements and discus the work still to be<br />

done on the system in the future.


Towards the Construction of a Comprehensive Arabic WordNet 533<br />

2 Arabic WordNet<br />

Semitic languages are a family of languages spoken by people from the Middle East<br />

and North and East Africa. It’s a subfamily of the Afro-Asiatic languages. Examples<br />

of Semitic languages are: Arabic, Amharic (spoken in North Central Ethiopia),<br />

Tigrinya (also in Ethiopia), and Hebrew. The name “Semitic” come from Shem son of<br />

Noah. Some characteristics of Semitic languages are:<br />

(1) Word order is Verb Subject Object (VSO),<br />

(2) Grammatical numbers are single, dual and plural, and<br />

(3) Words originate from a stem also called the “root”.<br />

Unlike English, Arabic script is written from right to left. Diacritics are indicated<br />

above or below the letters in Arabic words. Arabic morphemes are often based on<br />

insertion between the consonants of a root form. Roots are verbs and have the form (f<br />

'a l). Arabic is cursive, written horizontally from right to left, with 28 consonants.<br />

Arabic is the only Semitic language having "broken plurals".<br />

It is important to state that the most distinctive feature of this work is the insistence<br />

on maintaining language specific concepts and the intention of developing an Arabic<br />

WordNet which exhibits its richness rather than be driven by other incentives such as<br />

national security concerns, etc...<br />

In the field of lexical semantics, terms such as 'word' which we would usually<br />

define as "the blocks from which sentences are made" [30], are defined differently. It<br />

is therefore necessary to define such terms in order to be able to comprehend the<br />

following concepts.<br />

2.1 Word<br />

A word is an association between a lexicalized concept and an utterance (or<br />

inscription) that plays a syntactic role [1]. For clarity, "word form" is used to refer to<br />

the physical utterance or inscription and "word meaning" to the lexicalized concept.<br />

Associations between word forms and word meanings are many:many. Some word<br />

forms could have several meanings (Polysemy), and some word meanings could be<br />

expressed by several word forms (Synonymy) [1].<br />

W<br />

W<br />

W<br />

.<br />

.<br />

Synonymy .<br />

W<br />

W<br />

W<br />

W<br />

W<br />

.<br />

.<br />

.<br />

Polysemy<br />

Fig. 1. Synonymy and Polysemy


534 Hamza Zidoum<br />

2.2 Semantic Relations<br />

Semantic relations are very important in lexical semantics. However, prior to the<br />

appearance of WordNet, they were implicit in conventional dictionaries [28]. Now<br />

they are explicit in WordNet, and play as the source of WordNet’s richness. Semantic<br />

relations associate between synsets & words. Before they are listed, an important<br />

concept must be put forward. It is the distinction between lexical semantic relations<br />

(table 1) and conceptual semantic relations (table 2). The former are between word<br />

forms such as, synonymy and antonymy whereas the latter are between synsets such<br />

as, hyponymy and meronymy [30].<br />

Table 1. Lexical Semantic Relations<br />

Relation Relation<br />

Definition Example Type<br />

in Arabic<br />

Synonymy الترادف Similarity of meaning; two Location Lexical<br />

expressions are synonymous if<br />

the substitution of one for the<br />

other never changes the truth<br />

value of a sentence in which<br />

the substitution is made.<br />

and place<br />

Antonymy التضاد The antonym of a word x is Rich and Lexical<br />

sometimes not-x, but not poor<br />

always.<br />

Table 2. Conceptual Semantic Relations<br />

Relation<br />

Hyponymy/<br />

Hypernymy<br />

Meronymy/<br />

Holonymy<br />

Relation in<br />

Arabic<br />

Definition Example Type<br />

احتواء<br />

IS A relation; a hyponym Maple and Semantic inherits all the features of the tree<br />

more generic concept and adds<br />

at least one feature that<br />

distinguishes it from its<br />

superordinate and from any<br />

other hyponyms of that<br />

superordinate.<br />

آل<br />

HASA relation; part-whole Finger and Semantic relation<br />

hand<br />

انضواء/‏<br />

جزء/‏<br />

In [24] author proposed the idea of "Bootstrapping an Arabic WordNet Leveraging<br />

Parallel Corpora and an English WordNet". She studied the feasibility of meaning<br />

definition projection of English words onto their Arabic translation. She concluded<br />

that it is feasible to automatically bootstrap an Arabic WordNet taxonomy given less<br />

than perfect translations and alignments leveraging off existing English resources.<br />

The results were encouraging, as they are similar to those of researchers built<br />

EuroWordNet.


Towards the Construction of a Comprehensive Arabic WordNet 535<br />

Supported by the United States Central Intelligence Agency, a group of<br />

researchers, some of who were involved in the construction of other WordNets such<br />

as, Princeton WordNet and Euro WordNet, decided to undertake the task of<br />

developing an Arabic WordNet [18]; for reasons such as, Arabic being a language<br />

spoken in more than 20 countries and the fact that it represents vital interest to US<br />

national security [19]. The project is still under construction, as it is due to finish in<br />

2007 [19].<br />

3 General approach for constructing WordNet<br />

There are two main strategies for building WordNets: (1) Expand approach: translate<br />

English (or Princeton) WordNet synsets to another language and take over the<br />

structure. This is an easier and more efficient method. The outcome of this approach<br />

is of compatible structure with English WordNet. However, the vocabulary is biased<br />

by PWN, and (2) Merge approach: create an independent WordNet in another<br />

language and align it with English WordNet by generating the appropriate translation.<br />

This is more complex and requires a lot of work and effort. Language specific<br />

patterns can be maintained. But, it has different structure from WordNet [19].<br />

Arabic is a totally different language from English, obviously the expand approach<br />

will not be appropriate. Moreover, it is undesirable for the Arabic WordNet to be<br />

biased by the English WordNet. Therefore, we are going to use the merge approach,<br />

since Arabic's specific patterns could be maintained. Arabic WordNet being<br />

developed in [19] is centered on enabling future machine translation between Arabic<br />

and other languages which justifies the use of tools such as the SUMO. Some aspects<br />

have been adopted in our project because SUMO for instance is a good software<br />

engineering practice (increasing the number of users). However, it is necessary to<br />

state that the most distinctive feature of our project is the insistence on maintaining<br />

language specific concepts and the intention of developing an Arabic WordNet which<br />

exhibits its richness rather than be driven by other incentives.<br />

The user interface specification can be described by the following:<br />

Input:<br />

• Arabic<br />

• Noun<br />

• Singular<br />

Processing:<br />

• Search for the word in the Lexical Database (AWN). Display synset and<br />

gloss<br />

• Display relations at users demand<br />

Output:<br />

• Synset of word in Arabic<br />

• Gloss of synset<br />

• Display Different relations from current synset i.e. synonyms, antonyms,<br />

hyponyms and hyponyms.


536 Hamza Zidoum<br />

WordNet, since it is a lexical database, attempts to approximate the lexicon of a<br />

native speaker [30]. The mental lexicon, which is the knowledge that a native speaker<br />

has about a language, is highly dense in connectivity, i.e. there are many associations<br />

between words. Therefore, constant additions of relations are needed to improve the<br />

connectivity of a WordNet. This requires intensive research to discover relations<br />

which are not commonly used, since they are the ones which have a lower priority of<br />

inclusion into the database. Moreover, according to [30] "one of the central problems<br />

of lexical semantics is to make explicit the relations between lexicalized concepts". A<br />

lexicalized concept is a concept that can be expressed by a word [28].<br />

One of the challenges in this project specific to Arabic is the fact that Arabic texts<br />

today tend to be written without diacritics, leaving the task of disambiguation to the<br />

reader's natural mental ability, which is a very complicated one when attempted to<br />

implement through a computer.<br />

For example, the form ‏(آتاب)‏ could be either intended to mean ( آُت َّاب ) "kuttab"<br />

which is 'a group of writers' or ( آِتَاب ) "kitab" which is 'a book'. This example<br />

clearly demonstrates how missing diacritic marks compound lexical ambiguity.<br />

Finally, it is a frequent criticism that much of the infrastructure for computational<br />

linguistics research or the development of practical applications are lacking. Low<br />

involvement of Arab linguists also compounds the challenge, driving some<br />

researchers to find alternative means which sometimes might degrade the desired<br />

quality of the outcome.<br />

4 System Architecture<br />

The system has two main components: (1) a User System Component, and (2)<br />

Lexicographer System Component<br />

Main<br />

User System<br />

Component<br />

Lexicographer<br />

System Component<br />

Arabic_wordnet<br />

Database<br />

Fig. 2. System Architecture<br />

The latter is necessary since there is a lack in electronic Arabic resources and<br />

available lexicographers that are necessary for the population of the Arabic WordNet.<br />

The user system component retrieves information from Arabic_wordnet Database. It<br />

addresses the normal user's need, who is interested in finding out the different synsets<br />

and relations for a given word.


Towards the Construction of a Comprehensive Arabic WordNet 537<br />

Search<br />

lemma_id<br />

lemma_id<br />

Search<br />

synset_id<br />

Enter<br />

Word<br />

lemma_id<br />

lemma_id<br />

Database<br />

synset_id<br />

synset_id<br />

synset_id<br />

lemma<br />

Display<br />

synsets<br />

synsets<br />

synset<br />

Search<br />

synsets<br />

Fig. 3. Data Flow Diagram of displaying overview (level 2)<br />

rel_id<br />

rel_id +<br />

synset_from<br />

synset_to<br />

Choose<br />

Relation<br />

synset_to<br />

relation name<br />

Database<br />

synset_id<br />

Display<br />

synset<br />

synset<br />

synset<br />

Search<br />

synset<br />

Fig. 4. Data Flow Diagram of displaying synset's relation (level 2)


538 Hamza Zidoum<br />

The lexicographer system component stores information into Arabic_wordnet<br />

Database. It handles the lexicographer (linguist) requirements, who basically insert<br />

new synsets and relations. Data flow starts when the lexicographer chooses an action.<br />

The lexicographer could either choose to add a synset or to edit an existing one.<br />

Dashed arrows are optional paths which the lexicographer could choose to follow.<br />

Add<br />

Example<br />

Choice = add<br />

SUMO<br />

Add<br />

SUMO<br />

Optional<br />

Choice =<br />

add synset<br />

Add<br />

Synset<br />

w<br />

Ne<br />

Inco<br />

Add rrect<br />

Relation<br />

Edit<br />

Relation<br />

New<br />

synset<br />

Database<br />

New<br />

relatio<br />

Choice = add<br />

relation<br />

Display<br />

Edit<br />

Options<br />

Select<br />

Synset<br />

Chosen<br />

Synset<br />

Add Extra<br />

Rel<br />

New Rel for<br />

selected<br />

synset<br />

Choice = add<br />

example<br />

Select<br />

Synset<br />

Chosen<br />

Synset<br />

Add<br />

Example<br />

Example for<br />

selected<br />

synset<br />

Choice = add<br />

SUMO<br />

Select<br />

Synset<br />

Chosen<br />

Synset<br />

Add<br />

SUMO<br />

SUMO for<br />

selected<br />

synset<br />

Fig. 5. Lexicographer System Component


Towards the Construction of a Comprehensive Arabic WordNet 539<br />

Fig. 6. Arabic WordNet Browser – click noun button<br />

5 System Implementation and Validation<br />

In this project we used MySQL as a database management system for many reasons.<br />

MySQL is a widely used software by many large companies for keeping thousands<br />

even millions of records. Since there are thousands of words in a language, the need<br />

for software to handle such big number of records has emerged. MySQL is also web<br />

accessible. Querying MySQL is straight forward and easy. MySQL Query Browser is<br />

free software, which renders MySQL database with an interface similar to Microsoft<br />

Access DBMS. It also checks for users' query correctness.<br />

The programming language used is Java Programming Language accessed by<br />

SunOne Java. Java is platform independent. It allows creating attractive graphical<br />

interfaces. It handles Unicode characters as well. As we are going to manipulate<br />

Arabic words, ASCII characters do not suffice our purpose. Unicode characters are<br />

the solution for writing and reading Arabic text. Being an object oriented language;<br />

Java can provide a better structure and interface of the system. Java is also popular for<br />

its rich library which facilitates string manipulation.<br />

5.1 Testing<br />

Testing data was carefully chosen to cover all test cases. The test cases are shown in<br />

table 3.


540 Hamza Zidoum<br />

Table 3. Test CasesResults<br />

Test case Covered in System Proved<br />

Input is correct Yes Yes<br />

Input is wrong Yes Yes<br />

Input is correct and Yes<br />

Yes<br />

relation exist<br />

Input is correct and Yes<br />

Yes<br />

relation does not<br />

exist<br />

Table 4. Test Cases Results<br />

Input<br />

Expected<br />

synsets<br />

output<br />

Real synsets<br />

output<br />

Expected<br />

related<br />

synsets<br />

ouput<br />

Real related<br />

synsets<br />

output<br />

.1<br />

عين<br />

+<br />

علاقة<br />

جزء<br />

مادة،‏ جوهر،‏ عين أصول الشئ التي منها<br />

عين،‏ مقلة،‏ طرف،‏ بصر عضو الإبصار للإنسان<br />

وغيره من<br />

نفس،‏ مصيبة نفس<br />

عين،‏ ينبوع،‏ نبع ينبوع الماء ينبع من الأرض ويجري<br />

يتكون (<br />

الحيوان (<br />

الشئ (<br />

(<br />

) --<br />

) --<br />

) --<br />

3. عين،‏<br />

) --<br />

.2<br />

.4<br />

.5<br />

عين،‏ جاسوس<br />

) --<br />

الشخص الذي يطلع على الأخبار<br />

السرية (<br />

يتكون (<br />

) --<br />

) --<br />

.1<br />

.2<br />

أصول الشئ التي منها<br />

مادة،‏ جوهر،‏ عين عضو الإبصار للإنسان<br />

عين،‏ مقلة،‏ طرف،‏ بصر وغيره من<br />

نفس<br />

نفس،‏ مصيبة ينبوع الماء ينبع من الأرض ويجري<br />

عين،‏ ينبوع،‏ نبع الحيوان (<br />

الشئ (<br />

(<br />

.3 عين،‏ -- )<br />

) --<br />

.4<br />

.5<br />

عين،‏ جاسوس<br />

) --<br />

الشخص الذي يطلع على الأخبار<br />

السرية (<br />

يتكون (<br />

) --<br />

) --<br />

.1<br />

.2<br />

أصول الشئ التي منها<br />

مادة،‏ جوهر،‏ عين عضو الإبصار للإنسان<br />

عين،‏ مقلة،‏ طرف،‏ بصر وغيره من<br />

نفس<br />

نفس،‏ مصيبة ينبوع الماء ينبع من الأرض ويجري<br />

عين،‏ ينبوع،‏ نبع الحيوان (<br />

الشئ (<br />

(<br />

.3 عين،‏ -- )<br />

) --<br />

.4<br />

.5<br />

عين،‏ جاسوس<br />

) --<br />

عين،‏ مقلة،‏ طرف،‏ بصر<br />

الشخص الذي يطلع على الأخبار<br />

السرية (<br />

) --<br />

عضو الإبصار للإنسان وغيره<br />

من<br />

آل شخص يدرك من الإنسان<br />

والحيوان<br />

الحيوان (<br />


Towards the Construction of a Comprehensive Arabic WordNet 541<br />

Input<br />

Expected<br />

synsets<br />

output<br />

Real synsets<br />

output<br />

Expected<br />

related<br />

synsets<br />

ouput<br />

Real related<br />

synsets<br />

output<br />

عيت<br />

الكلمة غير موجودة<br />

الكلمة غير موجودة<br />

-<br />

-<br />

The functionality of the system is designed to handle all cases and all errors.<br />

Testing data was carefully chosen to cover all test cases. The test cases are shown in<br />

table 4. Apparently, the functionality of the system is designed to handle all cases and<br />

all errors.<br />

5.2 Performance<br />

In this section, we will discuss the approximation of data retrieval in our system.<br />

Another test was made to approximate processing times on Intel(R) Pentium(R) 4<br />

CPU 3.20 GHz 3.19 GHz, 0.99 GB of RAM. The results using the timer in the source<br />

code are summarized in table5.<br />

Table 5. Performance test using java timer<br />

Number of test case Retrieval time of synsets Retrieval time of relations<br />

Case 1 15 milliseconds 78 milliseconds<br />

Case 2 0 milliseconds -<br />

Case 3 16 milliseconds 31 milliseconds<br />

Case 4 16 milliseconds 15 milliseconds<br />

The results using the timer in MySQL Query Browser are given in table6.<br />

Table 6. Performance test using MySQL timer<br />

Number of test case Retrieval time of synsets Retrieval time of relations<br />

Case 1 0.0043 seconds 0.0050 seconds<br />

Case 2 0.0096 seconds -<br />

Case 3 0.0125 seconds 0.0454 seconds<br />

Case 4 0.0125 seconds 0.0107 seconds


542 Hamza Zidoum<br />

5.3 Future Work<br />

Our current system tackles nominal singular input. In the future, we are planning to<br />

implement verbs and adjectives as well as plural input. There will be separate tables<br />

in the database for verbs and adjectives. Also, an algorithm will be designed to<br />

generate the singular form of a given plural word since only the singular forms will be<br />

stored in the database. Moreover, to make the system comprehensive it is planned to<br />

integrate a morphological analyzer component to the system to generate the sound<br />

form of the word if given an inflected form or a derived form.<br />

Another issue that is planned to be tackled in the future is the problem of diacritics,<br />

a problem unique to Arabic. Lemmas with the same orthographical representation<br />

when stripped of diacritic marks will have to be disambiguated if the user attempts to<br />

search for one of them.<br />

To enrich our database as much as possible, it is desired in the future to cover<br />

Classical Arabic in addition to the Modern Standard Arabic which is currently being<br />

covered. For additional functionality, and specific to Semitic languages, it would be<br />

convenient to have a search by roots or to display the words derived from the same<br />

root. Also, an important plan for our system is that we are currently upgrading our<br />

system to be a web application. To facilitate this step, we have taken all the necessary<br />

precautions. We have used open source tools like MySQL and Java, and hence<br />

developing the application as a Servlet or an Applet is not a big challenge. The<br />

operating system that will be used is Linux with Apache as a server. It is notable that<br />

the domains: arabicwordnet.com, arabicwordnet.org and arabicwordnet.net have<br />

been registered.<br />

As well as the system being a textual-based system, we are looking forward to<br />

implement a graphical-based Arabic WordNet. The synsets and relations can be<br />

represented as hierarchies, trees or even radial diagram.<br />

6 Conclusion<br />

We have realized after doing research on the possibility of collecting validated data,<br />

that it is extremely difficult to populate the database because of the lack of machine<br />

readable dictionaries and available lexicographers. We decided therefore to include a<br />

lexicographer interface in addition to the original intended user interface. We have<br />

also built the system in a structure that enhances scalability. After upgrading the<br />

system to a web application (as it has been discussed in the previous chapter) we will<br />

aim to contact lexicographers in universities around the world to contribute in the<br />

construction of the Arabic WordNet. In the analysis phase we collected some user,<br />

lexicographer and system requirements and analyzed them. A general idea of the<br />

system architecture has been developed. The design phase included the system<br />

architecture, data flow diagrams, entity-relation diagram, data structures and interface<br />

design. The implementation phase discussed the different tools used in implementing<br />

the system. We define different functions in this phase as well as stating their<br />

pseudocode. The testing phase included database data as well as testing data, and a<br />

discussion. We also tested the performance of the system and stated the statistics. We


Towards the Construction of a Comprehensive Arabic WordNet 543<br />

anticipate the realization of a Comprehensive Arabic WordNet once it is published on<br />

the web (current project).<br />

References<br />

1. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An<br />

On-line Lexical Database. J. International Journal of Lexicography 3(4), 235–244 (1990)<br />

2. Miller, G.A.: Nouns in WordNet: A Lexical Inheritance System. J. International Journal of<br />

Lexicography 3(4), 245–264 (1990)<br />

3 Fellbaum, C., Gross, D., Miller, K.: Adjectives in WordNet. J.: International Journal of<br />

Lexicography 3(4), 265–277 (1990)<br />

4. C. Fellbaum.: English Verbs as a Semantic Net. J. International Journal of Lexicography<br />

3(4), 278–301 (1990)<br />

5. WordWeb, International English Thesaurus and Dictionary for Windows.<br />

http://wordweb.info<br />

6. Vancouver Webpages. (undated). WordNet – Definition of Terms. Online.. Viewed 2006<br />

March. http://vancouver-webpages.com/wordnet/terms.html<br />

7. Online Dictionary, Encyclopedia and Thesaurus. http://www.thefreedictionary.com/WordNet<br />

8. Fathom: The Source for Online Learning. Play With Words on the Web.<br />

http://www.fathom.com/feature/1140/<br />

9. The Global Wordnet Association (GWA). http://www.globalwordnet.org/<br />

10. Morato, J.,. Marzal, M..A., Llorens, J., Moreiro, J.: WordNet Applications. In: <strong>GWC</strong>'2004,<br />

Proceedings, pp. 270–278 (2004)<br />

11. Mihaltz, M., Proszeky, G.: Results and Evaluation of Hungarian Nominal WordNet v1.0.<br />

In: <strong>GWC</strong> 2004, Proceedings, pp. 175–180 (2004)<br />

12. Princeton University. (undated). Princeton tool tops dictionary.<br />

http://www.princeton.edu/pr/pwb/01/1203/1c.shtml<br />

13. The Institute for Logic, Language and Computation. EuroWordNet.<br />

http://www.illc.uva.nl/EuroWordNet/<br />

14. RussNet Project. http://www.phil.pu.ru/depts/12/RN/<br />

15. MultiWordNet. http://multiwordnet.itc.it/english/home.php<br />

16. Wintner, S., Yona, S.: Resources for Processing Hebrew.<br />

http://www.cs.cmu.edu/~alavie/Sem-MT-wshp/Wintner+Yona_presentation.pdf (2003,<br />

Sep.)<br />

17. Visual Thesaurus. http://www.darwinmag.com/read/buzz/column.html?ArticleID=576<br />

18. Elkateb, S., Black, W., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A.,Fellbaum, C.:<br />

Building a WordNet for Arabic. In: Proc. LREC'2006 (2006)<br />

19. Black, W., Elkateb, S., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A., Fellbaum, C.:<br />

Introducing the Arabic WordNet Project. In: Proc. of the Third International WordNet<br />

Conference (2006)<br />

20. Elkateb, S., Black, B.: Arabic, some relevant characteristics. Presentation.<br />

21. Educational Uses of WordNet. READER: A Lexical Aid.<br />

http://wordnet.princeton.edu/reader<br />

22. BalkaNet. http://www.ceid.upatras.gr/Balkanet<br />

23. Vossen, P.: Building Methodology. Presentation.<br />

24. Diab, M.T.: Feasibility of Bootstrapping an Arabic WordNet Leveraging Parallel Corpora<br />

and an English WordNet. In: Proceedings of the Arabic Language Technologies and<br />

Resources. NEMLAR, Cairo (2004)<br />

25. Abu-Absi, S.: THE ARABIC LANGUAGE. Glossary of linguistic terms.<br />

http://www.sil.org/linguistics/GlossaryOfLinguisticTerms


544 Hamza Zidoum<br />

26. Modern Standard Arabic. http://fizzylogic.com/users/bulbul/lmp/profiles/modern-standardarabic.html<br />

27. Arabic Overview. http://fizzylogic.com/users/bulbul/lmp/profiles/arabicoverview.html#orthography<br />

28. Niles, I., Pease, A.: Linking Lexicons and Ontologies: Mapping WordNet to the Suggested<br />

Upper Merged Ontology. In: Proceedings of the IEEE International Conference on<br />

Information and Knowledge Engineering, pp 412–416 (2003)<br />

29. Vossen P.: EuroWordNet General Documentation. Version 3. WordNet Statistics.<br />

http://wordnet.princeton.edu/man/wnstats.7WN<br />

30. Fellbaum, C.(ed.): WordNet: An Electronic Lexical Database. The MIT Press, 445 pp.<br />

(1998)


Author List<br />

Agić, Ž. 349<br />

Agirre, E. 474<br />

Alkhalifa, M. 387<br />

Almási, A. 462<br />

Álvez, J. 3<br />

Angioni, M. 21<br />

Atserias, J. 3<br />

Azarova, I. V. 35<br />

Balkova, V. 44<br />

Barbu, E. 56<br />

Bekavac, B. 349<br />

Bertran, M. 387<br />

Bhattacharyya, P. 321, 360<br />

Bijankhan, M. 297<br />

Black, W. 387<br />

Bosch, S. 74, 269<br />

Bozianu, L. 441<br />

Broda, B. 162<br />

Buitelaar, P. 375<br />

Butnariu, C. 91<br />

Calzolari, N. 474<br />

Carrera, J. 3<br />

Ceauşu, A. 441<br />

Charoenporn, T. 101, 419<br />

Clark, P. 111<br />

Climent, S. 3<br />

Cramer, I. 120, 178<br />

Csirik, J. 311<br />

Demontis, R. 21<br />

Deriu, M. 21<br />

Derwojedowa, M. 162<br />

Elkateb, S. 387<br />

Farreres, J. 387<br />

Farwell, D. 387<br />

Fellbaum, C. 74, 111, 269, 387, 474<br />

Finthammer, M. 120, 178<br />

Fišer, D. 185<br />

Gyarmati, Á. 254<br />

Hao, Y. 453<br />

Hatvani, Cs. 311<br />

Hobbs, J. 111<br />

Hong, J. 506<br />

Horák, A. 194, 200<br />

Hotani, C. 209<br />

Hsiao, P. 220<br />

Hsieh, S. 209, 474, 506<br />

Huang, C. 209, 220, 474, 506<br />

Ion, R. 441<br />

Isahara, H. 101, 419, 474<br />

Jaimai, P. 101<br />

Kahusk, N. 334<br />

Kalele, S. 321<br />

Kanzaki, K. 474<br />

Ke, X. 220<br />

Kerner, K. 229<br />

Kirk, J. 387<br />

Koeva, S. 239<br />

Kopra, M. 321<br />

Krstev, C. 239<br />

Kunze, C. 281<br />

Kuo, T. 209<br />

Kuti, J. 254, 311<br />

le Roux, J. 269<br />

Lemnitzer, L. 281<br />

Lüngen, H. 281<br />

Maks, I. 485<br />

Mansoory, N. 297<br />

Marchetti, A. 474<br />

Marina, A. S. 35<br />

Martí, M. A. 387<br />

Mbame, N. 304<br />

Melo, G. 147<br />

Miháltz, M. 311<br />

Mohanty, R. K. 321<br />

Mokarat, C. 101<br />

Monachini, M. 474


546<br />

Moropa, K. 269<br />

Neri, F. 474<br />

Nimb, S. 339<br />

Oliver, A. 3<br />

Orav, H. 334<br />

Pala, K. 74, 194<br />

Pandey, P. 321<br />

Parm, S. 334<br />

Pease, A. 387<br />

Pedersen, B. S. 339<br />

Piasecki, M. 162<br />

Poesio, M. 56<br />

Prószéky, G. 311<br />

Raffaelli, I. 349<br />

Raffaelli, R. 474<br />

Ramanand, J. 360<br />

Rambousek, A. 194, 200<br />

Reiter, N. 375<br />

Rigau, G. 3, 474<br />

Riza, H. 101<br />

Robkop, K. 419<br />

Rodríguez, H. 387<br />

Rouhizadeh, M. 406, 520<br />

Segers, R. 485<br />

Shamsfard, M. 406, 413, 520<br />

Sharma, A. 321<br />

Sinopalnikova, A. A. 35<br />

Sornlertlamvanich, V. 101, 419<br />

Spohr, D. 428<br />

Ştefănescu, D. 441<br />

Storrer, A. 281<br />

Su, I. 209, 220<br />

Sukhonogov, A. 44<br />

Szarvas, Gy. 311<br />

Szauter, D. 462<br />

Szpakowicz, S. 162<br />

Tadić, M. 349<br />

Tesconi, M. 474<br />

Tufiş, D. 441<br />

Tuveri, F. 21<br />

Vajda, P. 254<br />

VanGent, J. 474<br />

Váradi, T. 311<br />

Varasdi, K. 254<br />

Veale, T. 91, 453<br />

Vider, K. 334<br />

Vincze, V. 462<br />

Vitas, D. 239<br />

Vliet, H. 485<br />

Vossen, P. 200, 387, 474, 485<br />

Weikum, G. 147<br />

Xu, M. 506<br />

Yablonsky, S. 44<br />

Yarmohammadi, M. A. 406, 520<br />

Zawisławska, M. 162<br />

Zidoum, H. 531<br />

Zutphen, H. 485

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!