GENETER,
GENETER Plus and INTERLEX Toolkit:
A system for terminological and general vocabulary management Pedro Luis Díez Orzas, Universidad Alfonso X André Le Meur, Université du Rennes José Simón Granda, Universidad de Alcalá de Henares-Tecnolingua |
||||||||||||||||||||||||||||||||
Abstract Today, a considerable number of electronic terminological databases are available through the Internet or on CD-ROMs. The diversity of formats, structures and data models makes it very difficult to develop management and publishing tools that could be shared by all or most of the different lexical and terminological resources. GENETER has been developed at the University of Rennes 2. It is a modular and general standard format designed to be used as an intermediate format for the representation of terminological data. Due to its flexibility and capacity it has been proposed to become an ISO standard. This generic format is intended to facilitate dissemination of data on networks and their reuse in local applications. It is based on the idea of a conceptual model described in SGML, a "meta DTD", capable of holding data which come from different models. This model is "closed", "predictable" and independent from the source models. Programs can therefore be built on such a structure in order to produce different views of the data, like HTML, and outputs (specific import format, for example). This format can represent various terminological structures according to the principles defined in ISO 704 (Principles and Methods of Terminology), ISO 1087 (Terminology - Vocabulary) and ISO 10241 (International terminology standards - Creation and presentation). It is used to allow automatic processing of those data elements defined in these standards. In the INTERLEX Project (MLIS-103), former experience in semantic networks and lexicographical tools has been used to create a dynamic database that shares the principles of GENETER and will gradually become compatible with it. GENETER PLUS is a relational database model that allocates every piece of information from the DTD in a given field, within a given level. G+ has four levels conveniently interlinked through a set of cross-reference tables and their corresponding indices. These levels allow the handling of lexical and terminological descriptors, translation, ambiguity representation, polysemy and synonymy, other lexical and semantic relationships, as well as the performance of other lexicographical tasks. We use the conceptual approach to lexical and terminological representation, which is not only useful for completing the information for the existing entries (e.g. adding tags, attributes or examples), but it is absolutely necessary for performing more sophisticated data processing that can be summarised as follows:
All these processes: Conceptual Unification, Multilingual and Bilingual Merging and regular maintenance need manual verification. Some of the manual and automatic tasks are already supported by features and functions of the INTERLEX Toolkit
1. Introduction Electronic publishing relies on parallel R&D explorations in the area of Computational Linguistics and other related fields. The treatment of lexical reference information follows many different approaches, formats and structures each with its own needs for data representation and exploitation. The wide number of possibilities in terminological and general dictionary formats, structures and purposes is an obvious obstacle for access to linguistic and non-linguistic technologies by publishers and developers of such resources. The fast evolution and growth of the terminology and the multilingual reality of the European society means that it is extremely important to facilitate access to these technologies for the maximum number of researchers, authors, editors and publishers of lexical and terminological works. Some of the needs for the development of terminological information are:
The answers are not simple solutions: theoretical, formal, technical, organisational and political aspects are only some of the elements involved. The concept of "standard" then becomes crucial at this crossroads of disciplines. In the present work we will show our tentative contributions to a better lexical and terminological data management and publication strategy, providing formalisms, technologies and applications.
2. GENETER: a standard format for terminological data interchange Terminological databases and bilingual or multilingual dictionaries used by translators, technical writers and language planners have varied structures which constitute an obstacle for the exchange and network distribution of data. GENETER offers a generic model for mapping these data into a common format. This model allows for the definition of the necessary processing of the distribution and reuse of such data. The availability of these standards for the presentation of terminological and lexicographical data enables the processing, distribution and sharing of these data from different sources and their reuse in local or other applications. GENETER is an intermediate format for the representation of terminological and lexicographical data. This model can represent various terminological structures according to the principles defined in ISO 704 (Principles and Methods of Terminology) and ISO 1087 (Terminology - Vocabulary) for terminology. It is used to automatically process the data categories defined in these standards. It uses the ISO 12620 designation of data categories. It also deals with structures of technical multilingual dictionary projects. In order to facilitate its use on networks, this model uses the HTML standard for the representation of textual data. The GENETER model is based on the idea of a generic format that can be represented either by an entity-relationships model or by a "meta-DTD" in SGML. Such a model is:
Figure 1: Diagram showing principles involved Each source file is converted to GENETER format using a specific procedure (Converters: C). A Derivation Converter (independent of source files) can produce a representation format (HTML) or an importation format (EURODICAUTOM, TERMIUM, etc.). 2.1 Brief description of GENETER format The model specifies nine types of GENETER objects. The first two describe linguistic objects: terminological entries (composed of linguistic data, administrative information, descriptive elements and information about relations between objects and elements), and lexicographical entries. The other elements are secondary objects: bibliographical information, HTML pages, description of people, description of corporate bodies, elements of documentary language, binary data (image or sound) and other (free) objects. A terminological entry is composed of an optional block of language independent information (Language Independent Level - <LIL>) and one or more blocks of language dependent information (Language Dependent Level - <LDL>). The LIL level is made up of information that is independent of linguistic representations of a notion in a language. For a given language, the structure of the LDL level has three parts: a level that is independent of linguistic representations, a level that is related to a linguistic representation (Term Level - <TL>) and a level that is linked to a component of linguistic representation (Term Component Level - <TCL>). Figure 2: Diagram showing structure of a terminological entry Data categories are organised into Groups:
2.2 GENETER: a possible ISO standard for the year 2000 GENETER was presented at the TC37/SC3 meeting in 1997 (Copenhagen). A first draft was submitted for comments (Stockholm 1998). A New Work Item proposal will be voted on in Berlin in August 1999. Over the past two years, GENETER has also been used by two MLIS projects of the European Community for the dissemination of linguistic data on the Internet:
There is a close co-operation with two other MLIS projects: NORDTERM (five Nordic countries) and INTERLEX, which is presented in this paper.
3. INTERLEX TOOLKIT and GENETER PLUS In the INTERLEX Project (MLIS-103), former experience in semantic networks and lexicographical tools has been used to create a dynamic database that shares the principles of GENETER and will gradually become compatible with it. The necessary expertise for carrying this out is provided by the academic and industrial partners in the different areas covered. The Universidad Alfonso X El Sabio (Madrid, Spain) contributed to the design and development of the lexicographic and semantic tools and the web site. Tecnolingua, a leading Spanish Software Company in Computational Linguistics, has developed the software, tools and web site. The project also unites a number of well-known European dictionary publishers, namely Editorial Everest (Spain); Everest Editora (Portugal); and La Maison du Dictionnaire (France). These publishers will make their general and terminological, bilingual and multilingual resources available for exploitation on the Internet for professionals, students and all those interested in translation and the equivalence of languages. INTERLEX will make available shared general and terminological dictionaries, at the moment, for the languages: English, Spanish, German, French, Italian and Portuguese. The project will contribute to new ways of accessing bilingual and multilingual dictionaries in the Internet, where the user can benefit from the continual updating of these resources, which can be enhanced by the incorporation of new titles and publishers in the future. We also want to introduce those involved in publishing and using lexical resources to three advantages of using this method. Firstly, it contributes to a better understanding of how the mechanisms used to improve and label bilingual and multilingual dictionaries operate in a database which has fast access engines. Secondly, a methodology is being established which adapts the shared resources into a common format in order to exploit bilingual and multilingual dictionaries. Finally, the project is contributing to the study of commercial approaches to exploit lexical resources on the Internet;, and particularly the evolving relationship between the user/customer and the developer/provider. Added to the benefits outlined above, the project will also develop technological and methodological advances in the field, such as a database management system for terminological and lexical tasks which will be compatible with GENETER. The basic idea is to load lexical and terminological data, previously converted into GENETER format by a DTD, into a dynamic database for data storage and management. The database model is called GENETER PLUS (G+), and it works with different tools for dictionary management, creation and maintenance. Any lexicographical and terminological data which can interact with a DTD for GENETER will be ready to use G+ and the INTERLEX Toolkit. GENETER PLUS is based on a relational database model that allocates every piece of information from the DTD in a given field within a given level. G+ has four levels conveniently interlinked through a set of cross-reference tables and their corresponding indices. These levels allow the handling of lexical and terminological descriptors, translation, ambiguity representation, polysemy and synonymy, and other lexical and semantic relationships, as well as the performance of other lexicographical tasks. All the levels are connected and its higher levels ensure the relationship among connected sublevels. For example, a multiword term will have one record in the Term level, which is connected to several records in the Word level. 3.1 GENETER PLUS For the INTERLEX project some sort of overall model of representation was required that could potentially handle any possible terminological or lexicographical structure with reinforced search and retrieval capacities. To achieve these aims, the original format has been extended in two ways. Firstly, by adding new fields to the existing GENETER levels, including grammatical (morphological, syntactic and pragmatic) information that is usually found in lexicographical references and also by adding a new 'Word' or 'Lexeme Level' that ensures access to any term through each one of the words in it. The final structure has four levels. Three of them are adapted from a previous version of GENETER, plus a fourth one, included to allow access from a simple word to the expression. These levels are gradually more language-specific as we go down the scale from the most abstract (language-independent) level to the most specific (language-dependent). The conceptual level is thus absolutely abstract (language-independent) whilst the lexical level is completely language-dependent. The simplified version of the structure of this database in shown in the following table:
Table 1: Geneter Plus Structure This structure is then implemented in INTERLEX tools as a set of several separate groups of tables. One group, entirely language-independent, corresponds to the conceptual level. Related to this group, there is a set of tables for each language. The relations between two different languages are not direct but are via the conceptual level. In this scenario, multilingualism and translation are then performed though the Concept Level (which matches a sense in a language with the equivalent sense in another language). Several Term Level records pointing to the same Sense Level record represent different synonyms, while polysemy is indicated by the fact that a Sense Level record is pointing to several Term Level records. Figure 3: Multilevel representation and indexing For each language, all these levels are connected through an intricate system of bi-directional indices, which ensure the possibility of top-down as well as bottom-up explorations of the database. It is thus possible to find and retrieve information regarding the lexemes that make up a given term for a certain concept (top-down retrieval) or, conversely, given a lexeme, to retrieve all associated concepts with the corresponding senses and terms in any other language (bottom-up search). Figure 4: INTERLEX GENETER PLUS tool In addition to all these features, the most outstanding characteristic of the system is its capacity to incorporate information from general bilingual dictionaries into the extended terminological format that GENETER PLUS provides. Lexicographical information related to the entry is stored in the Lexeme Level tables. Information referring to different synonyms is stored in separate records at the Term level, as already stated, and different acceptations are represented by different senses. This way, all the elements of information that come linearly arranged in a conventional dictionary are here ranked in a set of hierarchical levels without loss of information and with the additional advantage of a much more flexible retrieval system. Some of the fields have been included in an attempt to foresee data source requirements other than general or terminological dictionaries which will be in demand in the future. This is the case of the REL-TARGET field (at the Concept Level) whose purpose is to include information (a code) about related concepts. Other fields may appear to be redundant (this is the case of Sense Number, for example), but we have decided to include them in order to be able to perform several checks. Finally, this hierarchical structure can be easily mapped onto almost any other potential structure: running text, tagged text (HTML, SGML, TEI, etc.) or any other relational or documentary database. The data can then be maintained in the GENETER PLUS format and be exported to any of these formats when required. 3.2 Overview of the INTERLEX data handling process A considerable number of lexical dictionaries are organised alphabetically, while many terminological dictionaries are organised conceptually. The range of formats is as wide as the number of publishers, or even the number of dictionaries. This situation makes it difficult to communicate between different formats and share tools. The traditional alphabetical ordering seems to be simpler for storage and handling for publishing, but completing, enriching and maintaining entries is tedious and expensive. On the other hand, a concept-based organisation of vocabulary (lexical or terminological) requires more effort to store and handle the data, but it facilitates other a priori much more complex tasks, like semiautomatic completion of synonym groups, merging of different dictionaries or new tagging. In a conceptually organised lexical and terminological database, one entry is a concept to which one sense per language is associated; in each of these senses, a group of terms, which constitutes a synonym group or synset, is linked. Thus, for example, every time a new tag needs to be added for a given sense, we only have to tag it once, at the level of the sense, and it will be distributed among all the terms in all the languages (if it is the case) for that acceptation in the alphabetical view or printing of the data. The full process can be illustrated as shown in the figure below. Figure 5: INTERLEX Tools work flow We illustrate two different starting points of the resource organisation. In order to be ready for further processes, alphabetically-organised dictionaries must be reorganised in such a way that all the synonyms in both languages, in bilingual dictionaries, are grouped into synonym sets, a process that we refer to as conceptual unification. GENETER has been extended to cover lexical information in an alphabetical organisation, and that feature could be use in the future when full compatibility is ensured. Filtering is carried out from the original format into the Lexical GENETER format, and from this to the so-called unification. The unification process involves collecting all the synonym acceptations which are distributed among the alphabetically listed entries (building synsets) keeping the translation equivalencies between the synsets or concepts instead between word and terms. The following figures illustrate the usual structure of alphabetical bilingual dictionaries and some examples of the mismatches between the two directions of the bilingual dictionaries, barriers for the collection of synsets or unification.
The second complex process is the merging of bilingual and multilingual resources. These two types of resources need to be treated separately, since in the merging of multilingual resources we compare more than one language, checking all possible overlapping and identifying which entries belong to the same concept, increasing the number of languages or synonyms as necessary. Figure 8: Merging multilingual conceptual dictionaries Regarding bilingual resources, the merging process requires a connection point or bridge to be crossed. If the two bilingual dictionaries do not share one languages, any automation of the merging process will use some other resource which can provide the linkage. However, normally publishers always include in their collections of bilingual dictionaries their national language as one of the language pairs. This language can be used as a common language, that is, as a pivot language or core language that allows for the union of the rest of the languages. In the following picture we show how the two directions of bilingual dictionaries are unified to become bi-directional bilingual dictionaries (conceptually organised), and these are merged by matching the shared or core language. Figure 9: Merging process through a core language. As a side effect, the unification process increases the number of entries of the dictionaries automatically by incorporating all the mismatched items (if A can be translated by B, B can be translated by A). The merging process produces similar effects. The difference is that the translations of the mismatched items from the two core language sets must be added by hand. We have seen that the conceptual approach to lexical and terminological representation is not only useful for completing the information for the existing entries (like adding tags, attributes or examples), but it is absolutely necessary for performing more sophisticated data processing, which can be summarised as follows. One bi-directional conceptual dictionary can be created from two bilingual alphabetic dictionaries. The first task we consider highly complex is to convert the two directions of a bilingual dictionary. One multilingual dictionary or database containing new translation pairs can be created from several bi-directional dictionaries. This can be considered as the final goal of the process. It is during this process that the biggest gains can be found. Publishers could save a lot of time by adopting this approach as opposed to building up the same new pairs by traditional methods. Several multilingual terminological dictionaries can be merged, which ensures maintenance of the newly created translation pairs as well as of new multilingual dictionaries to be added or checked. All these processes, conceptual unification, multilingual and bilingual merging, and regular maintenance, need manual verification. Both manual and automatic task are supported by features and functions integrated in the INTERLEX Toolkit. The INTERLEX Toolkit is now under development. Some of the steps of the automatic data handling process, such as conceptual unification, have been already achieved and the results are currently being verified by hand using the toolkit. 3.3 INTERLEX TOOLKIT 1.0 With regard to the problem of handling terminological and lexicographical information in a database from a purely theoretical perspective, three main tasks can be considered essential: 1) defining the model of description to be used, 2) importing existing data into such a model and 3) maintaining the data within it. In our case, the model (G+) has already been conveniently explained. The maintenance tools, which are still under development, rely mostly on the requirements of the owners, on the one hand, and the potential users on the other. Seemingly, importing existing data is the most trivial of the tasks. Nevertheless, as experience has taught us, this task may become annoyingly cumbersome. There are two possible starting points: the original data may be either conceptually organised or alphabetically organised. In the first case, any publisher could build a DTD to convert the data straight from its format into GENETER or, conversely, from GENETER into its original format to plug in/out the toolkit. In the case of alphabetically arranged data, either the format is first adapted outside the toolkit to use a GENETER DTD, or an 'ad hoc' filter should be built to transfer the data into GENETER PLUS. It is a well-known fact that, in the realm of practice, further problems are frequently encountered. Even in the most favourable case, that of conceptually organised glossaries, the number of gaps in certain languages and the discrepancies between data formats makes it advisable to write filters that transfer the data into the final model from scratch. This is the strategy that we have adopted with INTERLEX. For the time being, as work is still in progress, the operation of the INTERLEX toolkit responds essentially to the requirements of the INTERLEX Project MLIS-103, in which data from the publishers Editorial Everest (ES) and Everest Editora (PT) (five bilingual general dictionaries in all) and La Maison du Dictionnaire (FR) (eight multilingual terminological dictionaries), are processed. Due to the diversity of sources, the tasks have been organised into the following blocks.
For the first block a set of INTERLEX Dictionary Filters have been built that transfer data from their original format into a common temporary structure that allows gaps to be filled in and concepts to be unified in a more efficient way. These filters perform a set of consistency checks that detect and correct a good number of errors. Thus, once filtered, the data are much more accurate than they were at their source. For the second block of tasks, two different tools have been implemented. The first one, known as the LMD Tool Kit, deals with conceptually organised data, whilst the other, called EVES Tool Kit, deals with alphabetically organised dictionaries. The LMD Tool Kit in turn incorporates several other features such as:
This tool is accompanied by a Merging Tool that cross-checks the data in several dictionaries and languages and then merges them into a single database. The EVES Tool Kit consists mostly of what we refer to as the INTERLEX Conceptual Unification Algorithm, which copes with the automatic unification and reorganisation of alphabetically-organised dictionaries into conceptually-organised dictionaries. This algorithm checks bilingual dictionaries in both directions, defining concepts, correcting mistakes and detecting inconsistencies between them. The algorithm adds in many acceptations and synonyms which occur in one direction but which have no correlate in the other, thus helping to complete both directions. Besides this automatic algorithm, the toolkit includes similar editing capacities to the LMD Tool Kit. The toolkit will enable lexicographers to review, check, correct and complete each of the dictionaries. Once all the data sets have been merged, checked and completed, they will eventually be exported to the GENETER PLUS format. The INTERLEX GENETER Plus Maintenance Tool will be in charge of data maintenance and will also allow data to be transferred back to the original proprietary formats as well as to several other standard formats (ANSI, HTML, SGML, GENETER, etc.) either with or without encryption. In addition, the tables can be exported to any of the most popular corporate databases (Oracle, DB2, SQL Server, and Informix) to allow storage and retrieval in a standard Internet Server.
4. Conclusion We can conclude that GENETER, GENETER PLUS and the INTERLEX Toolkit may make a significant contribution to the improvement of publishing tasks by automating time-consuming tasks such as recovering old dictionaries (non-electronic) or checking data integrity and coherence. The adoption of GENETER as a standard format will benefit publishers of multilingual reference works. The full maintenance and development of multilingual dictionaries demands a flexible database that allows data to be exported from and imported to different electronic publishing formats. We regard the GENETER PLUS-based INTERLEX toolkit as a plug-in/plug-out system, which entails that using these tools for certain sophisticated processes by no means implies leaving aside in-house proprietary formats, tools and technology; data can simply be transferred to and from specific formats using GENETER and the INTERLEX tools as an intermediate format and tools. We are aware of the fact that much still remains to be done in the domains of computational lexicography and terminology in order to fully automate the most annoying tasks. However with the adoption of the GENETER PLUS model and the tools we have so far implemented, we strongly believe that we have taken some beneficial steps in that direction. Investigations in the domain of automatic semantic network generation have also been undertaken, which places us in a good position to tackle the promising task of merging several bilingual dictionaries into a single multilingual database. The modules implemented so far will also make it relatively simple to quickly develop new 'ad hoc' filters for other potential structures in the future. This in turn implies that, from now on, we will not only be maintaining the existing databases, but also reinforcing them with new glossaries and dictionaries, given that most of the groundwork has already been carried out.
5. Bibliography Díez Orzas, P., Peter, W. and Vossen, P. (1997): "The Multilingual design of EuroWordNet", ACL/EACL-97 Workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, Madrid Ide, N. and J. Veronis (1994): "Dictionaries: What have we learned, Where do we go?", in Calzolari and C. Guo (eds), Proceedings of the post-coling94 international workshop?, 15-17 de agosto, Pekín, 137-146. Le Meur,
Andre (1994): "GENETER an overview", Universite de Haute Bretagne Martín de Santa Olalla Sánchez, Aurora and Roales Ruíz, Antonio (1998): "Internet and traducción", Cuadernos Cervantes, nº 19, año IV. Disponible in Internet http://zeus.uax.es/~auroraso/expo/intertra.htm. Millán, J. A. (1996): "Los diccionarios del siglo XXI", in Cuadernos Cervantes de la lengua española, 11, noviembre-diciembre, Madrid, E. L. R. Ediciones and Univ. de Alcalá, pp. 74-76. Project EUREKA GENELEX. Report on the Semantic Layer. Genelex Consortium. Version 2. 1 September 1994 Project EUREKA GENELEX. Report on the Morphological Layer. Genelex Consortium. Version 2. 1 September 1994 Projet EUREKA GENELEX. Rapport sur le multilinguisme. Consortium Genelex. Version 2. 0. Décembre 1994. Santana, O., Hernández, Z. Pérez, J., Rodríguez, G. and Carreras F. (1996): "Diccionarios in soporte informático", in Cuadernos Cervantes de la lengua española, 11. Sharpe, Elizabeth and Marshall Keys (1994) (Network Planning Paper No. 28): "Toward a Comprehensive Model of Electronic Markets". Library of Congress Network Advisory Committee Network Planning Paper No. 28. http://lcweb.loc.gov/nac/nac28/nacsharp.html Soler, C. and M. A. Marti (1993): Dealing with Lexical Mismatches - Esprit BRA-7315, Acquilex2 , Working Paper 4.
|
||||||||||||||||||||||||||||||||