Medical
terminology in a translation environment
Christophe Declercq, Erik Snoeijers Department of translators and interpreters, Katholieke Vlaamse Hogeschool Antwerp, Belgium |
![]() |
||||||||||||||
Abstract
Keywords: memory-based translation, corpus annotation, corpus compilation, term bank construction, medical terminology, parallel corpora, knowledge management, AI implementations, Natural Language Processing (NLP) tools
I. Introduction The main objective of our project was to demonstrate that an interactive memory-based translation system can improve the translation process substantially in terms of speed, quality and terminological and stylistic consistency. The pivotal part of our project focuses on the development of an annotated bilingual term bank of medical terminology in a format (TRADOS Multiterm) which is compatible with a translation memory. This term bank will be integrated into TRADOS Translator's Workbench. Through research into the possibilities of semi-automatic construction of terminology lists and term banks on the basis of electronic corpora, we also try to test and develop a semi-automatic self-learning system for corpus analysis.
Fig.1: project scheme The system comprises a terminology database, a translation memory (i.e., a database of previously translated text units) coupled with a set of tools (taggers and alignment programs) that process the given text material, in this case, medical texts. Subsequent aims of the project are to determine the advantages and drawbacks of memory-based systems, to come up with suggestions to improve their shortcomings, and to expand and export the productivity evaluation of the system. In our project, the corpus will be used for NLP-research, but at the same time the project represents a realistic view of what translation offices are supplied with in a specific domain. It consists of parallel, aligned texts in various language combinations, and can be divided in three parts [1]. The core of our text material comprises over 2100 medication leaflets in Dutch and French, totalling roughly 4 million words for each language. In addition, we have two subcorpora containing text-pairs from translation offices which cover a wide range of topics and language pairs. The research described in this paper has been carried out on the first part of the corpus.
II. Translation quality and quantity The main interest of translation offices and professional translators is to reduce the amount of time and money spent on translation, while producing large volume of high-quality translation output. When translating technical or highly specialised texts, however, the translator is confronted with a number of problems. With regard to domain specific knowledge and terminology, a translator needs to invest in terminology databases and keep them up-to-date. The main issues in modern terminology relate to the trustworthiness, complexity and availability of the resources. Even if sufficiently authoritative paper resources can be located, the process of converting them into a machine-readable form remains time-consuming. Electronic databases, on the other hand, are less costly to process but their quality may be harder to establish. Moreover, some resources are only commercially available. Moreover, the problem still remains that translators are not subject field experts and subject specialists are not terminologists. It should be one of the inherent capacities of modern technology to improve the distribution, accuracy and quality of terminology and therefore of the translation process itself. The key issue in terminology and knowledge management is the re-usability of information and knowledge. Translation problems can partly be overcome by using technology taken from NLP and AI: text recognition, parsing and tagging, text alignment and terminology extraction, etc. One technological innovation that can be used to help both translators and domain specialists is a translation memory. It stores previously translated material and will use these past translations to propose a solution for each new translation problem by means of 'fuzzy matching'. TRADOS Translator's Workbench can also be built up by introducing pre-aligned texts and their translations into the translation memory. Though the processes involved are not strictly the concern of the field of terminology, they share some of the problems that can be encountered in information compilation.
III. Complex Terminology The restrictions and requirements of the future application environment form the basis on which a terminologist can construct a database. In a translation environment, there is a need for rapidly accessible high-quality term banks as well as vast information banks, which can cut down the time-consuming work in translation. Low cost compilation and maintenance of a term bank should result in a quick return on investment. Of course, a translator is required to systematically deal with complex terminology in an LSP-text, but there are also significant restrictions of time and money (Galinski and Budin 1993:213-214). The delicate balance between efficiency and quality has become a hot issue in terminology management. In many cases the translator-terminologist is not a subject field expert, and there are few sources, if any, that provide a list of terms or a term bank. In ideal circumstances, a terminologist will collect resources, make term lists, and develop conceptual systems and search for translations. Pragmatically, the terminologist will describe all terms not known by non-specialists (Wright 1997:19). Pearson (1998:12-40), on the other hand, explains that neither a word's subject specificity nor its 'knownness' with non-specialists will suffice as a criterion for regarding it as a term. Pearson furthermore points to the impossibility of intuitively recognising a term. The mode of communication in which terms are used can be an indication of words being terms. In LSP-texts this is most likely.
IV. Corpus Assembly: source text selection A translator-terminologist who is not a subject field expert will find his/her resource material in contexts. The selection of contexts used to build a corpus is subject to a number of criteria: only experts can verify whether or not a text is representative for a particular domain; the publication itself provides and indication of the value of the terminology that appears in it; and, finally, the reliability of both text and author, linguistically as well as professionally, helps to determine the appropriateness of a text as a source for terminology. In the project, we have a corpus of specialised texts which take the form of medication leaflets for medical staff. We may assume that they comply with the three criteria. Nevertheless, for the remainder of the corpus, a wide-ranging collection of texts, including interviews, legal contracts and patient information, some of these criteria can be questioned. Given that experts have written these articles and texts, we assume the samples to be representative of the text type. It is a fact that the medical industry's text production seems to have become more popular and commercial in tone. This is a reality that cannot be denied. The material is taken from specialised journals and publications. The professional abilities of an author can only be judged with regard to his/her reputation, and to determine this, we have to rely on other specialists. As far as the linguistic abilities of the subject field specialists are concerned, this is something a translator can easily determine; many translation jobs have been refused or cancelled by translators due to the poor quality of the source text. Working with electronically available texts is less labour-intensive than working with printed material. Converting printed texts into machine-readable form takes time and money. Although numerous articles are available on-line and on CD-ROM, most of the source texts to which translators have access are in printed form. Sometimes the graphic quality of these texts is so poor that they cannot be converted into electronic form by means of scanning and optical character recognition (OCR). Furthermore, the reliability of OCR software is not yet of a high enough quality that texts can be converted without careful proofreading.
V. Term Extraction One of the natural advantages a translator has over a domain expert is his/her linguistic background. He/she does, however, have to rely on a corpus of specialist texts. Often the texts in the parallel corpus will provide only partial (contextual) mapping of terms. Therefore secondary resources can aid the translator in identifying common elements between the translation and the source term. Putting several sources together will increase the reliability of establishing equivalence between terms in different languages. 5.1. Medical terminology Medical terminology is largely documented in classification systems, which can often only be accessed by specialist doctors and which require (substantial) payment. The usefulness of such an investment depends on the quality of the database and on the translation office's financial situation.
Fig.2: chapters in ICD-10 (Hirs 1993: 227-228) A study on these classification systems for medical data (Chute 1996) showed that the separate systems did not describe the concepts systems accurately. The research material dates back to 1993 and might be outdated already. A standard system, International Classification of Diseases (ICD-10), was revised from version 6 by the World Health Organisation. Because ICD-10 starts from diagnoses, substantial specialist knowledge is required to describe the organisation of the concepts. Language-independent coding maps the concepts. The goal of ICD coding is to disseminate data on diseases and health care, and to calculate refunds from health insurance agencies. Despite the restricted field (medical statistics), ICD remains a compromise between nomenclature and classification, and between uniformity and usability. The codes in these classifications describe diseases, procedures, bodily systems, substances, instruments, etc. They partially meet the demand for standardisation in reporting and comparing medical data in particular, and the need for terminological standardisation on the whole. In addition to ICD-10, various other systems coexist and are used by different organisations [2]. The American Food and Drug Administration, for instance, requires Coding Symbols for Thesaurus of Adverse Reaction Terms in reports on medication side effects. A WHO directive addressing the same domain in parts of Europe and Japan uses the WHO Adverse Reaction Thesaurus. Because we assume that a database could be constructed without the assistance of a domain expert, especially in a professional environment, these classification systems provided one basis for our term bank. In addition, written sources are consulted to fill in the holes remaining in the concept structure. 5.2. Chemical nomenclature Names for chemical substances can diverge, or an overlapping of terms can occur when two different substances are clearly involved. Each nomenclature system has its limitations. Some substances can be described with more than one name, others need more than one to be accurately positioned in a concept system. Various books, mainly monolingual, have been published on that subject. It seems more useful to observe existing systems for medication and therapeutic substances. These can be found for instance in Garlot (1991) and Dorian (1990). Merrit (1997:222-223) holds the nomenclature as recommended by the International Union of Pure and Applied Chemistry (IUPAC) as a standard. The WHO suggests the use of International Non-proprietary Names, which, due to non-systematic naming in this cumulative list, is often combined with the nomenclature of the International Standardisation Organisation and the Chemical Substance Name Selection Manuals, such as in the European Customs Inventory of Chemical Substances. Combining coding systems, medical as well as chemical, can add to the value of a term bank. Working with this corpus, it is clear that manual research for contexts and candidate terms takes up valuable time. Therefore a translator can turn to electronic tools for help with corpus analysis, which can be further subdivided into pre-processing and automatic annotation of the text, and searching for and extracting relevant material.
VI. Term bank construction Bearing in mind the principles mentioned above, we constructed a reference term bank using Multiterm from TRADOS. Our decision was based on the market situation in 1997, when the European Commission decided to implement that system for their translation activities. Despite some of the drawbacks of the program, it has been adopted by a large number of users. In configuring the database, the translator-terminologist must find a balance between efficiency and completeness. In translation, there is need for a system that maximises the use of a database. In Multiterm this is reflected in a maximum use of index fields. One of the drawbacks is that the consistency of the input depends on the terminologist; however, TRADOS has furnished a number of input models. There is also a need to spend a minimum amount of time on terminology work. The attribute fields in Multiterm require less effort in terms of input, but they are harder to perform a query on. Sources are confined to specialised monolingual dictionaries, and a few multilingual sources (e.g. Dorian and Garlot). The corpus is the prime source for the term bank, but brings with it some problems. Other resources have been found on the Internet [3]. We confined ourselves to using non-commercial products, which often merely required that an agreement be signed. One such product was the Unified Medical Language System (UMLS) Knowledge Resources, which consists of a Metathesaurus, a semantic network and a specialist lexicon. The Metathesaurus contains semantic information about biomedical concepts, their various names, and their inter-relationships. It is built from thesauri, classifications, coding systems, and lists of controlled terms that are developed and maintained by many different organisations. The Semantic Network is a network of the general categories or semantic types to which all concepts in the Metathesaurus have been assigned. The specialist lexicon contains syntactic information about biomedical terms and will eventually cover the majority of component terms in the concept names present in the Metathesaurus. A number of lexical programs are distributed with the UMLS Knowledge Sources for use with the lexicon and the Metathesaurus (UMLS 1999). Another resource for medical terminology comes from the same source. Medical Subject Headings (MeSH) is the National Library of Medicine's (NLM) controlled vocabulary thesaurus. The MeSH thesaurus is used by the NLM for indexing articles from leading biomedical journals for the MEDLINE-database and for other databases. A totally different type of resource found on the World Wide Web is a multilingual glossary of technical and popular medical terms in nine European languages, by the Heymans Institute of Pharmacology. It includes English, French, German, Dutch, Spanish, Portuguese, Italian, Greek and Danish.
VII. Alignments and bilingual corpora The idea of storing previously translated texts into a memory is an old one. A translation memory is based upon a model in which a source text and its translation are combined to form a parallel corpus. The translation memory can be built up in advance by pre-aligning texts and their translations and storing them in the memory. Alignment produces a parallel environment containing a source text and its translation; hence a bilingual corpus, or bitext [4] is created. The mutual compositional relation of a text segment and its translation is made explicit. An alignment is based upon a statistical comparison: the translation of source text x is situated around target text y. Thus alignments form the structural depictions of translation analyses just as parsers do for grammatical analyses. The alignment can be performed automatically, but the criterion in most programs is restricted to punctuation and paragraph markers. Extensive manual revision is needed to render the texts suitable for inclusion in a translation memory. The quality of an alignment depends on the nature of the items on which the segmentation is based. A minimalist (statistical) alignment is an alignment of a text into paragraphs ('gross alignment'). A maximalist alignment relies on an approach whereby the bitext is divided into the smallest possible pairs without losing the overall meaning. More recent developments treat the bitext at the level of words and/or characters. An alignment filters important components out of a translation. Next to two language specific components (the language models), pair specific features become obvious, i.e. the contrastive component (the corresponding model). Both monolingual components act in an analysis mode and the language specific representations they produce are fed into a corresponding model. That model connects them in a simple bitext image in which the translational similarities are made explicit. As a consequence this remains a natural model, however rule-based or corpus-based the techniques involved may be. While rule-based methods are suited to the development of 'deep' models in specific domains, probabilistic methods are most likely to be used to develop less thorough models which are perfectly capable of making fairly acceptable partial analyses of non-domain-bound translations. Although most results of alignment algorithms are produced by those programs that give more accurate results by using of anchors and pointers, they still remain partial alignments only. The major part of the corresponding bulk of information, however, at a level which is smaller than the sentence. It should come as no surprise then that the boundary of the sentence length still remains a dividing marker in the establishment of alignments. Though the most recent uses of parallel corpora focus on bringing translations into bitext at a level beneath the sentence (e.g. word or character), it appears that the balance of 1:1 is disturbed once the segmentation takes place beyond that limit. For this project, we have thus far tested a number of alignment programs. The first was Atril's Déjà Vu Database Maintenance Align. Déjà Vu is a CAT (computer-aided translation) package with different tools that enable the translator to work faster and more accurately. It consists of a memory database which stores all translated texts, and a terminology tool called Termwatch, which is a fully integrated terminology system [5]. Déjà Vu also contains the File Alignment Wizard. Using that alignment tool, all previously translated texts can be aligned and stored in the translation memory. We also tested TRADOS Winalign, a package which resembles Database Maintenance Align, but which offers additional possibilities. Winalign is comes with the Translator's Workbench. Winalign also synchronizes source texts from earlier translation projects with their translation and automatically complements translation memories. In collaboration with Language and Computing, our Department 'Language and Computer' has aligned a corpus on sentence level using a Borland alignment tool. The corpus consists of over 2100 medical instruction leaflets in Dutch and French. The text material in both languages comprises about 8 million words. The alignment takes place at three levels. First the names of the drugs are compared, then all subtitles of each drug lemma, and finally the subsequent paragraphs. There were no problems at all when the names and titles were compared. Only when the actual bulk of text was examined did many difficulties arise. First of all, the Borland alignment tool is a statistical program which places translated segments are into bitext on the basis of punctuation (EOL) [6]. Therefore, the whole corpus was aligned at sentence level. Secondly, information in some instruction leaflets seemed to be missing. It could be the task of the translator/aligner to obtain further information on a scientific basis. The Borland tool, however, did not allow the generation of the corpus. Thirdly, when aligning translations, all problems typical to the field of translations are encountered. Omissions, additions, shifts in sentence structures, etc., are complications that alignment tools are not yet able to deal with.
VIII. Corpus annotation and tagging The productivity of an aligned translation memory contrasts with the productivity of term banks. There are no direct links from the translated units to the term bank itself. This automatic transfer of translated items throws up three difficulties. Firstly, it is hard to find criteria which can be used to identify a term as such. Secondly, it is not easy to isolate the translation of a particular word within the translated sentence. To do so, the system would have to be based upon an in-depth knowledge of the syntax of both source language and target language. Thirdly, it is not enough to merely create a list of translated terms. This list should be annotated with information related to word class, gender, number, morphological behaviour, syntactic information and semantics. Clearly it would be time-consuming to process any text with regard to all of the above-mentioned aspects. The solution to the problem lies in the development of state-of-the-art NLP tools that automatically attach tagged information to each term. However, this process cannot be performed on a fully automated basis; it has to be manually edited and as such, we can call it semi-automatic. This type of work has to be supervised by an expert terminologist. Corpus annotation is the practice of adding interpretative - especially linguistic - information to a text corpus by adding coded information to the electronic representation of the text itself. If a corpus is said to be unannotated it appears in its existing raw state of plain text, whereas annotated corpora have been enhanced with various types of linguistic information. It is therefore not surprising that the utility of the corpus increases when it has been annotated; it is no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation. Several kinds of tagged information can be represented by different types of annotations. Morpho-syntactic annotation is the annotation of the grammatical class of each word-token in a text, also referred to as 'grammatical tagging' or 'part-of-speech (POS) tagging'. Syntactic annotation is the annotation of the structure of sentences, e.g. by means of a phrase-structure parse or dependency parse. Parsing involves bringing basic morpho-syntactic categories into high-level syntactic relationships with one another. This is probably the most commonly encountered form of corpus annotation after part-of-speech tagging. Parsed corpora are sometimes known as treebanks. One of the tagging schemes we tested was the WinBrill tagger for French. WinBrill [7] is a program for Windows based on the tagger developed by Eric Brill . The program has been adapted by the Institut National de Langue Française (InalF) of the Centre National de la Recherche Scientifique. The WinBrill tagger for French is an extensive program that is memory-based. The data files containing lexicon, bigrams, lexical and contextual rules add up to nearly 20 megabytes. It is on the basis of these types of data that the tool tags the selected source file. WinBrill uses 47 tags.
The development of such systems has three merits. First of all, the research results obtained in the domain of NLP will be transferred into the growing market of practical applications. Secondly, offering integrated annotated term banks will reduced the barriers of using aligned translation memories and will assist the spreading of these translation environments. Thirdly, the cost of using term banks will decrease.
CONCLUSION Our presentation has outlined a research project in medical terminology management. The use of translation memories in combination with alignment tools and term extraction has proved to be a useful instrument for improving the output of translation offices. With the increasing ability of computers to process annotated corpora, the impact of terminology on translations is still growing. First of all, terminology can be used as a basis for pre-acceptance of a source text because more than 70% of possible quality issues are related to the usage of terms. Secondly, the availability of efficient terminology tools and high-quality terminology content can easily reduce the work of a translator by 50%, since all information is directly accessible. Thirdly, the quality of a translator's work will also increase since he/she can use the terminology databases as a knowledge base. However, a number of problems still remain. There is a lack of terminological resources, and the quality of available terminology is questionable. There are no standardised sets of criteria; they overlap and display inconsistencies. Serious efforts are needed to overcome the lack of professional terminologists and funding for terminological research. The construction of a set of translation tools can assist translators/terminologists considerably. A document-generated dictionary, for instance, could result in very restricted, up-to-date and useful terminology sets (this will be one outcome of our efforts to select medical phraseology on the basis of tags and to link it with the medical nomenclature) [8]. One can conclude that many translation problems and localisation difficulties can be overcome by implementing an integrated system of translation memory and terminology management tools.
References BROWN, P., Lai J. & R. Mercer (1991): "Aligning Sentences in Parallel Corpora". In: Proceedings of the 29th Meeting of the Association for Computational Linguistics. California: Berkeley. CHUTE, Christopher et al. (1996): "The Content Coverage of Clinical Classification Systems". In: Journal of the American Medical Informatics Association 3: 224-233. DORIAN, Angelo Francis (ed.) (1990): Elsevier's Encyclopaedic Dictionary of Medicine. Part D: Therapeutic Substances. Amsterdam: Elsevier. DUBUC, Robert & Andy Lauriston (1997): "Terms and Contexts". In: Wright, Sue Ellen & Gerhard Budin (eds.): Handbook of Terminology Management. Volume 1: Basic Aspects of terminology Management. Amsterdam: John Benjamins. GALE W. & K. Church (1991): "A Program for Aligning Sentences in Bilingual Corpora". In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley. GALINKSKI, Christian & Gerhard Budin (1993): "New Trends in Translation-Oriented Terminology Management". In: Wright, Sue Ellen & Leland D. Wright (eds.): Scientific and Technical Translation. American Translators Association Scholarly Monograph Series: Volume VI. Amsterdam: John Benjamins. GARLOT, Christian , Gilles Aulagner & Jean Calop (eds.) (1991): Dictionnaire Européen des Médicaments et leurs équivalents. Paris: SEMP Editions. Heymans
Institute of Pharmacology, University of Gent, Belgium: Multilingual
Glossary of Technical and Popular Medical Terms in Nine European Languages.
HIRS, William (1993): "The use of Terminological Principles and Methods in Medicine". In: Sonneveld, Helmi B. & Kurt L. Loening (eds.): Terminology. Applications in Interdisciplinary Communication. Amsterdam: John Benjamins. ISABELLE P. (1991): La bi-textualité: vers une nouvelle génération d'aides à la traduction et le terminologie. Québec: Laval. ISABELLE P. (1992): "Bitextual Aids for Translators". In: Proceedings of the Eight Annual Conference of the UW centre for the New OED and Text Research. University of Waterloo, Waterloo, Canada. ISABELLE P., M. Dymetman et al. (1993): "Translation Analysis and Translation Automation". in: Proceedings of TMI-93. Kyoto. MERRIT, Joy E. & Byron J. Bossenbroek (1997): "Basic resources for Assigning Chemical Names within the Field of Chemical Nomenclature". In: Wright, Sue Ellen & Gerhard Budin (eds.): Handbook of Terminology Management. Volume 1: Basic Aspects of terminology Management. Amsterdam: John Benjamins. National
Library of Medicine: Medical Subject Headings and Unified Medical Language
System. PEARSON, Jennifer (1998): Terms in context. Amsterdam: John Benjamins. SIMARD M. & P. Plamondon (1996): "Bilingual Sentence Alignment: Balancing Robustness And Accuracy". Proceedings of AMTA-96. Montreal. UMLS (1999): Unified Medical Language System. U.S. Department of Health and Human Services. national Institutes of Health. National Library of Medicine. WEßEL, Renate (1995): "Trados: MultiTerm for Windows. User report". In: Terminology in advanced microcomputer applications: proceedings of the 3rd TermNet symposium; recent advances and user reports. Wien: TermNet. WRIGHT, Sue Ellen & Gerhard Budin (eds.) (1997): Handbook of Terminology Management. Volume 1: Basic Aspects of terminology Management. Amsterdam: John Benjamins. WRIGHT, Sue Ellen & Leland D. Wright (1997): "Terminology Management for Technical Translation". In: Wright, Sue Ellen & Gerhard Budin (eds.): Handbook of Terminology Management. Volume 1: Basic Aspects of terminology Management. Amsterdam: John Benjamins.
[1] Our corpus also contains a fourth part, but this has no direct link with medical terminology. At a further stage of the project, research will be carried out on a vast collection of car manuals in SGML-format provided by General Motors. [2] With regard to medical standardisation, we would like to cite Werner Ceusters, director of Language and Computing, a small but ambitious company specialised in medical language technology. L & C recently (20th April 1999) received an award from the Flemish government for its promising and innovating nature. When the classification systems were mentioned, Mr Ceusters said: "The medical colleagues are not especially fond of coding their language. Medical terminology is far too rich to let it fit in a formal embodiment. On top of that the most important classification systems, like ICD, ICPC, Read and SNOMED are not congruent." De Artsenkrant, Multimedia, Brussel, 27 April 1999. [3] We must strongly emphasise that Internet resources may not always be of a high quality. Thorough research on the data and the authors is essential. [4] Pierre Isabelle uses the term bitext to denote a pair of texts (a source text and its translation) related by explicit translation similarities (Isabelle 1991:2) [5] Termwatch is a program that allows users to access terminology databases from almost any Windows application. Key combinations can be defined and the selected word can be looked up in the database. [6] According to Gale and Church (1991) the accuracy of an alignment program based upon sentence length (the focus on the punctuation, i.e. end of line or EOL) is about 96%. [7] Eric Brill has his own homepage: http://www.cs.jhu.edu/~brill/home.html [8] For more details, please refer to the website of 'Language and Computing': http://www.LandC.be
|
|||||||||||||||