Cumbre “Terminología: interacción y diversidad”





	Actividades \| Publicaciones \| Actualidades \| Agenda \| Redes y asociaciones

Español	Otros enlaces \| Contacto \| Sitios asociados \| Premios \| Encuestas

[fr - it - pt - ro]

	Publicaciones

Cumbre “Terminología: interacción y diversidad” - Actas

Programa

Otras ponencias

Titulares
	Estudio sobre el lugar del francés en Internet
	II Seminario Interamericano sobre la Gestión de las Lenguas
	Virtual Educa 2003
	Repertorio biográfico de los países latinos
	Congreso internacional sobre lenguas neolatinas en la comunicación especializada
	Terminometro
	Lenguas y Culturas en Internet Edición 2001
	Léxico Multilingüe versión 2
	Opto-electrónica
	Termilat
Agenda
	Eventos
	Eventos pasados

		Natural language – Language resources – Semantic Web - standardisation of methodologies of content creation makes content re-usable and interoperable -

	Christian Galinski, Infoterm

Abstract

Recently more and more aspects of the ‘economics of language’ (viz. primarily the costs of the use of language in specialized/professional ‘written’ communication) are identified. Unification / standardisation / harmonisation of methodologies provides the most important clues for cost reduction, and at the same time for the improved quality of communication. The Technical Committee ISO/TC 37 “Terminology and other language resources” of the International Organization for Standardization (ISO) is contributing to ICT development by preparing standards and other documents with rules as well as guidelines among others for harmonised metadata, unified principles and methods for data modelling, standardised meta-models etc. This became necessary among others due to the following considerations:

- terminology is – especially in speech and text – embedded in or combined with LRs,

- new information and communication technology (ICT) developments – especially mobile content, e-business, mobile commerce, etc. – increasingly require the integration or combination of all kinds of content (incl. LRs),

- LRs (including terminology) increasingly have to be treated as multilingual, multimedia and multimodal from the outset.

Increasingly system designers and developers, therefore, recognise that only more refined data models (in terms of a higher degree of granularity and a higher degree of international unification and harmonisation) can enable information and knowledge management in the organisation to cope with the above-mentioned cost situation. A higher degree of standardisation of methodology standardisation with respect to LRs, is a prerequisite for achieving satisfactory solutions for information and knowledge management based on content management in the enterprise.

More and more aspects of the ‘economics of language’ (i.e. primarily the costs of the use of language in specialized/professional communication) are identified. Since communication consumes time or transaction efforts in some way or other, costs are incurred continuously. Some are not yet measurable, other have become measurable. This applies to

‘natural’ inter-personal communication

- whether in oral form or in written form,

- whether in general purpose language (GPL) or in special purpose language (SPL),

man-machine communication,

communication in language between computers.

Of course the objective is not to avoid communication, but to render communication more efficient and effective at places, in environments, at times, where and when it is necessary or useful. This is the role of content management (handling language resources as major elements of information and knowledge management). Unification / standardisation / harmonisation of methodology provides the most important clues for cost reduction in content management, and at the same time for the improved quality of communication.

This refers in particular to the unification/standardisation/harmonisation of methods concerning language resources (LRs) for the sake of content management, and may in some cases also refer to the data as well as data structures themselves. During the last couple of years the Technical Committee ISO/TC 37 “Terminology and other language resources” of the International Organization for Standardization (ISO) has opened its scope towards language resources in general. This was due others to the following considerations:

terminology is – especially in speech and text – embedded in or combined with LRs,

new information and communication technology (ICT) developments – especially mobile content, e-business, mobile commerce, etc. – increasingly require the integration or combination of all kinds of content (incl. LRs),

LRs (including terminology) increasingly have to be treated as multilingual, multimedia and multimodal from the outset.

From e-content to m-content

Everything which is representing information or knowledge for whatever purpose is content. A recent Andersen study for the European Commission (Feb. 2002) identifies among others the following transaction-centric and content-centric kinds of m-content or m-content services which are already emerging (left hand side), to which enhanced future kinds of m-content and m-content services could be added (right hand side by the author):

Content based on language resources –multilingual, multimedia, multimodal m—content

- mobile general news	- mobile public information
- mobile transport information	- mobile transport and delivery information
- mobile financial data	- mobile financial information
- mobile games	- MT services for professionals
- mobile edutainment	- mobile learning and training
- mobile music	- mobile composing
- mobile transaction services	- mobile B2B services
- mobile directories	- mobile directories for professionals
- mobile adult information	- mobile information for professionals

At present the creation of those kinds of content, which are based on LRs, is still too slow, too expensive, mostly not good enough and rarely with a guarantee for its correctness. By using the Internet more effectively – e.g. by using it for net-based distributed co-operative content creation with new methods of content management, by involving many more experts and even users as potential creators of content – the cost of content creation can be decreased dramatically, while at the same time improving considerably the quality of the content thus created. ISO/TC 37 “Terminology and other language resources” is contributing to this development by preparing standards and other documents with rules as well as guidelines for

harmonised metadata,

unified principles and methods for data modelling,

standardised meta-models.

This kind of methodology standardization not only enhances the performance of content creation, but also ensures the re-usability of data (for other environments, other purposes, different uses and over time) as well as interoperability of data structures.

The Semantic Web

In a letter to „Business Week“ (April 8, 2002) Tim Berners-Lee (MIT, the father of the „semantic web“ conception) denies that the WWW will be replaced by the Semantic Web, with the following arguments:

„The WWW contains documents intended for human consumption, and those intended for machine processing. The Semantic Web will enhance the latter. The Semantic Web will not understand human language ... The Semantic Web is about machine languages: well-defined, mathematical, boring, but processable. Data, not poetry.“

thus indicating that he is widely misunderstood or misinterpreted.

These remarks also point in the direction of how language use in the information and knowledge society in general and in future e-business (comprising the whole range of e-commerce, e-procurement, e-content, etc. to m-commerce) will develop: highly harmonised terminology combined with factual data and common language elements need to be provided in a form:

presumably nearer to human language usage in B2C,

presumably nearer to machine languages in B2B.

What is new in this connection is that these machine languages will also be multilingual in terms of human language use. Beside, they will be multimodal and multimedia from the outset.

Standardisation of LR related aspects

Standardisation as a rule is a highly co-operative endeavour carried out in a very democratic way involving industry experts, public administrators, researchers and consumers. The standardisation of methodology concerning LR related metadata, principles and methods for data modelling and meta-models will inevitably necessitate and result in a higher degree of granularity of database design and data modelling at the field level. This probably will also lay the basis for resolving a whole array of existing problems with respect to:

sources of information,

history of the evolution of individual pieces of information,

details on whatever kind of usage,

restrictions on individual applications, etc.

thus arriving at a higher level of:

data/information source indication (as a prerequisite for copyright management),

automatic or computer-assisted validation (supporting quality management),

tracing the ‘history’ of every data (thus coping with diachronic development of content and the intricacies of versioning control),

data safety and security management,

monitoring methods for collaborative work (with a view to interactive and dynamic content management and information/knowledge management), etc.

The resulting standards or guidelines are mainly aiming at improved content re-use and interoperability under a global mark-up, global usability and global design philosophy. The development from an information society into a global knowledge society cannot occur without technical-industrial standards as well as methodology standards. Parallel to the standardisation efforts, activities are undertaken to establish content infrastructures for content creation and distribution, which is also supporting UNESCO’s efforts for the universal availability of knowledge and universal access to information in cyberspace. Combining ICT solutions (some under an open source philosophy) with language and knowledge engineering approaches, as well as with terminological methods would even allow for a symbiosis between the needs of developing communities for advanced methods and tools on the one hand, and the needs of technologically and economically advanced communities for inexpensive knowledge organisation and content creation on the other hand.

The cost of language in the enterprise

Until recently a concrete method to calculate the cost of ‘language’, in order to be in a position to argue the usefulness or even the need to invest in ‘infrastructural’ measures with respect to corporate language in general and in terminology management in particular, was lacking. This usefulness/need to invest in language and knowledge infrastructures does not only concern so-called word workers (such as scientific authors, technical documentalists, technical writers/editors, specialised journalists, specialised translators, localizers, terminologists, etc., who prominently use ‘words’ in their professional activities based on communication in written form), but to all professionals, who deal with information and knowledge (i.e. any ‘knowledge worker’) in their work.

The examples for ‘catastrophic’ consequences of deficient language use abound. But in the eyes of decision makers this ‘anecdotic evidence’ only creates uneasiness, because these ‘negative examples’ do not help to find systematic solutions to the underlying problem: how to ensure the quality (especially consistency and coherence) of corporate language and knowledge as part of a ‘strategic survival strategy’ on the increasingly competitive markets. Beside, they do not show any systemic approach not only with respect to measures to avoid such ‘catastrophes’ in the future, but also in direction of arriving at a ‘measurable’ cost-saving effect. Only the latter would turn the negative argument of unavoidable ‘effort=investment=cost’ into the positive argument of overall ‘cost-saving’.

E-business – especially in combination with mobile computing resulting in m-commerce – is probably going to change the organisation and operation of enterprises and their business quite radically in the near future. Enterprises and other organisations/institutions will be forced not only to link hitherto separated systems to each other, but to really ‘integrate’ all data processing systems of the organisation. Latest at this point the whole degree of variation in language usage within the organisation will become apparent. It is quite clear that this divergence, inconsistency and incoherence not only bears the uncomforting potential for ‘catastrophes’ due to misunderstandings, but also results in constantly recurring costs in terms of loss of time, etc. The fact that computers will have to talk and understand language to each other via virtual marketplaces in future e-business will aggravate to this problem. Therefore, a much higher degree of unambiguity in language usage – and first of all in the terminology used – will be indispensable in the near future.

In order to be able to conceive a calculation method for the cost of language usage in the organisation, it is necessary

to analyse language from the point of view of ‘language resources’, which comprise

- (marked-up or tagged) text corpora,

- speech corpora,

- grammar models,

- lexicographical data,

- terminological data,

to identify ‘units’ occurring in (spoken or written man-man, man-machine and machine-machine) communication which can be put in relation to ‘transaction’ efforts (consuming time or funds).

This provides a clue for instance to estimate or even calculate the costs of words and terms across all documentation in conjunction with product description in an enterprise. An American consultancy firm and knowledge management software developer arrived at USD 0.23 for a word in every of its occurrences in technical documentation. If a term is used

10 times in a document,

in documents for 4 models of a product,

translated into 7 languages,

in several formats of the same document,

stored on several media,

this results in costs exceeding USD 160.00. This further multiplies with every

additional model developed,

further media used for storage,

other language used for localisation.

Unless the enterprise does not have a central directory, register or index for all terms used in all documentation, the cost for a global exchange of a word or term in an item of a product catalogue e.g.

from “fastened by a steel 3-1/2 threaded bolt”,

to “fastened by an aluminium 3-1/2 threaded bolt”,

across documentation on 5 related models in 4 languages in 3 formats would cost USD 138.00 compared to USD 9.20 in case of an appropriate information/knowledge system in place. In e-business in Europe today this lack of appropriate tools already sums up to more that 1 billion USD with a tendency to double every year for the years to come.

The above accounts only for the immediately calculable costs for word units in written documentation, not taking into account the positive effects on

product liability,

quality assurance,

internal training and external user training,

corporate identity, etc.,

which a firmer grip on ‘corporate language’ and terminology might bring about.

Traditional and new content creation and data modeling

Traditional content creation

New methods of content creation

by one subject-field expert

by one LR expert

LR expert can serve as

consultant

project manager

by a group of experts (subject-field experts or specialised LR expert with subject-field expertise or mixed composition of expert group:

majority experts with the assistance of one or a few terminologists

majority terminologists with the assistance of one or a few experts)

net-based distributed co-operative work to establish content databases

including terminological and other language/knowledge resources

-additional features:

-(semi)automatic validation

-copyright management

--> all users are potential creators of data

-->economies of scale in contents creation

*Traditional data modeling*	*Enhanced data modeling*
mono-purpose	multi purpose and multi-functional
textual data	graphical symbols, formula, etc. images and other visual representations multimedia, multimodal
LRs data categories	additional ontology data categories higher degree of granularity
data elements repeatable by language and within language	other kinds of repeatability (e.g. register) qualifiers, attributes, properties, etc. statistics, validation, copyright management, etc.
language independent approach	multimedia and multimodality (incl. non-linguistic representations)
by subject-field experts, LR experts with subject-field expertise	by anybody according to level of expertise... (--> sophisticated access right management)
traditional systematic/semi-systematic approach	using also other kinds of systematic approaches
conventional DB management	sophisticated database management methodology (based on metadata approaches for distributed DBs)
conventional quality control	automatic validation, maintenance, copyright management

Conclusion

Hardware costs are decreasing year by year – and gradually hardware components are nearing the time-honoured ideal of ‘plug-and-play’. Software still is far too expensive – not in terms of purchase, but in terms of the necessary adaptation and continuous upgrading. But the increased emergence of open-source software will reduce costs in the long run. High-quality content creation and maintenance, however, is the biggest cost factor!. In analogy to ISO’s OSI model we need something like an OCI (Open Content Interoperability) model – which in fact is the vision of ISO/TC 37.

A higher degree of standardisation of methodology with respect to LRs, is a prerequisite for achieving satisfactory solutions for information and knowledge management based on content management in the enterprise. Increasingly system designers and developers recognise that only more refined data models (in terms of a higher degree of granularity and a higher degree of international unification and harmonisation) can enable content management in the organisation to cope with the above-described cost situation.

References

Andersen. Digital content for global mobile services. Final report. Luxembourg: CEC, 2002

Andersen. Digital content for global mobile services. Executive summary. Luxembourg: CEC, 2002

ANNEXE

ISO/TC 37 “Terminology and other language resources”
(PWI “Basic principles of multilingual product classification for e-commerce”)

ISO/TC 37/SC 1 “Principles and methods”
WG 2 “Vocabulary of terminology”
WG 3 “Principles, methods and concept systems”
*WG 4 “Terminology of socio-linguistic applications”

ISO/TC 37/SC 2 “Terminography and lexicography”
WG 1 “Language coding”
WG 2 “Terminography”
WG 3 “Lexicography”
WG 4 “Source identification for language resources”

ISO/TC 37/SC 3 “Computer applications in terminology”
WG 1 “Data elements”
WG 2 “Vocabulary”
WG 3 “Data interchange”
WG 4 “Database management”

ISO/TC 37/SC 4 “Language resource management”
WG 1 “Basic descriptors and mechanisms for language resources”
*WG 2 “Representation schemes”
*WG 3 “Multilingual text representation”
*WG 4 “Lexical database”
*WG 5 “Workflow of language resource management”
*planned

131, rue du Bac - F-75007 Paris
T: (33 1) 45 49 60 62 / F: (33 1) 45 44 45 97
[email protected]
webmaster