Natural language – Language
resources – Semantic Web
- standardisation of methodologies of content creation
makes content re-usable and interoperable -
Christian Galinski,
Infoterm
Abstract
Recently more and more aspects of the ‘economics
of language’ (viz. primarily the costs of
the use of language in specialized/professional
‘written’ communication) are identified.
Unification / standardisation / harmonisation
of methodologies provides the most important clues
for cost reduction, and at the same time for the
improved quality of communication. The Technical
Committee ISO/TC 37 “Terminology and other
language resources” of the International
Organization for Standardization (ISO) is contributing
to ICT development by preparing standards and
other documents with rules as well as guidelines
among others for harmonised metadata, unified
principles and methods for data modelling, standardised
meta-models etc. This became necessary among others
due to the following considerations:
- terminology is – especially in
speech and text – embedded in or combined
with LRs,
- new information and communication technology
(ICT) developments – especially mobile
content, e-business, mobile commerce, etc. –
increasingly require the integration or combination
of all kinds of content (incl. LRs),
- LRs (including terminology) increasingly
have to be treated as multilingual, multimedia
and multimodal from the outset.
Increasingly system designers and developers,
therefore, recognise that only more refined data
models (in terms of a higher degree of granularity
and a higher degree of international unification
and harmonisation) can enable information and
knowledge management in the organisation to cope
with the above-mentioned cost situation. A higher
degree of standardisation of methodology standardisation
with respect to LRs, is a prerequisite for achieving
satisfactory solutions for information and knowledge
management based on content management in the
enterprise.
More and more aspects of the ‘economics
of language’ (i.e. primarily the costs of
the use of language in specialized/professional
communication) are identified. Since communication
consumes time or transaction efforts in some way
or other, costs are incurred continuously. Some
are not yet measurable, other have become measurable.
This applies to
‘natural’ inter-personal communication
- whether in oral form or in written form,
- whether in general purpose language (GPL)
or in special purpose language (SPL),
man-machine communication,
communication in language between computers.
Of course the objective is not to avoid communication,
but to render communication more efficient and
effective at places, in environments, at times,
where and when it is necessary or useful. This
is the role of content management
(handling language resources as major elements
of information and knowledge management). Unification
/ standardisation / harmonisation of methodology
provides the most important clues for cost reduction
in content management, and at the same time for
the improved quality of communication.
This refers in particular to the unification/standardisation/harmonisation
of methods concerning language resources (LRs)
for the sake of content management, and may in
some cases also refer to the data as well as data
structures themselves. During the last couple
of years the Technical Committee ISO/TC 37 “Terminology
and other language resources” of the International
Organization for Standardization (ISO) has opened
its scope towards language resources in general.
This was due others to the following considerations:
terminology is – especially in speech
and text – embedded in or combined with
LRs,
new information and communication technology
(ICT) developments – especially mobile
content, e-business, mobile commerce, etc. –
increasingly require the integration or combination
of all kinds of content (incl. LRs),
LRs (including terminology) increasingly have
to be treated as multilingual, multimedia and
multimodal from the outset.
From e-content to m-content
Everything which is representing information
or knowledge for whatever purpose is content.
A recent Andersen study for the European Commission
(Feb. 2002) identifies among others the following
transaction-centric and content-centric kinds
of m-content or m-content services which are already
emerging (left hand side), to which enhanced future
kinds of m-content and m-content services could
be added (right hand side by the author):
Content based on
language resources –multilingual,
multimedia, multimodalm—content
- mobile general news
- mobile public information
- mobile transport information
- mobile transport and delivery information
- mobile financial data
- mobile financial information
- mobile games
- MT services for professionals
- mobile edutainment
- mobile learning and training
- mobile music
- mobile composing
- mobile transaction services
- mobile B2B services
- mobile directories
- mobile directories for professionals
- mobile adult information
- mobile information for professionals
At present the creation of those kinds of content,
which are based on LRs, is still too slow, too
expensive, mostly not good enough and rarely with
a guarantee for its correctness. By using the
Internet more effectively – e.g. by using
it for net-based distributed co-operative content
creation with new methods of content management,
by involving many more experts and even users
as potential creators of content – the cost
of content creation can be decreased dramatically,
while at the same time improving considerably
the quality of the content thus created. ISO/TC
37 “Terminology and other language resources”
is contributing to this development by preparing
standards and other documents with rules as well
as guidelines for
harmonised metadata,
unified principles and methods for data modelling,
standardised meta-models.
This kind of methodology standardization not
only enhances the performance of content creation,
but also ensures the re-usability of data (for
other environments, other purposes, different
uses and over time) as well as interoperability
of data structures.
The Semantic Web
In a letter to „Business Week“
(April 8, 2002) Tim Berners-Lee (MIT, the father
of the „semantic web“ conception)
denies that the WWW will be replaced by the Semantic
Web, with the following arguments:
„The WWW contains documents intended
for human consumption, and those
intended for machine processing.
The Semantic Web will enhance the latter. The
Semantic Web will not understand human language
... The Semantic Web is about machine languages:
well-defined, mathematical, boring, but processable.
Data, not poetry.“
thus indicating that he is widely misunderstood
or misinterpreted.
These remarks also point in the direction of
how language use in the information and knowledge
society in general and in future e-business (comprising
the whole range of e-commerce, e-procurement,
e-content, etc. to m-commerce) will develop: highly
harmonised terminology combined with factual data
and common language elements need to be provided
in a form:
presumably nearer to human language usage in
B2C,
presumably nearer to machine languages in B2B.
What is new in this connection is that these
machine languages will also be multilingual in
terms of human language use. Beside, they will
be multimodal and multimedia from the outset.
Standardisation of LR related
aspects
Standardisation as a rule is a highly co-operative
endeavour carried out in a very democratic way
involving industry experts, public administrators,
researchers and consumers. The standardisation
of methodology concerning LR related metadata,
principles and methods for data modelling and
meta-models will inevitably necessitate and result
in a higher degree of granularity of database
design and data modelling at the field level.
This probably will also lay the basis for resolving
a whole array of existing problems with respect
to:
sources of information,
history of the evolution of individual pieces
of information,
details on whatever kind of usage,
restrictions on individual applications, etc.
thus arriving at a higher level of:
data/information source indication (as a prerequisite
for copyright management),
automatic or computer-assisted validation (supporting
quality management),
tracing the ‘history’ of every data
(thus coping with diachronic development of
content and the intricacies of versioning control),
data safety and security management,
monitoring methods for collaborative work (with
a view to interactive and dynamic content management
and information/knowledge management), etc.
The resulting standards or guidelines are mainly
aiming at improved content re-use and interoperability
under a global mark-up, global usability and global
design philosophy. The development from an information
society into a global knowledge society cannot
occur without technical-industrial standards as
well as methodology standards. Parallel to the
standardisation efforts, activities are undertaken
to establish content infrastructures for content
creation and distribution, which is also supporting
UNESCO’s efforts for the universal availability
of knowledge and universal access to information
in cyberspace. Combining ICT solutions (some under
an open source philosophy) with language and knowledge
engineering approaches, as well as with terminological
methods would even allow for a symbiosis between
the needs of developing communities for advanced
methods and tools on the one hand, and the needs
of technologically and economically advanced communities
for inexpensive knowledge organisation and content
creation on the other hand.
The cost of language in
the enterprise
Until recently a concrete method to calculate
the cost of ‘language’, in order to
be in a position to argue the usefulness or even
the need to invest in ‘infrastructural’
measures with respect to corporate language in general
and in terminology management in particular, was
lacking. This usefulness/need to invest in language
and knowledge infrastructures does not only concern
so-called word workers (such as scientific authors,
technical documentalists, technical writers/editors,
specialised journalists, specialised translators,
localizers, terminologists, etc., who prominently
use ‘words’ in their professional activities
based on communication in written form), but to
all professionals, who deal with information and
knowledge (i.e. any ‘knowledge worker’)
in their work.
The examples for ‘catastrophic’
consequences of deficient language use abound.
But in the eyes of decision makers this ‘anecdotic
evidence’ only creates uneasiness, because
these ‘negative examples’ do not help
to find systematic solutions to the underlying
problem: how to ensure the quality (especially
consistency and coherence) of corporate language
and knowledge as part of a ‘strategic survival
strategy’ on the increasingly competitive
markets. Beside, they do not show any systemic
approach not only with respect to measures to
avoid such ‘catastrophes’ in the future,
but also in direction of arriving at a ‘measurable’
cost-saving effect. Only the latter would turn
the negative argument of unavoidable ‘effort=investment=cost’
into the positive argument of overall ‘cost-saving’.
E-business – especially in combination
with mobile computing resulting in m-commerce
– is probably going to change the organisation
and operation of enterprises and their business
quite radically in the near future. Enterprises
and other organisations/institutions will be forced
not only to link hitherto separated systems to
each other, but to really ‘integrate’
all data processing systems of the organisation.
Latest at this point the whole degree of variation
in language usage within the organisation will
become apparent. It is quite clear that this divergence,
inconsistency and incoherence not only bears the
uncomforting potential for ‘catastrophes’
due to misunderstandings, but also results in
constantly recurring costs in terms of loss of
time, etc. The fact that computers will have to
talk and understand language to each other via
virtual marketplaces in future e-business will
aggravate to this problem. Therefore, a much higher
degree of unambiguity in language usage –
and first of all in the terminology used –
will be indispensable in the near future.
In order to be able to conceive a calculation
method for the cost of language usage in the organisation,
it is necessary
to analyse language from the point of view of
‘language resources’, which comprise
- (marked-up or tagged) text corpora,
- speech corpora,
- grammar models,
- lexicographical data,
- terminological data,
to identify ‘units’ occurring in
(spoken or written man-man, man-machine and
machine-machine) communication which can be
put in relation to ‘transaction’
efforts (consuming time or funds).
This provides a clue for instance to estimate
or even calculate the costs of words and terms
across all documentation in conjunction with product
description in an enterprise. An American consultancy
firm and knowledge management software developer
arrived at USD 0.23 for a word in every of its
occurrences in technical documentation. If a term
is used
10 times in a document,
in documents for 4 models of a product,
translated into 7 languages,
in several formats of the same document,
stored on several media,
this results in costs exceeding USD 160.00.
This further multiplies with every
additional model developed,
further media used for storage,
other language used for localisation.
Unless the enterprise does not have a central
directory, register or index for all terms used
in all documentation, the cost for a global exchange
of a word or term in an item of a product catalogue
e.g.
from “fastened by a steel 3-1/2 threaded
bolt”,
to “fastened by an aluminium 3-1/2 threaded
bolt”,
across documentation on 5 related models in
4 languages in 3 formats would cost USD 138.00
compared to USD 9.20 in case of an appropriate
information/knowledge system in place. In e-business
in Europe today this lack of appropriate tools
already sums up to more that 1 billion USD with
a tendency to double every year for the years
to come.
The above accounts only for the immediately calculable
costs for word units in written documentation,
not taking into account the positive effects on
product liability,
quality assurance,
internal training and external user training,
corporate identity, etc.,
which a firmer grip on ‘corporate language’
and terminology might bring about.
Traditional and new content
creation and data modeling
Traditional content creation
New methods of content creation
by one subject-field expert
by one LR expert
LR expert can serve as
consultant
project manager
by a group of experts (subject-field
experts or specialised LR expert with subject-field
expertise or mixed composition of expert
group:
majority experts with the assistance of
one or a few terminologists
majority terminologists with the assistance
of one or a few experts)
net-based distributed co-operative work
to establish content databases
including terminological and other language/knowledge
resources
-additional features:
-(semi)automatic validation
-copyright management
-->all users are potential creators of
data
-->economies
of scale in contents creation
Traditional data modeling
Enhanced data modeling
mono-purpose
multi purpose and multi-functional
textual data
graphical symbols, formula, etc.
images and other visual representations
multimedia, multimodal
LRs data categories
additional ontology data categories
higher degree of granularity
data elements repeatable by language and
within language
other kinds of repeatability (e.g. register)
qualifiers, attributes, properties, etc.
statistics, validation, copyright management,
etc.
language independent approach
multimedia and multimodality
(incl. non-linguistic representations)
by subject-field experts, LR experts with
subject-field expertise
by anybody according to
level of expertise...
(-->sophisticated access right
management)
traditional systematic/semi-systematic approach
using also other kinds
of systematic approaches
conventional DB management
sophisticated database
management methodology (based on metadata
approaches for distributed DBs)
Hardware costs are decreasing year by year –
and gradually hardware components are nearing the
time-honoured ideal of ‘plug-and-play’.
Software still is far too expensive – not
in terms of purchase, but in terms of the necessary
adaptation and continuous upgrading. But the increased
emergence of open-source software will reduce costs
in the long run. High-quality content creation and
maintenance, however, is the biggest cost factor!.
In analogy to ISO’s OSI model we need something
like an OCI (Open Content Interoperability) model
– which in fact is the vision of ISO/TC 37.
A higher degree of standardisation of methodology
with respect to LRs, is a prerequisite for achieving
satisfactory solutions for information and knowledge
management based on content management in the
enterprise. Increasingly system designers and
developers recognise that only more refined data
models (in terms of a higher degree of granularity
and a higher degree of international unification
and harmonisation) can enable content management
in the organisation to cope with the above-described
cost situation.
References
Andersen. Digital content for global mobile
services. Final report. Luxembourg: CEC, 2002
Andersen. Digital content for global mobile
services. Executive summary. Luxembourg: CEC,
2002
ANNEXE
ISO/TC 37 “Terminology
and other language resources”
(PWI “Basic principles of multilingual product
classification for e-commerce”)
ISO/TC
37/SC 1 “Principles and methods” WG 2 “Vocabulary
of terminology” WG 3 “Principles,
methods and concept systems” *WG 4 “Terminology of socio-linguistic
applications”
ISO/TC 37/SC 2
“Terminography and lexicography” WG 1 “Language
coding” WG 2 “Terminography” WG 3 “Lexicography” WG 4 “Source
identification for language resources”
ISO/TC 37/SC 3
“Computer applications in terminology” WG 1 “Data elements” WG 2 “Vocabulary” WG 3 “Data interchange” WG 4 “Database
management”
ISO/TC 37/SC 4
“Language resource management” WG 1 “Basic
descriptors and mechanisms for language resources” *WG 2 “Representation schemes”
*WG 3 “Multilingual text representation”
*WG 4 “Lexical database”
*WG 5 “Workflow of language resource management”
*planned
131, rue du Bac - F-75007 Paris
T: (33 1) 45 49 60 62 / F: (33 1) 45 44 45 97 [email protected] webmaster