E.NISSAN

Computer generated alternative coinages:
an automated ranking model for their psychosemantic
transparency
Ephraim Nissan
School of Computing and Mathematical Sciences,
University of Greenwich

Abstract:

ONOMATURGE, an expert system in word formation, is specifically devised for evaluating the quality of neologisms. In fact, it incorporates a quantitative model of how psychosemantically trans-parent candidate neologisms are. Arguably, terminology theory can benefit from such attempts at capturing such characteristics which models of morphological derivation or compounding per se can-not account for.

1 Introduction

ONOMATURGE is an expert system for word formation. The original implementation of ONOMATURGE [15, 16, 17, 19, 20, 21, 23] is for Hebrew, a language for which extensive work has been done from the 1960s developing natural-language processing or information-retrieval tools [12]. Other computational tools in the same domain for the same language also exist [10, 11, 24, 8, 6, 7]. What makes ONOMATURGE special, and of interest for other lan-guages, too, is in that it is specifically concerned with forming candidate neologisms, its de-velopment bearing in mind some of the criteria of terminologists rather than just theoretical morphologists, which is the case, instead, of interesting formal models such as [3, 27, 2].

From an input simplified definition of a concept for which a neologism is desired, ONOMATURGE

- generates candidate coinages, based on a word-formation knowledge-base for a given natural language, and moreover

- ranks the list of coinages so generated according to a computed expectation of how clearly the cognition of native or fluent speakers of the language would identify the sense of the neologism.

The latter capability is made possible by a quantitative model of psychosemantic transpar-ency, of which the paper presents the formulae in detail, and which in turn exploits:

- quantitative knowledge about the relative centrality of sememes or acceptations in extant lexical entries, or in sets thereof sharing a derivational base, or in extant derivational mor-phemes;

- quantitative knowledge about the relative centrality of given lexical or derivational mor-phemes for a given sememe.

Admittedly, corpus linguistics is the proper means for obtaining such precomputed estimates as inputs for the model. We evaluate why ONOMATURGE, without directly resorting to full-text corpora, nevertheless displayed proficiency in psychosemantic ranking.

Terms in isolation versus terms introduced in context involve different mechanisms for sense decipherment. It is argued that the psychosemantic transparency modelling approach presented here is portable across languages. Moreover, suitability for neologized technical terminology is safeguarded if the input definitions enucleate, emphasize or at least include those semantic traits (or semantic descriptors of pragmatic traits) which set the respective lexical concept apart from other, extant or co-innovated lexical concepts (whether already captured by a one-word or compounded term, or not). Yet, if the lexical knowledge base is customized for the given technical domain, its sublexicon, and the lexicon of its ergolect, then those discriminating cues can arguable be amenable to be computed automatically within the quantitative psychosemantic transparency model for coinages which are targeted to the given professional body of users.

2 The structure of ONOMATURGE

We are going to keep the description and discussion of ONOMATURGE minimal, in this paper, in order to focus then on the model of psychosemantic transparency of the candidate neolo-gisms it generates. ONOMATURGE was coded in the Lisp programming language. The control flow of the expert system starts by handling the simplified, sketchy definition given as input, which is reduced to a set of pairs of concepts. Then, the pool of lexical items as well as deri-vation patterns which can express each concept is retrieved, along with information about which other acceptations each such item is also capable of expressing. Next, several attempts at coinage are carried out side by side, and for each attempt, a score of expected transparency is computed. Finally, a ranked list of candidates (possible words) is yielded as output.

Lexical entries, in ONOMATURGE, are represented in frames (with no procedural at-tachments), structured as nested database relations, and coded as a hierarchy nested levels of parentheses, with several structured attributes. In the following, 'frame' will refer to this kind of data-structure. Identifiers of Hebrew word-formation rules are retrieved from a particular subtree of attributes in the lexical frame of the particular acceptation (i.e., word-sense) of a Hebrew term, being the particular sense that the derivational (or compounding) rule itself is intended to convey, in the current session. Each derivational rule has a procedural module associated, for applying it to the given lexical root by selecting the appropriate allomorph. Certain parameters of the lexical and morphological items resorted to, in a word-formation attempt, are manipulated by the control of ONOMATURGE, and a numeric score is calculated in order to evaluate how "good" each candidate neologism would seem to native (or conversant) speakers of Hebrew.

In Semitic languages, including Hebrew, the base for derivation is a lexical root being an ordered sequence of consonants only - usually, three of them (which we symbolize by the meta-characters P, ,, L), sometimes four - and wordformation rules most often impose a given order of vowels and consonants. Formation rules that are based on affix concatenation are not as productive, in Hebrew, as nonconcatenative derivation rules. These have already been discussed and exemplified in previous sections. The AI approach adopted in the archi-tecture of ONOMATURGE makes it an open system, with higher-level heuristics (e.g., some component handling lexical metaphor) enabled to eventually use the morphological word-formation primitives. A phonemic transcription of Hebrew is adopted throughout the system and its lexical database.

Frames in the knowledge-base of ONOMATURGE are associated with lexical entries, and sometimes with formation-rules as well. These frames are trees of properties, that if viewed as database relations, are structured according to a nested schema of attributes - resembling, though not identical with, relation schemata in the so-called nested relations approach to re-lational database management systems [25]. Frames in the knowledge-base of ONOMATURGE are managed by means of RAFFAELLO, a toolbox for representation and retrieval that was de-veloped in the framework of the same project, but also found other applications. RAFFAELLO is invoked by the control programs of ONOMATURGE whenever values should be retrieved from a frame.

Here is a description of the control structure of ONOMATURGE. Through the dialogue-track of a menu-system, or directly, MULTIMICRODEFINITIONS of the expert system is entered, with a list of "microdefinitions" drawn out of the definition of the concept that should be ex-pressed by a term (possibly, a neologism).

A microdefinition is a pair {W^p_i ; W^q_j)}, where each element, e.g., W^p_i is a pair composed of a lexical entry L_i (which may be a term, root, or lexical compound, provided it owns a frame) and of a conventional unambiguous identifier, D^p ; of an acceptation of L_i. The identifier stands for a semantic concept, and is coded as a string patterned after an English term or compound, only it is made conventionally unambiguous.

Downstream of the MULTIMICRODEFINITIONS module in the architecture of control, the SYNONYMYMANAGER module expands the input list of microdefinitions: such microdefini-tions that are variants of original ones are generated, by replacing terms in each, with "synonyms" (that is, near-synonyms "close enough"). If L_i is a term (instead of a root), and the semantics of the root of L_i is close enough, i.e., there would not be too drastic a loss of information if the root is replaced for the term, then this replacement takes place indeed and a new microdefinition is generated with D (L_i) replacing L_i. The same applies to L_j. Possibly, also metonymic (e.g., the whole for the part) or analogical formations may intervene, here, but, whereas a module was envisaged for handling lexical metaphor, it was not incorporated in the implemented system.

Beyond SYNONYMYMANAGER, further down the stream of control, the MORPHOLOGICALCOINER receives a microdefinition {S^p_i ; S^q_j} out of the extended microdefi-nition-list. In the coinage trial that MORPHOLOGICALCOINER performs, S^p_i / {L_i ; D_p} contributes L_i for use as being the base for the derivation (or a lexical item as an element in the generation of a new lexical compound), while S^q_j / {L_j ; D_q}provides the lexical concept, D_q, that should be conveyed by the formation rule adopted. Preferably, a derivation rule should be selected that conveys the semantic concept D_q. Short of that, a lexical compound should be formed, one of whose elements should convey D_q.

Next, identifiers of rules expressing the meaning of S^q_j are retrieved. The derivational pattern, P_x of each rule R^y_x / {P_x ; D_y} is likely to be polysemous, i.e., P_x may also convey some sense different from D_y. If we are looking for such formation rules that would convey concept D_q, then the various choices introduce some ambiguity, which according to numeric scores provided in the frames should eventually enable the evaluation of how clear the candidate neologism is, by applying a function. Anyway, once P_x is selected, then a procedure as-sociated (coded in either FRANZ LISP, or the UNIX SHELL), is applied, which takes as argu-ment the root yielded by the other element in the respective microdefinition. There are differ-ent classes of lexical roots in Hebrew, and according to phonological rules, an appropriate allomorph is selected when applying the derivational morpheme to the given root.

In the frame of the lexical entry L_j, the subtree of attributes owned by the acceptation S^q_j is identified by D^q. A module whose name is SELECTRULES accesses the subtree, whence it retrieves the list of names of those rules R^q_x that may express D^q, and filters out those rules whose pattern P_x may not apply to L_i. For example, if L_i is a root (thus, a sequence of conso-nants with no vowels, as the language is Hebrew and the morphology is nonconcatenative), then P_x should be a free-place formula (a derivation pattern). A suffix could not apply to a root with only consonants and no vowels. In contrast, if P_x is a suffix, L_i should be the stem of a word (thus including both consonants and vowels).

If the match succeeds, P_x is applied to L_i, and a candidate coin is generated. P[x] is ap-plied to LE[i], and a candidate coin is generated. Besides, a numeric score is calculated to ex-press its estimated clarity in terms of psychosemantic transparency. A list of such candidate neologism is generated during a session with ONOMATURGE, and the output is ranked based on the clarity scores.

Suppose we want a neologism coined, to indicate "a match" (for lighting). Historically, the neologism /gafrur/ [gaf'rur] (a diminutive formed out of the root for 'sulphur') won over a few other candidates, and in the sample session, ONOMATURGE generates /gaprur/ (as though it didn't already exist) along with several other possible terms.

Let the list of microdefinitions be: [1]

{ { { gorem   , causing } , { %e$~   ,   fire           } } ,
   { { mak$ir , tool         } , { %e$~   ,   fire           } } ,
   { { qaTan~ , small      } , { goprit ,   sulphur } }      }

Phonemic transcriptions of Hebrew terms are accompanied by a conventional English-like identifier of the relevant acceptation. (The terms are processed as in the given phonemic transcription, i.e., /gorem, ?e$ , mak$ir, qaTan , goprit/, but their actual pronunciation is [go'rem, 'eš, max'šir, ka'tan, gof'rit].)

ONOMATURGE accesses the frames associated with the Hebrew terms included in each microdefinition, and (close near-)synonyms of the terms as per the given acceptations are re-trieved. The synonym-list of each input term is augmented with the roots stems of the original term or of its synonyms, if the loss of meaning that occurs by spoiling the word down to its stem is not too severe (the entity of the loss may be ascertained by consulting numeric values that sometimes are associated with stems in the frame of the considered word). If found, such formation-rules that can express the considered meaning of the given term, are also retrieved from the frame of the term.

For example, let us focus on the last microdefinition. Diminutive formation rules, asso-ciated with the first pair, i.e., with {qaTan~ , small}, include the suffix /-on/, and free-place formulae such as /Pa,LuL/, /P,aL/, and so forth (where P, , and L are the meta-radicals, i.e., are variables that represent, respectively, the first, the second, and the third consonant in any triliteral root that is given to the derivation rule as argument).

Those rules are applied, among the other trials, to the word /goprit/ or to its root; in-deed, /goprit/ appears in the second pair,{goprit , sulphur }, in the microdefinition considered. The suffix is applied to the word /goprit/ itself (yielding /gopriton/), while the free-place for-mulae are applied to the root gpr, and possibly also to a newly formed secondary root, gprt, derived from the noun /goprit/ by deleting the vowels from /goprit/.

These applications of rules yield, among the other results:

-on               (goprit)    >      /gopriton/
P^aL^aL   (gpr)          >     /gparpar/
Pa^LuL     (gpr)          >     /gaprur/

---and even /kab~ariton/, from a Talmudic Aramaic synonym, /kab~arita/, of the pres-ent-day standard Hebrew /goprit/ for 'sulphur'--- which are not necessarily brilliant candidate neologisms. Derivatives or even compounds formed out of root or terms for 'to light' are also generated, and historically such terms were originally proposed for 'match', even though in the material culture of the 20th century they are not specific enough. Indeed, eventually /mac~it/ [ma'tsit] came to denote a "(cigarette) lighter".

Such formation rules that express causation are associated with the pair {gorem , caus-ing}, of the first microdefinition, whereas such rules that generate nouns which denote some instrument are associated instead with the pair {mak$ir , tool}, of the second microdefinition. When those rules are applied to the second pair, i.e.,{%e$~ , fire}, of those two microdefini-tions, a different gamut of candidate coins is obtained. Several semantically related roots (dlq, lhb, yct, etc.) are found in the frame of /%e$ / as associated with sense "fire". Those roots are transmitted as arguments to the ergative and instrumental formation rules mentioned, and coins are generated and output. The quality of those "possible words", if taken to be candi-date neologisms, is uneven. For example, /kab~ariton/ is not transparent, among the other things, because a present-day speaker of Hebrew is unlikely to think of the Aramaic term for 'sulphur', and /kab~ir/ for 'huge' would be a likelier guess. The suffix for the diminutive, /-on/, is widespread in the lexicon and still highly productive, but not just for the diminutive. (Productive derivation patterns are those who are still available for forming new terms, as op-posed to having been in use in some historical period, possibly with ample representation in the historical lexicon, but are no longer resorted to to derive new terms.)

3 The frames of ONOMATURGE

A lexical frame, in ONOMATURGE, embodies a lexical object, and also some information about semantic and other properties of the various senses (i.e., acceptations) that lexical item can convey. Some of the lexical features depend on specific senses of the term represented, whereas some other features are global. Attribute MORPHOLOGICAL_CATEGORY is global. Its values can be: noun, verb, adjective, etc., possibly accompanied by information on gender and also on number, as some terms are used only in the plural (pluralia tantum). Also global is the information about how to inflect the term, i.e., an identification of the inflection class. It is only rarely, in Hebrew, that the inflection class depends on the specific sense. It is the case, indeed, of the feminine noun /Sapa/ [sa'fa], which has different plural forms according to the relevant sense: 'lip' vs. 'language'. (This phenomenon, rare in Hebrew, is common in Dutch.) It is then possible to nest the property about inflection inside the property-subtrees of the sin-gle acceptations instead of the usual practice, in the lexical frames of ONOMATURGE, of plac-ing attribute morphological category at a global level, under the root of the tree, which is the term /Sapa/ itself. Also stylistic information, and data on the origin of the term, or on its reg-ister of use (e.g., standard, slangish, poetic, vulgar, etc.), sometimes are global, and some-times are sense-specific: they depend on the intersection of the given term, and the given se-mantic concept as expressed by it.

Information about how widespread of well-understood a term is, as expressing a par-ticular sense, is included under suitable attributes in the property-subtrees rooted in the vari-ous acceptations, and identified by an English-like identifier (as English terms are ambigu-ous, if necessary we define the identifying string as made of an English term with some more information). Acceptation-subtrees include semantic descriptions, under attributes MEANING and CONNOTATIONS, as well as, when available, lists (under attribute RULES) of identifiers of formationrules (suffixes, etc.) that can convey the specific semantic concept (for example, in English, the suffixes -ify and -ize can be used to convey into verbs a sense (a verbal aspect) of causation, but -ize can also convey other senses).

Let us go back to the morphological-category property. Implementationally, we can have the name for the property replaced by a synonym of the attribute, e.g., MORPHOLOGICAL_CATEGORY by MORPHO_CAT. The Hebrew name for 'road' and 'way' is /derk/ ['derex] (pronounce: dérekh); the term is sometimes used as masculine, and sometimes as feminine. We can implement this in the formulation:

(MORPHO_CAT (noun feminine singular) (noun masculine singular) )

or

(MORPHO_CAT ( noun (feminine masculine) ) )

by assuming the singular by default, whereas the name for 'face' is /panim/, a noun in the plural that also can be masculine or feminine. One can enrich the representation by speci-fying in which historical strata of the Hebrew language the one or the other gender prevailed for the given term (Biblical vs. Tannaitic, Medieval and Modern), as well as point to list of idiomatic use in those expressions that carry the gender from the stratum in which they origi-nated. (Abstracting from the purpose of ONOMATURGE, a lexical database for text-analysis has to be able to cope with language variation in the textual corpus its input is presumed to come, whereas in systems for text-generation, usually standard modern language is desired as output, so we should prescribe /derk/ in the feminine, as such is present-day standard Hebrew use).

In the frame of Hebrew /kebS/ ['keves], there is only one acceptation represented, as the only sense of the term is 'sheep'. (A simplified version of this frame is shown in Code 1.) On the other hand, there are also other terms that mean (almost) the same. Perfect synonyms may exist in the context of a given text, but not as generally stated in the lexicon; among those terms that mean "a sheep", /kebS/ is the most straightforward to think of; we express this knowledge by stating an ordinal number, 1, as value of a particular attribute, relative fre-quency among near-synonyms. near-synonym /raHel/ is specifically applied to an adult fe-male sheep (the exact English semantically equivalent term is 'ewe'), and by modern percep-tion is confined to literary use (it is known especially from the Bible, but the language of the passages where the term occurs, originally was rather colloquial); the noun /raHel/ is most often met with as in its other sense, of a first name: 'Rachel'.

Code 1

Now, let us introduce the notion (which is standard among semanticians) of marked vs. unmarked term. The usual way we generically refer to dogs is by the term dog, that specifi-cally is the name for a male dog, but that applies to dogs independently from gender (we mean the biological gender of the animal, not the morphological gender of the term): the term is, therefore, unmarked. Instead, a bitch is always a female: the term is marked. /raHel/ is a marked feminine for a sheep, whereas /kebS/ is unmarked, though specifically it is feminine. In a subtree of properties associated with the particular near-synonym /raHel/ in the frame of /kebS/ (which does not excludes that /raHel/ has a frame instance on its own, providing fur-ther information), we state differences with respect to /kebS/: according to a certain scale, we subjectively define a distance in frequency of use (relative frequency is stated as 4 for /raHel/ as opposed to 1 for the term /kebS/; value 1 is meant as an ordinal number, standing for "the most frequent"); we circumscribe the register of use of the term /raHel/ to literary contexts, which is information on the term (as opposed to the concept) from a stylistic viewpoint, and we state also the semantic differences of the concept expressed by /raHel/, by stating +adult and +female as values of attribute SEMANTIC_RELATIONSHIP. The plus sign, in the style of semanticians, indicates these particular semantic features hold (whereas a minus sign would indicate their negation holds: the feature is treated as a binary variable). However, /kebS/, the term for 'sheep', is usually understood as an adult female sheep, by default. Thus, to correctly state the difference of /raHel/ with respect to /kebS/, we state that /raHel/ is a marked term for those features: by an implementational convention, we write this as +female_:_marked and +adult_:_marked. It wouldn't be as clear, were we to state, instead of the former, +gender_specific (the latter expressing the gender markedness explained above), as value for attribute SEMANTIC_RELATIONSHIP. Indeed, it does not specify the particular gender, and, more importantly, it would fail to capture the markedness versus unmarkedness opposition, because it is untrue that /kebS/ does not specify the gender of the animal: our first pick, un-less we are told otherwise, is that /kebS/, i.e., 'sheep', is about a /raHel/, i.e., 'ewe'.

Code 2 shows the subframe (i.e., a node in the tree embodied by the data structure), in-side the frame of /kebS/ for 'sheep', that represents the relation to terms and semantic con-cepts that possibly replace /kebS/ in context. New syntactic features from the full-fledged rep-resentation-language are exemplified. Inside a given frame-instance, reference to a particular node in the tree of properties (i.e., to a subframe) is made possible by the defining a label standing for that subframe, by nesting a LABEL property in its upper level. Here, by the label @lamb we refer to the region of the frame so circumscribed in the comment (after the semi-colons, which starts comments in LISP). Instead, lamb as value of attribute keyword is an identifier of a semantic concept: in the frame of /Tale/, it is the identifier of the acceptation "lamb". The identifier lamb_or_kid is employed internally, for convenience, to capture the sense of Hebrew /Se/ (because English lacks a specific equivalent word: English 'kid' has a more restricted sense, as being a name for a non-adult goat, like Hebrew /gdi/; and a more general sense, as a name for non-adult mammals in general). Under attribute SUPERORDINATE, we state, by means of a "proportion", how the concept denoted by /Se/ is more general than the concept denoted by /Tale/ and owning the subframe labelled as @lamb as shown in Code 2. The advantage of labels is that they allow to refer to composite concepts that are not lexical concepts, and therefore own no frame-instance and no permanent status in the universe of concepts expressed by the lexicon. Although these are implementational details, they bear significance to the expressive power of the language; true, syntactically it is just about point-ers in a data structure, but this allows to support, on the representational level, composition of fleeting concepts: it is as convenient as anaphora in natural languages, that is, to put it plainly, as indexing by pronouns such as 'he' or 'it'.

Code 2

There is one more important point to be mentioned, concerning the discussion of the representation. On the implementational level, in the near-synonyms subtree (rooted in attrib-ute NEAR-SYNONYMS) of /kebS/, if we list just the Hebrew term /raHel/, then we are assuming that, in the frame of /raHel/ (if there is one), the corresponding acceptation is identified by the keyword sheep (like /kebS/ in its own frame); instead, if we point by the doubleton ((IS ra-Hel) (KEYWORD ewe)), then we are indicating that this other keyword should be used on access in the frame of /raHel/. This complicates access functions. On the representational level, this corresponds to allowing connections between different lexical concepts, that is, between different concepts that are expressed by terms ('sheep' and 'ewe': a lexical concept is such a concept that is expressed by a term, as opposed to fleeting concepts that occur just once), stated as being near-synonyms, which is a kind of short semantic distance. For more completeness, the contexts of replaceability could be indicated.

Code 3 shows an augmented representation of semantic qualities, with respect to what we have seen in Code 1. The following semantic properties of sheep are nested inside the composed property QUALITIES_OF_CONCEPT, that in turn is found under (denotational) mean-ing, namely, IS_A animal, IS vegetarian, and PRODUCES wool. In the particular implemen-tation for the purposes of ONOMATURGE, - as the representation of the lexical entries and of semantics is inside the same frames - in those properties (attribute/value pairs) that point to a semantic concept as value, this value consists of a doubleton, composed of a Hebrew term (the name of a frame instance) and of the English-like string that identifies the proper accep-tation. For example, one property of "sheep" in the lexical database of ONOMATURGE is: (PRODUCES ((FRAME cemr) (KEYWORD wool)). Instead, in a multilingual Semitic-language lexicon project that was undertaken on the same principles as the lexical database of ONOMATURGE, semantics was separated from the lexicon, and it is enough to point to the English name of the semantic concept alone. One of the reasons the choice made in ONOMATURGE makes sense for that project is that sometimes it is the intersection of a sense and a term, not the sense alone, that carries certain connotations or symbolism: English 'dove' and 'pigeon' are (near-synonyms, just as in Italian colomba and piccione are the same, only piccione is more prosaic and normally would not be resorted to as a symbol of peace. Also, the symbology of sheep in Hebrew is not exactly as in English, where a peculiar con-notated term exists, 'sheepish', that denotes somebody timid and embarrassed. In other lan-guages, other qualities are typified by the concept "sheep" and by the respective terms for it. Instead, in the multilingual and multicultural lexical representation project that was under-taken as a sequel of ONOMATURGE, assertions about term/concept associations are still possi-ble, but as we don't privilege one given language, we often need several descriptors of lan-guage, dialect, historical phase, etc., and just as we did not conglomerate terms and concepts in the same frame, so we did not feel the need for "regularizing" references among entries by sticking to the term/concept doubleton format. Thus, implementation syntax reflects repre-sentational criteria, and also ergonomic criteria of representation in terms of flexibility and convenience.

Code 3

If representation as in Code 3 is richer than the corresponding part of the frame in Code 1, it is not because of the intersection of terms and senses, explained before, but because of the way we roughhew the description of semantic qualities. Let us consider the property of sheep, of producing wool. This is not an entire single quality. In fact, the property is accom-panied by further facets (implemented as parentheses), that we term metaproperties:

(      (PRODUCES ( (FRAME cemr ) (KEYWORD wool ) ) )
        (SPECIFICITY 0.7 )
        (OBLIGATORINESS 0.5 )
)

Consider semantic qualities of the concept "sheep": with the IS_A property, there seems to be no problem, as every sheep is necessarily an animal (unless 'sheep' is taken to mean an im-age of a sheep, such as a picture or a statue, that could be dealt with in terms of metonymy, i.e., sense as extended to a semantically adiacent concept, or of metaphor, i.e., the application of the name or of the context of discussion of a given concept to another concept sharing a given quality of a set of qualities). Instead, a given sheep could be affected by an anomaly because of which it yields no wool. Otherwise, it could happen to have wool, but nobody to cut and exploit that wool (producing wool is a cultural role of sheep, not intrinsic in the ani-mal). In other terms, we have to find some way to indicate whether exceptions can be admit-ted, for the given property. The OBLIGATORINESS facet, inserted in the description of single qualities, accounts for this, by means of a quantitative degree between zero and one.

On the other hand, certain properties can be very typical of a certain concept.The speci-ficity numeric degree (in the range between zero and one) accounts for this. The fact we state, for specificity SPECIFICITY, the value 0.2 for the color property in the frame of the term for 'sulphur', means that sulphur is one out of the first five orange things a given person, who coded the frame, would subjectively think of. Note that typicality is not automatically re-flected in terms of symbology: objective peculiarity may happen not to be known or noticed by a given culture, or not to be reputed there as deserving to become a symbol. Thus, the connotational property SYMBOLIZES can be found in a single quality under connotation, and is not directly inferred from all those single qualities where the specificity degree exceeds a certain threshold. Other metaproperties can account for how much a concept is known or re-puted important in certain cultural contexts, including technical contexts [18, 22]).

For the purposes of ONOMATURGE, it is convenient to enumerate acceptations. How-ever, some senses are close nuances: pinpointing these was avoided, in the lexical database of ONOMATURGE, as this would mislead the expert system's scoring functions that evaluate the clarity of candidate neologisms by reckoning that the more the lexical and derivational entries employed are polysemous, the less the coined term is likely to be clear, as the more mislead-ing semantic cues could be spotted in the word-formation instance. If we are to consider lexi-cal semantics in general, there is a fundamental problem with considering the various accep-tations as discrete. As Putnam [26] puts it,

The problem of a word's having more than one sense is standardly handled by treating each of the senses as a different word (or rather, by treating the word as if it carried invisible subscripts, thus: 'rabbit 1 ' - animal of a cer-tain kind; 'rabbit 2 ' - coward; and as if 'rabbit 1 ' and 'rabbit 2 ' or what-ever were different words entirely). This again involves two very severe ide-alizations (at least two, that is): supposing that words have discretely many senses, and supposing that the entire repertoire od senses is fixed once and for all. Paul Ziff has recently [i.e., in 1972: [29]] investigated the extent to which both of these suppositions distort the actual situation in natural language.

The approach to acceptations in the lexical database of ONOMATURGE is synchronic. For ex-ample, it matters that a given acceptation is the central sememe, but it doesn't matter that a given acceptation is the primary sememe from which etymologists derive the other senses. The historical dimension is not taken into account; but it wouldn't be difficult to include such attributes and constraints that would compel a coinage session to employ only Biblical He-brew lexical and morphological elements, or, instead, to opt for the Tannaitic Hebrew histori-cal stratum.

No practical need was felt, to associate separate frames with different lexemes. The distance among meanings was not felt to be an abiding criterion, other than when clustering the acceptations into lexemes, nor was there a choice of introducing a subdivision within lex-emes above the subdivision into acceptations. Yet, closeness should be a reducing corrective factor in the evaluation of ambiguity. Such a consideration militates against representing sub-tle nuances in separate acceptations, in the frames of ONOMATURGE, at least to the extent that the given level of semantic subdivision is accessed by the dynamic scoring component of the psychosemantic transparency model. One could be tempted into trying to quantitatively model semantic distance (in [19, Section 3.5.2], hypergraphs were considered for that pur-pose), but let it suffice to say that given the state of research in lexical semantics, and Bogu-raev and Pustejowsky's cogent arguments, one should rather proceed in that direction.

Mere enumeration of a discrete number of acceptations, each with its own definition and selectional restrictions, has come indeed under criticism by Boguraev and Pustejovsky [4, 5]. They point out [4] that

- Selectional restrictions tend not to be flexible enough; e.g., the word fast as in a fast dance means "that moves quickly"; as in a fast typist it means "that performs some act quickly"; as in a fast book it means "that takes very little time"; whereas creative use may elude this enumeration: this way, fast as in a fast motorway means that vehicles moving thereby are able to sustain high speed.

- There tends to be permeability among word senses (e.g., "John baked the potato" is about a change of state, whereas "John baked the cake" is about a cake coming into being; [4] quotes this example from [1]).

Therefore, Boguraev and Pustejovsky replace exhaustive enumeration of a predeter-mined number of acceptations, with a "dynamic model of the lexicon", that, to resolve ambi-guity in text understanding, involves not only a matching of features, but also a generative process of semantic generalization at the lexical level, and a mechanism for composing the semantics of the lexical entries at the level of phrases (i.e., semantic compositionality). In some respects, this involves considering nouns not just as passive arguments, but as verb functions, to flexibilize semantic composition, in view of "the open-ended nature of word-combinations and their associated meanings." Different acceptations are conflated into a sin-gle meta-entry.

4 Psychosemantic transparency: the model's statics

This section describes how ONOMATURGE handles the evaluation of how transparent and clear word-formation is, for a given coinage attempt. There are various kinds of ambiguity that af-fect word formation but will not be discussed in the relatively short compass of this paper. Importantly, the effect of context is not taken into account, as coinages here are considered in isolation. Yet, it would be possible, to some extent, to take into account the expectation that a technical neologism will be used in the literature (or in the spoken ergolect) of its specific technical domain, by restricting the range of competing acceptations of the lexical elements employed in the given coinage attempt, to such that are expected to occur in the given kind of communication. Nevertheless, even that wouldn't do, when it comes to a variety of phenom-ena of ambiguity affecting word-formation. A major category is bracketing paradoxes, for which the reader is referred to, e.g., [28, 13]. There are more kinds of ambiguity.

In the frames of ONOMATURGE, there are such attributes that subserve the scoring model. Among the attributes in the lexical frames (as well as in the frames possibly associ-ated with the procedures of word-formation rules), there are such numeric properties that en-able the control component of ONOMATURGE to calculate a scoring function expressing a rough estimate of the psychosemantic clarity of the output of an attempt to coin a new term. A fundamental caveat is called for. Numeric properties represent a subjective estimate of sali-ence, which quite improperly may be though of as sort of a frequency. Yet, this must not be mistaken for word frequencies, which is an established domain with far more rigorous proce-dures, that are not necessarily relevant for our present purposes. Here is a translation of a pas-sage quoted (by kind permission) from an internal memo by Choueka, Laufer, and Weil [9]; these lines are essential for appreciating the limitations of an introspective acquisition method of numeric scores, if these are to be interpreted as frequencies (which should be calculated, instead, by analyzing online textual corpora):

Even if the unscholarly and untrue claim were voiced again, that the mere intuition of linguists and educators could provide a list of basic terms, yet we would still need an online corpus in order to make such discoveries that intuition is clearly not capable of. Intuition does not work, when frequencies have to be determined of the acceptations of polysemous expressions, espe-cially such that their meaning varies according to the register and to the broad semantic context: including also the acceptations of free-place for-mulae, [...].

And then:

Even if we were to admit that all of those acceptations are discovered other than by searching a corpus, could we ever dare to guess the relative fre-quency of each single acceptation? No linguist could ever try to estimate the frequency of every acceptation, and the importance thereof in the various phases of learning the language.

Keeping that point in mind, and also that we are not actually interested in frequency but in some vague quantification of prominence, here is a description of which numeric values the control component of ONOMATURGE retrieves from its knowledge-base to enable the SCORING module to calculate heuristic scores of clarity for the new terms coined. From the frame of L_i, the SCORING module of ONOMATURGE retrieves:

- the AURO score (Actual Use Relevance Ordinal), i.e., a numeric score expressing a sub-jective estimate of the salience of the given acceptation, S^p_i, in ordinary cognition (rather than the actual frequency in use) versus other acceptations of L_i ;

- the AURC score (Actual Use Relevance Cardinal), i.e., a numeric score expressing a subjective estimate of the salience of the given acceptation, S^p_i ,of the given lexical entry, L_i, relatively to the "entire" lexicon (a very tentative definition);

- a normativeness score of L_i or S^p_i, in order to penalize such lexical items that are disliked by institutional language planning (e.g., slangish or other substandard terms or senses thereof).

The expert system also needs some indication of how prominent alternative morphological devices are, with respect to each other and to the rest (see in [14] a different perspective on "Productivity and frequency in morphological classes").

From the frame of the word-formation pattern P_x ; the SCORING module of ONOMATURGE retrieves:

- the AURO score of R^q_x (intuitively assigned salience, or pseudo-frequency, versus other acceptations of P_x);

- the AURC score of R^q_x (intuitively assigned absolute salience or pseudofrequency, i.e., with respect to the "entire" present-day set of productive word-formation rules in the language at hand, i.e., Hebrew);

- a normativeness score of P_x or R^q_x (for penalizing substandard formation devices).

From the subtree of the acceptation S^q_j, inside the frame of L i ; the SCORING module of ONOMATURGE retrieves the RO score (Relevance Ordinal) of the rule R^q_x. This score ex-presses how often the rule is instantiated as expressing the meaning D_q, in the present-day lexicon of the language, versus the pseudofrequency of instantiation of the other rules that also may express D_q.

Moreover, in the procedure associated with the word-formation pattern P_x, sporadically a numeric score is associated with a selected allomorph, just in case the allomorphs are (which is properly a contradiction in terms) not mutually exclusive "combinatorial variants" for the given root (i.e., the given phonological context). This rarely happens, owing to histori-cal reasons, yet, it can be exploited sometimes as a degree of liberty.

Moreover, other numeric scores are retrieved for possible use, to account for the fact that, say, in the microdefinition being processed, a term that was stated in the input of the ses-sion has possibly been replaced with its root (thus possibly introducing more ambiguity); or, then, for the fact that the SYNONYMYMANAGER component has replaced a term from the input with some (near-)synonym retrieved from its lexical frame. The numeric score is used for in-troducing a penalty in the computed score, to account for the presumed departure from the specific meaning intended.

A function (heuristically defined) computes the output score of each candidate coin, out of the parameters listed above. To make the notation simpler, with just one subscript, let L stand for the lexical entry, R for the word-formation rule, and A for a particular semantic concept, which is possibly expressed by the given L or R ; this being indicated, in the nota-tion, by inserting -between A and either L or R - an "and" operator, i.e., v (for our purposes, it is also admissible to speak of an "intersection").

As seen lexical entry, L_i, is expected to have a nested relation associated, which some-what improperly we called "frame", and that is structured as a tree. We have already seen that the acceptations each have a subtree of properties associated. The acceptation is, more gener-ally, a semantic concept A_j, independently of L_i. In the particular case that A_j is expressed by term L_i ; we write this as L_ivA_j. As seen, sometimes a word-formation pattern - or device: let us indicate it more generally by R_k - also has a nested relation associated. This is desirable, even though in practice, usually not always implemented, in ONOMATURGE.

Here is how we express the sense of the AURO score of, respectively, a lexical entry or a word-formation rule:

where the horizontal line means "with respect to". Here is, instead, AURC:

The formula representing RO is as follows:

which stands for the salience (or perceived spread) of the given rule/acceptation intersection, as compared to the intersection of that acceptation with the "entire" ruleset.

5 The model's dynamics: heuristic scoring functions

The main contribution to the output of generate appraisal, the overall scoring function in ONOMATURGE is given by:

root_contribution_in_terms_of_diffusion =
              AURO_contribution (root_AURO_value)
        +   AURC_contribution (root_AURC_value)
        +   AURO_difference_contribution (root, AURO_value)

These three addends are based, respectively,

- on the value of AURO as retrieved from the subframe of the relevant acceptation in the frame owned by the root,

- on the value of AURC as retrieved in the subframe of the same acceptation, and

- on the set of AURO degrees of all of the acceptations found in the frame owned by the root.

We incorporate the contribution of the RO parameter, that reflects the way a given lexical sense can be conveyed by a given derivation rule, out of several rules that also can convey the same meaning (the rarer the rule as for that sense, the higher the value of RO):

root_diffusion_contribution_and_RO =
root_contribution_in_terms_of_diffusion
+ RO_contribution (RO_of_applied_rule)

where

2
RO_contribution = 4 * (40 - RO )

Purists tend to condemn slangish terms or usage, and if we are to allow deference to such concerns, we have to find some way to reflect the contribution of normativeness. Actually, in the expert system, NORMATIVENESS is defined as a numerical attribute whose value is in the closed interval [0,1]. Provisionally, the function root and RO normativeness filter returns just the value of normativeness as found in the frame of the root; the default value (if the norma-tiveness attribute, or even the frame instance, is not found), is 1, that is, it is assumed assume that here is no normativistic opposition to the use of the considered root.

root_and_RO_contribution =
root_diffusion_contribution_and_RO
@ root_and_RO_normativeness_filter (root, root_acceptation_keyword)

where the operation indicated by the symbol @ is either multiplication or division, according to whether the value of attribute root_diffusion_contribution_and_RO is a non-negative or negative number, respectively. The purpose of this is to enforce the normativeness filter as a penalty on low-normativeness roots.

The role of the function root_and_RO_normativeness_filter is to retrieve from the frame of the frame, and to multiply, the values of attribute normativeness

- as possibly found inside the subframe of the particular acceptation concerned,

- or at the top level of the frame (i.e., the normativeness of the lexical entry itself),

- or, possibly, in the subframe of the lexeme containing the subframe of the relevant ac-ceptation (as it should be allowed to state normativeness as associated with lexemes, in-stead or along with such an indication associated with acceptations).

Whenever no normativeness degree is found in the frame at the top level, or in the accepta-tion chunk, or in the lexeme chunk, then the respective degree is assumed, by default, to be 1. Thus, the function root_and_RO_normativeness_filter always returns a real number belong-ing to the closed interval [0,1], and smaller or equal to the smaller of the three retrieved nor-mativeness degrees listed above.

Now, let us define arguments we introduced in the above formulae:

2
AURO_contribution = 4 * (40 - RO )

Instead, AURC_contribution has been defined segmentwise, roughly as a compromise be-tween a linear and a parabolic curve.

As to AURO_difference_contribution, it is due to the difference between successive values of AURO inside the same frame instance: the smaller the difference, the more confu-sion is likely to arise, so we penalize small differences in AURO values with respect to the value of AURO associated with the relevant acceptation.

AURO_difference_contribution_penalizes zero or small distances between AURO values of acceptations, when one of the acceptations is the one currently considered. Besides, that function penalizes lexical entries with a high number of acceptations. If the frame includes N acceptations (where N > 1 ), but only M values of AURO can be retrieved (where M < N ), we assume the worst case, as if there was repetition of equal values in the AURO poset (i.e., partially ordered set), and such penalty is important especially if it is the relevant acceptation that has no AURO value associated. Indeed, in such cases we assume that rare acceptations cannot be easily identified, and that acceptations whose AURO value is missing, are confu-sion-prone. The AURO poset of the frame is sorted. To the penalty, both differences contrib-ute: of the AURO value associated with the relevant acceptation, with respect to the immedi-ately greater AURO value, and with respect to the immediately smaller AURO value.

If the relevant acceptation has the best (i.e., smallest) AURO value in the poset, then contribution_of_forward_distance_is set to zero. Else, if AURO_difference_with_antecedent > 1 ; then

contribution_of_forward_distance =

2
3 * AURO_difference_with_antecedent

If the relevant acceptation has the worst (i.e., either nil, or the largest) AURO value in the poset, then contribution_of_backward_distance is set to zero. Else, if AURO_difference_with_successor < 1, then

contribution_of_backward_distance =

2
3 * AURO_difference_with_successor

As to the global AURO difference contribution, it is given by the formula:

AURO_difference_contribution =
          2 * (poset_cardinality - 1)
       - 3 * (position_in_poset_of_relevant_AURO - 1)
     - 10 * (how_many_repeated_AURO_values - 1)
+     contribution_of_forward_distance
+     contribution_of_backward_distance

where

- position_in_poset_of_relevant_AURO is the ordinal, inside the AURO poset, of the AURO value of the relevant acceptation, and

- how_many_repeated_AURO_values is the number of acceptations that have the same AURO value as the relevant acceptation.

The influence of meaning-aggregation levels of scoring must be pointed out. AURO and AURC are considered as associated with acceptations of the root (and possibly of the rule, if a frame for the rule is also found). That means that - for the purposes of checking those scores - we recognize acceptations as being the only relevant unit of meaning inside possible meaning-aggregation hierarchies. We don't distinguish between lexemes and acceptations as being different levels of aggregation, for the purposes of scoring; just acceptations are con-sidered, and possible intermediate aggregations of meaning upon the acceptation level are ignored.

Apart from the RO score, how do derivational rules contribute to the estimation of the clarity of a candidate neologism, in ONOMATURGE? The following two formulae are an equivalent, for rules, of formulae we gave for the contribution of roots:

rule_contribution_in_terms_of_diffusion =
rule_AURO_contribution (rule_AURO_value)
+ rule_AURC_contribution (rule_AURC_value)
+ rule_AURO_difference_contribution (rule, rule_AURO_value)

Also normativeness considerations can contribute, in the same way as described for the root:

polysemy_and_normativeness_contribution_of_the_rule =
rule_contribution_in_terms_of_diffusion
@ rule_normativeness_filter (rule, rule_acceptation_keyword)

where the operation indicated by the symbol @ is either multiplication or division, according to whether the value of the attribute rule_contribution_in_terms_of_diffusion is non-negative or negative, respectively. (This enables to penalize, e.g., slangish suffixes that purists reject).

To reflect the fact roots are the ones that catch the eye (or the ear), when we try to ana-lyze a term, whereas the contribution of the derivational rule is somewhat secondary, the out-put score is given by:

weighted_score =
( 2 * root_and_RO_contribution
+ 1 * polysemy_and_normativeness_contribution_of_the_rule ) / 3

Moreover, the choice was made to penalize compounds, by giving them a lower score than such terms that are obtained by derivation. Actually, compounds tend to be clearer than de-rivatives, but oftentimes in Hebrew derivatives tend to be considered to be more elegant (even though this statement could be disputed, when considered more carefully).

References

1. B.T. Atkins, J. Kegl and B. Levin, "Anatomy of a Verb Entry." Journal of Lexicographic Research, 1(1).

2. O. Bat-El, "Phonology and Word Structure in Modern Hebrew." Ph.D. Dissertation (Linguistics). University of California, Los Angeles, 1989.

3. O. Bat-El, "Parasitic Metrification in the Modern Hebrew Stress System." The Linguistic Review, 10(3), 189--210 (1993).

4. B. Bogu raev and J. Pustejovsky, "Lexical Ambiguity and the Role of Knowledge Representation in Lexicon Design." Technical Report, Lexical Systems Group, IBM T.J. Watson Research Center, Yorktown Heights, N.Y., 1990.

5. B. Boguraev and J. Pustejovsky, The Generative Lexicon. Bradford / The MIT Press, Cambridge, Mass., 1995.

6. Y. Choueka, "Automatic Grammatical Analysis of the Hebrew Verb." Proceedings of the 2nd National Conference of the Information Processing Association (IPA) of Israel, Rehovot. IPA, Jerusalem, 1966, pp. 49--67. In Hebrew. English abstract: The Fi-nite String, 4 (1967).

7. Y. Choueka, "Pick-a-Word, or an On-Line Program for Automatic Analysis and Synthesis of Hebrew Words." Proceedings of the International Conference on Literary and Linguistic Computing (ALLC'79), Kibbutz Shfayim, Israel, 1979.

8. Y. Choueka and M. Shapiro, "Machine Analysis of Hebrew Morphology: Potentiali-ties and Achievements." Leshonenu, 27/28. The Academy of the Hebrew Language, Jerusalem, 1964, pp. 354--372. In Hebrew. English abstract: The Finite String, 3 (1966).

9. Y. Choueka, A. Laufer and Weil, "A Corpus of Contemporary Hebrew and the Fre-quencies of the Words in It." (Hebrew.) Internal Memo, Information Retrieval Labo-ratory, Department of Mathematics and Computer Science, Bar-Ilan University, Ramat-Gan, Israel, 1991.

10. M. Ephratt, "Root-Pattern Array: Main Tool for generating Hebrew Words." (He-brew). School of Advanced Studies, The Hebrew University, Jerusalem, 1984/5.

11. M. Ephratt, "Semantic Properties of the Root-Array Pattern." Computers and Translation, 3(3/4), 215--236 (1988).

12. I. Lancaster and E. Nissan, "Hebrew." Sec. 15.24.6 in I. Lancaster, ed., The Humanities Computing Yearbook 1989--90: A Comprehensive Guide to Software and Other Resources. Clarendon Press, Oxford, 1991, pp. 292--300.

13. M. Light, "Taking the Paradoxes Out of Bracketing in Morphology." In Proceedings of the Second Formal Linguistics Society of Mid-America Conference (FLSM'91), 1991.

14. C.L. Moder, "Productivity and Frequency in Morphological Classes." In S. Choi, D. Devitt, W. Janis, T. McCoy and Z.-s. Zhang, eds., Proceedings of the Eastern States Conference in Linguistics (ESCOL'85), SUNY Buffalo, Oct. 3--5, 1985. Publ.: The Ohio State University, 1986; distrib.: CLC Publications, Cornell University, Ithaca, N.Y.

15. E. Nissan, "Could an Expert System Perform What Schoenberg Couldn't for Moses? Word-Coinage in the Bible's Tongue: ONOMATURGE, a Lexical Mint." Proceedings of COGNITIVA'85, Paris, France, June 1985. CESTA, Paris, 1985, Vol. 1, pp. 95--100,

16. E. Nissan, "On the Architecture of ONOMATURGE, An Expert System Inventing Neologisms." Proceedings of the 12th Conference of the Association for Literary and Linguistic Computing (ALLC), Nice, France, 1985. Champion-Slatkine, Geneva and Paris, 1985, Vol. 2, pp. 671--680.

17. E. Nissan, "ONOMATURGE: An Expert System for Word-Formation and Morpho-Semantic Clarity Evaluation." Part I: "The Task in Perspective, and a Model of the Dynamics of the System." Part II: "The Statics of the System. The Representation from the General Viewpoint of Knowledge-Bases for Terminology." In H. Czap and C. Galinski, eds., Terminology and Knowledge Engineering. (Proceedings of the First International Conference, Trier, Germany, 1987. Indeks Verlag, Frankfurt, 1987, pp. 167--176 and 177--189.

18. E. Nissan, "Exception-Admissibility and Typicality in Proto-Representations." In H. Czap and C. Galinski, eds., Terminology and Knowledge Engineering. (Proceedings of the First International Conference, Trier, Germany, 1987. Indeks Verlag, Frankfurt, 1987, pp. 253--267.

19. E. Nissan, "ONOMATURGE: An Expert System in Word-Formation." Ph.D. Disser-tation (Computer Science), Ben-Gurion University of the Negev, Beer-Sheva, Israel, 1988. 3 vols. (600 pp.). In English. Project awarded the 1988 IPA Award in Computer Science.

20. E. Nissan, "Derivational Knowledge and the Common Sense of Coping With the In-completeness of Lexical Knowledge." In J. Lopes Alves, ed., Information Technology & Society: Theory, Uses, Impacts. Associação Portuguesa para o Desenvolvimento das Comunicações (APDC), & Sociedade Portuguesa de Filosofia (SPF), Lisbon, 1992, pp. 462--477.

21. E. Nissan, "ONOMATURGE: An Artificial Intelligence Tool and Paradigm for Sup-porting National and Native Language Fostering Policies." AI & Society: The Journal of Human-Centred Systems and Machine Intelligence, 5(3), 202--217 (1991).

22. E. Nissan, "Meanings, Expression, and Prototypes." Pragmatics and Cognition, 3(2), pp. 317--364 (1995).

23. E. Nissan, "The Lexical Mint." (Hebrew). Hebrew Linguistics, 36, 39--49 (1992).

24. U. Ornan, "Machinery for Hebrew Word Formation." In M.C. Golumbic, ed., Advances in Artificial Intelligence: Natural Language and Knowledge-based Systems. Springer-Verlag, Berlin, 1990, pp. 75--93.

25. M.Z. Özsoyoglu, ed., Special issue on Nested Relations, The IEEE Data Engineering Bulletin, 11(3). IEEE, 1988.

26. H. Putnam, "The Meaning of 'Meaning'". In H. Putnam, ed., Mind, Language and Reality. Philosophical Papers, Vol. 2. Cambridge University Press, Cambridge, U.K., 1975.

27. Y. Sharvit, "Issues in the Phonology and Morphology of the Modern Hebrew Verbal System: An Optimality Theoretic Analysis." In M. Hovav and A. Mittwoch, eds., Proceedings of the 10th Annual Conference of the Israel Association for Theoretical Linguistics, University of Haifa, Haifa, and Ben-Gurion University of the Negev, Beer-Sheva, 1994. CLC Publishing, Cornell University, Ithaca, N.Y., 1995; distrib.: N. Francez, Dept. of Computer Science, The Technion, Haifa, Israel.

28. A. Spencer, "Bracketing Paradoxes and the English Lexicon." Language, 64(4), 663--682 (1988).

29. P. Ziff, "Understanding Understanding". Cornell University Press, Ithaca, N.Y., 1972.

[1] By either % or ?, the coup de glotte is represented, whereas $ stands for the consonant �, and T for the phoneme t. As to the tilde character, it doubles the consonant that precedes it. Such doubling is phonemic, even though some current pronunciations do not actually dou-ble the consonant phonetically.

_retour à la page principale_