The Th(IC)2 Initiative Corpus-Based Thesaurus Construction f
更新时间:2023-04-07 05:41:01 阅读量: 教育文库 文档下载
- the推荐度:
- 相关推荐
The Th(IC)2 Initiative: Corpus-Based Thesaurus
Construction for Indexing WWW Documents
Nathalie Aussenac-Gilles* and Didier Bourigault**
* Université Toulouse 3, Institut de Recherche en Informatique de Toulouse (IRIT) 118, route de Narbonne, 31062 TOULOUSE Cedex 4 (F) - aussenac@irit.fr ** Université Toulouse Le Mirail, Etudes et Recherches en Syntaxe et Sémantique (ERSS) Maison de la recherche, 5, allées Antonio Machado, 31048 TOULOUSE Cedex (F)
didier.bourigault@univ-tlse2.fr
Abstract. This working paper reports on the early stages of our contribution to the
Th(IC)2 project, in which, together with other French research teams, we want to
test and demonstrate the interest of corpus analysis methods to design domain
knowledge models. The project should lead to produce a thesaurus in French about
KE research. The main stages of the method that we apply to thisexeprimentare (a)
setting up a corpus, (b) selecting, adapting and combining the use of relevant NLP
tools, (c) interpreting and validating their results, from which terms, lexical
relations or classes are extracted, and finally (d) structuring them into a semantic
network. We present the LEXTER system used to automatically extract from a
corpus a list of term candidates that could later be considered as descriptors. We
also comments upon the validation protocol that we set up : it relies on an interface
via the Internet and on the involvement of the French KE community.
1The Th(IC)2 Initiative
1.1 A contribution to the (KA)2 initiative
The Th(IC)2 project is an initiative from of the French TIA1 group of interest. With this project, some French researchers in Knowledge Engineering (KE) intend to contribute to the (KA)22project [4]. Initiated in 1998, the (KA)2 initiative aims at building an ontology that would be used by researchers in the domain of KE in order to index their own web pages with “semantic tags “ corresponding to concepts in this ontology. In its current state, the (KA)2 ontology contains the knowledge necessary to describe the administrative organisation of the research in the field, but few items related to the content of the research itself. The target of the Th(IC)2 contribution is to enrich the part of (KA)2 ontology dedicated to the description of research topics in the KE community. 1TheTIA special interest group (http://www.biomath.jussieu.fr/TIA/) is a research group in Linguistics, NLP and AI concerned with text-based acquisition of ontological and terminological resources. The authors, as well as the members of the TIA group, thank the " Direction Générale àla Langue Fran?aise" (DGLF) for supporting the Th(IC)2 project.
2 http://www.aifb.uni-karlsruhe.de/WBS/broker/KA2
With a larger scope, our methodological proposals can prove relevant in the broader context of designing community web portals.
The first purpose of the Th(IC)2 project is to build a thesaurus in French which will describe how KE research develops in the French speaking area, with its specificity and strengths. Indeed, we will first draw a state-of-the-art on research topics that are currently addressed in the French KE community. This must be done before it can be included into a broader description. This thesaurus will have a conventional structure: a set of descriptors referring to research topics will be organised in taxonomy and connected via synonymy and “see also” links. The correspondence between this thesaurus and the (KA)2 formal ontology will be established in a second stage.
1.2Using corpus-based methods to build a thesaurus
The overall process proposed by the promoters of the (KA)2 project is to use tools, methods and languages developed by the Knowledge Acquisition (KA) community in order to build the ontology. This recursive prerequisite explains the square in (KA)2. In the same spirit, the TIA group wants to test and demonstrate the interest of some KE results, and particularly those resorting to corpus analysis methods. A new trend appeared recently, derived from a major evolution of Terminology [3]. It resorts to both acquisition tools based on linguistics and browsing and modelling tools with links between models and texts. This evolution is due to new corpus-oriented Natural Language Processing (NLP) tools whose efficiency has increased thanks to valuable collaborations between linguists, terminologists and knowledge engineers. This trend is clearly less ambitious than the automatic transfer approach: NLP tools are viewed as aids for knowledge engineers who select and combine their results to develop a model.
The tools and methods developed by the TIA group members should be useful for ontology design. We assume that a thesaurus is a kind of lexico-conceptual resource similar to ontologies, at least enough to resort to the same corpus-based techniques to design them. Comparing thesaurus and ontologies is one of the issues that could be made clearer thanks to this project.
This paper describes one of the experiments carried out within the TIA group to build this thesaurus. The group set up a general method [3] that defines a framework within which knowledge engineers select and adapt the relevant tools for the application at hand, according to the documents and expertise available, the corpus language and the kind of resources to build. The main stages of this method are (a) setting up a corpus, (b) selecting, adapting and combining the use of the relevant NLP tools, (c) interpreting and validating their results, from which terms, lexical relations or classes are extracted, and finally (d) structuring them into a semantic network. This working paper reports on a particular experiment that illustrates most of these stages:
1. A first corpus as representative as possible of research activities within the French
speaking KE community is set up (section 2).
2.The LEXTER system is used to automatically extract from this corpus a list of term
candidates that could later be considered as descriptors (section 3).
3. A validation protocol is defined: the single term list is automatically subpided into
sub-lists according to the number of texts comprised in the original corpus; these sub-lists are validated through an interface via the Internet (section 4).
Further stages (section 5) include the selection of terms and their organisation into a thesaurus that is then structured with the help of additional tools. Finally, the French KE community will be asked to validate the whole.
2Building a reference corpus
The TIA group used all available criteria to set up an exhaustive and representative corpus. To this end, the corpus gathers many documents produced in the domain and distributed as follows:
-32 descriptions of laboratories or teams working in the field of KE ("AFIA sub-corpus") published in a special report on KE in the 34th issue of the Bulletin de l'Association Fran?aise d'Intelligence Artificielle. Each description (of an average size of 975 words) shortly outlines the main directions of investigation of a team or laboratory, its main results, collaborations and publications.
-35 papers of a recently edited book on KE ("LIVRIC sub-corpus") [8]. This book collects a selection of papers from the proceedings of the French conference in KE (IC) that were organised between 1995 and 1998. The average size of the papers is 5095 words. Most of the topics addressed by research in KE at this time are quite well represented.
AFIA sub-corpus.LIVRIC sub-corpus Document type Laboratory descriptions Scientific papers Number of documents3235 Average number of words per document975 5 095
Total number of words31 212178 336 Table 1: Some figures about the reference corpus of the Th(IC)2 project
3Extracting term candidates with LEXTER
A preliminary selection of terms is performed using LEXTER, which is a term extractor
[6] [7]. The input of LEXTER is an unambiguously tagged corpus. The output is a network of term candidates, that is words or sequences of words which are likely to be chosen as entries in a thesaurus or concept labels in an ontology. The extraction process is composed of two main steps.
1.Shallow parsing techniques implemented in the Splitting module detect morpho-
syntactical patterns that cannot be parts of terminological noun phrases, and that are therefore likely to indicate noun phrase boundaries. In order to process correctly some problematic splitting, such as coordinations, attributive past participles and ambiguous preposition + determiner sequences, the system acquires and uses corpus-based selection restrictions of adjectives and nouns.
2.Ultimately, the Splitting module produces a set of text sequences, mostly noun
phrases, which we refer to as Maximal-Length Noun Phrases (henceforth MLNP). The Parsing module recursively decomposes the maximal-length noun phrases MLNP into two syntactic constituents: a constituent in head-position
(e.g. ’model’in the noun phrase ’conceptual model’), and a constituent in
expansion position (e.g. ’conceptual’ in the noun phrase ’conceptual model’). The
Parsing module exploits rules in order to extract two subgroups from each MLNP, one in head-position and the other one in expansion position. Most of MLNP sequences are ambiguous. Two (or more) binary decompositions may compete, corresponding to several possibilities of prepositional phrase or adjective attachment. Disambiguation is performed by a corpus-based method that relies on endogenous learning procedures.
Term candidate freq.Term candidate freq.Term candidate freq
modèle conceptuel
conceptual model 135type de connaissance
knowledge type
38espace de connaissances
knowledge space
25
résolution de problème
problem solving 121méthode de résolution de
problème
problem solving method
37domaine d'application
application domain
25
ingénierie de la connaissance Knowledge engineering 120travail coopératif
co-operative work
37système expert
expert system
25
acquisition des connaissances
knowledge acquisition 106représentation de la
connaissance
knowledge representation
36Base de Connaissance
Knowledge base
24
système d'information
information system 106gestion de la connaissance
knowledge management
33système informatique
compute supported system
24
connaissance du domaine
domain knowledge 92fouille de donnée
data mining
33langage de représentation
representation language
23
candidat terme
term candidate 63niveau d'abstraction
abstraction level
33unité linguistique
linguistic unit
23
système à base de connaissances
knowledge based system 56contexte partagé
shared context
32relation sémantique
semantic relation
23
génie logiciel
software engineering 55langage de modélisation
modelling language
32premier temps
first stage
23
modélisation de la connaissance
knowledge modelling 50méthode de résolution
problem solving method
32haut niveau
high level
22
base de données
data base 47ontologie de l'expertise
expertise ontology
32base de cas
case base
22
logique de description
description logic 46acquisition de connaissances
knowledge acquisition
31modèle de connaissances
knowledge model
22
aide à la Décision computer supported decision
making 46appel d'offre
call for proposal
29système coopératif
co-operative system
22
modèle d'expertise
expertise model 45processus de conception
design process
29processus d'acquisition
acquisition process
22
structure prédicative
predicative structure 44mémoire d'entreprise
corporate memory
28primitive de modélisation
modelling primitive
21
points de vue
point of view 43mot clé
key word
28dossier médical
medical file
20
ingénieur de la connaissance
knowledge engineer 41fonction test
test function
27relation causale
causal relation
20
mesure de similarité
similarity mesure 39Management par projet
Project management
27primitive conceptuelle
conceptual primitive
20
modèle générique
generic model 39modèle de raisonnement
reasoning model
27niveau connaissance
knowledge level
20
graphe conceptuel
conceptuel graphs 38cycle de vie
life cycle
26type de document
document type
20
Table 2: The most frequent term candidates in the Th(IC)2 corpus
The sub-groups generated by the Parsing module, together with the MLNP extracted by the Splitting module, are the term candidates produced by LEXTER. This set of term candidates is represented as a network: each multi-word term candidate is connected to its head constituent and to its expansion constituent by syntactic decomposition links. Building the network is especially important for the purpose of term acquisition. LEXTER was used in many applications aiming at gathering lexical and/or conceptual resources, such as terminological knowledge bases, ontologies, thesaurus, etc. [6], [1].
In this experiment, the number of term candidates extracted by LEXTER from the Th(IC)2 corpus is given in table 3 and the most frequent term candidates are listed in table 2.
freq = 1freq > 1Total Number of term candidates17189387921068 Table 3 : Number of term candidates extracted by LEXTER from the Th(IC)2 corpus
4Evaluation protocol
4.1Generating sub-lists of term candidates for inpidual validation
The most frequent term candidates appear to be relevant descriptors, and thus must be considered as valid entries in the thesaurus. However, this simple numeric criterion is not powerful enough to select without any error or omission a set of descriptors that will cover the whole range of research activities in KE in a precise and exhaustive manner. Some term candidates with a low frequency should be considered. So the validation process should bear on the entire list of extracted term candidates.
Given the very large size of this list, it is hard to imagine that a small number of persons would undertake the validation of the entire list. It is doubtful that such a group has the competence and time required to check the whole domain and corpus. Moreover this thesaurus will not be used to massively index large document bases, but rather as a precise map of the KE domain intended as a reference documents for researchers. This is why we have set a collective and manual validation process: we ask every researcher to validate the term candidates extracted from his/her own texts.
In order to make this inpidual validation possible, we have decomposed the list of term candidates into as many sub-lists as documents in the corpus.
?For each document in the LIVRIC sub-corpus, we have selected those candidate terms occurring at least twice in the document, or only once in the document and at least once in an other document from the LIVRIC sub-corpus. The average number of term candidates of the sub-lists is 81.
?For each document in the AFIA sub-corpus, we have selected those candidate terms occurring at least twice in the document. The average number of term candidates of the sub-lists is 48.
This validation protocol requires involving all the researchers concerned as authors. We consider this participation as very beneficial. Firstly, it is a very enriching experiment for an author: he has a picture of his document in a form both unusual for him and familiar
enough to be interpreted. Secondly, we assume that, in line with the (KA)2 project promoters, the success of an experiment like the Th(IC)2 project strongly depends on the important involvement of the community members. They should not be only users of the thesaurus, but they should take part in the early stages of its design ("Do not ask what the community can do for you. Ask what you can do for the community!").
4.2 A validation interface on the web
To implement this collaborative validation process, we designed a web interface through which the authors can access and validate the sub-list of term candidates built up from their text. A snapshot of the validation interface is given on figure 1.
Figure 1 : A snapshot of the validation interface.
At this stage, the main difficulty was to formulate precise validation procedures so that any author would validate the list of term candidates “in the same spirit”. We led many
experiments in which specialists were asked to validate lists of term candidates. One of the main lessons learned from these experiments is that decision making is heavily dependent on the goal of the task, that is the type of lexical and/or conceptual resource that is under concern. Roughly speaking, with the same starting list of term candidates, the set of selected terms will not be the same whether the validated terms are to be integrated as descriptors in a thesaurus used by an automatic indexing system, or as concept labels in an ontology used by a knowledge-based system. For this reason, we will first explain the authors what the main goal of the Th(IC)2 project is (that is building a thesaurus for the KE community). We will then ask them not to index the document from which term candidates were extracted but to select term candidates according to their relevance and usefulness to characterise their own research within the field of KE. 5Further stages
5.1Expertise based cross validation by the community
The next step planned in the project is to launch the validation process by soliciting members of the teams described in the AFIA sub-corpus and authors of papers of the LIVRIC sub-corpus. We will then synthesise all the results and build an initial list of descriptors by gathering all the term candidates that were selected by at least one author. During this stage, we will also have to gather synonym terms, to add simple terms that could help organise more complex ones, and to get to a consensual view by comparing the various lists. The resulting list will serve as a bootstrap for further work.
5.2From term lists to a thesaurus
The main task will then be to structure this list with conventional thesaurus links [3]. This task will rely on two main approaches, carried out in parallel:
- A corpus based bottom-up analysis using results of natural language processing tools such as term extractors, relation extractors, clustering tools… . Links may indeed be revealed by term use in texts. Good means to identify these links may be either to browse term occurrences, which may be costly, or to look for co-occurrent terms, or to extract lexical relationships. We plan to use a lexical relation extractor, Caméléon [9], to check the links related to the selected terms, and to explore domain specific relations. By this means, additional information will be available to decide how to organise the thesaurus descriptors into a taxonomy. Lexical relations are also good inputs to precisely describe domain concepts. This analysis is likely to provide the lower level layers of the thesaurus.
- A top-down approach based on our expertise in the domain’s global organisation as a research field. In short, this will lead to define the high level layers of the thesaurus which will organise the lower level layers previously mentioned. More precisely, expertise is very useful to directly get to the right interpretation of any textual item and to avoid further text investigations. It is also likely to shortcut some references to texts when trying to differentiate descriptors one with another. Moreover, most of the high level structuring descriptors are not in texts and must be acquired from domain experts. Although this process may seem very
pragmatic and intuitive, our goal is to make explicit the more modelling rules as possible.
A modelling tool, such as Terminae [5] or Géditerm [2], will help to store, to browse, to structure and to describe the terms, their relations and their definitions.
6Discussion
Beyond the possible contribution to the (KA)2 project, this experiment raises two major issues for the KE community:
-What are the qualitative and quantitative benefits (in terms of design cost and time, domain coverage, quality of the final knowledge structure…) of a corpus-analysis-based approach?
-What are the right structuring and formalisation levels for an efficient indexing of researchers web pages? Is it worth undertaking the design of a formal ontology with very well-defined links or is a thesaurus enough? References
1.Assadi H. (1998) Construction of a regional ontology from texts and its use within a
documentary system. In Formal ontology in information system (FOIS’98), Guarino N. Ed.
Frontiers in Artificial Intelligence and Applications, Vol. 46, pp 236-249. IOS Press, Amsterdam
2.Aussenac-Gilles N. (1999), GEDITERM, un logiciel de gestion de bases de connaissances
terminologiques, in Terminologies Nouvelles n°19, pp 111-123. 1999
3.Aussenac-Gilles N., Biébow B. & Szulman S. (2000) Revisiting Ontology Design: a
methodology based on corpus analysis, in Proc. of the 12th Int. Conference on Knowledge Engineering and Knowledge Management (EKAW'2000), Juan-les-Pins (F), Oct. 2000
4.Benjamins R., Fensel D., and Decker S. (1999) (KA)2 : Building ontologies for the Internet:
A Midterm Report. International Journal of Human Computer Studies, 51(3):687
5.Biebow B., Szulman S. (1999) TERMINAE : A linguistic-based tool for the building of a
domain ontology, in proc. of the11th European Workshop on Knowledge Acquisition, Modelling and Management (EKAW'99), Dagstuhl Castle, Germany, pp 49-66. 1999.
6.Bourigault D. (1995). LEXTER, a Terminology Extraction Software for Knowledge
Acquisition from Text s, In Proceedings of the 9th Knowledge Acquisition for Knowledge Based System Workshop (KAW'95). Banff, Canada.
7.Bourigault D., Gonzalez-Mullier I. & Gros C. (1996) LEXTER, a Natural Language Tool for
Terminology Extraction, In Proceedings of the 7th EURALEX International Congres s, G?teborg, Sweden. pp771-779
8.Charlet J., Zacklad M., Kassel & Bourigault D. (2000), Ingénierie des Connaissances :
évolutions récentes et nouveaux défis, Paris: Eyrolles
9.Séguéla P., Aussenac-Gilles N. (1999), Extraction de relations sémantiques entre termes et
enrichissement de modèles du domaine, Actes de IC'99 (Conférence Fran?aise d'Ingénierie des Connaissances), pp 79-88, Paris, 1999.
正在阅读:
The Th(IC)2 Initiative Corpus-Based Thesaurus Construction f04-07
Unit - 1 - Can - you - play - the - guitar讲与练12-21
1中国画的笔墨情趣04-24
EVOLVINGBUILDINGBLOCKSFORDESIGNUSINGGENETIC ENGINEERING A FORMAL APPROACH.07-29
2013最新人教版新目标八年级英语上课文翻译二单元08-13
Content and context aware networking using semantic tagging07-27
医书推荐下载05-02
中华民国宪法《完整》01-30
会计实务试题及答案05-02
- exercise2
- 铅锌矿详查地质设计 - 图文
- 厨余垃圾、餐厨垃圾堆肥系统设计方案
- 陈明珠开题报告
- 化工原理精选例题
- 政府形象宣传册营销案例
- 小学一至三年级语文阅读专项练习题
- 2014.民诉 期末考试 复习题
- 巅峰智业 - 做好顶层设计对建设城市的重要意义
- (三起)冀教版三年级英语上册Unit4 Lesson24练习题及答案
- 2017年实心轮胎现状及发展趋势分析(目录)
- 基于GIS的农用地定级技术研究定稿
- 2017-2022年中国医疗保健市场调查与市场前景预测报告(目录) - 图文
- 作业
- OFDM技术仿真(MATLAB代码) - 图文
- Android工程师笔试题及答案
- 生命密码联合密码
- 空间地上权若干法律问题探究
- 江苏学业水平测试《机械基础》模拟试题
- 选课走班实施方案
- Construction
- Initiative
- Thesaurus
- Corpus
- Based
- Th
- IC
- 全新版大学英语2造句翻译
- 信息论与编码自学报告
- 操作系统 习题 最经典
- 纽约时光酒店(JFK机场)(Days Inn JFK Airport)
- 交换机MSTP常见问题定位
- 2022-2022年中国二手汽车市场运行态势研究报告(目录)
- 国家认定企业技术中心管理办法
- 陆军军医大学大学城校区监控整合改造
- 版初中物理新课程标准测试试题(含标准答案)
- 专英3考试练习3-4单元.doc
- 2016年郑州大学联合培养单位802经济管理基础之《管理学》考研必
- 【耳鼻咽喉头颈外科学】考试复习重点题库和整理重点。。。
- 普安县职称论文发表网-数字信息化系统油田勘探开发应用论文选题
- 2022版中国矿山机械行业发展现状分析与市场前景预测报告
- 使用VS2005开发移动设备应用
- 三年级数学教案:整十、整百数乘整十数的口算方法
- 公司安全管理机构设置及安全管理人员配备管理制度
- 高中语文 4.2.1《廉颇蔺相如列传(节选)》同步达标测试 苏教版必
- 2013年陕西省安康市公务员考试行政系统市区职位表
- 2022年广西师范大学物理科学与技术学院311教育学专业基础综合之