Recent posts
Semantic Assistants: SOA for Text Mining
With the rapidly growing amount of information available, employees spend an ever-increasing proportion of their time searching for the right information. Information overload has become a serious threat to productivity. We address this challenge with a service-oriented architecture that integrates semantic natural language processing services into desktop applications.
Beyond Information Silos — An Omnipresent Approach to Software Evolution
Abstract
Nowadays, software development and maintenance are highly distributed processes that involve a multitude of supporting tools and resources. Knowledge relevant for a particular software maintenance task is typically dispersed over a wide range of artifacts in different representational formats and at different abstraction levels, resulting in isolated 'information silos'. An increasing number of task-specific software tools aim to support developers, but this often results in additional challenges, as not every project member can be familiar with every tool and its applicability for a given problem. Furthermore, historical knowledge about successfully performed modifications is lost, since only the result is recorded in versioning systems, but not how a developer arrived at the solution. In this research, we introduce conceptual models for the software domain that go beyond existing program and tool models, by including maintenance processes and their constituents. The models are supported by a pro-active, ambient, knowledge-based environment that integrates users, tasks, tools, and resources, as well as processes and history-specific information. Given this ambient environment, we demonstrate how maintainers can be supported with contextual guidance during typical maintenance tasks through the use of ontology queries and reasoning services.
Converting a Historical Architecture Encyclopedia into a Semantic Knowledge Base
Proceedings of the Workshop New Challenges for NLP Frameworks (NLPFrameworks 2010)
Background
Natural language processing frameworks like GATE and UIMA have significantly changed the way NLP applications are designed, developed, and deployed. Features such as component-based design, test-driven development, and resource meta-descriptions now routinely provide higher robustness, better reusability, faster deployment, and improved scalability. They have become the staple of both NLP research and industrial application, fostering a new generation of NLP users and developers.
These are the proceedings of the workshop New Challenges for NLP Frameworks (NLPFrameworks 2010), held in conjunction with LREC 2010, which brought together users and developers of major NLP frameworks.
Ontology-Based Extraction and Summarization of Protein Mutation Impact Information
Introduction
Poster at BioNLP 2010: Ontology-Based Extraction and Summarization of Protein Mutation Impact InformationNLP methods for extracting mutation information from the bibliome have become an important new research area within bio-NLP, as manually curated databases, like the Protein Mutant Database (PMD) (Kawabata et al., 1999), cannot keep up with the rapid pace of mutation research. However, while significant progress has been made with respect to mutation detection, the automated extraction of the impacts of these mutations has so far not been targeted. In this paper, we describe the first work to automatically summarize impact information from protein mutations. Our approach is based on populating an OWL-DL ontology with impact information, which can then be queried to provide structured information, including a summary.
Automatic Quality Assessment of Source Code Comments: The JavadocMiner
Abstract
An important software engineering artefact used by developers and maintainers to assist in software comprehension and maintenance is source code documentation. It provides insights that help software engineers to effectively perform their tasks, and therefore ensuring the quality of the documentation is extremely important. Inline documentation is at the forefront of explaining a programmer's original intentions for a given implementation. Since this documentation is written in natural language, ensuring its quality needs to be performed manually. In this paper, we present an effective and automated approach for assessing the quality of inline documentation using a set of heuristics, targeting both quality of language and consistency between source code and its comments. We apply our tool to the different modules of two open source applications (ArgoUML and Eclipse), and correlate the results returned by the analysis with bug defects reported for the individual modules in order to determine connections between documentation and code quality.
Semantic Content Access using Domain-Independent NLP Ontologies
Abstract
We present a lightweight, user-centred approach for document navigation and analysis that is based on an ontology of text mining results. This allows us to bring the result of existing text mining pipelines directly to end users. Our approach is domain-independent and relies on existing NLP analysis tasks such as automatic multi-document summarization, clustering, question-answering, and opinion mining. Users can interactively trigger semantic processing services for tasks such as analyzing product reviews, daily news, or other document sets.
Leverage of OWL-DL axioms in a Contact Centre for Technical Product Support
Abstract
Real-time access to complex knowledge is a business driver in the contact centre environment. In this paper we outline for the domain of telecom technical product support a knowledge sharing paradigm in which a desktop client annotates named entities in technical documents with canonical names, class names or relevant class axioms, derived from an ontology by means of a web services framework. We described the system and its core components; OWL-DL telecom hardware ontology, ontological-natural language processing pipeline, an ontology axiom‐extractor; and the Semantic Assistants framework.
Flexible Ontology Population from Text: The OwlExporter
Abstract
Ontology population from text is becoming increasingly important for NLP applications. Ontologies in OWL format provide for a standardized means of modeling, querying, and reasoning over large knowledge bases. Populated from natural language texts, they offer significant advantages over traditional export formats, such as plain XML. The development of text analysis systems has been greatly facilitated by modern NLP frameworks, such as the General Architecture for Text Engineering (GATE). However, ontology population is not currently supported by a standard component. We developed a GATE resource called the OwlExporter that allows to easily map existing NLP analysis pipelines to OWL ontologies, thereby allowing language engineers to create ontology population systems without requiring extensive knowledge of ontology APIs. A particular feature of our approach is the concurrent population and linking of a domain- and NLP-ontology, including NLP-specific features such as safe reasoning over coreference chains.
Generating an NLP Corpus from Java Source Code: The SSL Javadoc Doclet
Abstract
Source code contains a large amount of natural language text, particularly in the form of comments, which makes it an emerging target of text analysis techniques. Due to the mix with program code, it is difficult to process source code comments directly within NLP frameworks such as GATE. Within this work we present an effective means for generating a corpus using information found in source code and in-line documentation, by developing a custom doclet for the Javadoc tool. The generated corpus uses a schema that is easily processed by NLP applications, which allows language engineers to focus their efforts on text analysis tasks, like automatic quality control of source code comments. The SSLDoclet is available as open source software.
Predicate-Argument EXtractor (PAX)
Abstract
Screenshot of MultiPAX resultsIn this paper, we describe the open source GATE component PAX for extracting predicate-argument structures (PASs). PASs are used in various contexts to represent relations within a sentence structure. Different ``semantic'' parsers extract relational information from sentences but there exists no common format to store this information. Our predicate-argument extractor component (PAX) takes the annotations generated by selected parsers and transforms the parsers' results to predicate-argument structures represented as triples (subject-verb-object). This allows downstream components in an analysis pipeline to process PAS triples independent of the deployed parser, as well as combine the results from several parsers within a single pipeline.
Believe It or Not: Solving the TAC 2009 Textual Entailment Tasks through an Artificial Believer System
Abstract
The Text Analysis Conference (TAC) 2009 competition featured a new textual entailment search task, which extends the 2008 textual entailment task. The goal is to find information in a set of documents that are entailed from a given statement. Rather than designing a system specifically for this task, we investigated the adaptation of an existing artificial believer system to solve this task. The results show that this is indeed possible, and furthermore allows to recast the existing, divergent tasks of textual entailment and automatic summarization under a common umbrella.
A Quality Perspective of Evolvability Using Semantic Analysis
Abstract
Software development and maintenance are highly distributed processes that involve a multitude of supporting tools and resources. Knowledge relevant to these resources is typically dispersed over a wide range of artifacts, representation formats, and abstraction levels. In order to stay competitive, organizations are often required to assess and provide evidence that their software meets the expected requirements. In our research, we focus on assessing non-functional quality requirements, specifically evolvability, through semantic modeling of relevant software artifacts. We introduce our SE-Advisor that supports the integration of knowledge resources typically found in software ecosystems by providing a unified ontological representation. We further illustrate how our SE-Advisor takes advantage of this unified representation to support the analysis and assessment of different types of quality attributes related to the evolvability of software ecosystems.
A Belief Revision Approach to Textual Entailment Recognition
Abstract
An artificial believer has to recognize textual entailment to categorize beliefs. We describe our system – the Fuzzy Believer system – and its application to the TAC/RTE three-way task.
ERSS at TAC 2008
Abstract
An Automatically Generated Summary
ERSS 2008 attempted to rectify certain issues of ERSS 2007. The improvements to readability, however, do not reflect in significant score increases, and in fact the system fell in overall ranking. While we have not concluded our analysis, we present some preliminary observations here.
Semantic Assistants – User-Centric Natural Language Processing Services for Desktop Clients
Abstract
Semantic Assistants Workflow OverviewToday's knowledge workers have to spend a large amount of time and manual effort on creating, analyzing, and modifying textual content. While more advanced semantically-oriented analysis techniques have been developed in recent years, they have not yet found their way into commonly used desktop clients, be they generic (e.g., word processors, email clients) or domain-specific (e.g., software IDEs, biological tools). Instead of forcing the user to leave his current context and use an external application, we propose a ``Semantic Assistants'' approach, where semantic analysis services relevant for the user's current task are offered directly within a desktop application. Our approach relies on an OWL ontology model for context and service information and integrates external natural language processing (NLP) pipelines through W3C Web services.
Story-driven Approach to Software Evolution

Abstract
From a maintenance perspective, only software that is well understood can evolve in a controlled and high-quality manner. Software evolution itself is a knowledge-driven process that requires the use and integration of different knowledge resources. The authors present a formal representation of an existing process model to support the evolution of software systems by representing knowledge resources and the process model using a common representation based on ontologies and description logics. This formal representation supports the use of reasoning services across different knowledge resources, allowing for the inference of explicit and implicit relations among them. Furthermore, an interactive story metaphor is introduced to guide maintainers during their software evolution activities and to model the interactions between the users, knowledge resources and process model.
Ontological Approach for the Semantic Recovery of Traceability Links between Software Artifacts

Abstract
Traceability links provide support for software engineers in understanding relations and dependencies among software artefacts created during the software development process. The authors focus on re-establishing traceability links between existing source code and documentation to support software maintenance. They present a novel approach that addresses this issue by creating formal ontological representations for both documentation and source code artefacts. Their approach recovers traceability links at the semantic level, utilising structural and semantic information found in various software artefacts. These linked ontologies are supported by ontology reasoners to allow the inference of implicit relations among these software artefacts.
A General Architecture for Connecting NLP Frameworks and Desktop Clients using Web Services

Abstract
Despite impressive advances in the development of generic NLP frameworks, content-specific text mining algorithms, and NLP services, little progress has been made in enhancing existing end-user clients with text analysis capabilities. To overcome this software engineering gap between desktop environments and text analysis frameworks, we developed an open service-oriented architecture, based on Semantic Web ontologies and W3C Web services, which makes it possible to easily integrate any NLP service into client applications.
Semantic Technologies in System Maintenance (STSM 2008)

Abstract
This paper gives a brief overview of the International Workshop on Semantic Technologies in System Maintenance. It describes a number of semantic technologies (e.g., ontologies, text mining, and knowledge integration techniques) and identifies diverse tasks in software maintenance where the use of semantic technologies can be beneficial, such as traceability, system comprehension, software artifact analysis, and information integration.
Enhancing the OpenOffice.org Word Processor with Natural Language Processing Capabilities

Abstract
Today's knowledger workers are often overwhelmed by the vast amount of readily available natural language documents that are potentially relevant for a given task. Natural language processing (NLP) and text mining techniques can deliver automated analysis support, but they are often not integrated into commonly used desktop clients, such as word processors. We present a plug-in for the OpenOffice.org word processor Writer that allows to access any kind of NLP analysis service mediated through a service-oriented architecture. Semantic Assistants can now provide services such as information extraction, question-answering, index generation, or automatic summarization directly within an end user's application.
Professional Activities
I have been involved in a number of review and event organization activities.
New Job, New Website
Submitted by rene on Sat, 2008-05-31 19:00.A Semantic Wiki Approach to Cultural Heritage Data Management
Abstract
Providing access to cultural heritage data beyond book digitization and information retrieval projects is important for delivering advanced semantic support to end users, in order to address their specific needs. We introduce a separation of concerns for heritage data management by explicitly defining different user groups and analyzing their particular requirements. Based on this analysis, we developed a comprehensive system architecture for accessing, annotating, and querying textual historic data. Novel features are the deployment of a Wiki user interface, natural language processing services for end users, metadata generation in OWL ontology format, SPARQL queries on textual data, and the integration of external clients through Web Services. We illustrate these ideas with the management of a historic encyclopedia of architecture.
Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles

Abstract
Reported speech in the form of direct and indirect reported speech is an important indicator of evidentiality in traditional newspaper texts, but also increasingly in the new media that rely heavily on citation and quotation of previous postings, as for instance in blogs or newsgroups. This paper details the basic processing steps for reported speech analysis and reports on performance of an implementation in form of a GATE resource.

