Welcome to my personal homepage. Here you can find information on my current research as well as my publications and other activities. More research-related information is published on semanticsoftware.info, where you can also contact me. For the socially networked, I'm also on Google+, LinkedIn, Xing and Twitter.
Motivation: Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation.
Results: We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%.
Availability: The OrganismTagger, including supporting tools, resources, training data and manual annotations, as well as end user and developer documentation, is freely available under an open-source license at http://www.semanticsoftware.info/organism-tagger.
Intelligent Software Development Environments: Integrating Natural Language Processing with the Eclipse Platform
Software engineers need to be able to create, modify, and analyze knowledge stored in software artifacts. A significant amount of these artifacts contain natural language, like version control commit messages, source code comments, or bug reports. Integrated software development environments (IDEs) are widely used, but they are only concerned with structured software artifacts – they do not offer support for analyzing unstructured natural language and relating this knowledge with the source code. We present an integration of natural language processing capabilities into the Eclipse framework, a widely used software IDE. It allows to execute NLP analysis pipelines through the Semantic Assistants framework, a service-oriented architecture for brokering NLP services based on GATE. We demonstrate a number of semantic analysis services helpful in software engineering tasks, and evaluate one task in detail, the quality analysis of source code comments.
Integrating Wiki Systems, Natural Language Processing, and Semantic Technologies for Cultural Heritage Data Management
Modern documents can easily be structured and augmented to have the characteristics of a semantic knowledge base. Many older documents may also hold a trove of knowledge that would deserve to be organized as such a knowledge base. In this chapter, we show that modern semantic technologies offer the means to make these heritage documents accessible by transforming them into a semantic knowledge base. Using techniques from natural language processing and Semantic Computing, we automatically populate an ontology. Additionally, all content is made accessible in a user-friendly Wiki interface, combining original text with NLP-derived metadata and adding annotation capabilities for collaborative use. All these functions are combined into a single, cohesive system architecture that addresses the different requirements from end users, software engineering aspects, and knowledge discovery paradigms. The ideas were implemented and tested with a volume from the historic Encyclopedia of Architecture and a number of different user groups.
Mutation impact extraction is a hitherto unaccomplished task in state of the art mutation extraction systems. Protein mutations and their impacts on protein properties are hidden in scientific literature, making them poorly accessible for protein engineers and inaccessible for phenotype-prediction systems that currently depend on manually curated genomic variation databases.
We present the first rule-based approach for the extraction of mutation impacts on protein properties, categorizing their directionality as positive, negative or neutral. Furthermore protein and mutation mentions are grounded to their respective UniProtKB IDs and selected protein properties, namely protein functions to concepts found in the Gene Ontology. The extracted entities are populated to an OWL-DL Mutation Impact ontology facilitating complex querying for mutation impacts using SPARQL. We illustrate retrieval of proteins and mutant sequences for a given direction of impact on specific protein properties. Moreover we provide programmatic access to the data through semantic web services using the SADI (Semantic Automated Discovery and Integration) framework.
We address the problem of access to legacy mutation data in unstructured form through the creation of novel mutation impact extraction methods which are evaluated on a corpus of full-text articles on haloalkane dehalogenases, tagged by domain experts. Our approaches show state of the art levels of precision and recall for Mutation Grounding and respectable level of precision but lower recall for the task of Mutant-Impact relation extraction. The system is deployed using text mining and semantic web technologies with the goal of publishing to a broad spectrum of consumers.
Natural language processing frameworks like GATE and UIMA have significantly changed the way NLP applications are designed, developed, and deployed. Features such as component-based design, test-driven development, and resource meta-descriptions now routinely provide higher robustness, better reusability, faster deployment, and improved scalability. They have become the staple of both NLP research and industrial application, fostering a new generation of NLP users and developers.
These are the proceedings of the workshop New Challenges for NLP Frameworks (NLPFrameworks 2010), held in conjunction with LREC 2010, which brought together users and developers of major NLP frameworks.
NLP methods for extracting mutation information from the bibliome have become an important new research area within bio-NLP, as manually curated databases, like the Protein Mutant Database (PMD) (Kawabata et al., 1999), cannot keep up with the rapid pace of mutation research. However, while significant progress has been made with respect to mutation detection, the automated extraction of the impacts of these mutations has so far not been targeted. In this paper, we describe the first work to automatically summarize impact information from protein mutations. Our approach is based on populating an OWL-DL ontology with impact information, which can then be queried to provide structured information, including a summary.
We present a lightweight, user-centred approach for document navigation and analysis that is based on an ontology of text mining results. This allows us to bring the result of existing text mining pipelines directly to end users. Our approach is domain-independent and relies on existing NLP analysis tasks such as automatic multi-document summarization, clustering, question-answering, and opinion mining. Users can interactively trigger semantic processing services for tasks such as analyzing product reviews, daily news, or other document sets.
An important software engineering artefact used by developers and maintainers to assist in software comprehension and maintenance is source code documentation. It provides insights that help software engineers to effectively perform their tasks, and therefore ensuring the quality of the documentation is extremely important. Inline documentation is at the forefront of explaining a programmer's original intentions for a given implementation. Since this documentation is written in natural language, ensuring its quality needs to be performed manually. In this paper, we present an effective and automated approach for assessing the quality of inline documentation using a set of heuristics, targeting both quality of language and consistency between source code and its comments. We apply our tool to the different modules of two open source applications (ArgoUML and Eclipse), and correlate the results returned by the analysis with bug defects reported for the individual modules in order to determine connections between documentation and code quality.
Real-time access to complex knowledge is a business driver in the contact centre environment. In this paper we outline for the domain of telecom technical product support a knowledge sharing paradigm in which a desktop client annotates named entities in technical documents with canonical names, class names or relevant class axioms, derived from an ontology by means of a web services framework. We described the system and its core components; OWL-DL telecom hardware ontology, ontological-natural language processing pipeline, an ontology axiom?extractor; and the Semantic Assistants framework.