GATE Components
Predicate-Argument EXtractor (PAX)
Abstract
Screenshot of MultiPAX resultsIn this paper, we describe the open source GATE component PAX for extracting predicate-argument structures (PASs). PASs are used in various contexts to represent relations within a sentence structure. Different ``semantic'' parsers extract relational information from sentences but there exists no common format to store this information. Our predicate-argument extractor component (PAX) takes the annotations generated by selected parsers and transforms the parsers' results to predicate-argument structures represented as triples (subject-verb-object). This allows downstream components in an analysis pipeline to process PAS triples independent of the deployed parser, as well as combine the results from several parsers within a single pipeline.
Flexible Ontology Population from Text: The OwlExporter
Abstract
Ontology population from text is becoming increasingly important for NLP applications. Ontologies in OWL format provide for a standardized means of modeling, querying, and reasoning over large knowledge bases. Populated from natural language texts, they offer significant advantages over traditional export formats, such as plain XML. The development of text analysis systems has been greatly facilitated by modern NLP frameworks, such as the General Architecture for Text Engineering (GATE). However, ontology population is not currently supported by a standard component. We developed a GATE resource called the OwlExporter that allows to easily map existing NLP analysis pipelines to OWL ontologies, thereby allowing language engineers to create ontology population systems without requiring extensive knowledge of ontology APIs. A particular feature of our approach is the concurrent population and linking of a domain- and NLP-ontology, including NLP-specific features such as safe reasoning over coreference chains.
Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles

Abstract
Reported speech in the form of direct and indirect reported speech is an important indicator of evidentiality in traditional newspaper texts, but also increasingly in the new media that rely heavily on citation and quotation of previous postings, as for instance in blogs or newsgroups. This paper details the basic processing steps for reported speech analysis and reports on performance of an implementation in form of a GATE resource.
A Self-Learning Context-Aware Lemmatizer for German

Abstract
Accurate lemmatization of German nouns mandates the use of a lexicon. Comprehensive lexicons, however, are expensive to build and maintain. We present a self-learning lemmatizer capable of automatically creating a full-form lexicon by processing German documents.
Durm German Lemmatizer v1.0 Released
Submitted by rene on Thu, 2007-05-31 08:59.I'm happy to announce the first public release of our free/open source Durm Lemmatization System for the German language.
The release comes with source code, binaries, documentation, resources (German lexicon, Case Tagger probabilities), and manually annotated texts from the German Wikipedia for evaluation.
Multi-lingual Noun Phrase Chunker Updated
Submitted by rene on Sat, 2007-03-03 09:08.I just posted a small update to my multi-lingual noun phrase chunker (MuNPEx) for GATE.
Changes in v0.2 are:
o preliminary Spanish support (see below)
o renamed from "NPE" to "MuNPEx" in a blatant attempt on Googlewhacking
o small cleanups
o now comes with a sample NE transducer for number markup to improve chunking
Supported languages are now English, German, French, and Spanish (beta).
