Ogmios: a scalable NLP platform for annotating large web document collection

While NLP tools are now widely available, their use can be problematic
considering the lack of homogeneity of their input/output, the granularity
variation of the provided information, but also the difficulties to
process large amounts of documents in a reasonable time, and their
tunability to a domain. To address these problems, we propose a
configurable platform combining NLP tools to enrich very large
collections of French and English specialised documents. The platform
is a modularized and tunable framework. Each module carries out an
annotation step by using existing NLP tools and can be tuned to a
domain by adding specific resources: named entity recognition,
sentence and word segmentation, lemmatisation, POS tagging, term
tagging and parsing. Linguistic annotations are recorded in a
stand-off XML format. To manage very large collections of documents,
we focus on the robustness of the annotation process by distributing
the process on several machines.

In the ALVIS project (www.alvis.info/alvis), we have tested the
scalability of the platform on two collections of 55,329 biomedical web
documents (107 millions of words) and 47,393 Search Engine News (13
millions of words) with 20 computers. The collections have been
annotated until the term tagging, respectively in 35 hours and 3
hours.
