GeMTeX-German Medical Text Corpus

The main goal of the GeMTeX project is to generate a large annotated text corpus of German medical texts from the routine patient care. Documents of prospective consenting patients from the electronic health records (EPA) of six university hospitals are to be extracted. In a concerted action, annotated text corpora are generated and deep annotations provided in multiple dimensions. After anonymization, sharing of these documents will be allowed and thus new resources for research and development will be created. The advances in clinical Natural Language Processing (NLP) crucially depend on specially trained language models that require authentic clinical documents. The GeMTeX joint project thus addresses two major hurdles that hinder the development of clinical Language models so far: the accessibility of data and their annotation. The annotated text documents and the models will be made publicly availabel via the Central Library for Medicine (ZBMED) and the DFG-funded project NFDI4Health.

Further information

Contact: Dr. Tobias Brix, Sarah Riepenhausen, Physician

Funding Reference Number: 01ZZ2314K