RNNTagger and WebLicht

Using the Middle High German RNNTagger in WebLicht

Part-of-Speech tagging and lemmatization of Middle High German texts using WebLicht

Visit the URL https://weblicht.sfs.uni-tuebingen.de/weblicht/ in your favourite browser
Log into WebLicht via your university account
Click on the "Start" button
Click on the "Browse" button
Choose the file you want to annotate and click on "open". (The file must be encoded in the UTF8 format.)
Choose "German" as the "Language"
Click on the "OK" button
Click on the "Advanced Mode" button
Choose "SfS: To TCF Converter" with a double click
Set a tick mark in the box "development" by clicking on it
Choose "SfS: RNN Tagger" with a double click
Click on the button "Run Tools" below. (The processing will take a few seconds.)
The processing pipeline consists of the uploaded text file, the TCF converter, and the RNNTagger and is shown at the bottom of the webpage.
Clicking on the tree symbol between "↓" and "i" below the RNNTagger box of the pipeline will open a new browser tab where the results are displayed. (You have to allow WebLicht to open pop-up windows at this point.)

On the result page you will find

a query window for complex queries using TIGERSearch
a less relevant window "Visualization"
a window "Table View" showing the analysis of a sentence in multi-column format
a window "Sentence in context" where you can select a sentence to be displayed above.
The "Table View" window includes a button "Save" where you can select the option "Save tokens as CSV" in order to store the results in a CSV file. You can use LibreOffice to convert the comma-separated CSV file into a tab-separated CSV file: Open the result file using LibreOffice, choose "Save as" in the pull-down menu "File", check the box "Edit filter settings" and store the file in "Text CSV" format. In the next window, uncheck the box "Fixed column width", select "Tabulator" as "field separator", and click on the "OK" button.

Sentence Boundaries

The detection of sentence boundaries does not work very well for Middle High German texts because they often lack unambiguous sentence-final punctuation.

Lemmatization

The Lemmatizer has been trained on the Middle High German Reference corpus ReM and follows the same conventions: The basis for the lemmatization is Early Middle High German. Contrary to the well-known Lexer list, the lemmas do not show final-obstruent devoicing (Auslautverhärtung) and degemination (Geminatenkürzung). The "Umlaut"-e is represented as "è".

Please send questions, comments, suggestions and bug reports to Helmut Schmid at LastName@cis.lmu.de.