Using the Middle High German RNNTagger in WebLicht
Part-of-Speech tagging and lemmatization of Middle High German
texts using WebLicht
- Visit the
URL https://weblicht.sfs.uni-tuebingen.de/weblicht/
in your favourite browser
- Log into WebLicht via your university account
- Click on the "Start" button
- Click on the "Browse" button
- Choose the file you want to annotate and click on "open". (The file must be encoded in the UTF8 format.)
- Choose "German" as the "Language"
- Click on the "OK" button
- Click on the "Advanced Mode" button
- Choose "SfS: To TCF Converter" with a double click
- Set a tick mark in the box "development" by clicking on it
- Choose "SfS: RNN Tagger" with a double click
- Click on the button "Run Tools" below. (The processing will take a few seconds.)
- The processing pipeline consists of the uploaded text file,
the TCF converter, and the RNNTagger and is shown at the bottom of the webpage.
- Clicking on the tree symbol between "↓" and "i" below the
RNNTagger box of the pipeline will open a new browser tab where the
results are displayed. (You have to allow WebLicht to open pop-up
windows at this point.)
On the result page you will find
- a query window for complex queries using TIGERSearch
- a less relevant window "Visualization"
- a window "Table View" showing the analysis of a sentence in multi-column format
- a window "Sentence in context" where you can select a sentence
to be displayed above.
- The "Table View" window includes a button "Save" where you can
select the option "Save tokens as CSV" in order to store the
results in a CSV file. You can use LibreOffice to convert the
comma-separated CSV file into a tab-separated CSV file: Open the
result file using LibreOffice, choose "Save as" in the pull-down
menu "File", check the box "Edit filter settings" and store the file
in "Text CSV" format. In the next window, uncheck the box "Fixed column width",
select "Tabulator" as "field separator", and click on the "OK"
button.
Sentence Boundaries
The detection of sentence boundaries does not work very well for
Middle High German texts because they often lack unambiguous
sentence-final punctuation.
Lemmatization
The Lemmatizer has been trained on the Middle High German Reference
corpus ReM and follows the same conventions: The basis for the
lemmatization is Early Middle High German. Contrary to the well-known
Lexer list, the lemmas do not show final-obstruent devoicing
(Auslautverhärtung) and degemination (Geminatenkürzung). The
"Umlaut"-e is represented as "è".
Please send questions, comments, suggestions and bug reports to Helmut
Schmid at LastName@cis.lmu.de.