Udmurt corpus

The page you are currently viewing is a web interface for the pilot version of Udmurt corpus. The corpus size so far is about 7.3 million tokens. The texts of the corpus have been automatically annotated with a morphological analyzer, about 88% of tokens have morphological annotation. There is no disambiguation in the corpus, i. e. each token is annotated with all possible analyses, regardless of its context. The latest update of the corpus was performed on Macrh 6th, 2016.

Most tokens in the current version of the corpus belong to newspaper texts written in 2007–2015 (91%). The rest of the texts include blogs (6%) and nonfiction (New testament, Wikipedia articles and essays, 3%). Eventually we intend to increase the number and diversity of texts and add fiction texts to the corpus. Most of the texts are written in standard orthography (Cyrillic script with diacritics), but still there are a handful of texts which lack the diacritics. If needed, such texts can be excluded from search with the help of the subcorpus selection tool. To type Cyrillic letters and letters with diacritics, you can use the virtual keyboard (the button at the end of the query textbox).

Participants

Maria Medvedeva
Timofey Arkhangelskiy
(HSE School of Linguistics)

Web interface

The search platform of the Eastern Armenian National Corpus (EANC) was used for this corpus. You can read about making search queries at EANC help page.