Corpora and Tools
Corpora of Russian Federation
-
Buryat Corpus
The Buryat corpus is intended for everyone who is interested in the language and culture of the Buryat people: Specialists in Mongolian studies, linguists, scholars, teachers, as well as writers, journalists, and librarians. The corpus includes over 2200000 words of written text.
-
Kalmyk Corpus
The Kalmyk Corpus is for everyone who is interested in Kalmyk language and culture: linguists, Mongolian and/or Kalmyk studies specialists, teachers, editors of Kalmyk dictionaries and grammars. The corpus is also suitable for typology research in linguistics. The size of the corpus is 800000 words of written texts
-
Tatar National Corpus “Tugan tel”
Tatar corpus «Tugan tel» consists of texts that represent modern standard Tatar. It contains more than 26000000 words of written texts.
-
Udmurt corpus
The size of the Udmurt corpus is currently around 7300000 tokens. The texts of the corpus have been automatically annotated with a morphological analyser. Approximately 88% of tokens are grammatically annotated..
-
Bashkir Poetic Corpus
This corpus contains over 1,800,000 tokens. There are approximately 450,000 lines of verse and more than 17,000 poems written by 101 poets. The works in the corpus consist of poems by Bashkir poets of the 20th and early 21st centuries.
Corpora of Russian
-
Russian Learner Corpus
The Russian Learner Corpus (RLC) is a collection of texts produced by two categories of non-standard speakers of Russian: learners of Russian as a Foreign language and speakers of Heritage Russian with different dominant languages.
-
Corpus of Russian Student Texts
Corpus of Russian Student Texts (CoRST) is a collection of Russian texts written by students of different universities. Currently, the size of the corpus is about 3100000 tokens. Texts have several types of annotation (metatext, morphological and mistakes annotation) that facilitate searching in the corpus.
Corpora of other languages
-
Albanian corpus
Albanian National Corpus currently consists of around 16700000 tokens. These fiction and journalistic texts are morphologically annotated and are available to users.
-
Corpus of Modern Greek
The annotated corpus with built-in search engine currently includes 35700000 tokens.
-
The Corpus of Modern Yiddish
The Corpus of Modern Yiddish (CMY) is a comprehensive linguistic database of annotated texts in Yiddish with the size of 4000000 tokens.
-
Annotated Corpus of Luwian Texts
A pilot version of the Annotated Corpus of Luwian Texts (ACLT). It contains Iron Age Luwian hieroglyphics, as well as the cuneiform texts of the Bronze Age.
-
Almaty corpus of Kazakh
The corpus size is currently around 2000000 tokens. The texts have been automatically annotated with by morphological analyser. Approximately 86% of word forms in the corpus have grammatical annotation.
-
Mongolian Corpus
Size of modern Mongolian corpus is 1160000 tokens. The corpus is intended for researchers of Mongolian and typologists.
-
Amharic Corpus
The corpus size so far is around 23000000 tokens. The texts of the corpus have been automatically annotated with a part-of-speech tagger. The corpus is disambiguated, i. e. each token is annotated with only one appropriate grammatical form.
-
HSE Thai Corpus
Corpus of modern texts written in Thai language. It contains 50000000 . The texts were collected from various Thai websites (mostly news websites).
-
Romani corpus
Corpus contains approximately 600000 tokens. Texts in the corpus were issued in the USSR between the 1920s and 1930s.
Tools for Russian
-
United dictionary of synonyms
Large synonyms database for Russian aggregated from five dictionaries with advanced search capabilities
-
United dictionary of antonyms
Large antonyms database for Russian aggregated from four dictionaries
-
MyStem+
Collection of morphological taggers for Russian. Every tagger was trained on disambiguated part of Russian National Corpus (ca. 6 million tokens). Trained models and online demo versions are available on the current website. Work on the project is in progress.
-
Syntactic parser
Parser with syntactic tree output in ConLL format. This parser can be either run online on the website or downloaded
-
Old Russian Dictionary of 11-17 centuries
Database with search through volumes of Old Russian Dictionary of 11-17 centuries
-
Transliterator for old Russian orthography
A web application that converts old Russian orthography into a modern one.
-
Sentinet
Database of Russian adjectives with sentiment markup
-
Sentinet: The Game
The game was created to gather new data for Sentinet.
-
Metaphoric verb meanings
Annotated example of a metaphoric and non-metaphoric use of 10 Russian verbs. Data can either be watched online or downloaded
-
Russian Chunker
Text chunking for the Russian language. It provides a partial syntactic structure of a sentence. Chunking is used as a preliminary step towards full parsing or as a feature in complex machine learning systems.
-
Heritage texts annotation
Automated annotation of mistakes in texts of Russian heritage speakers
-
Style training system
The system provides students with tools to improve their self-editing skills for academic writing. All exercises are based on real texts.
-
Facebook of the Past
The project contains letters of Russian writers. It allows to a particular date search as well as the role of a person in history
Tools for other languages
-
Minority languages of Russia
Text collection of Russian minor languages with some statistical data.
-
Database of animal sound verbs
Typological Database of animal sound verbs. It contains data from over 20 languages from various language families
-
Database of synonyms
The aim of this project is to compare theoretical and computational approaches to synonyms and semantic fields research. In these database lexicographical and statistic vector approaches are used for describing phenomena specified above
-
Yiddish transliterator
The converter has 2 main functions: Yiddish text normalisation and transliteration from Hebrew script to Latin script.
-
Typological Database of Qualitative Features
This tool was created for lexical-typology research. The database contains information on the lexicalisation of adjective semantic fields in different languages (like 'sharp' -- 'blunt', 'empty' -- 'full', 'thick' -- 'thin', etc.). On the practical side, the database may serve as a multilingual dictionary which provides an extensive description of the difference between individual words. On the theoretical side, the database allows generalisation of cross-linguistic patterns of polysemy and semantic change.
-
Vocabulary of Adyghe idioms in Russia
This atlas contains a map that shows all Adyghe-speaking villages and a list of Adyghe idioms spread across the territory of the Russian Federation
-
Numerals of the American languages
A set of maps with the distribution of numeral features in the American linguistic area. These features are base, limit and adding markers.
-
diacritics restorer
This web application helps to restore missing diacritics in regional and minority languages of Russia
-
Sun Tzu Art of war
Sun Tzu's ""The Art of War"" is one of the most famous ancient Chinese treatises about the art of strategy and politics. On the website, you can find an electronic version of the text (in Chinese) with notes and translation.