Romani corpus

This is the home page of the Russian Romani Corpus, which currently contains approximately 720,000 tokens. At the moment, the corpus only contains texts published in te USSR in the 1920s and 1930s. The corpus includes all original texts (both fiction and press), as well as a handful of translated texts (fiction, non-fiction and press). The corpus is still under development. Right now, the user can make lexical and grammatical queries in the corpus. In the future, we are going to increase the size of the corpus by including the rest of the texts published in the 1920s and 1930s, as well as by adding the texts collected during the fieldwork conducted by the authors of the corpus. We are also going to improve the quality of the morphological annotation, enlarge our grammatical dictionary and perform disambiguation.

The corpus is being developed by K. Kozhanov (Moscow), S. Oskolskaya (St. Petersburg),
M. Oslon (Moscow), A. Tenser (Helsinki), T. Arkhangelskiy (Moscow).

The corpus was annotated with the help of an automated morphological annotation tool UniParser developed by T. Arkhangelskiy. The search platform of the Eastern Armenian National Corpus (EANC) was used for this corpus. You can read about making search queries at EANC help page.

The corpus has been developed with the support of the RFBR grant no. mol_a 14-06-31038 “Development of a Russian Romani Corpus” headed by K. Kozhanov in 2014–2015 .