Corpus of Russian Student Texts

About

Corpus of Russian Student Texts (CoRST) is a collection of Russian texts written by students of different universities. Currently, the size of the corpus is about 3.1 million tokens. The texts are annotated in several ways (metatextual annotation, morphological annotation and error markup), which enables users to perform many types of search in the corpus.

To learn more about how to use the corpus and the search request language, please, consult the help page.

Corpus of Russian Student Texts is a comprehensive reference system intended for researchers, teachers, students, as well as everyone interested in the problems of modern Russian grammar, current processes in lexis, morphology and syntax of modern Russian.

Types of texts

The texts represented in the corpus have been written by students of bachelor's and master's programmes of different universities, students. The main types of texts are as follows: course papers, term papers, bachelor's and master's theses, essays, abstracts, reports, summaries, autobiographies, and paragraphs (small texts for different purposes: home assignments, answers to various questions, etc.).

The corpus provides information about the academic year/ semester/ module when each text was written and the field of knowledge which the text relates to. The field of knowledge may not coincide with the academic majors of students. For instance, if a linguist writes an essay on history, we indicate his/her major (linguistics) and the subject of the work (history).

Authors of texts

The corpus includes texts written by students of the following academic majors: economics, sociology, political science, law, psychology, journalistics, linguistics, history, philology, logistics, mathematics, philosophy. Generally, the corpus provides information about the gender and age of the author, and also about the academic year (1st year bachelor, 2nd year master, etc.). Some texts are provided with the information about the region where the author lived until the age of 18 and whether the author is a bilingual speaker.

Annotation

As in any other corpus intended for linguistic and sociological research, the texts of the Corpus of Russian Student Texts are supplemented with linguistic and metalinguistic information, or markup. Linguistic markup includes morphological markup and error markup. Currently, the texts of the corpus contain metatextual markup and morphological markup.

Below is more information about the layers of markup.

Metatextual markup

Metatextual markup contains metalinguistic information about the text and its author.

Author-meta

Linguist, economist, sociologist, law student, etc.
The student’sacademic major is always indicated: this is an important part of sociolinguistic information, which is of value for methodological research.

M, F
Information about the gender of a student is given in the majority of cases; however, sometimes it is impossible to be retrieved. For instance, if a student has signed his/her paper Jakovenko M., the field «gender» is not filled in. After the metatextual data is collected, the name of the author is deleted.

21, 22, etc.
The age of the author is indicated if necessary. While creating the corpus, we did not intend to collect accurate information about the age. Most of Russian students enter the university right after leaving school; therefore, the approximate age of a student is easily retrieved from the data about the year of studying (see Metadata related to the text).

Ukrainian, Armenian, German, etc.
This includes information about whether the author has any other mother tongue, besides Russian. Generally, this is a language which is spoken in the family or a language of the community where the author lived before entering the university.

Ukraine, Armenia, Germany, etc.
This includes information about the region of the author's residency before he/she entered the university. This information matters for sociolinguistic research. The region is indicated if known.

Text-meta

2003-2004, 2011-2012, etc.
Generally, metatextual markup has information about the academic year when the text was written. This information is needed for methodological research.

1st year spec, 2nd year bach, 1st year mas, etc.
This is information about the year of studying during which the text was created. For instance, 1st year spec means the first of five years of specialist program, 2nd year bach means the second of four years of bachelor's program, etc.

1 semester, 2 module, etc.
Information about the semester, or the module (for students with the module system), when the text was created.

Course paper, barchelor's of master's thesis, abstract, essay, report, summary, autobiography, paragraph, etc.
The type of an academic text is always indicated. Paragraphs are small texts for different purposes: hometasks, answers to various questions, etc.

Linguistics, economics, sociology, law, etc.
In the process of studying at the university a student writes texts related to his/her academic major or in other academic fields. Generally, linguists would write course papers and theses on linguistics, economists – on economics, however, even for term papers, there may be exceptions. As for the rest of the texts, the field of knowledge to which they may be related does not always coincide with the academic courses students major in. For instance, a linguist may write a paper on history, an economist – on sociology, etc.
Many short texts were written in the Academic Writing or Speech Culture courses and may not be attributed to any particular field of knowledge. Such texts are marked with label «various topics»).

Morphological markup

Morphological markup contains information about morphological forms and meanings (parts of speech, gender, number, case, mood, etc.). Morphological markup is required for searching words, word forms and constructions.

For the morphological markup of the Corpus of Russian Student Texts we have used the same tag set that was integrated into the Russian National Corpus (RNC). The full list of tags is available on the RNC website. The morphological markup is carried out automatically with the help of the morphological analyzer MYSTEM.

Error markup

The multilayered system of error markup is currently being tested, after which the texts will be annotated manually. The error markup system includes:

linguistic error type (errors in lexis, grammar or discourse);
cause of an error;

Linguistic error types presented in (а) are divided into subtypes as follows:

lexical (lexical and word formation errors),
stylistic (fficial style, colloquial style, other stylistic deviance),
grammatical (agreement, government, coordination errors, errors in comparative constructions, in adverbial participial phrases, in complex sentences – in sentential arguments and relative clauses; errors in pronouns including anaphora and reflexives; reciprocals, nominal, verbal and adjectival inflection – errors in number, person, gender, case, verbal and adjectival inflectional forms),
discursive (parcellation, topicalization, wrong use of link-words, mixing of direct and indirect speech, wrong word order, logical errors).

Among the causes of mistakes presented in (b) one can point out contaminations of constructions and typos as the most transparent and the most common causes. In the future, the process of annotating the corpus texts may reveal other types of causes of mistakes.

It is important to stress that the suggested error classification is multilayered:

Firstly, for each linguistic error its linguistic type and, if possible, cause, have to be defined.
Secondly, an error may be attributed to different linguistic types. For instance, an error may be both lexical and stylistic at the same time.

Corpus users

The Corpus of Russian Student Texts is intended for linguists - researchers, teachers of the literary Russian language and of the Russian academic writing, and students.

To researchers, this resource is useful because it helps, on the one hand, to systematize the peculiarities of academic texts of novice authors. On the other hand, the Corpus of Russian Student Texts is an invaluable resource for the study of errors made by speakers of Russian, as well as for investigation of grammar, lexis, stylistics, sociolinguistics, and psycholinguistics.

To teachers of the literary Russian language and of the Russian academic writing, the resource helps to see the main problems that a student faces while trying to master a new register of his/her first language . The methodological value of the resource is that teachers now receive an opportunity to develop various exercises and linguistic training software on its basis.

To students of different majors, the resource offers an opportunity to familiarize themselves with typical language errors and their classification as well as an opportunity to edit them. Such exercises are supposed to reduce such errors while writing a text of one’s own.

Team

The corpus was created by the Linguistic Laboratory of Corpus Technologies of National Research University Higher School of Economics.

Chiefs
Natalya Aleksandrovna Zevahina
Ekaterina Vladimirovna Rahilina
Yuliya Mihaylovna Kuvshinskaya

Developers
Timofey Aleksandrovich Arhangelskiy
Evgeniy Glazunov
Elmira Gayazovna Mustakimova

Team:

HSE School of Linguistics Staff
(who contributed to the text markup, research and development of the CORST)

Yana Emilevna Ahapkina
Svetlana Dzhakupova
Olesya Kisselev
Anna Iosifovna Levinzon
Aleksandr Borisovich Letuchiy
Anna Dmitrievna Plisetskaya

HSE School of Linguistics Students
(who contributed to the text markup, research and development of the CORST)

Vlada Aleksandrova
Olga Astapenko
Yuliya Badryizlova
Margarita Bobrova
Tatyana Bolgina
Mariya Bocharova
Anna Vishenkova
Ekaterina Voloshina
Evgeniy Glazunov
Nikita Gordeev
Nadezhda Grigoreva
Dasha Demidova
Andrey Zaytsev
Mariya Zarifyan
Veronika Zyikova
Violetta Ivanova
Sergey Kolesnikov
Anna Kondrateva
Anastasiya Kostyanitsyina
Polina Kudryavtseva
Mariya Kuznetsova
Olga Kultepina
Alina Ladyigina
Nastya Makarovich
Darya Maksimova
Ekaterina Matyuhina
Ekaterina Mozhelyanskaya
Nikol Morozova
Mariya Myislina
Mariya Ob'edkova
Olga Osinovskaya
Vasilisa Osipova
Irina Panteleeva
Alina Prohorova
Svetlana Puzhaeva
Olga Ramzaytseva
Kseniya Romanova
Sasha Salamatina
Asya Simonyan
Anna Simonyan
Polina Sonina
Ekaterina Taktasheva
Mariya TerYohina
Mariya Fedorova
Anna Tsyizova
Natalya Chukicheva
Alina Shaymardanova
Yana Shevchenko

Publications

Projects

2021

Aleksandra Serova
Variability in Argument Structure

2017

Nadezhda Grigor'eva
Errors in Voice and Argument Alternations: Corpus and Experimental Study

Anastasiya Makarovich
Analysis of the Errors in Government in Contemporary Russian

Nadezhda Grigor'eva
Non-standard realisations of sya-constructions in Russian

Anastasiya Makarovich
Non-standard realisations of argument structures in Russian

2016

Olga Ramzaytseva
Automatic parsing of errors in comparative constructions in the Corpus of Russian Student Texts

Svetlana Puzhaeva
Faktoryi, vliyayuschie na narushenie koreferentsii deeprichastnyih oborotov

2015

Anna Vishenkova
Comparative constructions with double marker of comparison in non-standard Russian: Corpus study

Svetlana Puzhaeva
Construction Blending in the Corpus of Russian Student Texts

2014

Natalya Zevahina, Svetlana Dzhakupova
Case (non-)coincidence in elliptical coordinated constructions: Learner texts of Russian native speakers

Natalya Zevahina, Svetlana Dzhakupova, Anna Vishenkova
Meta-linguistic comparative constructions in Russian and cross-linguistically (since 2014 until now)

Publications

2021

Kuvshinskaya Yu. M., Aksenova A. A. Punktuatsiya v predlozheniyah s soyuzom "to est" kak otrazhenie razvitiya diskursivnyih funktsiy // V kn.: Yazyik i metod: Russkiy yazyik v lingvisticheskih issledovaniyah XXI veka Vol. 7. Kraków : Wydawnictwo Uniwersytetu Jagiellońskiego, 2021.

2020

Klimov A., Kopotev M., Zevakhina N., Kilina M., Grillandi A., Elizaveta N., Sidorova A. Assessing novice writing against the Corpus of Academic Texts, in: 14 TaLC - Teaching and Language Corpora Conference. , 2020. P. 44-45.

2019

Klimov A., Toldova S., Kopotev M., Zevakhina N., Dmitrieva A. D., Kisselev O., Baranchikova A., Fedorova M. CAT&kittens: a corpus-based text-analytic tool for Russian academic writing, in: SlaviCorp 2018 Book of Abstracts. Charles University, 2018. P. 22-25.

Kuvshinskaya Yu. M. Sovremennyie tendentsii v distributivnom upotreblenii russkih suschestvitelnyih (po korpusnyim dannyim) // V kn.: Russkaya grammatika: aktivnyie protsessyi v yazyike i rechi. Yar. : RIO YaGPU, 2019. S. 405-415.

Kuvshinskaya Y. M. Russian indefinite pronoun kakoj-libo: non-standart usage and changes in the semantics // Jazykovedny Casopis. 2019. No. 2. P. 225-233.

2018

Grigoreva N. A., Zevahina N. A., Kuvshinskaya Yu. M. Nestandartnyie strategii upotrebleniya vozvratnyih glagolov: prichinyi i yazyikovyie mehanizmyi // V kn.: Fortunatovskie chteniya v Karelii. Sbornik dokladov Mezhdunarodnoy nauchnoy konferentsii (10-12 sentyabrya 2018 goda, Petrozavodsk) Ch. 1. Petrozavodsk : Izdatelstvo PetrGU, 2018. S. 72-74.

2016

Kuvshinskaya Yu. M. Studencheskie rechevyie oshibki i aktualnyie tendentsii v razvitii russkogo yazyika // V kn.: Leo philologiae. Festshrift v chest 70-letiya Lva Iosifovicha Soboleva / Pod obsch. red.: A. A. Bonch-Osmolovskaya, M. A. Kucherskaya, K. M. Polivanov, A. A. Zubov, Mayofis Mariya Lvovna. M. : [b.i.], 2016. S. 157-178.

2015

Puzhaeva S. Yu., Zevahina N. A., Dzhakupova S. S. Kontaminatsiya konstruktsiy v rechi nestandartnyih russkogovoryaschih na materiale korpusa russkih uchebnyih tekstov // V kn.: Trudyi Mezhdunarodnoy nauchnoy konferentsii "Korpusnaya lingvistika-2015". SPb. : Izdatelstvo SPbGU, 2015. S. 390-397.

Zevakhina N., Dzhakupova S. Corpus of Russian student texts: design and prospects, in: Trudyi 21-y Mezhdunarodnoy konferentsii po kompyuternoy lingvistike «"Dialog". Izd-vo RGGU, 2015.

Puzhaeva Svetlana, Natalia Zevakhina, and Svetlana Dzhakupova. (2015). Construction blending in non-standard variants of Russian in the Corpus of Russian Student Texts. In Proceedings of the 6th International Conference “Corpus Linguistics-2015”, 390-397. Saint-Petersburg. (in Russian)

Zevakhina, Natalia and Svetlana Dzhakupova. Corpus of Russian student texts: design and prospects. In Proceedings of the 21st International Conference on Computational Linguistics “Dialog”. Moscow, 2015.

Zevakhina N., Dzhakupova S. Russian metalinguistic comparatives: a functional perspective. In Working papers by NRU HSE. Series WP BRP "Linguistics". 2015. No. 39.

2014

Svetlana Dzhakupova and Natalia Zevakhina. Case (non-)coincidence in elliptical coordinated constructions: Learner texts of Russian native speakers // Ahti Nikunlassi, Ekaterina Protassova (eds.) Slavica Helsingiensia 45 Instrumentarium of Linguistics: Language errors and multilingualism. Helsinki: University of Helsinki, 2014, pp. 35—49. (in Russian)

Presentations

2021

Kuvshinskaya Yulia
Rabota nad oshibkami: oshibki v uchebnyih tekstah kak instrument obucheniya i predmet studencheskih issledovaniy/ International conference "Yazyik i metodyi ego opisaniya", RSUH
28 January 2021, Russian Federation, Moscow

Kuvshinskaya Y.M., Aksenova A.A.
Sovremennyie tendentsii v upotreblenii konnektora "to est": korpusnoe issledovanie / V Mezhdunarodnyiy simpozium "Russian Grammar: System – Language Usage – Language Variation"
22–24 September 2021, Germany, Potsdam

2020

Kuvshinskaya Yulia and Natalia Zevakhina
Non-standard participles in Russian: a corpus study. The 15th Annual Slavic Linguistics Society Meeting. Indiana University
4-6 September 2020, US (Online presentation due to Covid-19)

2019

Kuvshinskaya Yulia
Mestoimenie KAKOY-LIBO: nestandartnyie upotrebleniya i sovremennyiy uzus (po materialam Korpusa russkih uchebnyih tekstov i NKRYa)// VI Mezhdunarodnaya nauchnaya konferentsiya "Kultura russkoy rechi" (Grotovskie chteniya)
21–23 February 2019, Russian Federation, Moscow

Kuvshinskaya Yulia
Sovremennyie tendentsii v distributivnom upotreblenii russkih suschestvitelnyih (po korpusnyim dannyim)// Russkaya grammatika: aktivnyie protsessyi v yazyike i rechi
17–19 May 2019, Russian Federation, Yaroslavl

2018

Makarovich Anastasiya, Natalia Zevakhina and Yulia Kuvshinskaya
Deviations in argument structures of Russian predicates: Corpus evidence. The 51st Annual Meeting of the Societas Linguistica Europaea.
September 2018, Tallinn

Kuvshinskaya Yulia
Non-standard usage of participles in Russian written texts: a corpus study // International Conference on Russian Studies.
20–22 June 2018, Spain, Barcelona

Kuvshinskaya Yulia, Nadezhda Grigor’eva and Natalia Zevakhina
Non-standard strategies for using sya-verbs: facts and explanations. International conference in honour of Filipp Fortunatov (Paper accepted for presentation, in Russian)
September 2018, Russian Federation, Petrozavodsk

Makarovich Anastasiya, Natalia Zevakhina and Yulia Kuvshinskaya
Deviations in argument structures of Russian predicates: Corpus evidence. SLE Conference
September 2018, Tallinn

Kuvshinskaya Yulia and Natalia Zevakhina
Non-standard usage of participles in Russian written texts: a corpus study. International Conference on Russian Studies (Paper accepted for presentation, in Russian)
June 2018, Spain, Barcelona.

2016

Zevakhina Natalia, Svetlana Dzhakupova, Svetlana Puzhaeva, Elmira Mustakimova, Olga Ramzaytseva
Research on the basis of the Corpus of Russian Student Texts. Conference “Grammatical Processes in synchrony and diachrony”. Institute for the Russian Language
31 May 2016, Russian Federation, Moscow.

2015

Anna Vishenkova, Natalia Zevakhina, Svetlana Dzhakupova
Morphosyntax and Semantics of Russian Metalinguistic Comparatives. 14th annual conference of the Slavic Cognitive Linguistics Association "Crossing boundaries: taking a cognitive scientific perspective on Slavic languages and linguistics". Universities of Sheffield and Oxford
9–13 December 2015, UK

Dzhakupova, Svetlana, Elmira Mustakimova, and Natalia Zevakhina
Corpus of Russian Student Texts: goals, annotation, and perspectives. Corpus Linguistics 2015 conference
21–24 Jule 2015, UK, Lancaster

Puzhaeva Svetlana, Natalia Zevakhina, and Svetlana Dzhakupova
Construction blending in non-standard variants of Russian in the Corpus of Russian Student Texts. The 6th International Conference “Corpus Linguistics-2015”
22–26 June 2015, Russian Federation, Saint-Petersburg

Zevakhina, Natalia and Svetlana Dzhakupova
Corpus of Russian student texts: design and prospects. The 21st International Conference on Computational Linguistics “Dialog”
27–30 May 2015, Russian Federation, Moscow

2014

Natalia Zevakhina and Svetlana Dzhakupova
Russian in the mirror of the Corpus of Russian Student Texts. The Institute of Slavic Studies of the Russian Academy of Sciences
16 December 2014, Russian Federation, Moscow

Natalia Zevakhina and Svetlana Dzhakupova
Russian Metalinguistic Comparatives: Towards the Typology // The 11th Conference on Typology and Grammar for Young Researchers. Institute for Linguistic Studies, Russian Academy of Science
27–29 November 2014, Russian Federation, Saint-Petersburg

License

The contents of this site are licensed under a Creative Commons Attribution 4.0 International License.

CoRST

Corpus of Russian Student Texts

About

Types of texts

Authors of texts

Annotation

Metatextual markup

Author-meta

Text-meta

Morphological markup

Error markup

Corpus users

Team

Publications

Projects

Publications

Presentations

Links

License