CoRST

Corpus of Russian Student Texts


About

Corpus of Russian Student Texts (CoRST) is a collection of Russian texts written by students of different universities. Currently, the size of the corpus is about 3.1 million tokens. The texts are annotated in several ways (metatextual annotation, morphological annotation and error markup), which enables users to perform many types of search in the corpus.

To learn more about how to use the corpus and the search request language, please, consult the help page.

Corpus of Russian Student Texts is a comprehensive reference system intended for researchers, teachers, students, as well as everyone interested in the problems of modern Russian grammar, current processes in lexis, morphology and syntax of modern Russian.


Types of texts

The texts represented in the corpus have been written by students of bachelor's and master's programmes of different universities, students. The main types of texts are as follows: course papers, term papers, bachelor's and master's theses, essays, abstracts, reports, summaries, autobiographies, and paragraphs (small texts for different purposes: home assignments, answers to various questions, etc.).

The corpus provides information about the academic year/ semester/ module when each text was written and the field of knowledge which the text relates to. The field of knowledge may not coincide with the academic majors of students. For instance, if a linguist writes an essay on history, we indicate his/her major (linguistics) and the subject of the work (history).


Authors of texts

The corpus includes texts written by students of the following academic majors: economics, sociology, political science, law, psychology, journalistics, linguistics, history, philology, logistics, mathematics, philosophy. Generally, the corpus provides information about the gender and age of the author, and also about the academic year (1st year bachelor, 2nd year master, etc.). Some texts are provided with the information about the region where the author lived until the age of 18 and whether the author is a bilingual speaker.


Annotation

As in any other corpus intended for linguistic and sociological research, the texts of the Corpus of Russian Student Texts are supplemented with linguistic and metalinguistic information, or markup. Linguistic markup includes morphological markup and error markup. Currently, the texts of the corpus contain metatextual markup and morphological markup.

Below is more information about the layers of markup.


Metatextual markup

Metatextual markup contains metalinguistic information about the text and its author.

Author-meta

Linguist, economist, sociologist, law student, etc.
The student’sacademic major is always indicated: this is an important part of sociolinguistic information, which is of value for methodological research.

M, F
Information about the gender of a student is given in the majority of cases; however, sometimes it is impossible to be retrieved. For instance, if a student has signed his/her paper Jakovenko M., the field «gender» is not filled in. After the metatextual data is collected, the name of the author is deleted.

21, 22, etc.
The age of the author is indicated if necessary. While creating the corpus, we did not intend to collect accurate information about the age. Most of Russian students enter the university right after leaving school; therefore, the approximate age of a student is easily retrieved from the data about the year of studying (see Metadata related to the text).

Ukrainian, Armenian, German, etc.
This includes information about whether the author has any other mother tongue, besides Russian. Generally, this is a language which is spoken in the family or a language of the community where the author lived before entering the university.

Ukraine, Armenia, Germany, etc.
This includes information about the region of the author's residency before he/she entered the university. This information matters for sociolinguistic research. The region is indicated if known.




Text-meta

2003-2004, 2011-2012, etc.
Generally, metatextual markup has information about the academic year when the text was written. This information is needed for methodological research.

1st year spec, 2nd year bach, 1st year mas, etc.
This is information about the year of studying during which the text was created. For instance, 1st year spec means the first of five years of specialist program, 2nd year bach means the second of four years of bachelor's program, etc.

1 semester, 2 module, etc.
Information about the semester, or the module (for students with the module system), when the text was created.

Course paper, barchelor's of master's thesis, abstract, essay, report, summary, autobiography, paragraph, etc.
The type of an academic text is always indicated. Paragraphs are small texts for different purposes: hometasks, answers to various questions, etc.

Linguistics, economics, sociology, law, etc.
In the process of studying at the university a student writes texts related to his/her academic major or in other academic fields. Generally, linguists would write course papers and theses on linguistics, economists – on economics, however, even for term papers, there may be exceptions. As for the rest of the texts, the field of knowledge to which they may be related does not always coincide with the academic courses students major in. For instance, a linguist may write a paper on history, an economist – on sociology, etc.
Many short texts were written in the Academic Writing or Speech Culture courses and may not be attributed to any particular field of knowledge. Such texts are marked with label «various topics»).


Morphological markup

Morphological markup contains information about morphological forms and meanings (parts of speech, gender, number, case, mood, etc.). Morphological markup is required for searching words, word forms and constructions.

For the morphological markup of the Corpus of Russian Student Texts we have used the same tag set that was integrated into the Russian National Corpus (RNC). The full list of tags is available on the RNC website. The morphological markup is carried out automatically with the help of the morphological analyzer MYSTEM.


Error markup

The multilayered system of error markup is currently being tested, after which the texts will be annotated manually. The error markup system includes:

  1. linguistic error type (errors in lexis, grammar or discourse);
  2. cause of an error;

Linguistic error types presented in (а) are divided into subtypes as follows:

  • lexical (lexical and word formation errors),
  • stylistic (fficial style, colloquial style, other stylistic deviance),
  • grammatical (agreement, government, coordination errors, errors in comparative constructions, in adverbial participial phrases, in complex sentences – in sentential arguments and relative clauses; errors in pronouns including anaphora and reflexives; reciprocals, nominal, verbal and adjectival inflection – errors in number, person, gender, case, verbal and adjectival inflectional forms),
  • discursive (parcellation, topicalization, wrong use of link-words, mixing of direct and indirect speech, wrong word order, logical errors).

Among the causes of mistakes presented in (b) one can point out contaminations of constructions and typos as the most transparent and the most common causes. In the future, the process of annotating the corpus texts may reveal other types of causes of mistakes.

It is important to stress that the suggested error classification is multilayered:

  1. Firstly, for each linguistic error its linguistic type and, if possible, cause, have to be defined.
  2. Secondly, an error may be attributed to different linguistic types. For instance, an error may be both lexical and stylistic at the same time.

Corpus users

The Corpus of Russian Student Texts is intended for linguists - researchers, teachers of the literary Russian language and of the Russian academic writing, and students.

To researchers, this resource is useful because it helps, on the one hand, to systematize the peculiarities of academic texts of novice authors. On the other hand, the Corpus of Russian Student Texts is an invaluable resource for the study of errors made by speakers of Russian, as well as for investigation of grammar, lexis, stylistics, sociolinguistics, and psycholinguistics.

To teachers of the literary Russian language and of the Russian academic writing, the resource helps to see the main problems that a student faces while trying to master a new register of his/her first language . The methodological value of the resource is that teachers now receive an opportunity to develop various exercises and linguistic training software on its basis.

To students of different majors, the resource offers an opportunity to familiarize themselves with typical language errors and their classification as well as an opportunity to edit them. Such exercises are supposed to reduce such errors while writing a text of one’s own.


Team

The corpus was created by the Linguistic Laboratory of Corpus Technologies of National Research University Higher School of Economics.

Chiefs
Natalya Aleksandrovna Zevahina
Ekaterina Vladimirovna Rahilina
Yuliya Mihaylovna Kuvshinskaya


Developers
Timofey Aleksandrovich Arhangelskiy
Evgeniy Glazunov
Elmira Gayazovna Mustakimova


Team:

  • HSE School of Linguistics Staff
    (who contributed to the text markup, research and development of the CORST)

  • Yana Emilevna Ahapkina
    Svetlana Dzhakupova
    Olesya Kisselev
    Anna Iosifovna Levinzon
    Aleksandr Borisovich Letuchiy
    Anna Dmitrievna Plisetskaya

  • HSE School of Linguistics Students
    (who contributed to the text markup, research and development of the CORST)

  • Vlada Aleksandrova
    Olga Astapenko
    Yuliya Badryizlova
    Margarita Bobrova
    Tatyana Bolgina
    Mariya Bocharova
    Anna Vishenkova
    Ekaterina Voloshina
    Evgeniy Glazunov
    Nikita Gordeev
    Nadezhda Grigoreva
    Dasha Demidova
    Andrey Zaytsev
    Mariya Zarifyan
    Veronika Zyikova
    Violetta Ivanova
    Sergey Kolesnikov
    Anna Kondrateva
    Anastasiya Kostyanitsyina
    Polina Kudryavtseva
    Mariya Kuznetsova
    Olga Kultepina
    Alina Ladyigina
    Nastya Makarovich
    Darya Maksimova
    Ekaterina Matyuhina
    Ekaterina Mozhelyanskaya
    Nikol Morozova
    Mariya Myislina
    Mariya Ob'edkova
    Olga Osinovskaya
    Vasilisa Osipova
    Irina Panteleeva
    Alina Prohorova
    Svetlana Puzhaeva
    Olga Ramzaytseva
    Kseniya Romanova
    Sasha Salamatina
    Asya Simonyan
    Anna Simonyan
    Polina Sonina
    Ekaterina Taktasheva
    Mariya TerYohina
    Mariya Fedorova
    Anna Tsyizova
    Natalya Chukicheva
    Alina Shaymardanova
    Yana Shevchenko


    Publications


    Projects


      2021

    • Aleksandra Serova
      Variability in Argument Structure


    • 2017

    • Nadezhda Grigor'eva
      Errors in Voice and Argument Alternations: Corpus and Experimental Study

    • Anastasiya Makarovich
      Analysis of the Errors in Government in Contemporary Russian

    • Nadezhda Grigor'eva
      Non-standard realisations of sya-constructions in Russian

    • Anastasiya Makarovich
      Non-standard realisations of argument structures in Russian


    • 2016

    • Olga Ramzaytseva
      Automatic parsing of errors in comparative constructions in the Corpus of Russian Student Texts

    • Svetlana Puzhaeva
      Faktoryi, vliyayuschie na narushenie koreferentsii deeprichastnyih oborotov


    • 2015

    • Anna Vishenkova
      Comparative constructions with double marker of comparison in non-standard Russian: Corpus study

    • Svetlana Puzhaeva
      Construction Blending in the Corpus of Russian Student Texts


    • 2014

    • Natalya Zevahina, Svetlana Dzhakupova
      Case (non-)coincidence in elliptical coordinated constructions: Learner texts of Russian native speakers

    • Natalya Zevahina, Svetlana Dzhakupova, Anna Vishenkova
      Meta-linguistic comparative constructions in Russian and cross-linguistically (since 2014 until now)

    Publications


      2021

    • Kuvshinskaya Yu. M., Aksenova A. A. Punktuatsiya v predlozheniyah s soyuzom "to est" kak otrazhenie razvitiya diskursivnyih funktsiy // V kn.: Yazyik i metod: Russkiy yazyik v lingvisticheskih issledovaniyah XXI veka Vol. 7. Kraków : Wydawnictwo Uniwersytetu Jagiellońskiego, 2021.


    • 2020

    • Klimov A., Kopotev M., Zevakhina N., Kilina M., Grillandi A., Elizaveta N., Sidorova A. Assessing novice writing against the Corpus of Academic Texts, in: 14 TaLC - Teaching and Language Corpora Conference. , 2020. P. 44-45.


    • 2019

    • Klimov A., Toldova S., Kopotev M., Zevakhina N., Dmitrieva A. D., Kisselev O., Baranchikova A., Fedorova M. CAT&kittens: a corpus-based text-analytic tool for Russian academic writing, in: SlaviCorp 2018 Book of Abstracts. Charles University, 2018. P. 22-25.

    • Kuvshinskaya Yu. M. Sovremennyie tendentsii v distributivnom upotreblenii russkih suschestvitelnyih (po korpusnyim dannyim) // V kn.: Russkaya grammatika: aktivnyie protsessyi v yazyike i rechi. Yar. : RIO YaGPU, 2019. S. 405-415.

    • Kuvshinskaya Y. M. Russian indefinite pronoun kakoj-libo: non-standart usage and changes in the semantics // Jazykovedny Casopis. 2019. No. 2. P. 225-233.


    • 2018

    • Grigoreva N. A., Zevahina N. A., Kuvshinskaya Yu. M. Nestandartnyie strategii upotrebleniya vozvratnyih glagolov: prichinyi i yazyikovyie mehanizmyi // V kn.: Fortunatovskie chteniya v Karelii. Sbornik dokladov Mezhdunarodnoy nauchnoy konferentsii (10-12 sentyabrya 2018 goda, Petrozavodsk) Ch. 1. Petrozavodsk : Izdatelstvo PetrGU, 2018. S. 72-74.


    • 2016

    • Kuvshinskaya Yu. M. Studencheskie rechevyie oshibki i aktualnyie tendentsii v razvitii russkogo yazyika // V kn.: Leo philologiae. Festshrift v chest 70-letiya Lva Iosifovicha Soboleva / Pod obsch. red.: A. A. Bonch-Osmolovskaya, M. A. Kucherskaya, K. M. Polivanov, A. A. Zubov, Mayofis Mariya Lvovna. M. : [b.i.], 2016. S. 157-178.


    • 2015

    • Puzhaeva S. Yu., Zevahina N. A., Dzhakupova S. S. Kontaminatsiya konstruktsiy v rechi nestandartnyih russkogovoryaschih na materiale korpusa russkih uchebnyih tekstov // V kn.: Trudyi Mezhdunarodnoy nauchnoy konferentsii "Korpusnaya lingvistika-2015". SPb. : Izdatelstvo SPbGU, 2015. S. 390-397.

    • Zevakhina N., Dzhakupova S. Corpus of Russian student texts: design and prospects, in: Trudyi 21-y Mezhdunarodnoy konferentsii po kompyuternoy lingvistike «"Dialog". Izd-vo RGGU, 2015.

    • Puzhaeva Svetlana, Natalia Zevakhina, and Svetlana Dzhakupova. (2015). Construction blending in non-standard variants of Russian in the Corpus of Russian Student Texts. In Proceedings of the 6th International Conference “Corpus Linguistics-2015”, 390-397. Saint-Petersburg. (in Russian)

    • Zevakhina, Natalia and Svetlana Dzhakupova. Corpus of Russian student texts: design and prospects. In Proceedings of the 21st International Conference on Computational Linguistics “Dialog”. Moscow, 2015.

    • Zevakhina N., Dzhakupova S. Russian metalinguistic comparatives: a functional perspective. In Working papers by NRU HSE. Series WP BRP "Linguistics". 2015. No. 39.


    • 2014

    • Svetlana Dzhakupova and Natalia Zevakhina. Case (non-)coincidence in elliptical coordinated constructions: Learner texts of Russian native speakers // Ahti Nikunlassi, Ekaterina Protassova (eds.) Slavica Helsingiensia 45 Instrumentarium of Linguistics: Language errors and multilingualism. Helsinki: University of Helsinki, 2014, pp. 35—49. (in Russian)


    Presentations



      2021

    • Kuvshinskaya Yulia
      Rabota nad oshibkami: oshibki v uchebnyih tekstah kak instrument obucheniya i predmet studencheskih issledovaniy/ International conference "Yazyik i metodyi ego opisaniya", RSUH
      28 January 2021, Russian Federation, Moscow

    • Kuvshinskaya Y.M., Aksenova A.A.
      Sovremennyie tendentsii v upotreblenii konnektora "to est": korpusnoe issledovanie / V Mezhdunarodnyiy simpozium "Russian Grammar: System – Language Usage – Language Variation"
      22–24 September 2021, Germany, Potsdam


    • 2020

    • Kuvshinskaya Yulia and Natalia Zevakhina
      Non-standard participles in Russian: a corpus study. The 15th Annual Slavic Linguistics Society Meeting. Indiana University
      4-6 September 2020, US (Online presentation due to Covid-19)


    • 2019

    • Kuvshinskaya Yulia
      Mestoimenie KAKOY-LIBO: nestandartnyie upotrebleniya i sovremennyiy uzus (po materialam Korpusa russkih uchebnyih tekstov i NKRYa)// VI Mezhdunarodnaya nauchnaya konferentsiya "Kultura russkoy rechi" (Grotovskie chteniya)
      21–23 February 2019, Russian Federation, Moscow

    • Kuvshinskaya Yulia
      Sovremennyie tendentsii v distributivnom upotreblenii russkih suschestvitelnyih (po korpusnyim dannyim)// Russkaya grammatika: aktivnyie protsessyi v yazyike i rechi
      17–19 May 2019, Russian Federation, Yaroslavl


    • 2018

    • Makarovich Anastasiya, Natalia Zevakhina and Yulia Kuvshinskaya
      Deviations in argument structures of Russian predicates: Corpus evidence. The 51st Annual Meeting of the Societas Linguistica Europaea.
      September 2018, Tallinn

    • Kuvshinskaya Yulia
      Non-standard usage of participles in Russian written texts: a corpus study // International Conference on Russian Studies.
      20–22 June 2018, Spain, Barcelona

    • Kuvshinskaya Yulia, Nadezhda Grigor’eva and Natalia Zevakhina
      Non-standard strategies for using sya-verbs: facts and explanations. International conference in honour of Filipp Fortunatov (Paper accepted for presentation, in Russian)
      September 2018, Russian Federation, Petrozavodsk

    • Makarovich Anastasiya, Natalia Zevakhina and Yulia Kuvshinskaya
      Deviations in argument structures of Russian predicates: Corpus evidence. SLE Conference
      September 2018, Tallinn

    • Kuvshinskaya Yulia and Natalia Zevakhina
      Non-standard usage of participles in Russian written texts: a corpus study. International Conference on Russian Studies (Paper accepted for presentation, in Russian)
      June 2018, Spain, Barcelona.


    • 2016

    • Zevakhina Natalia, Svetlana Dzhakupova, Svetlana Puzhaeva, Elmira Mustakimova, Olga Ramzaytseva
      Research on the basis of the Corpus of Russian Student Texts. Conference “Grammatical Processes in synchrony and diachrony”. Institute for the Russian Language
      31 May 2016, Russian Federation, Moscow.


    • 2015

    • Anna Vishenkova, Natalia Zevakhina, Svetlana Dzhakupova
      Morphosyntax and Semantics of Russian Metalinguistic Comparatives. 14th annual conference of the Slavic Cognitive Linguistics Association "Crossing boundaries: taking a cognitive scientific perspective on Slavic languages and linguistics". Universities of Sheffield and Oxford
      9–13 December 2015, UK

    • Dzhakupova, Svetlana, Elmira Mustakimova, and Natalia Zevakhina
      Corpus of Russian Student Texts: goals, annotation, and perspectives. Corpus Linguistics 2015 conference
      21–24 Jule 2015, UK, Lancaster

    • Puzhaeva Svetlana, Natalia Zevakhina, and Svetlana Dzhakupova
      Construction blending in non-standard variants of Russian in the Corpus of Russian Student Texts. The 6th International Conference “Corpus Linguistics-2015”
      22–26 June 2015, Russian Federation, Saint-Petersburg

    • Zevakhina, Natalia and Svetlana Dzhakupova
      Corpus of Russian student texts: design and prospects. The 21st International Conference on Computational Linguistics “Dialog”
      27–30 May 2015, Russian Federation, Moscow


    • 2014

    • Natalia Zevakhina and Svetlana Dzhakupova
      Russian in the mirror of the Corpus of Russian Student Texts. The Institute of Slavic Studies of the Russian Academy of Sciences
      16 December 2014, Russian Federation, Moscow

    • Natalia Zevakhina and Svetlana Dzhakupova
      Russian Metalinguistic Comparatives: Towards the Typology // The 11th Conference on Typology and Grammar for Young Researchers. Institute for Linguistic Studies, Russian Academy of Science
      27–29 November 2014, Russian Federation, Saint-Petersburg


    Links


    License

    Лицензия Creative Commons
    The contents of this site are licensed under a Creative Commons Attribution 4.0 International License.