CoRST

Corpus of Russian Student Texts


About

Corpus of Russian Student Texts (CoRST) is a collection of Russian texts written by students of different universities. Currently, the size of the corpus is about 3.1 million tokens. The texts are annotated in several ways (metatextual annotation, morphological annotation and error markup), which enables users to perform many types of search in the corpus.

To learn more about how to use the corpus and the search request language, please, consult the help page.

Corpus of Russian Student Texts is a comprehensive reference system intended for researchers, teachers, students, as well as everyone interested in the problems of modern Russian grammar, current processes in lexis, morphology and syntax of modern Russian.


Types of texts

The texts represented in the corpus have been written by students of bachelor's and master's programmes of different universities, students. The main types of texts are as follows: course papers, term papers, bachelor's and master's theses, essays, abstracts, reports, summaries, autobiographies, and paragraphs (small texts for different purposes: home assignments, answers to various questions, etc.).

The corpus provides information about the academic year/ semester/ module when each text was written and the field of knowledge which the text relates to. The field of knowledge may not coincide with the academic majors of students. For instance, if a linguist writes an essay on history, we indicate his/her major (linguistics) and the subject of the work (history).


Authors of texts

The corpus includes texts written by students of the following academic majors: economics, sociology, political science, law, psychology, journalistics, linguistics, history, philology, logistics, mathematics, philosophy. Generally, the corpus provides information about the gender and age of the author, and also about the academic year (1st year bachelor, 2nd year master, etc.). Some texts are provided with the information about the region where the author lived until the age of 18 and whether the author is a bilingual speaker.


Annotation

As in any other corpus intended for linguistic and sociological research, the texts of the Corpus of Russian Student Texts are supplemented with linguistic and metalinguistic information, or markup. Linguistic markup includes morphological markup and error markup. Currently, the texts of the corpus contain metatextual markup and morphological markup.

Below is more information about the layers of markup.


Metatextual markup

Metatextual markup contains metalinguistic information about the text and its author.

Author-meta

Linguist, economist, sociologist, law student, etc.
The student’sacademic major is always indicated: this is an important part of sociolinguistic information, which is of value for methodological research.

M, F
Information about the gender of a student is given in the majority of cases; however, sometimes it is impossible to be retrieved. For instance, if a student has signed his/her paper Jakovenko M., the field «gender» is not filled in. After the metatextual data is collected, the name of the author is deleted.

21, 22, etc.
The age of the author is indicated if necessary. While creating the corpus, we did not intend to collect accurate information about the age. Most of Russian students enter the university right after leaving school; therefore, the approximate age of a student is easily retrieved from the data about the year of studying (see Metadata related to the text).

Ukrainian, Armenian, German, etc.
This includes information about whether the author has any other mother tongue, besides Russian. Generally, this is a language which is spoken in the family or a language of the community where the author lived before entering the university.

Ukraine, Armenia, Germany, etc.
This includes information about the region of the author's residency before he/she entered the university. This information matters for sociolinguistic research. The region is indicated if known.




Text-meta

2003-2004, 2011-2012, etc.
Generally, metatextual markup has information about the academic year when the text was written. This information is needed for methodological research.

1st year spec, 2nd year bach, 1st year mas, etc.
This is information about the year of studying during which the text was created. For instance, 1st year spec means the first of five years of specialist program, 2nd year bach means the second of four years of bachelor's program, etc.

1 semester, 2 module, etc.
Information about the semester, or the module (for students with the module system), when the text was created.

Course paper, barchelor's of master's thesis, abstract, essay, report, summary, autobiography, paragraph, etc.
The type of an academic text is always indicated. Paragraphs are small texts for different purposes: hometasks, answers to various questions, etc.

Linguistics, economics, sociology, law, etc.
In the process of studying at the university a student writes texts related to his/her academic major or in other academic fields. Generally, linguists would write course papers and theses on linguistics, economists – on economics, however, even for term papers, there may be exceptions. As for the rest of the texts, the field of knowledge to which they may be related does not always coincide with the academic courses students major in. For instance, a linguist may write a paper on history, an economist – on sociology, etc.
Many short texts were written in the Academic Writing or Speech Culture courses and may not be attributed to any particular field of knowledge. Such texts are marked with label «various topics»).


Morphological markup

Morphological markup contains information about morphological forms and meanings (parts of speech, gender, number, case, mood, etc.). Morphological markup is required for searching words, word forms and constructions.

For the morphological markup of the Corpus of Russian Student Texts we have used the same tag set that was integrated into the Russian National Corpus (RNC). The full list of tags is available on the RNC website. The morphological markup is carried out automatically with the help of the morphological analyzer MYSTEM.


Error markup

The multilayered system of error markup is currently being tested, after which the texts will be annotated manually. The error markup system includes:

  1. linguistic error type (errors in lexis, grammar or discourse);
  2. cause of an error;

Linguistic error types presented in (а) are divided into subtypes as follows:

  • lexical (lexical and word formation errors),
  • stylistic (fficial style, colloquial style, other stylistic deviance),
  • grammatical (agreement, government, coordination errors, errors in comparative constructions, in adverbial participial phrases, in complex sentences – in sentential arguments and relative clauses; errors in pronouns including anaphora and reflexives; reciprocals, nominal, verbal and adjectival inflection – errors in number, person, gender, case, verbal and adjectival inflectional forms),
  • discursive (parcellation, topicalization, wrong use of link-words, mixing of direct and indirect speech, wrong word order, logical errors).

Among the causes of mistakes presented in (b) one can point out contaminations of constructions and typos as the most transparent and the most common causes. In the future, the process of annotating the corpus texts may reveal other types of causes of mistakes.

It is important to stress that the suggested error classification is multilayered:

  1. Firstly, for each linguistic error its linguistic type and, if possible, cause, have to be defined.
  2. Secondly, an error may be attributed to different linguistic types. For instance, an error may be both lexical and stylistic at the same time.

Corpus users

The Corpus of Russian Student Texts is intended for linguists - researchers, teachers of the literary Russian language and of the Russian academic writing, and students.

To researchers, this resource is useful because it helps, on the one hand, to systematize the peculiarities of academic texts of novice authors. On the other hand, the Corpus of Russian Student Texts is an invaluable resource for the study of errors made by speakers of Russian, as well as for investigation of grammar, lexis, stylistics, sociolinguistics, and psycholinguistics.

To teachers of the literary Russian language and of the Russian academic writing, the resource helps to see the main problems that a student faces while trying to master a new register of his/her first language . The methodological value of the resource is that teachers now receive an opportunity to develop various exercises and linguistic training software on its basis.

To students of different majors, the resource offers an opportunity to familiarize themselves with typical language errors and their classification as well as an opportunity to edit them. Such exercises are supposed to reduce such errors while writing a text of one’s own.


Team

The corpus was created by the Linguistic Laboratory of Corpus Technologies of National Research University Higher School of Economics:

Chiefs: Ekaterina Rakhilina, Natalia Zevakhina, Svetlana Dzakupova, Yulia Kuvshinskaya

Developers: Тimofey Arkhangelskiy, Elmira Mustakimova

Team:
Yana Akhapkina
Tatyana Bolgina
Natalia Chukicheva
Anna Kondratyeva
Yulia Kuvshinskaya
Olga Kul'tepina
Mariya Kuznetsova
Alexander Letuchiy
Anna Levinzon
Maria Ob'yedkova
Irina Panteleeva
Anna Plissetskaya
Svetlana Puzhaeva
Olga Ramzaytseva
Alina Shaymardanova
Anna Tsyzova
Anna Vishenkova
Maria Zarifyan


Publications


Projects

  • Natalia Zevakhina, Svetlana Dzakupova, Anna Vishenkova. "Meta-linguistic comparative constructions in Russian and cross-linguistically" (2014 — current)
  • Natalia Zevakhina, Svetlana Dzhakupova. "Case (non-)coincidence in elliptical coordinated constructions: Learner texts of Russian native speakers" (2014)

     Including students' course projects:
  • Svetlana Puzhaeva. "Factors which determine the lack of coreference in Russian converbs" (master thesis, 2016)
  • Olga Ramzaytseva. "Automatic parsing of errors in comparative constructions in the Corpus of Russian Student Texts" (master, 1st year, 2016)
  • Svetlana Puzhaeva. "Construction Blending in the Corpus of Russian Student Texts" (master, 1st year, 2015)
  • Anna Vishenkova. "Comparative constructions with double marker of comparison in non-standard Russian: a corpus study" (bachelor, 2nd year, 2015)

Publications

  • Zevakhina N., Dzhakupova S. Russian metalinguistic comparatives: a functional perspective. In Working papers by NRU HSE. Series WP BRP "Linguistics". 2015. No. 39.
  • Puzhaeva Svetlana, Natalia Zevakhina, and Svetlana Dzhakupova. (2015). Construction blending in non-standard variants of Russian in the Corpus of Russian Student Texts. In Proceedings of the 6th International Conference “Corpus Linguistics-2015”, 390-397. Saint-Petersburg. (in Russian)
  • Zevakhina, Natalia and Svetlana Dzhakupova. Corpus of Russian student texts: design and prospects. In Proceedings of the 21st International Conference on Computational Linguistics “Dialog”. Moscow, 2015.
  • Svetlana Dzhakupova and Natalia Zevakhina. Case (non-)coincidence in elliptical coordinated constructions: Learner texts of Russian native speakers // Ahti Nikunlassi, Ekaterina Protassova (eds.) Slavica Helsingiensia 45 Instrumentarium of Linguistics: Language errors and multilingualism. Helsinki: University of Helsinki, 2014, pp. 35—49. (in Russian)

Presentations

  • Anna Vishenkova, Zevakhina, Natalia, Svetlana Dzhakupova. Morphosyntax and Semantics of Russian Metalinguistic Comparatives. 14th annual conference of the Slavic Cognitive Linguistics Association "Crossing boundaries: taking a cognitive scientific perspective on Slavic languages and linguistics". Universities of Sheffield and Oxford, UK, 9 - 13 December 2015.
  • Dzhakupova, Svetlana, Elmira Mustakimova, and Natalia Zevakhina. Corpus of Russian Student Texts: goals, annotation, and perspectives. Corpus Linguistics 2015 conference. Lancaster, UK, 21-24.07.2015.
  • Puzhaeva Svetlana, Natalia Zevakhina, and Svetlana Dzhakupova. Construction blending in non-standard variants of Russian in the Corpus of Russian Student Texts. The 6th International Conference “Corpus Linguistics-2015”. Saint-Petersburg, 22-26.06.2015.
  • Zevakhina, Natalia and Svetlana Dzhakupova. Corpus of Russian student texts: design and prospects. The 21st International Conference on Computational Linguistics “Dialog”. Moscow, 27-30.05.2015.
  • Natalia Zevakhina and Svetlana Dzhakupova. Russian in the mirror of the Corpus of Russian Student Texts. The Institute of Slavic Studies of the Russian Academy of Sciences, Moscow, 16.12.2014.
  • Natalia Zevakhina and Svetlana Dzhakupova. Russian Metalinguistic Comparatives: Towards the Typology // The 11th Conference on Typology and Grammar for Young Researchers. Institute for Linguistic Studies, Russian Academy of Science, Saint-Petersburg, Russia. 27-29.11.2014.

Links