Corpus of Russian Student Texts (CoRST) is a collection of Russian texts written by students of different universities. Currently, the size of the corpus is about 3.1 million tokens. The texts are annotated in several ways (metatextual annotation, morphological annotation and error markup), which enables users to perform many types of search in the corpus.
To learn more about how to use the corpus and the search request language, please, consult the help page.
Corpus of Russian Student Texts is a comprehensive reference system intended for researchers, teachers, students, as well as everyone interested in the problems of modern Russian grammar, current processes in lexis, morphology and syntax of modern Russian.
The texts represented in the corpus have been written by students of bachelor's and master's programmes of different universities, students. The main types of texts are as follows: course papers, term papers, bachelor's and master's theses, essays, abstracts, reports, summaries, autobiographies, and paragraphs (small texts for different purposes: home assignments, answers to various questions, etc.).
The corpus provides information about the academic year/ semester/ module when each text was written and the field of knowledge which the text relates to. The field of knowledge may not coincide with the academic majors of students. For instance, if a linguist writes an essay on history, we indicate his/her major (linguistics) and the subject of the work (history).
The corpus includes texts written by students of the following academic majors: economics, sociology, political science, law, psychology, journalistics, linguistics, history, philology, logistics, mathematics, philosophy. Generally, the corpus provides information about the gender and age of the author, and also about the academic year (1st year bachelor, 2nd year master, etc.). Some texts are provided with the information about the region where the author lived until the age of 18 and whether the author is a bilingual speaker.
As in any other corpus intended for linguistic and sociological research, the texts of the Corpus of Russian Student Texts are supplemented with linguistic and metalinguistic information, or markup. Linguistic markup includes morphological markup and error markup. Currently, the texts of the corpus contain metatextual markup and morphological markup.
Below is more information about the layers of markup.
Linguist, economist, sociologist, law student, etc. The student’sacademic major is always indicated: this is an important part of sociolinguistic information, which is of value for methodological research.
M, F Information about the gender of a student is given in the majority of cases; however, sometimes it is impossible to be retrieved. For instance, if a student has signed his/her paper Jakovenko M., the field «gender» is not filled in. After the metatextual data is collected, the name of the author is deleted.
21, 22, etc. The age of the author is indicated if necessary. While creating the corpus, we did not intend to collect accurate information about the age. Most of Russian students enter the university right after leaving school; therefore, the approximate age of a student is easily retrieved from the data about the year of studying (see Metadata related to the text).
Ukrainian, Armenian, German, etc. This includes information about whether the author has any other mother tongue, besides Russian. Generally, this is a language which is spoken in the family or a language of the community where the author lived before entering the university.
Ukraine, Armenia, Germany, etc. This includes information about the region of the author's residency before he/she entered the university. This information matters for sociolinguistic research. The region is indicated if known.
2003-2004, 2011-2012, etc. Generally, metatextual markup has information about the academic year when the text was written. This information is needed for methodological research.
1st year spec, 2nd year bach, 1st year mas, etc. This is information about the year of studying during which the text was created. For instance, 1st year spec means the first of five years of specialist program, 2nd year bach means the second of four years of bachelor's program, etc.
1 semester, 2 module, etc. Information about the semester, or the module (for students with the module system), when the text was created.
Course paper, barchelor's of master's thesis, abstract, essay, report, summary, autobiography, paragraph, etc. The type of an academic text is always indicated. Paragraphs are small texts for different purposes: hometasks, answers to various questions, etc.
Linguistics, economics, sociology, law, etc. In the process of studying at the university a student writes texts related to his/her academic major or in other academic fields. Generally, linguists would write course papers and theses on linguistics, economists – on economics, however, even for term papers, there may be exceptions. As for the rest of the texts, the field of knowledge to which they may be related does not always coincide with the academic courses students major in. For instance, a linguist may write a paper on history, an economist – on sociology, etc. Many short texts were written in the Academic Writing or Speech Culture courses and may not be attributed to any particular field of knowledge. Such texts are marked with label «various topics»).
Morphological markup contains information about morphological forms and meanings (parts of speech, gender, number, case, mood, etc.). Morphological markup is required for searching words, word forms and constructions.
For the morphological markup of the Corpus of Russian Student Texts we have used the same tag set that was integrated into the Russian National Corpus (RNC). The full list of tags is available on the RNC website. The morphological markup is carried out automatically with the help of the morphological analyzer MYSTEM.
The multilayered system of error markup is currently being tested, after which the texts will be annotated manually. The error markup system includes:
Linguistic error types presented in (а) are divided into subtypes as follows:
Among the causes of mistakes presented in (b) one can point out contaminations of constructions and typos as the most transparent and the most common causes. In the future, the process of annotating the corpus texts may reveal other types of causes of mistakes.
It is important to stress that the suggested error classification is multilayered:
The Corpus of Russian Student Texts is intended for linguists - researchers, teachers of the literary Russian language and of the Russian academic writing, and students.
To researchers, this resource is useful because it helps, on the one hand, to systematize the peculiarities of academic texts of novice authors. On the other hand, the Corpus of Russian Student Texts is an invaluable resource for the study of errors made by speakers of Russian, as well as for investigation of grammar, lexis, stylistics, sociolinguistics, and psycholinguistics.
To teachers of the literary Russian language and of the Russian academic writing, the resource helps to see the main problems that a student faces while trying to master a new register of his/her first language . The methodological value of the resource is that teachers now receive an opportunity to develop various exercises and linguistic training software on its basis.
To students of different majors, the resource offers an opportunity to familiarize themselves with typical language errors and their classification as well as an opportunity to edit them. Such exercises are supposed to reduce such errors while writing a text of one’s own.
The corpus was created by the Linguistic Laboratory of Corpus Technologies of National Research University Higher School of Economics:
Chiefs: Ekaterina Rakhilina, Natalia Zevakhina, Yulia Kuvshinskaya, Svetlana Dzakupova
Developers: Тimofey Arkhangelskiy, Elmira Mustakimova, Evgeniy Glazunov
Team: HSE School of Linguistics Staff (who contributed to the text markup, research and development of the CORST) Yana Akhapkina Alexander Letuchiy Anna Levinzon Anna Plissetskaya
HSE School of Linguistics Students(who contributed to the text markup, research and development of the CORST)Vlada Alexandrova Tatyana Bolgina Natalia Chukicheva Nadezhda Grigor'eva Anna Kondratyeva Anastasiya Kostyanitsyna Olga Kul'tepina Mariya Kuznetsova Violetta Ivanova Anastasiya Makarovich Daria Maximova Maria Myslina Maria Ob'yedkova Svetlana Puzhaeva Asya Simonyan Olga Ramzaytseva Alina Shaymardanova Anna Vishenkova Maria Zarifyan