RLC: Russian Learner Corpus


The Russian Learner Corpus (RLC) comprises texts produced by two categories of non-standard speakers of Russian: learners of Russian as a Foreign language and speakers of Heritage Russian. The corpus contains both oral and written production and enables search by morphological properties and a variety of deviations from Standard Russian ranging from mistakes in orphography and grammar to non-standard use of lexical and syntactic constructions.

What is RLC?

The texts for the Corpus were kindly provided by researchers and teachers from Russia and the USA: the Polinsky Language Sciences Lab, Evgeny Dengub, Irina Dubinina, Anna Mikhaylova, Alla Smyslova, the Center of Russian as a Foreign Language at the Higher School of Economics.Data on Heritage Russian oral production include the results of experimental studies: frog stories (based on the methodology developed by Berman & Slobin 1994; Slobin 2004) and narratives based on a short cartoon (“Nu pogodi!”) (see Isurin & Ivanova-Sullivan (2008) and Polinsky (2008) for more details).

Part of RLC is RULEC – a longitudinal subcorpus of Academic Writing collected by Olessya Kisselev and Anna Alsufieva of Portland State University.The preliminary linguistic analysis and tagging were done by the members of the Heritage Russian Research Group (Higher School of Economics), with technical support provided by Timofey Arkhangelsky and Elmira Mustakimova.


Each text in the Corpus is assigned background information.

  1. Mandatory fields
    1. Oral / Written
    2. Author’s gender
    3. Author’s language background (Heritage / L2)
    4. Author’s dominant language
    5. Author’s proficiency in Russian
  2. Optional fields
    1. Date
    2. Genre

A more elaborated system of text marking is used in RULEC:

  1. student’s name (pseudonym),
  2. gender,
  3. language background and language experience of the student (L2 or HL),
  4. student’s linguistic level (established through external tests),
  5. time stamp (week and academic year when the paper was written),
  6. time limit under which the paper was written (timed or non-timed),
  7. text type (one paragraph or a long research paper),
  8. text function (e.g. narration, argumentation), and
  9. whether a paper was written individually or in a group.
Using RLC

Comprising texts from two different groups of non-standard speakers of Russian, RLC is a valuable source for various studies in the fields of Second Language Acquisition, Second Language teaching, language interference and theoretical linguistics.

Corpus data and its flexible search system provide a sound basis for comparative research in Heritage and L2 production and enables a deeper insight into complicated phenomena, such as non-standard use of Russian aspect, cases, prepositional phrases, as well as lexical and semantic misuse in multi-word constructions.

Apart from telling a lot about non-standard Russian, RLC is a powerful tool for opening new facets of Standard Russian grammar: deviations in language use help uncover subtle rules that previously have been paid no attention to.