Almaty Corpus of Kazakh language (NCKL)

The major version of the Almaty corpus of the Kazakh language is posted on this website. At the moment the size of the corpus is more than 40 million word tokens. The texts of the corpus were marked by means of the automatic morphological analyzer, 86% of word forms of the corpus were parsed. The homonymy in the corpus was not removed, i.e. all possible options of analysis without a context are assigned to each word form.

NCKL is one of the possible version of the National corpus of the Kazakh language as a reference system on the basis of extensive fund of the marked texts of literary Kazakh, the state language of the Republic of Kazakhstan. Certainly, the corpus is constantly being supplemented, updated both quantitatively, and qualitatively, besides search functionality of the corpus is significantly improved.

The main characteristics of NCKL are as following:

In the long term the following characteristics of ACKL are outlined:

Short history and news of the project:

The work on the project of the Corpus began in May, 2012 with assistance of the rector of al-Farabi KazNU Mutanov G.M. and it is being created by efforts of the department of general linguistics and foreign philology of faculty of philology, literary study and world languages of Al-Farabi Kazakh National University under the leadership of the chief of the department G.B. Madiyeva with the participation of the staff of the faculty of philology of Higher School of Economics National Research University (Moscow).

In 2013 the size of the corpus was 650 thousand word tokens.

In 2015 the work on ACKL was carried out within the grant financing of scientific research of MES RK on a subpriority "Fundamental researches in the field of social and economic sciences and humanities" (number of the grant is 4769GF4). In the result ACKL was updated up to 2 million word tokens.

Since September 21, 2015 LLC “RCO” has been taking part in improving the morphological marking.

In early 2016 the full version of the Almaty Corpus of the Kazakh Language (ACKL) is posted on the website. The size of the corpus compiled about 40 million word tokens.

Since October 2016 the department of computer science of al-Farabi KazNU has been taking a part in developing the functions of ACKL thanks to activities of the c.p.-m.s.,docent Mansurova M.E. From the side of al-Farabi KazNU Ermekov Zhantemir is engaged in information maintenance of the corpus.

At the present time the work on improving the base of ACKL is realized by efforts of the department of General linguistics and European languages of the faculty of philology and world languages of Al-Farabi Kazakh National University under the leadership of doctor of philology professor G.B. Madiyeva with the participation of the staff of the faculty of philology of Higher School of Economics National Research University (Moscow).

Participants of the project:

Staff of the faculty of philology of Higher School of Economics National Research University (Moscow):
Mikhail Danielle
Timofey Arkhangelsky
Svetlana Toldova
Olga Lyashevskaya

The updated structure of the faculty, doctoral candidates, undergraduates and students of the department of General linguistics and European languages:
Zhanna Umatova
Manshuk Mambetova
Gulnaz Iskakova
Saule Bektemirova
Albina Dosanova
Ayauzhan Tausogarova
Gulnara Boribayeva
Anna Danchenko
Dinara Madiyeva
Gulnaz Nabiyeva
Aygerim Zhanaliyeva
Ulbosyn Parmanova
Aynur Baymurzin
Gulsara Bayzhanova
Zholdas Orysbayev
Nurgul Ismaylova
Feruza Moldasheva
Ulzhan Anarbekova
Fariza Hasilan
Asem Tokenova
Tansholpan Zhamalova
Zhazira Alisheva

We thank the staff of the faculty of philology of NRI HSE (Moscow), who significantly contributed to creation of Almaty Corpus of the Kazakh Language at Al-Farabi KazNU.

We express our appreciation to Department of information technologies under the direction of Zhanyl Mamykova for the support in work on ACKL, the library of Al-Farabi Kazakh National University and Publishing house "Kazakh University" for the electronic collection of scientific and literary texts provided to the project.