Albanian National Corpus (2011–2016 version)

Attention! The Corpus is moving to a new platform!

Dear users!

We are pleased to present you a new version of the Albanian National Corpus, which is based on the tsakorpus platform and located at http://albanian.web-corpora.net/. In the future, the new version will be actively developed and replenished, while the old version published on this site will no longer be supported. We suggest you to use the new version of the Corpus and inform us of the possible mistakes in it.

This website contains the Albanian National Corpus, at the moment including about 20 million tokens. The texts relating to fiction and journalistic genre are furnished with user-friendly morphological markup consisting of tags assigned to individual tokens. In the near future it is planned to fit the Albanian National Corpus with other kinds of markup.

The Albanian National Corpus is designed for people having a particular interest in issues related to the Albanian language. Both professional linguists and those who show an interest in Albanian and its history due to their occupation or just out of mere curiosity can find useful reference information in the Corpus. A well marked up and representative Corpus ensures fast processing of large amounts of language material provided with translations and other linguistic information. The material compiled in the National Corpus can be used for lexical and grammatical studies, as well as for investigation into language changes which happened in Albanian in the course of the previous centuries.

We bring to notice of our users that the Albanian National Corpus is currently under development. Text addition, grammatical wordlist extension and further morphological markup are still under way. The work on ambiguity detection and resolution will start in the near term. The Corpus team will tackle other important issues, such as the creation of the Albanian oral discourse subcorpus, further addition of texts in Albanian referring to the various periods of the language’s history, as well as texts written and spoken in dialects of Albanian.

The Albanian National Corpus employs the search engine of the Eastern Armenian National Corpus (EANC). The Albanian Corpus is being developed by a team of Saint-Petersburg linguists: Maria Morozova, Marina Domosiletskaya, Alexander Rusakov, Ekaterina Bernatskaya, Anastasia Sidko, Anna Konovalenko. Maksim Makartsev (Moscow), Darya Alekseeva (Saint-Petersurg), Varvara Diveeva (Saint-Petersburg) and Qerim Ondozi (Prishtina) are among those who took part in text selection and processing. The morphological parsing tool UniParser has been developed by Timofey Arkhangelskiy (Moscow). The ongoing consulting help is provided by Mikhail Daniel (Moscow), a participant of various corpus development projects. The Corpus team is grateful to the publishing houses “Onufri” and “OM” for their help in the selection of texts.

The Albanian National Corpus is created with the financial support of the Presidium of the Russian Academy of Sciences program “Corpus linguistics”.