Amharic Corpus

The page you are currently viewing is a web interface for the pilot version of Amharic corpus. The corpus size so far is about 23 millions tokens. The texts of the corpus have been automatically annotated with a part of speech analyzer. There is a disambiguation in the corpus, i. e. each token is annotated with one appropriate analyse.

Most tokens in the current version of the corpus belong to news texts. The rest of the texts include blogs and nonfiction (Wikipedia articles and essays). Eventually we intend to increase the number and diversity of texts and add fiction texts to the corpus.

The latest update

May 25th, 2016.

Created by

Maria Obedkova under guidance of Boris Orekhov within the project of HSE School of Linguistics

Web interface

The search platform of the Eastern Armenian National Corpus (EANC) was used for this corpus. You can read about making search queries at EANC help page.