Corpus of Modern Greek

The page you are viewing is a web interface to the Corpus of Modern Greek, a convenient on-line tool for researching or studying Modern Greek. The corpus is essentially a collection of texts enhanced with various kinds of annotation and a search engine. Using the panel at your right, you may specify a search query; the result will appear in the middle of the window after you press the “Search” button. The search engine of the corpus allows for queries like “find all usage examples of a given word form or lexeme”, “find all sentences where the word Y follows the word X at the distance of 2 through 5”, “find all occurrences of the genitive case after prepositions”, etc. You can learn more about the search possibilities and the query language at the help page of the Eastern Armenian National Corpus whose search platform we are using.

Currently, the size of the corpus is approximately 35.7 million tokens. Most texts come from contemporary Greek newspapers (Η Καθημερινή, Μακεδονία, Το Βήμα, Ελευθεροτυπία), but there are also fiction, poetic, official, scientific, and religious texts, both original and translated, that were created in the 20th or in the 19th century.

All texts have been morphologically annotated, which means that each word is provided with a lemma (dictionary form) and a set of morphosyntactic tags (such as case values, number values, etc.), all of which can be used in a search query. The morphological annotation was carried out with the help of a digital grammatical dictionary compiled by Maxim Kisilier and Тimofey Arkhangelskiy, and a morphological analyzer called UniParser. Since the annotation and (partly) the compilation of the dictionary were automatic rather than manual, there can be occasional mistakes in the grammatical tags; we are currently working on improving the quality of annotation. There is no disambiguation in the corpus, i. e. each word was annotated with all possible analyses in all contexts.

The Corpus of Modern Greek is being developed with the support of the “Corpus linguistics” program of the Russian Academy of Sciences. We are grateful to the authors of the Eastern Armenian National Corpus (EANC) who allowed us to use their search engine and web interface.