HSE Thai Corpus

This website gives access to the HSE Thai Corpus - the corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). Each token was assigned it's English translation and part of speech tag. Some other grammatical tagging also was assigned where suitable. HSE Thai Corpus can be used both by native speakers of Thai and any English-speaking users since every recognized word is given it's English translation. It is a useful tool for linguists and basically anyone who interests themselves in Thai language. The corpus is suitable for lexical, syntactic and other sinchronical studies and, due to it's volume, can provide researchers with a huge amount of data. The corpus employs the search engine of the Eastern Armenian National Corpus (EANC). The user-friendly and flexible search system allows users to gather material by grammatical and POS tags alongside with translations and, of course, actual wordforms. To make it easier for non-Thai-speakers to comprehend and use texts in the corpus we decided to separate words in each sentence with spaces.

The Thai Corpus is being developed by the team of students of HSE School of Linguistics in Moscow under the guidance of professor Boris Orekhov. The team consisted of Grigory Ignatyev, Alexandra Ershova, Anna Kuznetsova, Tatyana Shalganova, Daniil Kolomeytsev and Nikolai Mikulin. The consulting help on Thai language was provided by Nadezhda Motina. Natalia Filippova, Elizaveta Kuzmenko, Tatyana Gavrilova, Elena Krotova, Elmira Mustakimova, Olga Sozinova, Aleksandra Martynova, Maria Sheyanova, Marina Kustova and Julia Badryzlova also contributed to the project.

The data for the corpus was collected by means of Scrapy. To tokenize texts the Pythai module was used. The tagging was based on the material of two english-thai dictionaries: Online Thai Dictionary and Thai Dictionary 2 In total we have downloaded and tagged texts containing 200 mln tokens (the corpus contains only 50 mln). The texts collected are availible in a .zip-file. All materials and scripts connected to this project are available on GitHub. The visualization of repository activity is availible here.