[Go to Śata-Anuva̅dak]
Tools and Resources for Machine Translation
- Indic Language NLP library
The goal of this project is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonolofy, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.
The library provides the following modules for Indian languages:
- Text Unicode Normalization
- Transliteration between Indic scripts
- Morphology Analysis
- CFILT-Preorder: Source-reordering system for English-Indian language translation
There is many structural divergences between Indian languages and English, the principal of them being the word order viz. Subject-Object-Verb for Indian languages and Subject-Verb-Object for English. This toolkit reorders a given English sentence so address these structural divergences, so that the word order of the modified English sentence conforms to the canonical word order in Indian languages. This transformation is useful for Machine Translation.
- METEOR-Indic: METEOR for Indian Languages
Extensions for METEOR to support Indian languages. Synonyms come from Indian language WordNet and stemming is done using a WordNet-assisted stemmer.
Languages Supported: Hindi, Marathi. Other languages in IndoWordNet are also supported, but you will need access to the data to use METEOR-Indic for these languages.
- Job scripts for Moses
A simple experiment management system for Moses. It can do training, tuning and testing all at once and compute various evaluation metrics like BLEU, METEOR and TER. It contains scripts for batch-training of multiple MT systems.
For any queries/information, please contact Prof. Pushpak Bhattacharyya (firstname.lastname@example.org) or Anoop Kunchukuttan (email@example.com)
- Translation Resources
110 translation models for Phrase based SMT between the languages mentioned above. This includes phrase tables, lexicalized reordering models and language models along with learnt parameters.
- Transliteration Resources
Moses-based statistical transliteration system for some of the language pairs mentioned above. The resources in the download section are essentially character level translation models which can be used with the Moses decoder to transliterate between scripts. The transliteration systems have been learnt in an unsupervised manner from parallel corpora.