IIT Bombay English-Hindi Corpus

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task in 2016 and 2017 for the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs. File name: parallel.tgz

Source Number of Segments
GNOME (Opus) 145,706
KDE4 (Opus) 97,227
Tanzil (Opus) 187,080
Tatoeba (Opus) 4,698
OpenSubs2013 (Opus) 4,222
HindEnCorp 273,885
Hindi-English Wordnet Linkage 175,175
Mahashabdkosh: Administrative Domain Dictionary 66,474
Mahashabdkosh: Administrative Domain Examples 46,825
Mahashabdkosh: Administrative Domain Definitions 46,523
TED talks 42,583
Indic multi-parallel corpus 10,349
Judicial domain corpus - I 5,007
Judicial domain corpus - II 3,727
Different Indian Government websites 123,360
Wiki Headlines 32,863
Book Translations (Gyaan-Nidhi Corpus) 227,123
Total 1,492,827
File name: dev_test.tgz

Set Description Number of Segments
Dev (use for tuning) Newswire (from WMT 2014) 520
Test (use for final evaluation) (untokenized) Newswire (from WMT 2014) 2507
File name: dev_test_tokenized.tgz (same as dev_test.tgz, except that the test set is also tokenized)

Set Description Number of Segments
Dev (use for tuning) Newswire (from WMT 2014) 520
Test (use for final evaluation) Newswire (from WMT 2014) 2507
File name: monolingual.hi.tgz

Source Number of Sentences
BBC-new 18,098
BBC-old 135,171
HindMonoCorp 44,486,533
Health Domain 8,001
Tourism Domain 15,395
Wikipedia 259,305
Judicial Domain 152,776
Total 45,075,279
You can use the monolingual corpora on the WMT 2014 website for English

Version 1.0

Click this link to download the corpus If you use this corpus for your research, kindly cite it as follows:
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. 2017 (under review at LREC 2018) Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Anoop Kunchukuttan

anoop.kunchukuttan@gmail.com

Pratik Mehta

pratikmehta1494@gmail.com

Pushpak Bhattacharyya

pb@cse.iitb.ac.in