IIT Bombay English-Hindi Corpus

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task in 2016 and 2017 for the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.

Version 2.0

Click this link to download the corpus
2.0 March 2019 Previous versions provided tokenized dataset. This dataset is not tokenized, so the corpus can be processed by systems as per their choice.
1.5 December 2018 Added the corpus 'Different Indian Government websites 2': around 70000 sentence pairs.
1.0 July 2016 Intial Release.
File name: parallel.tgz

Source Number of Segments
GNOME (Opus) 145,706
KDE4 (Opus) 97,227
Tanzil (Opus) 187,080
Tatoeba (Opus) 4,698
OpenSubs2013 (Opus) 4,222
HindEnCorp 273,885
Hindi-English Wordnet Linkage 175,175
Mahashabdkosh: Administrative Domain Dictionary 66,474
Mahashabdkosh: Administrative Domain Examples 46,825
Mahashabdkosh: Administrative Domain Definitions 46,523
TED talks 42,583
Indic multi-parallel corpus 10,349
Judicial domain corpus - I 5,007
Judicial domain corpus - II 3,727
Different Indian Government websites 123,360
Wiki Headlines 32,863
Book Translations (Gyaan-Nidhi Corpus) 227,123
Different Indian Government websites 2 69,013
Total 1,561,840
File name: dev_test.tgz

Set Description Number of Segments
Dev (use for tuning) Newswire (from WMT 2014) 520
Test (use for final evaluation) Newswire (from WMT 2014) 2507
File name: monolingual.hi.tgz

Source Number of Sentences
BBC-new 18,098
BBC-old 135,171
HindMonoCorp 44,486,496
Health Domain 8,001
Tourism Domain 15,395
Wikipedia 259,305
Judicial Domain 152,776
Total 45,075,242
You can use the monolingual corpora on the WMT 2014 website for English This section lists the supplementary resources derived from the parallel corpus. If you use this corpus or its derivate resources for your research, kindly cite it as follows:
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018. Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. The corpora we compiled from other sources are available under their respective licenses.

Anoop Kunchukuttan

anoop.kunchukuttan@gmail.com

Pratik Mehta

pratikmehta1494@gmail.com

Pushpak Bhattacharyya

pb@cse.iitb.ac.in