IIT Bombay English-Hindi Corpus

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task since 2016 the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.

Version 3.0

Click this link to download the corpus. You can find a catalog of other English-Hindi and other Indian language parallel corpora here: Indic NLP Catalog
3.0 August 2020 Added the corpus 'Different Indian Government websites 3': around 47,000 sentence pairs.
2.0 March 2019 Previous versions provided tokenized dataset. This dataset is not tokenized, so the corpus can be processed by systems as per the user's choice. We recommend using the Indic NLP Library for tokenization.
1.5 December 2018 Added the corpus 'Different Indian Government websites 2': around 70000 sentence pairs.
1.0 July 2016 Intial Release.
File name: parallel.tgz

Note: The file contains sentences from different sources in the same order as listed in the table below. So, you can select any subset as required.

Source Number of Segments
GNOME (Opus) 145,706
KDE4 (Opus) 97,227
Tanzil (Opus) 187,080
Tatoeba (Opus) 4,698
OpenSubs2013 (Opus) 4,222
HindEnCorp 273,885
Hindi-English Wordnet Linkage 175,175
Mahashabdkosh: Administrative Domain Dictionary 66,474
Mahashabdkosh: Administrative Domain Examples 46,825
Mahashabdkosh: Administrative Domain Definitions 46,523
TED talks 42,583
Indic multi-parallel corpus 10,349
Judicial domain corpus - I 5,007
Judicial domain corpus - II 3,727
Different Indian Government websites 123,360
Wiki Headlines 32,863
Book Translations (Gyaan-Nidhi Corpus) 227,123
Different Indian Government websites 2 69,013
Different Indian Government websites 3 47,842
Total 1,609,682
File name: dev_test.tgz

Set Description Number of Segments
Dev (use for tuning) Newswire (from WMT 2014) 520
Test (use for final evaluation) Newswire (from WMT 2014) 2507
File name: monolingual.hi.tgz

Source Number of Sentences
BBC-new 18,098
BBC-old 135,171
HindMonoCorp 44,486,496
Health Domain 8,001
Tourism Domain 15,395
Wikipedia 259,305
Judicial Domain 152,776
Total 45,075,242
You can use the monolingual corpora on the WMT 2014 website for English This section lists the supplementary resources derived from the parallel corpus. If you use this corpus or its derivate resources for your research, kindly cite it as follows:
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018. Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. The corpora we compiled from other sources are available under their respective licenses.

Anoop Kunchukuttan

anoop.kunchukuttan@gmail.com

Pratik Mehta

pratikmehta1494@gmail.com

Pushpak Bhattacharyya

pb@cse.iitb.ac.in