IIT Bombay English-Hindi Corpus

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task in 2016 and 2017 for the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.

Version 2.0

Click this link to download the corpus. You can find a catalog of other English-Hindi and other Indian language parallel corpora here: Indic NLP Catalog

2.0	March 2019	Previous versions provided tokenized dataset. This dataset is not tokenized, so the corpus can be processed by systems as per the user's choice. We recommend using the Indic NLP Library for tokenization.
1.5	December 2018	Added the corpus 'Different Indian Government websites 2': around 70000 sentence pairs.
1.0	July 2016	Intial Release.

File name: parallel.tgz

Note: The file contains sentences from different sources in the same order as listed in the table below. So, you can select any subset as required.

Source	Number of Segments
GNOME (Opus)	145,706
KDE4 (Opus)	97,227
Tanzil (Opus)	187,080
Tatoeba (Opus)	4,698
OpenSubs2013 (Opus)	4,222
HindEnCorp	273,885
Hindi-English Wordnet Linkage	175,175
Mahashabdkosh: Administrative Domain Dictionary	66,474
Mahashabdkosh: Administrative Domain Examples	46,825
Mahashabdkosh: Administrative Domain Definitions	46,523
TED talks	42,583
Indic multi-parallel corpus	10,349
Judicial domain corpus - I	5,007
Judicial domain corpus - II	3,727
Different Indian Government websites	123,360
Wiki Headlines	32,863
Book Translations (Gyaan-Nidhi Corpus)	227,123
Different Indian Government websites 2	69,013
Total	1,561,840

File name: dev_test.tgz

Set	Description	Number of Segments
Dev (use for tuning)	Newswire (from WMT 2014)	520
Test (use for final evaluation)	Newswire (from WMT 2014)	2507

File name: monolingual.hi.tgz

Source	Number of Sentences
BBC-new	18,098
BBC-old	135,171
HindMonoCorp	44,486,496
Health Domain	8,001
Tourism Domain	15,395
Wikipedia	259,305
Judicial Domain	152,776
Total	45,075,242

You can use the monolingual corpora on the WMT 2014 website for English

The Hindi side of the training, dev, test sets as well as the monolingual corpus have been normalized to ensure canonical Unicode representation using the Indic NLP Library.

This section lists the supplementary resources derived from the parallel corpus.

Xlit-IITB-Par: Hindi-English Transliteration Corpus
This is a corpus containing transliteration pairs for Hindi-English. These pairs were automatically mined from the IIT Bombay English-Hindi Parallel Corpus using the Moses Transliteration Module. The corpus contains 68,922 pairs. This has been created from v1 of the corpus.

If you use this corpus or its derivate resources for your research, kindly cite it as follows:
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. The corpora we compiled from other sources are available under their respective licenses.

IIT Bombay English-Hindi Corpus

About

Downloading the corpus

Version 2.0

Version Info

Parallel Corpus

Training Set

Dev and Test Sets

Hindi Monolingual Corpus

English Monolingual Corpus

Pre-processing

Supplementary Resources

Citing the corpus

License

Contact

Anoop Kunchukuttan

Pratik Mehta

Pushpak Bhattacharyya