IIT Bombay English-Hindi Corpus (Version 1: Old)

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task in 2016 and 2017 for the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs. File name: parallel.tgz

Source	Number of Segments
GNOME (Opus)	145,706
KDE4 (Opus)	97,227
Tanzil (Opus)	187,080
Tatoeba (Opus)	4,698
OpenSubs2013 (Opus)	4,222
HindEnCorp	273,885
Hindi-English Wordnet Linkage	175,175
Mahashabdkosh: Administrative Domain Dictionary	66,474
Mahashabdkosh: Administrative Domain Examples	46,825
Mahashabdkosh: Administrative Domain Definitions	46,523
TED talks	42,583
Indic multi-parallel corpus	10,349
Judicial domain corpus - I	5,007
Judicial domain corpus - II	3,727
Different Indian Government websites	123,360
Wiki Headlines	32,863
Book Translations (Gyaan-Nidhi Corpus)	227,123
Total	1,492,827

File name: dev_test.tgz

Set	Description	Number of Segments
Dev (use for tuning)	Newswire (from WMT 2014)	520
Test (use for final evaluation) (untokenized)	Newswire (from WMT 2014)	2507

File name: dev_test_tokenized.tgz (same as dev_test.tgz, except that the test set is also tokenized)

Set	Description	Number of Segments
Dev (use for tuning)	Newswire (from WMT 2014)	520
Test (use for final evaluation)	Newswire (from WMT 2014)	2507

File name: monolingual.hi.tgz

Source	Number of Sentences
BBC-new	18,098
BBC-old	135,171
HindMonoCorp	44,486,533
Health Domain	8,001
Tourism Domain	15,395
Wikipedia	259,305
Judicial Domain	152,776
Total	45,075,279

You can use the monolingual corpora on the WMT 2014 website for English

The Hindi side of the training, dev, test sets as well as the monolingual corpus have been normalized to ensure canonical Unicode representation using the Indic NLP Library.
The training and dev sets of the parallel corpus have been tokenized.

For English, the Moses tokenizer was used
For Hindi, the Indic NLP tokenizer was used

Version 1.0

Click this link to download the corpus This section lists the supplementary resources derived from the parallel corpus.

Xlit-IITB-Par: Hindi-English Transliteration Corpus
This is a corpus containing transliteration pairs for Hindi-English. These pairs were automatically mined from the IIT Bombay English-Hindi Parallel Corpus using the Moses Transliteration Module. The corpus contains 68,922 pairs.

If you use this corpus or its derivate resources for your research, kindly cite it as follows:
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. 2017 (under review at LREC 2018)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

IIT Bombay English-Hindi Corpus (Version 1: Old)

About

Parallel Corpus

Training Set

Dev and Test Sets

Hindi Monolingual Corpus

English Monolingual Corpus

Pre-processing

Downloading the corpus

Version 1.0

Supplementary Resources

Citing the corpus

License

Contact

Anoop Kunchukuttan

Pratik Mehta

Pushpak Bhattacharyya