IIT Bombay English-Hindi Corpus (Version 1: Old)
The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the
Workshop on Asian Language Translation Shared Task in 2016 and 2017 for the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.
File name: parallel.tgz
Source |
Number of Segments |
GNOME (Opus) |
145,706 |
KDE4 (Opus) |
97,227 |
Tanzil (Opus) |
187,080 |
Tatoeba (Opus) |
4,698 |
OpenSubs2013 (Opus) |
4,222 |
HindEnCorp |
273,885 |
Hindi-English Wordnet Linkage |
175,175 |
Mahashabdkosh: Administrative Domain Dictionary |
66,474 |
Mahashabdkosh: Administrative Domain Examples |
46,825 |
Mahashabdkosh: Administrative Domain Definitions |
46,523 |
TED talks |
42,583 |
Indic multi-parallel corpus |
10,349 |
Judicial domain corpus - I |
5,007 |
Judicial domain corpus - II |
3,727 |
Different Indian Government websites |
123,360 |
Wiki Headlines |
32,863 |
Book Translations (Gyaan-Nidhi Corpus) |
227,123 |
Total |
1,492,827 |
File name: dev_test.tgz
Set |
Description |
Number of Segments |
Dev (use for tuning) |
Newswire (from WMT 2014) |
520 |
Test (use for final evaluation) (untokenized) |
Newswire (from WMT 2014) |
2507 |
File name: dev_test_tokenized.tgz
(same as dev_test.tgz, except that the test set is also tokenized)
Set |
Description |
Number of Segments |
Dev (use for tuning) |
Newswire (from WMT 2014) |
520 |
Test (use for final evaluation) |
Newswire (from WMT 2014) |
2507 |
File name: monolingual.hi.tgz
Source |
Number of Sentences |
BBC-new |
18,098 |
BBC-old |
135,171 |
HindMonoCorp |
44,486,533 |
Health Domain |
8,001 |
Tourism Domain |
15,395 |
Wikipedia |
259,305 |
Judicial Domain |
152,776 |
Total |
45,075,279 |
You can use the monolingual corpora on the WMT 2014 website for
English
- The Hindi side of the training, dev, test sets as well as the monolingual corpus have been normalized to ensure canonical Unicode representation using the Indic NLP Library.
- The training and dev sets of the parallel corpus have been tokenized.
- For English, the Moses tokenizer was used
- For Hindi, the Indic NLP tokenizer was used
Version 1.0
Click
this link to download the corpus
This section lists the supplementary resources derived from the parallel corpus.
- Xlit-IITB-Par: Hindi-English Transliteration Corpus
This is a corpus containing transliteration pairs for Hindi-English. These pairs were automatically mined from the IIT Bombay English-Hindi Parallel Corpus using the Moses Transliteration Module. The corpus contains 68,922 pairs.
If you use this corpus or its derivate resources for your research, kindly cite it as follows:
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya.
The IIT Bombay English-Hindi Parallel Corpus. 2017 (under review at LREC 2018)

This work is licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License.