IIT Bombay English-Hindi Corpus
The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the
Workshop on Asian Language Translation Shared Task in 2016 and 2017 for the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.
Version 2.0
Click
this link to download the corpus. You can find a catalog of other English-Hindi and other Indian language parallel corpora here:
Indic NLP Catalog
2.0 |
March 2019 |
Previous versions provided tokenized dataset. This dataset is not tokenized, so the corpus can be processed by systems as per the user's choice. We recommend using the Indic NLP Library for tokenization. |
1.5 |
December 2018 |
Added the corpus 'Different Indian Government websites 2': around 70000 sentence pairs. |
1.0 |
July 2016 |
Intial Release. |
File name: parallel.tgz
Note: The file contains sentences from different sources in the same order as listed in the table below. So, you can select any subset as required.
Source |
Number of Segments |
GNOME (Opus) |
145,706 |
KDE4 (Opus) |
97,227 |
Tanzil (Opus) |
187,080 |
Tatoeba (Opus) |
4,698 |
OpenSubs2013 (Opus) |
4,222 |
HindEnCorp |
273,885 |
Hindi-English Wordnet Linkage |
175,175 |
Mahashabdkosh: Administrative Domain Dictionary |
66,474 |
Mahashabdkosh: Administrative Domain Examples |
46,825 |
Mahashabdkosh: Administrative Domain Definitions |
46,523 |
TED talks |
42,583 |
Indic multi-parallel corpus |
10,349 |
Judicial domain corpus - I |
5,007 |
Judicial domain corpus - II |
3,727 |
Different Indian Government websites |
123,360 |
Wiki Headlines |
32,863 |
Book Translations (Gyaan-Nidhi Corpus) |
227,123 |
Different Indian Government websites 2 |
69,013 |
Total |
1,561,840 |
File name: dev_test.tgz
Set |
Description |
Number of Segments |
Dev (use for tuning) |
Newswire (from WMT 2014) |
520 |
Test (use for final evaluation) |
Newswire (from WMT 2014) |
2507 |
File name: monolingual.hi.tgz
Source |
Number of Sentences |
BBC-new |
18,098 |
BBC-old |
135,171 |
HindMonoCorp |
44,486,496 |
Health Domain |
8,001 |
Tourism Domain |
15,395 |
Wikipedia |
259,305 |
Judicial Domain |
152,776 |
Total |
45,075,242 |
You can use the monolingual corpora on the WMT 2014 website for
English
- The Hindi side of the training, dev, test sets as well as the monolingual corpus have been normalized to ensure canonical Unicode representation using the Indic NLP Library.
This section lists the supplementary resources derived from the parallel corpus.
- Xlit-IITB-Par: Hindi-English Transliteration Corpus
This is a corpus containing transliteration pairs for Hindi-English. These pairs were automatically mined from the IIT Bombay English-Hindi Parallel Corpus using the Moses Transliteration Module. The corpus contains 68,922 pairs. This has been created from v1 of the corpus.
If you use this corpus or its derivate resources for your research, kindly cite it as follows:
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya.
The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018.

This work is licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License. The corpora we compiled from other sources are available under their respective licenses.