IIT Bombay English-Hindi Corpus
About
We also provide this parallel corpus via the HuggingFace Datasets repository. Please visit this link for details:
IITB English-Hindi Parallel Corpus on HuggingFace
You can also find some useful instructions and sample code for the usage of this corpus here:
https://github.com/cfiltnlp/IITB-English-Hindi-PC
Downloading the corpus
Version 3.1
Click this link to download the corpus. You can find a catalog of other English-Hindi and other Indian language parallel corpora here: Indic NLP CatalogVersion Info
3.1 | December 2021 | Added 49,400 sentence pairs to the parallel corpus. In 3.0, we had added the corpus 'Different Indian Government websites 3': around 47,000 sentence pairs. |
2.0 | March 2019 | Previous versions provided tokenized dataset. This dataset is not tokenized, so the corpus can be processed by systems as per the user's choice. We recommend using the Indic NLP Library for tokenization. |
1.5 | December 2018 | Added the corpus 'Different Indian Government websites 2': around 70000 sentence pairs. |
1.0 | July 2016 | Intial Release. |
Parallel Corpus
Training Set
Source | Number of Segments |
---|---|
GNOME (Opus) | 145,706 |
KDE4 (Opus) | 97,227 |
Tanzil (Opus) | 187,080 |
Tatoeba (Opus) | 4,698 |
OpenSubs2013 (Opus) | 4,222 |
HindEnCorp | 273,885 |
Hindi-English Wordnet Linkage | 175,175 |
Mahashabdkosh: Administrative Domain Dictionary | 66,474 |
Mahashabdkosh: Administrative Domain Examples | 46,825 |
Mahashabdkosh: Administrative Domain Definitions | 46,523 |
TED talks | 42,583 |
Indic multi-parallel corpus | 10,349 |
Judicial domain corpus - I | 5,007 |
Judicial domain corpus - II | 3,727 |
Different Indian Government websites | 123,360 |
Wiki Headlines | 32,863 |
Book Translations (Gyaan-Nidhi Corpus) | 227,123 |
Different Indian Government websites 2 | 69,013 |
Different Indian Government websites 3 | 47,842 |
Different Indian Government websites 4 | 49,400 |
Total | 1,659,082 |
Dev and Test Sets
Set | Description | Number of Segments |
---|---|---|
Dev (use for tuning) | Newswire (from WMT 2014) | 520 |
Test (use for final evaluation) | Newswire (from WMT 2014) | 2507 |
Hindi Monolingual Corpus
Source | Number of Sentences |
---|---|
BBC-new | 18,098 |
BBC-old | 135,171 |
HindMonoCorp | 44,486,496 |
Health Domain | 8,001 |
Tourism Domain | 15,395 |
Wikipedia | 259,305 |
Judicial Domain | 152,776 |
Hindi Corpus from Release 3.1 | 49,400 |
Total | 45,124,642 |
English Monolingual Corpus
Pre-processing
- The Hindi side of the training, dev, test sets as well as the monolingual corpus have been normalized to ensure canonical Unicode representation using the Indic NLP Library.
Supplementary Resources
- Xlit-IITB-Par: Hindi-English Transliteration Corpus
This is a corpus containing transliteration pairs for Hindi-English. These pairs were automatically mined from the IIT Bombay English-Hindi Parallel Corpus using the Moses Transliteration Module. The corpus contains 68,922 pairs. This has been created from v1 of the corpus.
Citing the corpus
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018.
BiBTeX
@inproceedings{kunchukuttan-etal-2018-iit,title = "The {IIT} {B}ombay {E}nglish-{H}indi Parallel Corpus",
author = "Kunchukuttan, Anoop and Mehta, Pratik and Bhattacharyya, Pushpak",
booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
month = may,
year = "2018",
address = "Miyazaki, Japan",
publisher = "European Language Resources Association (ELRA)",
url = "https://aclanthology.org/L18-1548",
}
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. The corpora we compiled from other sources are available under their respective licenses.