IIT Bombay English-Hindi Corpus

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus which will be used for the first time in the Workshop on Asian Language Translation 2016 Shared Task for the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs. File name: parallel.tgz

Source Number of Segments
GNOME (Opus) 145,706
KDE4 (Opus) 97,227
Tanzil (Opus) 187,080
Tatoeba (Opus) 4,698
OpenSubs2013 (Opus) 4,222
HindEnCorp 273,885
Hindi-English Wordnet Linkage 175,175
Mahashabdkosh: Administrative Domain Dictionary 66,474
Mahashabdkosh: Administrative Domain Examples 46,825
Mahashabdkosh: Administrative Domain Definitions 46,523
TED talks 42,583
Indic multi-parallel corpus 10,349
Judicial domain corpus - I 5,007
Judicial domain corpus - II 3,727
Different Indian Government websites 123,360
Wiki Headlines 32,863
Book Translations (Gyaan-Nidhi Corpus) 227,123
Total 1,492,827
File name: dev_test.tgz

Set Description Number of Segments
Dev (use for tuning) Newswire (from WMT 2014) 520
Test (use for final evaluation) (untokenized) Newswire (from WMT 2014) 2507
File name: dev_test_tokenized.tgz (same as dev_test.tgz, except that the test set is also tokenized)

Set Description Number of Segments
Dev (use for tuning) Newswire (from WMT 2014) 520
Test (use for final evaluation) Newswire (from WMT 2014) 2507
File name: monolingual.hi.tgz

Source Number of Sentences
BBC-new 18,098
BBC-old 135,171
HindMonoCorp 44,486,533
Health Domain 8,001
Tourism Domain 15,395
Wikipedia 259,305
Judicial Domain 152,776
Total 45,075,279
You can use the monolingual corpora on the WMT 2014 website for English In order to access the corpus, WAT 2016 participants should sign the following agreement, scan and send it to the addresss mentioned in it. Once we complete verification, we will send you the links to download the parallel corpus along with the development and test data. Please check the WAT website for further details about shared task registration, evaluation, etc. If you use this corpus for your research, kindly cite it as follows:
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. IIT Bombay English-Hindi Corpus. http://www.cfilt.iitb.ac.in/iitb_parallel/

Anoop Kunchukuttan

anoop.kunchukuttan@gmail.com

Pratik Mehta

pratikmehta1494@gmail.com

Pushpak Bhattacharyya

pb@cse.iitb.ac.in

  1. HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation. Ondrej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Stranak, Vit Suchomel, Aleš Tamchyna and Daniel Zeman. In Proc. of LREC 2014. Reykjavik, Iceland. ISBN 978-2-9517408-8-4. ELRA. 2014.
  2. The Indic multi-parallel corpus. Lexi Birch, Chris Callison-Burch, Miles Osborne and Matt Post. http://homepages.inf.ed.ac.uk/miles/babel.html, 2011.