Resources

Following resources are freely available for research purposes only. If you use these resources please cite relevant papers

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus.

We provide the unique word list form the IndoWordnet (IWN) knowledge base. The word list for each language in IWN is available in a separate file where each file contains one word per line.

Universal Word - Hindi dictionary is being made at CFILT, IIT Bombay for the purpose of Machine Translation. The user can search the Hindi and English words and phrases. This lexicon also provides the grammatical, morphological and semantic attributes of the Hindi words. This version contains

Hindi polarity labeled corpora for movie domain.

Hindi polarity labeled corpora for tourism domain.

Marathi polarity labeled corpora for tourism domain.

Polarity labeled sense annotated corpora.

Generated Hindi-Marathi Corpus from English-Marathi Corpus of PIB, PMI and Tatoeba using Google Translate API

Gaze Data collection and annotation for the task of Cognition-aware Cognate Detection.

Contains the dataset repository released with the LREC 2020 publication which introduces the challange dataset for cognate detection and false friend detection among Indian languages.

It contains the dataset created for multilingual Query Intent Detection system for Indian languages, namely Hindi, Marathi, and Bengali.

It contains the dataset created for generating full length answers from factoid answers.

This repository contains the dataset created for extracting N-ary Cross-sentence Relations using Constrained Subsequence Kernel

This repository contains the knowledge graphs created for two data sources - aircraft accident investigation reports and Operation and Maintenance manuals.

Tool to search the Saint Tukaram's abhangas.

This system allows for reading and writing mails in Indian languages using Pine as message handling tool. Pine is University of Washington's Program for Internet News and Email.

The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.

The Hindi WordNet is a system for bringing together different lexical and semantic relations between the Hindi words. It organizes the lexical information in terms of word meanings and can be termed as a lexicon based on psycholinguistic principles. The design of the Hindi WordNet is inspired by the famous English WordNet.

Sanskrit wordnet is based on idea of English WordNet. It is more than a conventional Sanskrit dictionary. It gives different relations between synsets or synonym sets which represent unique concepts.

Marathi wordnet is based on idea of English WordNet. It is more than a conventional Marathi dictionary. It gives different relations between synsets or synonym sets which represent unique concepts.