Resource Center for Indian Language Technology Solutions


The core activities of the CFILT are divided into the following groups:
Hindi WordNet is an online lexical database for Hindi Language. The basic building block for the WordNet is the synonymy set or synsets. Each synset conveys an unique concept in the language. These synsets are linked with other synsets through different semantic relations like hypernymy, hyponymy, meronymy, holonymy, entailment, troponymy etc. Unique features like graded antonymy and meronymy relations are included here. Cross part of speech linkage is also introduced. Compounding phenomenon which is specific feature for Indian languages has been handled here. Currently there are more than 8500 synsets where an attempt is made to cover all the common words for the language. It has an efficient underlying database design. Data entry interface has been implemented using Java/JFC. The web interface for querying the Hindi WordNet has been implemented using Php4 scripting language. Hindi WordNet is now in a stable stage of development and other Indian languages can use it for building their respective WordNets.
Click here for the ppt file.



The work on Marathi wordnet has just been started in the Centre for Indian Language Technology at Indian Institute of Technology Bombay. It is being created taking Hindi Wordnet as a base because Marthi and Hindi come from the same mother language that is Sanskrit and also share the same script, Devnagri. The basic approach is insertion, deleation and translation of words in the Hindi Synsets.
Click here for the ppt file.


The activities in the language translation group can be further divided into dictionary creation, standardisation of lexicon, enconversion and deconversion

English-Hind Dictionary According To Universal Language Specifications

An English-Hindi dictionary is being made according to the UNL Specifications for the purpose of Machine Translation. The user can see the Hindi meanings of the English words. It also provides more than 200 grammatical and semantic attributes of the Hindi word. In this first version, about 80,000 words are available. We have taken the common words for the purpose of machine translation from English to Hindi.


Standardization of lexicon

As dictionary is one of the most important resource in the Enconversion and Deconversion process, it is very much important that all the language dictionaries that are being developed confirm to some standards or are standardized in some form. Basically by standardization of the dictionary, we mean that the concepts that are present in all the languages should be represented uniformly in all the language dictionaries. Also the semantic attributes that are used in different dictionaries should be uniform.

For standardization of the dictionaries UNU, Tokyo has provided a Knowledge Base which is in the form of hierarchy of concepts. We have created a set of semantic attributes, which will be used in all the language dictionaries that are being developed. These semantic attributes have also been incorporated into the Knowledge Base. So our task is to map each entry of the dictionary to one of the concepts in the Knowledge Base.

For the Noun part of the dictionary (i.e. for all the noun entries of the dictionary), a program has been written which allows the user to quickly select a concept from the Knowledge Base. Efforts are being made to automatically standardize the Verb, Adjective and Adverb parts of the dictionary.

Click here for the ppt file


Automatic Generation of UW-Dictionary through WordNet

Universal Word Dictionary plays an important role at the time of Machine Translation. It is used by both of the enconverter and deconverter softwares of the UNL. But the main problem comes at the time of generating it, as it takes a lot of manual efforts for generating every entry of the UW Dictionary. A word can have multiple senses, and for every sense, it can have a substantial number of "Semantic-attributes". So, Automatic generation of the Universal Word Dictionary would be a great step as it would save a large amount of time and manual efforts.

In WordNet, an attempt is made to organize the lexical information in terms of word meanings, rather than word forms. It organizes nouns, verbs, and adjectives into synonym sets, each representing one underlying concept. Different relations are used to link the synonym sets. This organization of lexical information makes the WordNet an important knowledge base for the Automatic generation of "Universal Word Dictionary".

Click here for the ppt file


Enconversion of Marathi to UNL

EnConverter is a language independent parser that provides a framework for achieving morphological, syntactic and semantic analysis almost simultaneously. It helps to resolve ambiguity, if any, in the input natural language (Marathi) sentences. The Enconverter needs a semantically rich Marathi-UW dictionary and a rule base consisting of enconversion rules. The output of the Enconverter is the UNL expressions that express the knowledge extracted from the input Marathi sentences.

The work mainly consists of:
1. finding out the correspondence between Marathi language phenomena and UNL building blocks namely relation labels and UNL attributes and recording it in tabular form.
2. creating Marathi-UW dictionary and enconversion rulebase to be used by the Enconverter software for achieving machine translation.

The scope of the current work is restricted to simple sentences only.

Click here for the ppt file


Enconversion and Deconversion of Hindi - UNL

Every language has separate language server in which both encoverter and deconverter resides. The process of transformation of native language to UNL is called enconversion (analysis) and the opposite is deconversion (generation). Analysis and Generation rules are required for the enconversion and deconversion process. There are 6000 analysis rules and 5000 generation rules for analyzing and generating simple, clausal and complex sentences of Hindi. We have also handled sentences from different corpus like Agriculture, ITU etc. We have incorporated tagging to resolve ambiguity. Certain language specific features are also dealt with.

Click here for the Hindi Analysis ppt file
Click here for the Hindi Generation ppt file



The activities in the Search engine group can be divided into following areas:
  • Font converters: Converters to convert from different fonts to ISCII standard and vice-versa. (ppt file)
  • Abhangas of Saint Tukaram: More than 4600 abhangas of Saint Tukaram are made browsable and searchable on the Web. (ppt file)
  • Damle Grammar: An online Marathi Grammar book (ppt file)
  • Marathi Search Engine: The web pages downloaded for the Marathi search engine have to be checked whether they are in Marathi language or not. For this purpose a Language identifier is developed.(ppt file)
    Generally search engines store the webpages on the filesystems provided by their operating systems. The webpages can be stored efficiently in a file system specially designed for storing webcontents. A Web Content File System is being developed for the same purpose.(ppt file)
  • Enabling Devnagri fonts on Linux: Linux offers limited support to the fonts like Devnagri Unicode fonts.(ppt file)



Text-To-Speech synthesis system for marathi language is being undertaken at CFILT. It speaks the text entered in the computer in Marathi. This system is based on Devanagari script. The concatenative Speech Synthesis model is used for the system. All phones in Marathi are stored and syllables are formed by connecting these to vowels. Text is played back by concatenating these syllables. Digital Signal Processing is done on speech signals to produce good quality speech.
Click here for the ppt file




CFILT Home Page