Resource Center for Indian Language Technology Solutions


Questionnaire Section I (to be filled by the Organisations)
1. Organization Details: Name: I.I.T Bombay Address: Powai, Mumbai – 400 076 Telephone: 5722545 Fax: 5720290 e-mail: URL Address of website: http://www.iitb.ac.in 2. Contact Person: Prof. Pushpak Bhattacharya Telephone: 5767718 Fax: 5720290 e-mail: pb@cse.iitb.ac.in Chief/Head of R&D Prof. Pushpak Bhattacharya Telephone: 5767718 Fax: 5720290 e-mail: pb@cse.iitb.ac.in 3. Indian Language Tools : (Please furnish following information separately for each tool along with the copy of published brochure) Name: iLeap Nature (h/w or s/w): Software Minimum Platform Requirement: Windows platform. (h/w and Operating System) Languages supported: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Sanskrit, Tamil, Telugu. Fonts supported for each language: --- Bought-out Fonts/Developed in-house Developed in-house. Functionality: Is it web-enabled? Yes Keyboard Lay-outs: Phonetic, Typewriter, Inscript Coding scheme: ISCII (ISCII/ Unicode/ Proprietory) Convertor modules for above: Converters available for inter-conversion between ISCII and Unicode. Product evolution cycle: Detailed information not available Portability/ expandability: Cannot be used with other similar (Inter-operatibility with products available in the market. other similar products available in the market) Date of launch: 1995 No. of copies sold so far: Not revealed Developed in-house/ Contracted to other agencies: Developed in-house. Development Efforts in Man-Hours: Inadequate information. Any Technology Upgradation plans: --- Name: Akruti Nature (h/w or s/w): Software Minimum Platform Requirement: Windows platform. (h/w and Operating System) Languages supported: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Sanskrit, Tamil, Telugu. Fonts supported for each language: --- Bought-out Fonts/Developed in-house Developed in-house. Functionality: Is it web-enabled? No Keyboard Lay-outs: Phonetic, Typewriter, Inscript Coding scheme: Proprietory (ISCII/ Unicode/ Proprietory) Convertor modules for above: Available for offline inter-convertion between ISCII and Akruti DBF Product evolution cycle: Detailed information not available Portability/ expandability: Cannot be used with other similar (Inter-operatibility with products available in the market. other similar products available in the market) Date of launch: 15-8-1999 No. of copies sold so far: Not revealed Developed in-house/ Contracted to other agencies: Developed in-house. Development Efforts in Man-Hours: Inadequate information. Any Technology Upgradation plans: --- Name: Windows 2000 (Indian Language Support) Nature (h/w or s/w): Software Minimum Platform Requirement: IBM-PC Compatibles. (h/w and Operating System) Languages supported: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Sanskrit, Tamil, Telugu. Fonts supported for each language: --- Bought-out Fonts/Developed in-house Developed in-house. Functionality: Is it web-enabled? Yes Keyboard Lay-outs: Phonetic, Typewriter, Inscript Coding scheme: Unicode (ISCII/ Unicode/ Proprietory) Convertor modules for above: Convertors available from iLeap for inter-conversion between ISCII and Unicode. Product evolution cycle: Detailed information not available Portability/ expandability: Cannot be used with other similar (Inter-operatibility with products available in the market. other similar products available in the market) Date of launch: 2000 No. of copies sold so far: Not revealed Developed in-house/ Contracted to other agencies: Contracted to NCST, Mumbai. Development Efforts in Man-Hours: Inadequate information. Any Technology Upgradation plans: --- 4. Products under development: Hindi Wordnet, Marathi portal complete with search engine, Machine translation software along with dictionary, Online textbooks for schools in Marathi, Text to speech converter for Marathi. For details see below. Name and functionality: As above. Stage of completion: Machine translation softwares reasonably developed, other activities started about 6 months back. Plans for commercialization: Local industries will be contacted. The Market assessment: Local industries also are working on Marathi portals. (Assessment of the likely competition from Microsoft’s Indian language products and other vendors) 5. Technical Capabilities Developed for processing of Indian languages: Tools development: Hindi Wordnet, Marathi portal complete with search engine, Machine translation software along with dictionary, Online textbooks for schools in Marathi, Text to speech converter for Marathi. Fonts Development: Nil. Web enabled applications: Marathi portal complete with search engine, Online textbooks for schools in Marathi. Multilingual and Multimedia content creation: Speech Technology for Marathi. Capturing the Ancient heritage into knowledge based system: Development of Marathi text corpus in electronic form (including a Marathi dictionary, and 10 Marathi classics) Sorting and searching: --- Search engines: Efficient and linguistic search engines Optical Character recognition: --- Text to speech systems: Speech Technology for Marathi. Voice recognition: --- Machine translation systems: Machine translation between Marathi on one hand and Hindi and English on the other. 6. Tools/contents contributed for the public domain: Marathi portal complete with search engine. Machine translation software along with dictionary. Online textbooks for schools in Marathi. Hindi WordNet. 7. Limitations with regard to technology development/ growth of market (Indicative list of parameters for providing inputs):
  • Standards (for coding and keyboard Layouts, ISCII/Unicode): Difficulty is being faced in deciding on the appropriate standard for representing Indian language text. This is mainly because there does not seem to be consensus on the encoding adopted by various organizations, notably publishing houses, newspapers, etc. For example: Maharashtra Times and Loksatta use totally different encodings. It is urgently required that the ministry enforces a common standard accepted across the length and breadth of the country.
  • Return on investments: ---
  • Potential Buyers/User Organizations: No limitation is envisaged.
  • Support from respective State Governments: For the Marathi portal to be effective, government organizations should all create their sites and make them accessible. If not, this will pose a serious limitation.
  • Long development Cycles and gestation periods and piracy issues: The Indian language text to speech state-of-the-art is in a primitive state. Foundation level research will be required before proceeding with the actual implementation.
  • Intricacies with regard to technology application to multiple Indian Languages: The Hindi WordNet development is both linguistically and computationally a highly challenging task. In particular it is extremely difficult to find people combining both linguistic finesse and computational ability. Relevancy calculation for indexed documents (in the context of Marathi portal) is also a technologically challenging job. Finally the translation software requires enormous amount of lexical knowledge.
  • Competition from other companies and multi-nationals and so on: Few local industries are competitors as far as Marathi portal development is concerned.

Section II (to be filled by Technology Developers)
1. Profile of the Developer: See Appendix - I (Copy of Bio-data may be enclosed) 2. Technologies Developed: See Appendix - I (Details may be given separately for each of the developed technologies) Brief Description of Technology: (Enclose detailed write-up on the technology) Mode of funding: No. of Man-Hours: Creator/Owner of the Technology: Is the technology copyrighted: Testing / Beta Testing reports / feedback. Potential beneficiaries/ Users tie-ups: Detailed layout of Technology Transfer Documents: Proposed mode of technology transfer: (please give details): 3. Possible integration with other technologies and prospective IT products/services? Hindi WordNet once developed will provide great assistance towards Indian language enabled internet, machine translation among Indian languages and knowledge extraction from and text summarisation for Indian languages. The Marathi search engine will help the people in finding information that has been published on the internet in Marathi. The information can be extracted from websites of newspapers, agricultural websites, etc. The Marathi text to speech converter will help visually disabled people to derive the benefit of Indian language internet. 4. Measure of efficacy of the technology developed: i) Electronic corpus of Marathi text: Size and heterogeneity of corpus. ii) Hindi Wordnet: Number of words, Number of pointers (antonym, hypernym, meronym, etc) ii) Marathi portal complete with search engine: Retrieval precision, and accuracy, Quality of user interface, Search engine efficiency. iv) Machine Translation software along with dictionary: Size of the dictionary, Syntactic and semantic attributes of entries in dictionary, Coverage of the source and target languages. v) Online textbooks in Marathi for schools: Coverage, Quality of presentation, Animation, Amount of interactivity. vi) Speech technology for Marathi: Quality of speech, Coverage of words and sentences 5. Portability/Expandability. The technologies developed do not assume any specific software or hardware platform.
APPENDIX - I
Name: Rashmi Kulkarni Designation: Post Graduate student. Organization: I.I.T Bombay Address: Room #76, Hostel #11, IIT, Bombay. Telephone: 5722545 ext: 8749 Fax: - e-mail: rash@cse.iitb.ac.in URL Address of website: Areas of Research Work: Natural Language Processing. Brief Description of Technology developed: We are working on Marathi Language generation using intermediate form known as Universal Networking Language(UNL). UNL representation of a sentence is converted into an internal structure called nodenet. Syntax Planning phase produces the intermediate sentence by placing the words at correct positions. Then Morphology phase produces the desired Marathi sentence. The programming language used is C. We have generated Marathi Sentences from UNL expressions involving almost all relation labels mentioned in the UNL specification. Our system can generate simple, clausal, interrogative and imperative Marathi sentences. Currently, the system is being tested by the UNL expressions generated by Hindi Enconverter.
Name: Dipak Kumar Narayan. Designation: Research Assistant. Organization: I.I.T Bombay Address: Room #52, Hostel-4, IIT, Bombay. Telephone: 5720096 Fax: - e-mail: dipak@cse.iitb.ac.in URL Address of website: http://www.cse.iitb.ac.in/~dipak Areas of Research Work: Wordnet, Neural nets, Natural language understanding Brief Description of Technology developed: I am working on the WordNet project. The idea of the project is to make WordNet for Hindi language. The work consists of creating a software infrastructure and support for the data entry of Hindi words and its semantic relations. It involves of generating the user interface for data-entry and writing applications for the retrieval of data.
Name: Shachi Dave. Designation: Research Staff. Organization: I.I.T Bombay Address: Room #264, Hostel-10, IIT, Bombay. Telephone: 5722545 ext: 8749 Fax: - e-mail: shachi@cse.iitb.ac.in URL Address of website: Areas of Research Work: Natural Language Processing, Knowledge Extraction. Brief Description of Technology developed: Conversion of Hindi sentences into an intermediate form, called the Universal Networking Language (UNL). The conversion is done using a Japanese software called EnConverter. The system takes appropriate lexicon and analysis rules and generates UNL expressions from the Hindi sentence input in the romanized form. Our system is implemented to handle almost all the relation labels given in the UNL specification. It can deal with many types of sentences including simple, clausal, interrogative and imperative. The system is scalable, needing augmentation of the analysis rules and the dictionary entries. Currently, the system is being tested using real life sentences from an Hindi article in a magazine.
Name: Sagar A. Tamhane Designation: Project Engineer Organization: I.I.T Bombay Address: Room #313, Tansa House, I.I.T Bombay, Powai-76. Telephone: 5722545 Ext: 8749 Fax: - e-mail: sagar@cse.iitb.ac.in URL Address of website: http://www.cse.iitb.ac.in/~sagar Areas of Research Work: Indian language search engines, Information retrieval. Brief Description of Technology developed: Creation of a website on Marathi language and Marathi language technologies along with a search engine. The activities are: collection of information about other research groups and software concerns working in this area, collection of papers, reports, free tools and products, study of word morphology; stemming algorithm development and design and implementation of simple NLP techniques for improving search accuracy.
Name: Sanjay Kumar Jha. Designation: Research Scientist. Organization: I.I.T Bombay Address: 414, Tansa House, IIT, Bombay. Telephone: 5722545 ext: 8749 Fax: - e-mail: lendlod@yahoo.com URL Address of website: - Areas of Research Work: WordNet (Hindi). Brief Description of Technology developed: Preparation of lexical database for Hindi WordNet from the perspective of different semantic analysis, eg: synonym, antonym, polysemy, homonymy, hypernymy, hyponymy, meronymy, holonymy, etc. Building a lexical inheritance system for Hindi lexical entity is one of the major concerns along with several linguistic issues which have to be resolved in the due course.
Name: Prabhakar Pande. Designation: Project Assistant. Organization: I.I.T Bombay Address: A.I lab, CSE, I.I.T. Powai, Mumbai-76. Telephone: 5722545 ext: 8749 Fax: - e-mail: - URL Address of website: - Areas of Research Work: WordNet (Hindi). Brief Description of Technology developed: Preparation of lexical database for Hindi WordNet from the perspective of different semantic analysis, eg: synonym, antonym, polysemy, homonymy, hypernymy, hyponymy, meronymy, holonymy, etc. Building a lexical inheritance system for Hindi lexical entity is one of the major concerns along with several linguistic issues which have to be resolved in the due course.