Machine Translation set for quantum leap in India

In India?s multi-linguistic landscape, the need to facilitate smooth communication between the Centre and the states is vital for good governance. Machine Translation offers a great solution to this problem. Srikanth RP & Dheeraj Kapoor do a reality check on Machine Translation technology in India and find out what?s happening on the ground

No one knows fully understands the meaning of `Unity in diversity' better than an Indian. Eighteen different languages, plus the countless dialects, and what you have is one messy pottage which very few would like to put their hands in. While this diversity has been India's distinguishing mark on the global scene, it has created quite a few hiccups in the day to day administration of the country.

In an effort to root out this problem, the founding fathers of our country nominated `Hindi' as the official language of the country and ordained that all government communication should be made through this medium. However, the situation on ground zero seems to be far from ideal. Quite a chunk of this communication is done in English. Couple this with the fact that most state governments function in their own regional languages and the situation becomes even more complex. This predicament has given rise to an urgent need to translate these documents into a language best understood by the target audience, but with translators increasingly hard to find, what could be the solution to this problem?

C-DAC?s Darbari says the socio-political structure of a country has a direct bearing on the development of MT technology

Machine translation, say experts, could offer a viable option to those wishing to move on to an environment where thousands of verses in English could be converted into regional languages on the trot. In fact, Machine Translation (MT), thanks to its ability to change the way communication is done, has emerged as one of the most exciting technologies in recent times.

Says R M K Sinha, professor, Computer Science and Engineering, IIT Kanpur, "India has 18 major regional languages written in 10 different scripts. However, English, though spoken by a minuscule 3 percent of the population, is still the de-facto link language for administration, business and control. All grass root information of land, agriculture, health and education needs to be disseminated in the respective regional languages for effective communication and understanding. Hence, translation is as important as basic and necessary infrastructure like roads, water and transportation for a country like India."

Agrees Dr Hemant Darbari, group co-ordinator, Applied Artificial Intelligence group, C-DAC, "The social or political importance of MT arises from the socio-political importance of translation in communities where more than one language is generally spoken. The technology assumes even greater importance in non-English speaking countries such as India."

Adds Durgesh Rao, research scientist, Knowledge based Computer Systems Division, NCST, "India is a linguistically rich area-we have Hindi, English, and fourteen other official languages, each of which is spoken by millions of people. Since most information is generated in English, Machine Translation has emerged as a critical technology that can help communicate and share information more effectively."

NCST?s Rao says the lack of online lexical resources has hampered the growth of MT technology in India

The MT revolution was kick-started by C-DAC when it started work on NLP (Natural language processing) and developed a parser which could parse Hindi, Sanskrit, Gujarati, English and German. While developing this technology, the company was looking at practical implementations of the same and suggested it to various agencies. Realising the immense potential of MT, the Department of Official Language (DOL) Government of India began actively funding such projects.

Today, the Ministry of Information Technology has realised the importance of Machine Translation and has identified the following domains for development of domain specific translation systems: government administrative procedures and formats, parliamentary questions and answers, pharmaceutical information and legal terminology and judgements. The ministry also initiated the `Technology for Development of Indian languages' project in the year 1990-91 to support and fund R&D efforts in the area of Information processing in Indian languages covering machine translation among others.

However, with 18 different languages, translation is no kid's play. As English and Hindi are a critical pair of languages and constitute a bulk of the correspondence in government offices, this pair has been identified as the priority area for Machine aided Translation. Accordingly, two specific areas of research have been identified. They are: MT systems for translation between Indian languages and MT systems for translation between English to Hindi. Currently, three institutions in the country namely C-DAC, NCST and IIT have taken the lead in developing applications using this cutting edge technology.

Under the knowledge-based computer systems project of the DOE, C-DAC developed VYAKARTA, which could parse English, Hindi, Gujarati and Sanskrit. It used the same parser to develop MANTRA (a machine assisted translation tool for translating official language sentences from English To Hindi). The same was demonstrated to the Department of Official Languages who financed the project entitled `English to Hindi Computer assisted Translation System' for administrative purposes. The aim of the project was to design, develop and implement a computer assisted translation system for personnel administration. The system is now able to translate letters and circulars such as appointment letters and transfers and is also capable of taking inputs from standard Word processing and DTP packages. After successful completion of English to Hindi translation in the above-specified domain, the company is now looking to extend it to other domains and apply the developed techniques for multi-lingual translation. This capability would also enable it to achieve Machine translation between any language pair.

Another organisation involved in the area of MT is Mumbai-based NCST. NCST was one of the first institutes in India to work

IIT Mumbai?s Bhattacharya says interaction with the industry helps them understand the technology better

on Machine Translation. Explains Rao "In the late 80s we developed an early prototype, ScreenTalk, to translate PTI news stories of specific categories, using a script-like approach. Since then, we have continued our work and have developed MaTra, a general-purpose framework for translation between English and Indian languages, starting with Hindi." The focus in MaTra is obviously on the innovative use of man machine synergy. Currently, the domain being explored is news, which can later be extended to any domain. The system breaks an English sentence into chunks, analyses the structure and displays it allowing the user to verify and correct it. MaTra can be used in two ways. In the automatic mode, the system gives the best translation it can which can be later post-edited by the user. In the manual mode, the user can guide the system towards the correct translation using an intuitive GUI. Adds Rao, "We have an advanced prototype of this system that works for simple sentences, and are extending it to cover more complex sentences in an incremental fashion."

Talk about cutting edge technology and you simply cannot keep the IIT's out of the picture. Explains Dr Pushpak Bhattacharya, Department of Computer Science and Engineering, IIT Mumbai, "IITs have long felt the need for investing in Machine Translation. IIT Kanpur took the lead through projects such as Anusaaraka, Anglabharati, Anubharati etc. Currently, a very modern approach to this problem through the Universal Networking Language is being pursued in IIT Bombay. The faculty regularly interacts with industries on MT related problems. Also numerous student projects at bachelors, masters and PhD level are undertaken in IITs."

`ANGLABHARATI' is said to be a revolutionary system in the field of Machine Translation. The system is a machine aided translation system for translation between English to Hindi, for the specific domain of Public Health Campaigns. Explains Sinha, "We at IIT Kanpur have developed ANGLABHARATI (a rule based system for translation from English to all Indian languages) and ANUBHARTI (an abstracted example based approach). An alpha version of a system for English to Hindi based on ANGLABHARATI technology is ready and is being field-tested by ER & DCI Noida."

The technology behind developing a machine translation system is not so simple. A good machine translation system cannot be produced by merely replacing source language words with target language words. A word for word translation does not exactly produce a very satisfying target language text. A good machine translation system must incorporate not only a good knowledge of the vocabulary of both the source and target language, but also of their grammar. For example, C-DAC's MANTRA follows the strategy of not word-to-word or rule-to-rule but lexical tree to tree, wherein a chunk to chunk level of transfer can be done. This system uses the Tree Adjoining Grammar (TAG) formalism for both parsing of English sentences and generation of Hindi sentences. Currently focussing on the domain of personnel administration, C-DAC claims that text related to appointment, transfer and office orders are translated successfully with almost 90-95 percent accuracy.

Adds Ajai Jain, associate professor, Computer Science and Engineering, IIT Kanpur, "The most common technique to use machine translation is by coding the grammatical rules of source and target languages in the software and get the translation done using these rules and dictionaries specifically created for this purpose. The other technique is to store the source and target language pairs and try and match the new sentences for similarities from the existing example base and obtain a translation based on the best match. There can be an amalgamation of the above techniques, wherein patterns are stored in place of raw examples. In addition, statistical methods can be deployed to increase efficiency of the translation."

While all the current projects have focused their energies on machine translation from English to Hindi, extending them to other languages, the Anusaaraka project which started at IIT Kanpur-and is now being continued at IIIT Hyderabad-is innovative and was started with the explicit aim of translation from one Indian language to another. Anusaaraka is a software which is capable of converting text from one Indian language to another. It produces output which a reader can understand but is not exactly grammatical. For example, a Bengali to Hindi Anusaaraka can take a Bengali text and produce output in Hindi which can be understood by the user but will not be grammatically perfect. Likewise, a person visiting a site in a language he does not know can run Anusaaraka and read the text. Anusaaraka's have been built from Telugu, Kannada, Bengali, Marathi and Punjabi to Hindi. The system so developed will be available as open source software.

Sceptics who doubt the efficiency of MT systems would be surprised to know that there are several MT systems in use around the world. Examples include the well known Systran (used by the AltaVista search engine) and METEO (used at the Canadian Meteorological Centre which does translation of over 45,000 words in weather bulletins since 1977).

C-DAC is making sure that MT is poised for even more exciting times with the proposed development of a Mantra Translation

JNU?s Anvita Abbi believes India has made rapid progress in specific domains in Machine Translation

Server which can be accessed by anyone on the Internet using a browser. All a user has to do is send the English text and the server sends back the translated text in the language requested. C-DAC is also working on a domain specific translated chat application. Here, one can select the language and all the communication will be done in the selected language. This means that even if you select Hindi and the other person selects English, you will receive all messages in Hindi although the other person types in English.

Despite such innovative projects enjoying the complete support of the government, the development of the technology has not been as rapid as expected. What could have hampered this growth? "Machine Translation is acknowledged as a major challenge the world over. When you take languages that are quite diverse, such as English and Hindi, the complexity is compounded. Since there is lack of appreciation of the nature of the task, popular perception of MT falls into two extreme categories. MT is either viewed as a simple problem that is already solved, or is dismissed as totally impossible. The truth is somewhere in between. As you know, services like Altavista and Google are offering rough automatic translation among several languages. This is mainly among European languages, and between English and far Eastern languages. Indian languages are yet to be covered! The languages that are now covered represent more than 30 years of hard work! Of course, we can learn from this experience and have Indian MT systems off the ground faster, but we obviously cannot underestimate the size of the task," says Rao.

Adds Jain, "Today nobody in the government has a roadmap for development of technologies like MT. They try to sponsor short term projects and generally have a wide gap between two projects. This leads to only patch working and loosing trained manpower in this area."

As seen with any technology, MT in India too has its share of blemishes. For example, the much touted Anusaaraka project is dismissed as a non-starter by R M K Sinha who launched the ANGLABHARATI project. Explains Sinha, "The Anusaaraka methodology works to some extent only for the specific pair of languages for which it was designed. It heavily exploits the fact that two Indian languages have the same word order but this is not necessarily true in all situations. Technically, Anusaaraka can be considered to be a specific case of ANGLABHARATI technology. Anubharati is generic in nature whereas Anusaaraka is language pair specific with limited growth capability and no guarantee for grammatical forms. Incidentally, the Anusaaraka project was sponsored by the Government of India with large funding and nothing tangible has come out of it even after a decade of work. In fact, it has been a great mistake on the part of the government to have funded this project to the extent that other more promising MT paradigms investigation and development have been starved of funds. It is unfortunate that the country has been pushed behind by almost five years due to lopsided support to Anusaaraka."

So what is the solution to such a problem? Explains Rao, "Coming to the issue of lexical resources, building MT systems without basic lexical resources such as online corpora, lexicons and thesauri is like trying to build cities without brick, mortar and cement. There is an acute lack of online lexical resources for Indian languages. Whatever little exists has been developed for specific groups and is not easily shareable. This is a massive task which cannot be done by any single group. Indian groups have now begun to address this challenge jointly by starting a collaborative open source initiative called LERIL-Lexical Resources for Indian languages, which includes several groups such as IIIT Hyderabad, NCST, AU-KBC and Kendriya Hindi Sansthan." Sharing of resources could be the key to helping MT projects take off at a faster rate. Agrees Sinha, "What we lack today is a rich lexical database and availability of trained manpower to do R&D in this area. Given the unique multi-lingual culture of the country and our leadership in this area, we can become a global player in the field of MT if proper encouragement and funding is provided for R&D."

Adds Bhattacharya, "The main reason is the absence of lexical resources. Fortunately, the Ministry of Information Technology is taking concrete steps towards creating these resources. Using the funding from the Ministry, IIT Bombay is building the Hindi Wordnet (to be followed by the Marathi Wordnet). Once built, this will facilitate MT R&D in the country."

In conclusion, it would be worthwhile to add that despite all the issues involved, India has over the years made significant progress in the field of MT. Currently, the Ministry of Information Technology sponsors nearly 75 percent of these projects. Expressing her views on the state of the technology in India, Anvita Abbi, professor of linguistics, Jawaharlal Nehru University, New Delhi, says, "The main problem faced in the area of MT are syntactical. With various grammatical issues involved in the languages-since each language has disparate structures-it is difficult to capture these differences. However, MT has over the years made notable progress and has been quite successful in scientific and domain specific fields because of its objectivity."

