Technical Details of the Proposed Project
The internet is surely the most exciting and beneficial technology of modern times. In no other age has such an all-pervading medium of information exchange been found among the people of the world. No branch of study, no organization, no commercial endeavor can continue to remain indifferent to the internet and its associated benefits.
However one major barrier to widespread access to the internet is the language barrier. English has been the de facto language of the internet. Information extraction as well as data inputting is largely via English. Thus one of the pre-requisites to using the internet is knowledge of and proficiency in English. The largest text base existing on the internet is in English.
In India, - which along with China home to more than half of humanity - the need for the internet is beyond debate. This cheap mode of information transfer and retrieval is the need of the hour.This will not only bring the far corners of the country close to one another, but it will also provide a window to the outside world.
One of the peculiarities of this country, which distinguishes it from China and which is both an advantage and disadvantage- is the enormous variety of languages in use. A villager from remote Arunachal Pradesh will find it impossible to communicate with a villager from Gujarat. Likewise someone from Kashmir will find himself not understood or misunderstood in Tamilnadu.
This is not to say that radio, television and other such media have not done an admirable job of disseminating information, but this is only a very small percentage of what needs to be done. In a democratic society where information is empowerment this language barrier must be overcome.
Thus although the internet has arrived with all its blessings, the biggest obstacle to reaping its benefits is the language barrier.
At the international level also there is a growing awareness among people, that the barrier posed by English must be surmounted. The question, however, is how? Information will continue to be input into the internet in English by predominantly English speaking countries like the USA, UK, Australia, Canada and even the India. But a large number of people would like to incorporate their knowledge, expertise and information into the internet in their own native tongues. They would also like to access and decipher information from the internet in their own language.
The above clearly necessitates the scenario as shown in the figure
The internet stores the information in a language independent form. An analyzer module converts the information input in any natural language into the above mentioned form (let us call it LIF, for language independent form). The generator module produces the actual natural language sentences from the LIF.
An example of LIF is given below for the sentence The Internet will benefit India:
agent and object are case relationships. is\_a is a semantic net construct to fix the ontology of the words. Thus benefit is specified as a verb and India is specified as a country. This essentially disambiguates the words. .future specifies the tense of the verb.
Because of the LIF, users of different language will be able to exchange information via this(?) and the internet. This is illustrated in figure
From the diagram it is clear that the two users- one speaking Hindi and the other speaking Bengali - need not know any language other than his or her own. The user avails of the benefit of the internet in his own language.
A question that may arise at this stage is why bring in the intermediate form? One could store the documents in Hindi and then a translator could convert Bengali (or any other language) document to or from Hindi. The point, however is that the LIF will have only one interpretation. The representation shall be unambiguous.
One of the important benefits accruing from this work is that it will lead to the development of a repository of Indian Language Processing tools on the internet. The most important among them is the construction of the Hindi Word Net. This is a huge lexical knowledge base of Hindi words specifying all possible relationships between words. It may be noted that a 'wordnet' package for English already exists and is in very wide use.
Italy, Germany, France and Russia have undertaken the task of lexical knowledge base construction for their languages in a big way. Any work on natural language based information processing is impossible without a 'wordnet' like tool.
Thus a major objective of the project will be Indian Language Enabled Information Processing on the Internet.
The important steps in realizing this objective are:
The development and refinement of the dictionary which is the bridge between the Hindi words and the entities of the LIF will continue through the duration of the project with different stages being identified for the completion of the most common words, words of intermediate usage and words that are rarely used.
The building of the Hindi Wordnet will also continue till the end of the project. The generator, which will make extensive use of the dictionary and the wordnet, will be built first. While the generator is nearing completion the work on the analyser will start. It is envisaged that the same dictionary will be useful for both the analyser and the generator. The final phase of testing is the most important since it is this, which will ultimately enable Hindi on the internet.
Though the approach and the method described above pertain to Hindi, they are general enough to be applicable to other Indian languages. Further, though efforts are underway at many places in India ( CDAC, IIT Kanpur, University of Hyderabad , etc) to make Indian languages available on the internet, the goals of the current proposal are very different and have far reaching consequences. There is a great emphasis on this on the semantics of the information, faithful transfer of information between languages and the universality and efficiency of information representation. These are fundamental research issues in the area of natural language processing as a whole.
It may be mentioned here that there has been lot of work on transfer based and interlingua based machine translation . The interlingua based approach has met with better success historically. This field has established the techniques for lexicalization, parsing, generation, knowledge representation etc.
Currently the internet has brought about new challenges of information transfer and retrieval on the internet in natural languages. Recently the United Nations University, Tokyo has designed, with seventeen other research groups from various countries, a novel intermediate langauge called the Universal Networking Language.
Some of these linguistic groups are French, German, Chinese, Russian, Spanish, Hindi, Indonesian, Japanese and so on. Work is underway at many places in the world to build analysers and generators to and from UNL to natural languages.