1. Adam Pease
Adam Pease is the Principal Consultant and CEO of Articulate Software. Formerly, he was Director of Knowledge Systems at Teknowledge, where he led a group conducting research and applications in ontology and knowledge based systems. His current work is on the Suggested Upper Merged Ontology, Arabic WordNet with Ontology, and Sigma ontology environment. His previous projects include Rapid Knowledge Formation, DARPA Agent Markup Language, High Performance Knowledge Bases and the Core Plan Representation.
Topic: Suggested Upper Merged Ontology (SUMO)
Abstract: This talk presents an overview of ontology, including how formal ontology compares to less formal approaches and how the Suggested Upper Merged Ontology (SUMO) (www.ontologyportal.org) compares to other formal ontologies. Classes of ontology-based applications are introduced. A detailed description of first order logic is provided. Issues of the capabilities and tradeoffs in first order logic inference are explored. Several exercises are included in the tutorial in order to maximize audience understanding of the concepts and provide the basics needed for ontology creation. The SUMO is also described in detail, along with its mappings to the WordNet lexicon. If time permits, the Controlled English to Logic Translation (CELT) system will be described as well as its SUMO-based logical output.
Ch. Boitet is professor of computer science at Université Joseph Fourier (Grenoble 1), where he has taught algorithmics, compiler construction, formal languages & automata, elementary logic, formal systems, and natural language processing. He is one of the authors of ARIANE, GETA's generator of MT systems. He has presented communications in many national and international conferences and published in various journals and books. He has also edited a book dedicated to the presentation of Pr. Vauquois' scientific work, as well as several international onference proceedings. He has been in charge of several research contracts aiming at reaching the operational stage and the industrial stage. He has also been involved in and/or in charge of GETA's participation in several cooperative research efforts. He is a member of ICCL and has been one of the organizers of COLING-92. In 1998, he was Programme Chair of COLING-ACL'98. He is and has been a regular reviewer for several journals and conferences, and has been in the programme committees of many congresses. His current interests include personal dialogue-based MT for monolingual authors (GETA's LIDIA project, international UNL project), speech translation (CSTAR project), machine helps to translators and interpreters, integration of speech processing inspired techniques in MT, multilingual lexical data bases (Papillon project), and specialized languages and environments for lingware engineering and linguistic research (Ariane-Y project).
Topic: Towards Higher Quality Internal and Outside Multilingualization of Web Sites
Abstract: The multilingualization of Web sites with high quality is increasingly important, but is unsolvable in most situations where internal quality certification is needed, and not solved in the majority of other situations. We demonstrate it by analyzing a variety of techniques to make the underlying software easily "localizable" and to manage the translation of textual content in the classical "internal" mode, that is by modifying the language-dependent "resources". A new idea is that volunteer final users should be able to contribute to the improvement or even production of translated resources and content. For this, we have developed a PHP piece of code which naive webmasters (not computer scientists nor professional translators) can add to a web site to enable "internal" multilingualization by users with enough access rights: in "management mode", these users can edit the texts of titles, button labels, messages, etc. in text areas appearing "in context" in the web page. If Web site developers follow some recommendations, all textual interface elements should be localizable in this way.
Another angle of attack, applicable in all cases where navigating a site though a "gateway" is possible, consists in replacing the problem of diffusion by the problem of access in multiple languages. We introduce the concept of iMAG (interactive Multilingual Access Gateway, dedicated to a web site or domain) to solve the problem of "higher quality" multilingual access. First, by using available MT systems or by default morphological processors and bilingual dictionaries, any page of an "elected" website is made instantly accessible in many languages, with a generally low quality profile, as through usual translation gateways. Over time, the quality profile of textual GUI elements, web pages and even documents (if accessible in html) will improve thanks to outside contributors, who will post-edit or produce the translations from the reading context. This is only possible because the iMAG associated to the website stores the translations in its translation memory (TM) and the contributed dictionary items it its dictionary. The TM has quality levels, according to the users' profiles, and scores within levels. An API will be proposed so that the developers of the "elected" website can connect their to its iMAG, retrieve the "best level" translations, certify them if necessary, and put them in their localized resources. At that point, external localization meets internal localization.
Keywords: Multilingual access, Machine Translation (MT), online MT, Web localization, collaborative translation environment, extended translation memory, translation of dynamic Web sites
S. Sudarshan received the Ph.D. from the Univ. of Wisconsin, Madison in 1992. He was a Member of the Technical Staff in the database research group at AT&T Bell Laboratories, from 1992 to 1995, and he has been at the Indian Institute of Technology (IIT), Bombay since 1995. He spent a year on sabbatical at Microsoft Research, USA in 2004-05.
Sudarshan's research interests center on database systems, and his current research projects include keyword querying on databases, processing and optimization of complex queries, and database support for securing and testing applications. Sudarshan is a co-author of the widely used textbook, Database System Concepts, 5th Ed., by Silberschatz, Korth and Sudarshan.
Topic: Keyword Search on Graph-Structured Data
Abstract: A variety of types of data, such as relational, XML and HTML data can be naturally represented using a graph model of data,
with entities as nodes and relationships as edges. Graph representations are also a natural choice for representing information extracted from unstructured data, and for representing information integrated from heterogeneous sources of data.
Keyword search is a natural way to retrieve information from graph representations of data, especially in the common situation where the graphs do not have a well-defined schema. Unlike text or Web search, there is no natural notion of a document, and information about a single conceptual entity may be split across multiple nodes. Answers to queries are therefore usually modeled as trees that connect nodes matching the keywords, with each answer tree having an associated score. The goal of keyword search then is to find answers with the highest scores. A number of systems, including the BANKS system developed at IIT Bombay, are based on such a model for answering keyword queries. In this talk we first outline background material on keyword querying on graph data, including models for ranking of answer trees. We then focus on search algorithms for finding top-ranked answers. We outline key issues in finding top-ranked answers, and present algorithms that address the problem in the context of in-memory graphs. We then consider the problem of search on external memory graphs, and briefly outline our on-going work on external memory graph search based on a multi-granular graph representation.
Ganesh Ramkrishnan is with the faculty of Computer Science and Engineering Department in IIT Bombay. His Research Interests are Statistical Learning Theory (Support Vector Machines, Kernel Machines, Radial Basis Functions),Statistical Language Modeling,Reasoning and Inferencing,Information Extraction.
Topic: Scalable Techniques for Information Extraction
Abstract: Information Extraction from terabytes of text is becoming an increasingly important requirement. In this presentation, we will review the state of the art in scaling up the extraction of information from large amounts of unstructured text. Two key approaches will be covered: (a) specialized indexing techniques and (b) grammar vs algebra-based paradigms for information extraction. We will also specifically discuss the challenges involved in building and maintaing rule-based annotators for information extraction. To conclude, we will demonstrate two systems that are focussed on scalable rule-based information extraction.
Dr. Sudeshna Sarkar is with the faculty of Computer Science and Engineering Department in IIT Kharagpur. Her broad topics of interest span different areas of artificial intelligence and machine learning. Her current interests are in Personalized information retrieval and natural language technology. She is mainly interested in building large scale practical systems that can be used to solve real life problems.
Topic: Opinion Analysis
Abstract: The large volume of user created content contains not only facts but a lot of opinions. Comments on different sites including news sites, personal blogs, social networking sites are only some of the places where internet users express their personal opinions and sentiments. These opinions are often valuable to consumers as well as businesses, governments, political analysts, etc. Opinion mining systems are geared to retrieve opinionated results. The ability to do so automatically implies that large collections can be automatically searched for opinions, which can be aggregated for the user. Opinion analysis aims to analyze opinionated texts to identify the topic, features and polarity of the opinions. In this talk we will talk about the methodologies to mine opinions from user generated content. We will briefly review the work under several headings – identification of polarized expressions or opinion phrases, document level sentiment analysis and sentence level sentiment analysis. We will briefly discuss some applications of opinion analysis related to aggregating opinions and creating opinion summaries.
Dr. Mandar Mitra is with the faculty in Computer Vision and Pattern Recognition Unit at Indian Statistical Institute, Kolkota. His research interests are in Information Retrieval, Machine Learning, Natural Language Processing.
Topic: Information Retrieval Evaluation
Abstract: This talk will focus on evaluation issues in Information Retrieval. The basic experimental framework and metrics, as embodied in the Cranfield paradigm, will be covered. The talk will also look at the assumptions behind this approach, some of the problems with this paradigm, and some proposed solutions. An overview of the main evaluation fora for IR -- TREC, CLEF and NTCIR -- will be given. FIRE - a proposed forum for IR evaluation for Indian languages - will also be briefly described. Time permitting, benchmark datasets and evaluation metrics for some IR tasks besides the basic document retrieval task will also be covered.
Vasudeva Varma is a faculty member at International Institute of Information Technology, Hyderabad Since 2002. Prior to joining IIIT-H, he was the president of MediaCognition India Pvt. Ltd and Chief Architect at MediaCognition Inc. (Cupertino, CA). Earlier he was the director of Engineering and research at InfoDream Corporation, Santa Clara, CA. He also worked for Citicorp and Muze Inc. in New York as senior consultant.
His areas of interests include search and information extraction, knowledge management and software engineering. He is heading the Search and Information Extraction Lab at Language Technologies Research Center (LTRC) and Software Engineering Research Lab.
Topic: Personalization and IR/IE
Abstract: Personalization within information retrieval, extraction and access (IR, IE and IA) has become a very interesting research area in the recent years. Explicit and implicit Relevance feedback is being used in various ways to improve the relevance of search results and to personalize these results. In this talk, I will explore the links between IR/IE/IA technologies and personalization. Various personalization approaches like statistical Language Based Modeling, Machine Learning will be discussed. I will also share some of the ongoing work at Search and Information Extraction Lab (SIEL) of Language Technologies Research Centre in IIIT Hyderabad on using these approaches to build a personalized web search engine for mobile phones and a personalized summarization system.
8. Dr. A Kumaran
Dr. A Kumaran is the head of Multilingual Systems research group in Microsoft Research India, in Bangalore. His areas of interest are database and information retrieval systems in multilingual domains. His current area of research include Machine Transliteration, Crosslingual Information Retrieval systems, and data creation methodologies for support of such research, such as, web mining, collaborative data creation, etc.
Topic: Mining Multilingual Named Entities from the Web
Abstract: Named entities play a significant role in Information Retrieval (IR), as they constitute a significant fraction of query terms in search engines. They play an even greater in Cross-Language Information Retrieval (CLIR), as in addition to their significant presence, named entities occurring in the query in the source language need to be replaced by their transliteration equivalents in the target language during query translation. Dictionaries, both hand-crafted and statistical, lack sufficient coverage of named entities as new named entities get added every day to the vocabulary of languages. Machine transliteration, on the other hand, offers only limited help, as very often it produces misspelled or incorrect transliterations. A more promising approach is to mine named entity transliteration equivalents from comparable corpora and use them during query translation. In this talk, the techniques that are employed to identify the named entities in a multilingual corpus, and pairing them up appropriately would be presented. These techniques vary from identifying named entities, phonetic mapping between langauges and depend on frequency information to match them with high confidence. We also present our current work on relying on techniques that relies very little on linguistic information or frequency cues, and hence can scale well for many languages.
Chiranjib Bhattacharyya is with the faculty of Computer Science and Automation, Indian Institute of Science Bangalore. His research interests are in Statistical Machine Learning and Convex optimization.
Topic: Large Scale Classification
Abstract: In this talk we will discuss two approaches for training maximum margin classifiers on large datasets. The first approach is
based on Chance Constraint Programming(CCP). We formulate maximum margin classification as a CCP. We then derive a Second Order Cone Programming formulation by approximating the CCPs by Chebychev-Cantelli inequality. The resultant algorithm gives a classifier which scales well on large datasets. In the next approach we solve the standard SVM classification problem by random projections. Both the algorithms show favourable empirical performance.
The talk is based on the following two papers
Joint work with Krishnan Suresh Kumar, J. Saketha Nath, Prof. R. Hariharan, and Prof. M. N. Murty
10. Dr. Sobha L
Sobha L. is a faculty at the AU-KBC Research Centre, Anna University, Chennai and heads the Computational Linguistics Group. Her area of research is in the field of Anaphora Resolution and Information Extraction. She has worked extensively in the area of NLP for both English and South Indian languages.
Topic: What are Anaphors and How It is Resolved?
Abstract: Interpretation of anaphora is necessary for any natural language processing application. Anaphora is defined as "any entity that requires a referent in front is called as an anaphora". This talk focuses on different types of anaphors and how they are resolved. The important works in this area as well as work done for Indian languages will be dealt in detail.
11. Dr. Sivaji Bandyopadhyay
Dr. Sivaji Bandyopadhyay is with the faculty of Computer Science & Engineering Department in Jadavpur University, Kolkata. His research interests are Machine Translation, Automatic Text Summarization, Question Answering Systems, Information Retrieval and Extraction and Emotion Analysis. Dr. Bandyopadhyay is associated with a number of significant language technology projects and has published in prestigous conferences and journals.
Topic: Automatic Summarization
Abstract: A summary can be loosely defined as a text that is produced from one or more texts, that convey important information in the original text(s), and that is no longer than half of the original text(s). Text can refer to speech, multimedia documents, hypertexts etc. Automatic summarization systems can be broadly classified as single-document versus multi-document summarizer, Mono-lingual, Multi-lingual versus Cross-lingual summarizer, Text versus Multimedia summarizer, Query independent versus Query dependent summarizer and Extraction based versus Abstraction based summarizer. Information extraction (IE) and text summarization (TS) are key technologies aiming at extracting relevant information from texts and
presenting the information to the user in condensed form. These technologies, however, face new challenges with the adoption of the Web 2.0 paradigm (e.g. blogs, wikis) because of their inherent multi-source and multi-lingual nature.
The present talk will concentrate on both query independent and query dependent sentence extraction based single and multi-document summarization systems. We will present our work on a query independent summarizer that extracts key sentences from a set of two or more related news documents by automatically correlating textual contents of the document set and clustering similar texts having related (sub) topical features. A graph theoretic framework has been proposed for the system. The idea is extended to produce a query dependent summary from a set of related news documents. The algorithms have been tested with standard DUC data sets.
Sunita Sarawagi researches in the fields of databases, data mining, machine learning and statistics. Her current research interests are information integration, graphical and structured models, and probabilistic databases. She is associate professor at IIT Bombay. Prior to that she was a research staff member at IBM Almaden Research Center. She got her PhD in databases from the University of California at Berkeley and a bachelors degree from IIT Kharagpur. She has several publications in databases and data mining including a best paper award at the 1998 ACM SIGMOD conference and several patents. She is on the editorial board of the ACM TODS, ACM TKDD, and FnT for
machine learning journal. She serves on the board of directors of ACM SIGKDD and VLDB. She is program chair for the ACM SIGKDD 2008 conference and has served as program committee member for SIGMOD, VLDB, SIGKDD, ICDE, and ICML conferences.
Topic: Max-margin training and inference on structured models for information extraction
Abstract: Feature-based structured models provide a flexible and elegant framework for various information extraction (IE) tasks. These include label sequences for traditional IE, segmentation models for entity-level extractions, and skip chain models for collective labeling. I will present efficient inference algorithms for finding the highest scoring (MAP) prediction for two interesting types of structured models in IE.
I will then present our recent results in max-margin training of such models. There are two popular formulations for maximum margin training of structured spaces: margin scaling and slack scaling. While margin scaling is extremely popular since it requires the same kind of MAP inference as prediction, slack scaling is believed to be more accurate and better-behaved. I will describe an efficient variational approximation to the slack scaling method that solves its inference bottleneck while retaining its accuracy advantage over margin scaling. Further I argue that existing scaling approaches do not separate the true labeling comprehensively while generating violating constraints. I will propose a new max-margin trainer PosLearn that generates violators to ensure separation at each position of a decomposable loss function.
Soumen Chakrabarti received his B.Tech in Computer Science from the Indian Institute of Technology, Kharagpur, in 1991 and his M.S. and Ph.D. in Computer Science from the University of California, Berkeley in 1992 and 1996. At Berkeley he worked on compilers and runtime systems for running scalable parallel scientific software on message passing multiprocessors.
He was a Research Staff Member at IBM Almaden Research Center from 1996 to 1999, where he worked on the Clever Web search project and led the Focused Crawling project.
In 1999 he joined the Department of Computer Science and Engineering at the Indian Institute of Technology, Bombay, where he has been an Associate professor since 2003. In Spring 2004 he was Visiting Associate professor at Carnegie-Mellon University.
He has published in the WWW, SIGIR, SIGKDD, SIGMOD, VLDB, ICDE, SODA, STOC, SPAA and other conferences as well as Scientific American, IEEE Computer, VLDB and other journals. He holds eight US patents on Web-related inventions. He has served as technical advisor to search companies and vice-chair or program committee member for WWW, SIGIR, SIGKDD, VLDB, ICDE, SODA and other conferences, and guest editor or editorial board member for DMKD and TKDE journals. He is also author of a book on Web Mining.
His current research interests include integrating, searching, and mining text and graph data models, exploiting types and relations in search, and Web graph and popularity analysis.
Topic: Building Blocks for Semantic Search: Ranking and Indexing in Entity-Relation Graphs.
Abstract: Relational, XML and IR systems are converging to text-enabled search in entity-relation (ER) graphs (V,E) where nodes V are typed entities (email, paper, person, conference, company) and edges E are typed relations (wrote, cited, works-for). ER-graphs also arise when sentences are parsed, or named entities annotated, and then these annotations are connected to lexical networks and ontologies.
Queries involve uninterpreted strings and structure restrictions. We may wish to rank mentions of instances of a given type (distance) based on their textual proximity to query keywords (Rome Helsinki). Such named-entity searches are crude but effective filters for simple questions like "what is the distance between Rome and Helsinki". Or, in a desktop or enterprise search setting, we may wish to rank potential reviewers for a paper, based on proximity along diverse paths between the paper under consideration and people in our emails and address books.
We are building public-domain graph database systems that can deal with both forms of proximity search. I will focus on two aspects of the work. The first is the design of broad families of parameterized ranking functions suited to ER graphs, which can be trained effectively from relevance judgments. The second is the design of new kinds of workload-cognizant indices to support graph proximity queries, while striking useful trade-offs between index space and query execution cost. I will describe new techniques to estimate query-processing cost and cost-driven index compaction based on query logs.
Joint work with Alekh Agarwal, Manish Gupta, Amit Pathak, Kriti Puniyani, Sujatha Das
More info at http://www.cse.iitb.ac.in/~soumen/doc/netrank/ and http://www.cse.iitb.ac.in/~soumen/doc/www2006i/
Partly supported by IBM Research and Microsoft Research.
Dr. Rajat Mohanty is the Research Scientist in the R&D Unit at AOL India,Bangalore. He is also a Visiting Faculty in the School of Language Sciences at the EFL University, Hyderabad. Prior to this, he was the Research Member of the NLP-AI Research group at IIT Bombay. His research interest includes Syntactic Parsing, Semantics Extraction, Interlingual MT, and Information Retrieval.
Topic: Challenges for Semantic Argument Realization involving Event Composition
Abstract: Semantics extraction from natural language text has become increasingly important in recent years. This talk focuses on the richness and complexity of the phenomena falling under the rubric "verb meaning and argument realization". The challenge is to explain why two verbs (e.g., hit and break) show divergent behaviour and which facets of the meaning of verbs are relevant for the mapping from the lexical semantics to syntax. In this context it is important to recognize that verb meanings represent linguistic construals of happenings in the real world and, thus, may pick up on only certain facets of these happenings. The semantic determinants of argument realization centered on individual arguments of a verb, but interactions between arguments that affect argument realizations cannot be ignored. These interactions suggest that there are precedence - or prominence – relations among arguments statable in terms of their semantic roles. The talk also throws light on the variety and complexity found in the association of semantic roles with their morphosyntactic realizations, showing that argument realization involves much more than the commonly assumed agent-subject, patient-object associations. The content of semantic roles can be unpackaged in terms of bundles of binary features.
Dr. Pushpak Bhattacharyya is with the faculty of Computer Science and Engineering Department in IIT Bombay. His research interests are Natural Language Understanding, Machine Translation, Information Retrieval and Extraction, Machine Learning and Neural Networks. He has published widely in prestigious conferences and journals including Pattern Recognition Journal, Journal of Machine Translation, WWW Conference, AAAI conference, and Knowledge Based Computer Systems conference.
Dr. Bhattacharyya is leading a number of significant language technology projects. He has a number of collaborations with research institutions and universities in India and abroad.
Topic: Representation and Extraction of Semantic Relations
Abstract: We discuss our approach to creating graphs of semantic role labeled arcs and partially disambiguated nodes from English sentences. The multi-staged process makes use of a number of tools and resources, and is based on rule based components built over a long period of time from a of home-grown lexical knowledge bases. The use of such semantic graphs is in machine translation, key word extraction and high accuracy information retrieval- the last of which will be briefly touched upon. The scheme falls in the general framework of dependency parsing which has interesting differences with constituent based parsing. We briefly dwell on the merits an demerits of current probabilistic parsers in the context of their utility in semantics extraction.