Domain specific sense marked corpus 2009-current


For a detailed description of the preliminary version of this corpus please read:

Mitesh M. Khapra, Anup Kulkarni, Saurabh Sohoney and Pushpak Bhattacharyya, All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010.

Please cite the paper, if you use this corpus in your work.

Note that the corpus has been enhanced since the publication of the above paper (more words from both the domains have been tagged and sentences containing less than 4 words have been removed). For the purpose of comparison, results obtained using our algorithm on this enhanced dataset will be made available on this page soon.


Release v1.0

The first version of this corpus was released on 10-July-2010. The corpus is freely available for research purposes under GPL 2.0.

Download

English Tourism Sense Marked corpus (XML Format) (Untagged)
English Health Sense Marked corpus (XML Format) (Untagged)
Hindi Tourism Sense Marked corpus
Hindi Health Sense Marked corpus
Marathi Tourism Sense Marked corpus
Marathi Health Sense Marked corpus

Size of the corpus

The total number of sense marked words (inluding monosemous words) in each domain is as follows:

POS Category Domain
Tourism Health
Noun 72932 52230
Verb 26086 24291
Adjective 32499 22699
Adverb 9820 8555

References

  1. Mitesh M. Khapra, Anup Kulkarni, Saurabh Sohoney and Pushpak Bhattacharyya, All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010.