Subspace methods in information Retrieval (IR)

Synopsis

This a chanllenging while interesting project! In this project, your task is to design a good information (text) retrieval program in matlab. Information Retrieval is a long-existing and very-active research area. For a general description of Inforamtion retrieval topic, you can check this Information Retrieval HUB .

Design & implementation a decent text retrieval program, and compete with other students' implementations.
Understand the numerical computing methods/ideas that prevails these algorithms, gain skills/experiences in serious research/implementation work.

In this sections, we just brief some basic algorithms as examples. Please refer to corresponding references for details. Also you are encourage to investigate & implement more advanced algorithms. Basically, The following implementations are expected:

SVD factorization method, to get left/right eigen-vectors of data matrix, provide basis for the subspace model. (as described mainly in reference 1, 2 & 3)
hierarchical agglomotive clustering + similarity-based classification method. (by ways of reference 4, 5.)
(OPTIONAL) implement probabilistic LSI method. (as of reference 6.)
train & test your codes on the provided dataset

Note that though typically you are expecting to implement the above-mentioned algorithms, you are free to choose more advanced(?) methods, or come up your own novel ideas to implement (still you need to make your implementation testable/comparable with the same document-term dataset as used by other students).

Data and codes

Only Dataset are provided.

The Data is a subset of the 20 newsgroups, with only the articles from "talk.politics.misc" and "talk.religion.misc" are used, giving a 2 class problem. Note that the categorization was made without inspecting the actual content of the articles, so some outliers can be expected. (For more information, please consult this paper) You will need to divide it into two parts: training data and testing data (i.e. 90% for training and 10% for testing). Your program could learn offline from the training data, and you can get a feeling of its performance by running on the testing data.
Basic performance evaluation method will be provided, to decide the effectiveness of your program.

Suggested steps and deliverables

Because it is a smaller project, so 1-2 people is expected to form into one project group, and each person is responsible for a portion.
The final output should consist of a) a write-up with clear&detailed information about the algorithms used, theories and practical implementations of these algorithms, simulations and results, future extension and a clear references. b) Your codes with user friendly GUI & Help files, with some examples to show how it works.
You are encouraged to participate in the google programming contest , to evaluate your software with real webpage data collected by Google Inc..

Marking scheme

The project will be marked based on the performance of your group's outcome, also we will take into account your own contribution.
A testing database will be built, including a collection of documents with variant contents.
Extra credits will be given to thoese with Probabilistic LSI method (or other subspace methods not mentioned above) implemented.
Extra credits will be given to novel ideas (new algorithms, tricks to improve the detection rate).

References

1. Basic one: Using Linear Algebra for Intelligent Information Retrieval. Michael W. Berry and Susan T. Dumais, and Gavin W. O'Brien, December 1994. Published in SIAM Review 37:4 (1995), pp. 573-595
2. More general treatment: Matrices, Vector Spaces, and Information Retrieval M.W. Berry, Z. Drmac, and E.R. Jessup. SIAM Review 41:2, (1999), pp. 335-362
3. a simple explanation of vector-space model and IR
4. SVD-based hierarchical agglomotive clustering
5. Text Classification in a Hierarchical Mixture Model for Small Training Sets, Kristina Toutanova, Francine Chen, Kris Popat & Thomas Hofmann, Proceedings 10th International Conference on Information and Knowledge Management, 2001
6. Probabilistic Latent Semantic Indexing Thomas Hofmann, Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR'99)

Contacts: Each project will have an instructor assigned Last update: Mon, Feb 11, 2002 @ 20:12:00 MST