Subspace methods in information Retrieval (IR)
Synopsis
This a chanllenging while interesting project!
In this project, your task is to design a good information (text) retrieval program in matlab. Information Retrieval is a
long-existing and very-active research area. For a general description of Inforamtion retrieval topic, you can check this Information Retrieval HUB .
- Design & implementation a decent text retrieval program, and compete with other students' implementations.
- Understand the numerical computing methods/ideas that prevails these algorithms, gain skills/experiences in serious
research/implementation work.
In this sections, we just brief some basic algorithms as examples. Please refer to corresponding references for details. Also
you are encourage to investigate & implement more advanced algorithms.
Basically, The following implementations are expected:
- SVD factorization method, to get left/right eigen-vectors of data matrix, provide basis for the subspace model. (as
described mainly in reference 1, 2 & 3)
- hierarchical agglomotive clustering + similarity-based classification method. (by ways of reference 4, 5.)
- (OPTIONAL) implement probabilistic LSI method. (as of reference 6.)
- train & test your codes on the provided dataset
Note that though typically you are expecting to implement the above-mentioned algorithms, you are free to choose more
advanced(?) methods, or come up your own novel ideas to implement (still you need to make your implementation
testable/comparable with the same document-term dataset as used by other students).
Data and codes
Only Dataset are provided.
- The Data is a subset of the 20 newsgroups, with only the articles from "talk.politics.misc" and "talk.religion.misc" are used, giving a 2 class problem. Note that the categorization was made without inspecting the actual content of the articles, so some outliers can be expected. (For more information, please consult this paper) You will need to divide it into two parts: training data and testing data (i.e. 90% for training and 10% for testing). Your program could learn offline from the training data, and you can get a feeling of its performance by running on the testing data.
- Basic performance evaluation method will be provided, to decide the effectiveness of your program.
Suggested steps and deliverables
- Because it is a smaller project, so 1-2 people is expected to form into one project group, and each person is
responsible for a portion.
- The final output should consist of a) a write-up with clear&detailed information about the algorithms used, theories and
practical implementations of these algorithms, simulations and results, future extension and a clear references. b) Your
codes with user friendly GUI & Help files, with some examples to show how it works.
- You are encouraged to participate in the google programming
contest , to evaluate your software with real webpage data collected by Google Inc..
Marking scheme
- The project will be marked based on the performance of your group's outcome, also we will take into account your own
contribution.
- A testing database will be built, including a collection of documents with variant contents.
- Extra credits will be given to thoese with Probabilistic LSI method (or other subspace methods not mentioned above)
implemented.
- Extra credits will be given to novel ideas (new algorithms, tricks to improve the detection rate).
References
- 1. Basic one:
Using Linear Algebra for Intelligent Information Retrieval.
Michael W. Berry and Susan T. Dumais, and Gavin
W. O'Brien, December 1994. Published in SIAM Review 37:4 (1995), pp. 573-595
- 2. More general treatment:
Matrices, Vector Spaces, and Information Retrieval M.W. Berry, Z.
Drmac, and E.R. Jessup. SIAM Review 41:2, (1999), pp. 335-362
- 3. a simple explanation of vector-space model and
IR
- 4. SVD-based hierarchical agglomotive
clustering
- 5. Text Classification in a Hierarchical
Mixture Model for Small Training Sets, Kristina Toutanova, Francine Chen, Kris Popat & Thomas Hofmann, Proceedings 10th
International Conference on Information and Knowledge Management, 2001
- 6. Probabilistic Latent Semantic Indexing
Thomas Hofmann, Proceedings of the 22nd International
Conference on Research and Development in Information Retrieval (SIGIR'99)
Contacts: Each project will have an instructor assigned
Last update:
Mon, Feb 11, 2002 @ 20:12:00 MST