Experiments in
Information Retrieval
and
Document Classification

Kevin C. O'Kane
Professor Emeritus
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50613

March 28, 2024

The ISR code is now packaged with the Mumps Language code. The Mumps distro (see below) contains the ISR code and a small subset of the OHSUMED database. A link to the full database is given below. The OHSU database used in these experiments is now contained in the Mumps distro.

  • ISR Book: Implementing Information Retrieval Algorithms Using Mumps
  • Click Here for Mumps & ISR source code distribution
  • The OHSU Medline database used in these experiments was obtained from:

    https://trec.nist.gov/data/t9_filtering.html

    and reformatted.

    The following notes apply (see web site above for full discussion):

    • "... The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. The National Library of Medicine has agreed to make the MEDLINE references in the test database available for experimentation, restricted to the following conditions:

      1. The data will not be used in any non-experimental clinical, library, or other setting.
      2. Any human users of the data will explicitly be told that the data is incomplete and out-of-date.

      The OHSUMED document collection was obtained by William Hersh (hersh@OHSU.EDU) and colleagues for the experiments described in the papers below:

      Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED: An interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.

      Hersh WR, Hickam DH, Use of a multi-application computer workstation in a clinical setting, Bulletin of the Medical Library Association, 1994, 82: 382-389. ..."

  • Example Output
  • Slides for Providence College Talk

The purpose of this document is to introduce a collection of programs to be found in the Vector Space ISR Workbench.

The workbench presently consists of about fifty modular programs written in Mumps and/or bash script. These programs implement the basic Vector Space Model for document classification and retrieval as originally developed by G. Salton [Salton, 1968, 1983, 1988, 1992] and others. Also included is a collection of approximately 294,000 medical abstracts for testing and experiments.

The purpose of this package is to facilitate teaching, exploration and experimentation with the vector space model and the development of new algorithms and techniques. The modular design of the code together with the Mumps multidimensional database model enable the user to experiment, augment, and measure various indexing strategies.

Currently, the package contains programs that perform:

  1. word frequency analysis,
  2. stop list generation,
  3. word stemming,
  4. term weighting,
  5. synonym detection,
  6. phrase identification,
  7. term clustering,
  8. document clustering,
  9. document hyper-clustering, and
  10. several retrieval methods.
The programs build:

  1. document-term matrix,
  2. term-document matrix,
  3. term-term matrix,
  4. document-document matrix,
  5. dictionary vectors giving:
    1. word frequency,
    2. document frequency,
    3. Zipf's Law coefficients,
    4. inverse document frequency weights [Salton 1968] and
    5. discrimination coefficients [Willet 1985].
There are programs to calculate:

  1. term phrases,
  2. term cohesion,
  3. proximity weighted term similarities,
  4. term clusters.
  5. document clusters and
  6. clusters of document clusters.
The package includes routines to retrieve documents based on:

  1. simple sequential searches,
  2. inverted file searches and
  3. weighted inverted file searches using document similarity metrics such as Cosine [Salton 1983].
There also indexing routines to organize the documents by:

  1. controlled vocabularies such as MeSH,
  2. KWIC/KWOC indices,
  3. n-grams [Manning 1999] and
  4. Soundex codes [US National Archives, 2007].
The experimental corpus provided (details given below) is the OSU Medline collection used at the National Institute of Standards (NIST) Text Retrieval Conference 9 (TREC-9) [NIST 2000]. Other user provided collections may also be used if their source text is formatted according to the input model.

Most of the code in these modules is written in Mumps, a language developed in medicine in the late 1960s [Barnett 1970, Bowie 1976, O'Kane 2008] which supports a string handling and a multidimensional database model which is ideally suited for vector space model implementations. The Mumps modules are invoked by bash scripts which control flow of data and multitasking.

The Mumps interpreter software used in these experiments are available for free download (GPL License) at:

HERE

 

References

[Salton 1968] Salton, G., Automatic Information Organization and Retrieval, McGraw Hill (New York, 1968).

[Salton 1971] Salton, G, ed.; The SMART Retrieval System, Experiments in Automatic Document Processing, Prentice-Hall (Englewood Cliffs, NJ, 1971).

[Salton 1983] Salton, G.; and McGill, M.J., Introduction to Modern Information Retrieval, McGraw Hill; (New York, 1983).

[Salton 1988] Salton, G., Automatic Text Processing, Addison-Wesley (Reading, 1988).

[Salton 1992] Salton, G., The state of retrieval system evaluation, Information Processing & Management, Vol 28 No 4, pp. 441-449 (1992).

[NIST 2000] National Institute of Standards and Technology, Text Retrieval Conference 9 https://trec.nist.gov/pubs/trec9/t9_proceedings.html

[Willet 1985] Willett, P., An algorithm for calculation of exact term discrimination vales, Information Processing and Management, Vol 21, No. 3, pp 225-232 (1985).


Information Storage and Retrieval Videos


  1. Part 1
    https://www.youtube.com/watch?v=i-Lvxj6-cAQ

  2. Part 2
    https://www.youtube.com/watch?v=Xh-bwnKcT3w

  3. Part 3
    https://www.youtube.com/watch?v=y1e_FZ9A_-M

  4. Zipf's Law
    https://www.youtube.com/watch?v=omz_a5ydyb0

  5. Vector Space Model
    https://www.youtube.com/watch?v=5kkismynHlo

  6. Vector Space Model Matrices
    https://www.youtube.com/watch?v=eiSxFAWRkis

  7. Kwic/Kwoc indices, stop lists and stemming
    https://www.youtube.com/watch?v=Vhj2ZRCgMDE

  8. Reducing the collection to word stems
    https://www.youtube.com/watch?v=BWmHJynIQfg

  9. Word pruning based on frequency
    https://www.youtube.com/watch?v=9mMmL70hPTU

  10. Document Term Matrix
    https://www.youtube.com/watch?v=zb4fvSY4-U0

  11. Global Array Overview
    https://www.youtube.com/watch?v=izhH68xXirc

  12. The Big Picture
    https://www.youtube.com/watch?v=yeI45tOQyYo

  13. Term Normailization and Weights
    https://www.youtube.com/watch?v=jEDJ3qgmoz0

  14. Parallel Processing
    https://www.youtube.com/watch?v=nkmAyP6bsyQ

  15. Term-Term Matrix Overview
    https://www.youtube.com/watch?v=YWEziRZC7g0

  16. Term-Term Matrix Calculation
    https://www.youtube.com/watch?v=catIUBRHCGM

  17. Term-Term Matrix for Full Collection
    https://www.youtube.com/watch?v=2FWMsNU-Zng

  18. Pruning the Document-Term Matrix
    https://www.youtube.com/watch?v=xfWw-V0eotY

  19. Inverse Document Frequencies
    https://www.youtube.com/watch?v=iHPwftg_PlI

  20. Weighting Terms in Documents
    https://www.youtube.com/watch?v=2zeqx5MSu3s

  21. Building a MeSH Tree
    https://www.youtube.com/watch?v=ObUAklaia1Y

  22. MeSH Tree Print Programs
    https://www.youtube.com/watch?v=x7ZVnYeQROc

  23. MeSH Index Program
    https://www.youtube.com/watch?v=6w4PpEU9EBM

  24. MeSH Titles Program
    https://www.youtube.com/watch?v=GA3xneZNHPY

  25. Find MeSH Terms and Sub-Terms
    https://www.youtube.com/watch?v=3ZOiN5ZF7MQ