December 24, 2025

ISR Experiments using Mumps amd MDH

>> Do not run the ISR code if using the Sqlite3 database as it will
>> run ver slowly. Use the native database only.

There are two versions of this software. One is written in Mumps and the
the other is written in C++ using the MDH class library which provides 
Mumps-like facilities in C++.

>> Mumps Version

The Mumps version is in the directory 'Mumps-version' and the MDH version
is in the directory 'MDH-version'. This README file is used by both versions.

In the Mumps version, you have the option to run the Mumps interpretively
or to compile the Mumps and run the result. To specify compilation
(the default) or interpretation, modify the 'compile' bash variable in
the file 'control.script'. Make compile equal 'yes' for compiled execution
and 'no' for interpretive execution. At present, the retrieval phase is
written for compile only.

You should not expect the same results from the Mumps versus the MDH version
as the algorithms differ. 

The Mumps splits the database into segments and indexes/retrieves from the
segments in parallel. The number of segments is the number of available CPUs.

Be sure to install the Mumps interpretyer, compiler and MDH library conatined 
elsewheres in this distro.

Please make all files with the endings .script and .mps executable 
if you want to run the examples.

The database provided is ohsu.medline. See the description at the end of
this file.

To use this database, it must first be converted to a different format
and the titles of the articles extracted. This will happen automatically
when you first run control.script and any time you run it thereafter
with a larger numeric sample size. Conversion can be time consuming.
Conversion reads the MEDLINE formated original database and converts
it to a format similar to FASTA format.

>> Running the Mumps Version

------------------------------------------------------------------------
NOTE: DO NOT RUN make or index.script. The main file is control.script
It will invoke the other files as needed.
------------------------------------------------------------------------

    Index a portion of the database:

    The text database is in the parent of the appropriate directory and has the file
    name 'ohsu.medline'.

    The first time this database is accessed, it will be converted to the
    files 'ohsu.converted' and 'titles.data'. This may take some time depending 
    on your system.  The converted database also resides in the parent directory.
    The conversion will not take place again unless you delete this file.

    To index the first 5000 documents in the database, type:

        ./control.script 5000

    The number shown is the number of documents from the data base that will 
    be indexed.  The indexing will be done in X segments where X is the number of
    CPUs on your machine. 

    If you fail to specifiy a number of documents, 1000000 will be used

    Retrieval

    You may run a test retrieval with retrieve.script which uses the term
    in the file tstquesy. Since your test database is very small, don't
    expect much in the way of results.

    The script file 'retrieve.script' will run the retrieval. Do not use the
    similarly named executable files as they will not work by themselves.

    The programs in 'retrieve.script' expects the name of a file containing a
    search query. If none is provided, it uses the file 'tstquery'.

    A query file consists of one line of keywords separated by blanks. The
    combination of the keywords is the query vector.


Running the MDH Version

    The MDH version uses the same ohsu database as the Mumps version and will 
    convert the ohsu database as needed similar to the above. 

    DO NOT RUN 'make' as this will be done by the scripts.

    To run the MDH version, type:

	./index.script <number>

    where <number> is the number of documents to index. The value 1000 is a good
    starting point,

    To test the retrieval, type:

	./retrieve.script

OHSU Medline Database

    The original OHSU Medline database was obtained from:

        https://trec.nist.gov/data/t9_filtering.html

    And reformatted.

    From the website:

    "... The OHSUMED test collection is a set of 348,566 references from
    MEDLINE, the on-line medical information database, consisting of
    titles and/or abstracts from 270 medical journals over a five-year
    period (1987-1991). The available fields are title, abstract, MeSH
    indexing terms, author, source, and publication type. The National
    Library of Medicine has agreed to make the MEDLINE references in the
    test database available for experimentation, restricted to the
    following conditions:

    1. The data will not be used in any non-experimental clinical,
    library, or other setting.

    2.  Any human users of the data will explicitly be told that the data
    is incomplete and out-of-date.

    The OHSUMED document collection was obtained by William Hersh
    (hersh@OHSU.EDU) and colleagues for the experiments described in the
    papers below:

    Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED: An interactive
    retrieval evaluation and new large test collection for research, 
    Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.

    Hersh WR, Hickam DH, Use of a multi-application computer workstation
    in a clinical setting, Bulletin of the Medical Library Association,
    1994, 82: 382-389. ..."

    Full details on the original format are at the web site referenced above.
