CS 3150 Spring 2013
Information Storage and Retrieval (3 hours)

Last updated Jan 13, 2013
Objective To understand computer based automatic indexing and retrieval of text/web based information.
Book: Experiments in Information Storage and Retrieval Using Mumps (to be distributed)

The Mumps Programming Language (Note: copies of the PDF for this will be sent to members of the class once the email class list has been established. If you want a printed copy, I have a limited number available for $2.50 each.)

Notifications Assignments and other notifications will be sent by email. If you have blocked your email address or you register late, you will not be on the class list.

If you are not on the initial registration list provided by the Registrar, it is your responsibility to add yourself to the class list. See the instructions on my homepage. You may add additional email address(s) to this list if the default is not your primary account.

Course Materials: Slides in PDF Format

Book

Collection of data bases, examples and text

Requirements: The requirements will consist of a set of assignments to be performed individually and a project which may be done either individually or in small groups (up to 3). As this couse depends heavily on lecture content, attendance is required. Excessive absence will result in a reduced grade for the project. Extra credit may be awarded for attendance at selected presentations and seminars both on campus and off. These opportunities will be announced in class.

  1. Term Project (25%)
  2. Assignments (30%) Assignments are due on the date assigned. Late assignments may be charged 10% per class day.
  3. Tests (2 at 15% each)
  4. Final (15%).
  5. Extra Credit (up to 5%)

Classes are lecture format. Cell phones, pagers, and PDA's may not be used.

Laptop use is permitted if class related. YouTube, Facebook, email, Chat (any flavor), AngryBirds (or anyother game) are not class related.

Assignments may be submitted by email as plain text, PDFs or image files (jpg or png). The subject line must contain the course (CS 3150), your name and the assignment number. If these are not present, the email will not be accepted.

If an assignment involves programming, you must turn in your code and an example of its execution. If you elect to submit code without output, your program will be assumed to be non-functional and there will be an automatic 40% deduction.

Test 1 TBA
Test 2 TBA
Final: Click Here
Makeup Tests Makeup tests will be given only in cases of demonstrated need for causes such as serious illness, family emergency or University sanctioned schedule conflict. In all cases, written documentation will be required. In those cases where a makeup test is granted, it must be taken within 1 week of the originally scheduled exam or, in the case of illness, return to classes..
Penalty for
Hacking & Cheating
A grade of "F" for the course and possible University disciplinary action. If your work duplicates in whole or part the work of another, both works will receive a grade of F.

You may use material from the Internet if you document the source (URL). Any undocumented use of the Internet in an assignment will be considered a serious case of cheating and will result in a grade of F for the course.

Project Projects will be presented in class with a brief online demonstration. A vote will be taken for the best project and the winner will be exempt from the final (grade of A will be entered for the final exam grade). You may work in teams of from 1 to 3 (hint: have one person do the indexing, one the retrieval code and the other do the web interface). All group members will receive the same grade.
Grades I will send an anonimized spreadsheet by email each time there is a significant grading event. The spreadsheet will list the grades I have recorded for you and the final score, assuming all remaining work is perfect.

If you see an error, please contact me immediately to have it corrected. I will assume that if I do not hear within a reasonable period of time, that all grades are correct and that no typos have happened.

The spreadsheet is sent by email to the class list thus you must register for the class email list. You must also give me a code word by which your row in the spreadsheet will be known. Spreadsheet rows will be randomized and not in alphabetic order. If you do not give me a secret codeword, your row will not appear in the spreadsheet.

Contact Click Here
Computer You may want to use your own computer to do the assignments and project. If that is not the case, you may have an account on one of my Linux servers.

The assignments and project are extreme I/O intensive. If you use your own machine, a desktop with lots of memeory is best.

Assignments will be done in Linux. I will be distribute an Oracle Virtual Box virtual disk with a preconfigured Debian system that contains the code and databases. You will need to install the (free) Oracle Virtual Box and install the distributed virdual disk in same.

Data Base To be distributed.
Getting Started
  • Learn an adult editor (nano/pico are for children) : vi Tutorial
  • Learn Mumps.
  • HTML BareBones HTML Guide
    Other Resources Mesh

    Documentation and Availability
    Documentation Descriptor Data Elements
    Descriptor Records
    Documentation Qualifier Data Elements
    Qualifier Records
    Supplementary Records

    Resources:

    Salton, G., Automatic Informatiuon Organization and Retrieval, McGraw Hill (1968)
    Salton, G., Automatic Text Processing, Addison-Wesley (1989)
    Salton, G., ed., The SMART Retrieval System, Prentice-Hall (1971).
    Salton, G., and McGill, M., Introduction to Modern Information Retrieval, McGraw-Hill, (1983)
    Borko, H., Automated Language Processing, (1968)
    The Smart System from Cornell: ftp://cs.cornell.edu/pub/smart


    Data Sets for Machine Learning
    WordNet
    Apache
    Web Archive (you think you have disk capacity problems!)
    PostgreSQL Tutorial
    Rod Library Electronic Resources
    ACM Digital Library
    Lemur Toolkit for Language Modeling
    Cornell Smart System
    SIGIR List and Archives
    UVA Electronic Text Center
    NLM Gateway
    Digital Library Research Laboratory
    Automatic Text Browsing Using the Vector Space Model
    Lawrence Berkeley Lab Science Articles Archive
    The Internet Archive
    Search Engine Features
    Anatomy of a Search Engine (Google)
    Medline (National Library of Medicine)
    Information Retrieval by C. J. van Rijsbergen
    Modern Information Retrieval Chapter 10
    Cystic Fibrosis Reference Collection
    Marti Hearst Site
    Online Papers
    Ed Fox Links
    IIT IR Publications
    Web IR
    WWW 10
    WWW 9
    WWW 8
    WWW 7
    WWW 5
    IRIS Project
    Lots of Links
    Top Ten Issues
    Searching Genomic Databases
    Amazon.com's recommender algorithm
    Huffman Trees
    Knuth Optimal Binary Trees
    Hu-Tucker Trees
    AVL (Balanced) Trees
    B trees and AVL Trees
    B Trees
    IBM Clever Project
    flex documentation
    Homology Searching
    Project Gutenberg

    The following notice is required by the University:

    Students seeking disability accomodations are directed to see: UNI Policy 13.15 Accommodations of Disabilities