Information Retrieval, its origins in Newcastle

Elizabeth Barraclough

The year was 1964, we had a relatively new, powerful computer, the KDF9, with 100K bytes of store and 4 Magnetic tape drives, on this machine all things were possible. George Smart was Post -graduate Sub Dean of Medicine and Ewan Page was Director of the Computing Laboratory. These two were both keen to see the development of their disciplines and were aware of the activity at the National Library of Medicine in Washington DC where computers were being used to produce the printed Index Medicus and to search the accumulated data. The question they asked was could we do the searching here?

Clearly if such a facility was to be provided it had to be on a national basis and would require close cooperation with the NLM to get copies of the retrospective data and on a regular basis for the new material. We also needed funding to allow the computer system and the searching skills to be developed. At that time the government had a Department of Scientific and Industrial Research (DSIR) but they had recommended setting up a department to promote research into 'handling the rising tide of scientific literature' so the Office for Scientific and Technical Information was set up under the Department of Education and Science in April 1965. One of the first grants awarded was to Newcastle University, £41,600 over 4 years for the MEDLARS experiment.

In parallel with seeking funding for the computing element, the National Lending Library at Boston Spa were negotiating with NLM for access to the data. NLL were the obvious people to do this as they would be the source of the medical articles required as a result of performing any computer searches. They already had expertise in using the printed Index Medicus for manual searching so knew the system and all its complexities very well. The agreement that NLL reached with NLM was that in exchange for the provision of the data NLL would undertake the indexing of some of the British medical literature eg BMJ etc. NLL were thus faced not only with learning searching techniques for Medlars but also learning the very rigorous indexing system.

The task we had in 1965 was to get a system working for the Medlars data on the KDF9 in Newcastle. I was seconded by the University to this project and in March 1965 flew to the US and joined Tony Harley from NLL at the National Library of Medicine in Bethesda just north of Washington. Tony had been seconded by British Library to tackle the indexing and searching whilst I had the task of designing and writing a system for the KDF9. We worked together as a team as Tony didn't understand the computing side and I was wholly ignorant on the indexing and searching but it was essential for the success of the project that we should both understand sufficient of the other's problems.

The computer system at NLM was written for an Illiac machine in machine code, as all systems were at that time,. All I could use from their system were the flow charts for the basic operations. The NLM system contained far more function than was needed in the UK, they had to create the data base >from the indexing records and provide all the facilities for printing the monthly volumes of Index Medicus as well as providing the search facilities. Although some indexing was to be done in the UK the input to the computer was to take place at NLM so all I had to do was provide the search engine.

The Medlars data was, and still is, very tightly controlled. The main reason for the fixed index terms is the requirement to produce the printed Index Medicus which, each month, lists all articles about each term. The thesaurus contains about 8000 terms which are organised into a tree structure. Each article in a medical journal has relevant terms assigned to it by a team of indexers. The indexers are free to assign as many terms as they wish, generally 3 are assigned as 'print' terms under which the article will appear in Index Medicus with about 10 less important terms which are only used for computer searches. The tree structure in the thesaurus allows terms to be assigned at varying levels of specificity for example an experiment using rats might specify terms at higher levels in the tree vertebrates - mammals - rodents - rats, if the experiment could be applicable to vertebrates higher up the tree. In addition to the search terms there were other qualifiers for a citation such as the language of the original article.

The constraints of the system simplified the computer task as all the terms used in a search had to be in the thesaurus. Search formulations could not contain ambiguities of language though they could be logically incorrect. The task was thus to implement a search engine without having to accommodate any of the problems of natural language processing. The three months I spent at NLM were used to design and partially implement the search system to be run on the KDF9 computer. It was a salutary exercise to write programs 3000 miles from the computer on which they were to run before the days of on-line working. The system was completed about six months after my return from the States and was used for over two years to provide a national service. Subsequently the service was provided by the Document Processing Centre at Manchester.

The main problem to be tackled was the speed of searching. The data was held on magnetic tape and was accumulating at the rate of 200 thousand articles per year. The system was to provide both Selective Dissemination of Information (SDI) ie searches on the latest month's data, and retrospective searches of the whole, ie past 5 years data, of one million records. The SDI searches were grouped together and processed in one pass of the latest monthly data. The retrospective searches were done once a week overnight again using one pass of the magnetic tapes but the time taken could still run to several hours as the number of tapes searched rose to about 20.

The provision of the data from NLM was an interesting exercise. We needed to have a secure and guaranteed system for getting a magnetic tape each month from NLM to Newcastle. We were working closely with the National Lending Library at Boston Spa, a Government Department. NLL had access to 'The Diplomatic Bag' and our data initially used this route bypassing customs and any other red tape, it was most efficient. The unfortunate part >from my point of view was that the tape was 'industry compatible' but we had a KDF9 computer which used 1" tape! Each month I drove to ICI at Wilton on Teesside to use their computer early in the morning before the day shift started. ICI also had a KDF9 but they had an IBM tape deck which could be used for the conversion. On the whole this worked very well though inevitably there were some faults on the tapes and records were lost but in the inexact techniques of Information Retrieval these would rarely be crucial.

The Next Stage

Our experience with the Medlars database, the advent of the 360/67 and the MTS operating system put us in a unique position. We had a large body of data and the ability to access the computer from a terminal. It is worth reminding ourselves of the configuration of the computer. We had a megabyte of core storage and ?? disc storage. The terminals were typewriter devices working at 15 characters per second - the speed of the lines hardly mattered!

We took this opportunity and ran two consecutive research projects funded by the Office of Scientific and Technical Information (OSTI). The first was designed to show that research medical staff could use a retrieval system successfully if it was provided in a helpful on-line environment. (TR34) The second used this system to investigate the needs of users from a 'Current Awareness' service. (TR78) Both projects required close collaboration with Library staff which, in Newcastle at that time, was very good. Working together on these projects was beneficial to all parties.

These two projects enabled us to recruit people who have subsequently played a major role in the work of the Computing Laboratory and, indeed the wider computing scene. Two of the staff, Alan Hunter and Nick Rossiter subsequently joined the Computing Service and made a significant contribution to the systems and the advisory side of our activities. On the first project were Stephanie Barber, now in the Library, and Alex Gray, now Professor of Computer Science in Cardiff.

At the time that this work was starting Information Retrieval was still in the days of punched cards and knitting needles, in such systems topic cards were created with the card information giving the document numbers of documents containing that topic. Such systems were very restricted in the quantity of data they could handle. Also at this time librarians were begining to worry about their ability to find all the references on a particular topic. Cyril Cleverdon at Cranfield was one of the first to formulate the concepts of 'Recall' - what percentage of the known relevant references were retrieved; and 'Precision' - what percentage of the retrieved references were relevant to the query.

All the early experiments were done on tiny datasets and required judgements to be made on every citation in the dataset. Such blanket coverage was not a possibility in our work where even one months citations could not reasonably be scanned by an individual. Recall and precision are highly subjective measures and it is also not clear what the aim of the system should be, certainly 100% precision and 100% recall would not be believed by any user of the system. However 'recall' and 'precision' or measures derived from them, were the only criteria available for judging the systems and were used quite successfully for the comparitive performance of users as against librarians acting as search specialists. On the whole the users retrieved fewer relevant references but with higher precision than the librarians. Once the user accepts that no system is going to find everything for him and that if he wants to be sure that the particular topic has not been covered in the literature he must use other techniques in addition to a literature search, then the actual values of recall and precision are less important. Possibly one of the benefits of these measures is to make the uncertainty of all retrieval systems - even the Internet - apparent.

The first project was concerned with the mechanics of providing an on-line system and supplementing the thesaurus of Medical Subject Headings (MeSH) to provide additional routes to the terms to be used in the search. For example many synonyms were added and new ways of traversing the tree structure in MeSH. The system was designed for the end user so was given the name MEDUSA.

The users of the system were in three centres, locally in Newcastle with directly connected terminals, and in Leeds and Manchester where they used dial-up over the telephone system. We believe this was the first time that end users, ie medical research staff, accessed a retrieval system remotely. (The Medline system being developed at the same time by the National Library of Medicine in the States was specifically for search librarians and did not have the additional aids built into the MeSH thesaurus). It is interesting to note the time it then took before this type of access became commonplace, probably not until databases appeared on CD's. The reason is almost entirely due to the method of charging by the telephone system. All charges were made by duration of the call so thinking time was very expensive, if the charging had been by unit of data transferred the position could have been very different.

One of our more illustrious users, but only for a few minutes, was the then Vice Chancellor, Henry Miller, who was persuaded to try the system from a terminal.

The work on this project showed the feasibility of on-line use by the research workers themselves and led directly to the second project where MEDUSA was used to provide a current awareness service.

The main differences from the earlier project were the provision of an up to date file of references and an increase in the number of centres and users taking part in the experiments. The file of citations had to be updated each month adding the latest months citations received from NLM and deleting the oldest batch. The number of centres was increased to six and the number of users to 76.

From the computing viewpoint it was the latter change that provided the more interesting problem. If we were to handle several users simultaneously then it was important to be able to share as much store as possible. The MTS system code and some of the compilers were shared among all concurrent users but each user program was held in its entirety in virtual memory and brought into real store when required. The MEDUSA programs were very large so to have had more than one copy in use would have degraded the system significantly. A modification was made to MTS locally in Newcastle which allowed the use of re-entrant code by applications programmers. This actual modification was never incorporated into MTS though the same function was added by the MTS team at Michigan in the upgrade from version 2 to version 3 of MTS.

The system was available to users from June 1973 to December 1974 and showed that medical research workers were capable of formulating and running their own searches on-line though with supervision from the liaison librarians at each centre. In some ways the computer aspects of the project were almost incidental. The main thrust of the results was the analysis of the way users tackled the problems of search formulation and how their information needs varied during the experimental period.

In retrospect these IR projects went through three phases. Initially we implemented a copy of the NLM search system to provide a service in this country before the days of accessing machines across the Atlantic. The next project demonstrated the enormous potential of on-line systems. It was clear that it was not necessary to be computer literate in order to use a system designed for the on-line environment. In the third project the computer system, though very complex, was a tool for other research so the computer was put in its proper place working reliably in the background.

Contents Page - 40 years of Computing at Newcastle
Information Retrieval, its origins in Newcastle, 19 August 1997