![]() | Reprinted with permission from Trends Guide to Bioinformatics, 1998, pp. 3-5. |
Abstract
As the amount of biologically relevant data is increasing at such a rapid rate, knowing how to access and search this information is essential. There are three data retrieval systems of particular relevance to molecular biologists - Entrez, Sequence Retrieval System (SRS) and DBGET.
The amount of biological information accessible via the World Wide Web (WWW) is truly astonishing, and the volume of data is increasing at a fast pace. It is important for the bench scientist to have easy and efficient ways of wading through the data and finding what is important to his or her research. Although one can browse the data, a far more efficient access method is to perform a search. Depending on the type of data at hand, there are two basic ways of searching: using descriptive words to search text databases or using a nucleotide or protein sequence to search a sequence database. This article focuses on the former.
Here, I will discuss three tools - Entrez, the Sequence Retrieval System (SRS), and DBGET - that allow text searching of multiple molecular biology databases and provide links to relevant information for entries that match the search criteria. Examples of basic and advanced search strategies are also included. Although many databases that can be accessed with text-based searching will not be discussed here, the search strategies presented are broadly applicable and can be used to search many organism-specific resources, such as the Saccharomyces Genome Database (SGD) [1] and the Mouse Genome Database (MGD) [2].
These retrieval systems are indispensable to the scientist in search of information. In using any of these systems, queries can be as simple as entering the accession number of a newly published sequence or as complex as searching multiple database fields for specific terms (see box 1 for search concepts). The advantage of Entrez, SRS, and DBGET is that they not only return matches to a query, but also provide handy pointers to additional important information in related databases. The three systems differ in the databases that they search and the links they make to other information.
Entrez
Entrez, a molecular biology database and retrieval system, was developed by the National Center for Biotechnology Information (NCBI)[3,4]. It is an entry point for exploring distinct but integrated databases. The Entrez system provides access to nucleotide and protein sequence databases, a molecular modeling 3-D structures database (MMDB), a genomes and maps database, and the literature. The literature database, PubMed, provides excellent and easy access to MEDLINE and pre-MEDLINE (not fully indexed) articles. The taxonomy database contains more than 23,000 different species and allows retrieval of DNA and protein sequences for any taxonomic group.
Of the three text-based database systems, Entrez is the easiest to use, but it offers more limited information to search (see descriptions of SRS and DBGET below). The search is begun in one database, and records are presented that match the query. Related records (neighbors) in that database and associated records (links) in other Entrez databases are then retrieved.When appropriate, links are made to external databases, such as Online Mendelian Inheritance in Man (OMIM) [5] and the MGD. Neighbors and links are listed in the order of similarity to the query. The similarity is based on precomputed analyses of sequences, structures and the literature. In the case of sequences, for example, the precomputed analysis is a BLAST search.
One particularly useful feature in Entrez is the ability to retrieve large sets of data based on some criterion and to download them to a local computer, allowing these sequences to be worked on using analytical tools available on the local computer. This so-called batch Entrez permits retrieval of DNA or protein sequences that are specified in a file. Moreover, all entries for specific organism based on the taxonomy database can be retrieved, or a Boolean search can be entered (see box 1 for a description of Boolean searches) to retrieve those sequences that match the query.
Sequence Retrieval System
The Sequence Retrieval System (SRS) is a homogeneous interface to over 80 biological databases that has been developed at the European Bioinformatics Institute (EBI) at Hinxton, United Kingdom. The types of databases included are sequence and sequence related, metabolic pathways, transcription factors, application results (e.g. BLAST), protein 3-D structure, genome, mapping, mutations, and locus-specific mutations. You can access and query their contents and navigate among them [6]. The Web page listing all the databases contains a link to a description page about the database and includes the date on which it was last updated. You select one or more of the databases to search before entering your query. Over 30 versions of SRS are currently running on the WWW. Each includes a different subset of databases and associated analytical tools.
Although there are many potential databases to search, SRS databases are indexed well, thus reducing the search time. The contents of the data fields in each database are broken down into components, and selected words are extracted and inserted into an index. Each field generally has its own index.The query form allows search terms for a specific field to be entered, or you can search all fields using the option "Alltext." SRS provides an alternative query form that allows more-complex Boolean queries to be composed.
DBGET
DBGET/LinkDB is an integrated database retrieval system, developed by the Institute for Chemical Research, Kyoto University, and the Human Genome Center of the University of Tokyo [7], that is available through GenomeNet. DBGET provides access to about 20 databases, which are queried one at a time. After querying one of these databases, DBGET presents links to associated information in addition to the list of results. The LinkDB database can also be searched directly with a specific entry and provides a list of links to all of the database entries with information about the entry. Another unique feature of DBGET is its connection with the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [8], which is a database of metabolic and regulatory pathways that has been developed by the same group.
DBGET has simpler, but more limited, search methods than either SRS or Entrez. For DBGET, you can search the database of your choice using one of two commands.The bfind command allows searching based on text terms. In response, a list of entries that match the query is presented together with links to associated information about each entry. The bget command searches by entry name or accession number.
Examples
The following example shows how each system can be used to retrieve the SWISS-PROT [9] entry P04391, an ornithine carbamoyltransferase protein in Escherichia coli. In Entrez, you could enter the name P04391 in the protein database query form and view the entry and associated links and neighbors. In SRS, you could first select the SWISS-PROT database, then enter P04391 in the query form and, once the entry is displayed, search for links to other related databases. However, the fastest way of gathering the related information for this entry is to search LinkDB. By simply entering swissprot:P04391, a list of all links to all the related databases is displayed.
Box 2 provides an example of text-based searching and illustrates some of the problems that might be encountered while doing such a search.The choice of system will depend on what you are searching for and what additional information you hope to find.
You Can't Always Get What You Want
Text-based searches are dependent on the quality of the text, annotations, and index being searched. If entries are not annotated fully or consistently, it will be difficult to find all relevant entries for a topic. Text can be either free form or controlled vocabulary, and each can cause different problems for a search. For example, if searching free-form text, spelling errors in the text can exclude relevant entries from the results. Inconsistencies also exist with, for example, hyphenation. In one place, a phrase might contain a hyphen but in another instance not; thus, the entry will be missed when searching. Another potential problem to be aware of is limiting a search to the fields called "keywords." It is important to know if keywords are applied consistently to entries. Beware of keyword indexes that have a substantial number of keywords pointing to only one entry, or keyword indices with author-supplied keywords, which can be arbitrary or idiosyncratic. If this is the case, it might be better to search "free text." If you search a field with a controlled vocabulary (e.g., MeSH), it is important to understand its organization and hierarchy.
Fran Lewitter is the associate director for biocomputing at the Whitehead Institute for Biomedical Research.

