A Billion Base Pairs
Up for Grabs

by Jo McEntyre


(Posted September 4, 1998 · Issue 37)


Can an explosion of information last for decades? Twenty years ago we complained of too many research papers to read. Now the culprit is biological information, particularly of the molecular sequence kind. And the rate of publication of scientific papers has not diminished. In fact, there is so much information that a new field of biology is in now full swing, sporting the fashionable name of bioinformatics.

Where on the Web can biologists find a simple route to powerful bioinformatics tools? Beginners and experts alike will find solace at the site of the National Center for Biotechnology Information (NCBI). Sponsored by the National Library of Medicine and the National Institutes of Health, it aims to bring data analysis to your desktop as painlessly as possible. All NCBI resources are free, and require no passwords or registration for use.

Integrating Information

Integration is the central tenet of NCBI's approach to organizing information. The Entrez retrieval system, developed for this purpose, allows searching and retrieval of nucleotide sequences, protein sequences, protein structures, whole-genome mapping information, taxonomic trees, and records from the PubMed (MEDLINE) literature database. What's more, all these databases are linked, making it easy to shift from a protein sequence to an article abstract to a related protein structure.

The results of a search are first displayed as a summary list; each item on the list cites other records that are linked to it within the Entrez system. This includes information from other Entrez databases, as well as any "neighbors," or similar records of the same type.

Submitting Sequences

How does the proud possessor of a new sequence obtain an accession number? Submitting your sequence to a public database, which allocates it a unique identifying number, is now a prerequisite for publication in most high-profile molecular biology journals. Since the European (EMBL), Japanese (DDBJ), and U.S. (GenBank) nucleotide databases exchange information on a daily basis, you need only submit your sequence to one of them, whichever is most convenient.

While BankIt requires no downloading of software and is very easy to use, it lacks the power to manipulate the sequence and submit more complex information. If you need more versatility, your best bet is Sequin, a stand-alone program available by FTP. Like BankIt, Sequin can handle simple sequence submissions, but it offers several added advantages. It can handle long sequences and sets of sequences (segmented entries as well as population, phylogenetic, and mutation studies), and allows sequence editing and extensive annotation. Sequin can also run in "network-aware" mode, making it possible to run your sequence through PowerBLAST - a BLAST client designed for large-scale analysis of genomic sequences - or check for vector contamination, for example.

Tools that Mine Databases

With over 2.2 million sequences in the database and about 2,400 accession numbers allocated per day worldwide, finding effective and efficient analysis tools is at the top of many biologists' agendas.

Sequence Analysis

The BLAST programs for sequence analysis were designed to detect similarities between proteins and nucleotides that could indicate common biological function, and give some statistical measure of the potential significance of the similarity, all within a reasonable length of time. Reflecting the changes in database size and therefore in the desirable level of search sensitivity, a revised and improved version of BLAST was released in 1997. The algorithm of the original (ungapped) BLAST was changed, and two new flavors were introduced. All versions can be run with default parameters, which is the best option for beginners; for those conversant with such intricacies as PAM matrices and e values, a variety of advanced options can be invoked. The various versions are accessible on the BLAST home page.

The new Basic BLAST (gapped BLAST) can be used to search both nucleotide and protein sequence databases. It allows gaps to be introduced into the alignment, so that when more than one region of similarity is found between the query sequence and the "hit" sequence, the result is returned as one segment, which is very important for detecting different functional domains in modular proteins.

PSI-BLAST (Position-Specific Iterated BLAST), for proteins only, uses the significant results from the first search to form a position-specific matrix, which can then be used to query the database again. In turn, the significant results from the first iteration form a matrix to query the database in the second iteration, and so on, until convergence is reached. This approach allows greater sensitivity in detecting subtle sequence similarities and identifying outlying members of protein families.

Figure 1
All versions of BLAST now provide a graphical summary of search results (figure 1), in which a colored bar represents a sequence hit by the query in the database. The color of the bar indicates the approximate level of similarity between the hit and the query sequence, and its position relative to the query sequence locates the similar regions.

Free Access to the Literature

Should your quest for information turn to text, not sequences, you will find that NCBI's PubMed database offers free searches of the biomedical literature. The straightforward Web interface masks a sophisticated search engine for 9 million abstracts gathered since 1966. PubMed consists of all of NLM's MEDLINE, plus some non-biomedical articles from general science journals such as Science and the Proceedings of the National Academy of Sciences. PubMed can be used via simple text searches or Boolean searches or by selecting specific fields (such as author name, publication date, or journal name), and is set up to allow you to refine your search step by step.

Several features of PubMed set it apart from other literature search facilities. In papers where a new sequence or structure is reported, a link is provided to that information. Where a participating journal has an online counterpart, the abstract is linked to the full text of the article, although a subscription may be required for access. Finally, a related-articles, or neighbors, button summons other articles similar to the one selected, based on the content of the abstract, the title, and the key terms assigned under the MeSH (Medical Subject Headings) system, if any. This allows search results to include articles that might have gone unnoticed with a conventional search, and may uncover new links to other databases.

3-D Structures

The Molecular Modeling Database (MMDB) incorporates macromolecular structure information from the Protein Data Bank (PDB; maintained at Brookhaven National Laboratory) in a form compatible with other NCBI data sources. VAST (the Vector Alignment Search Tool) directly compares 3-D structures in MMDB to find and align those with statistically significant similarity. Each individual protein record in MMDB has these structure neighbors listed in a table.

For visual examination of structural alignments, Cn3D ("see in 3-D"), created specifically to permit viewing of MMDB structures over the Web, joined the ranks of RasMol and Mage as a free 3-D molecular structure viewer in 1997. With the soon-to-be-released latest version, Cn3D 2.0, two or more 3-D structures can be aligned, viewed, and animated. Furthermore, their sequences can be seen in a separate window, aligned according to structural rather than sequence identity. Other sequences, even those for which structures are not available, can be imported into the sequence viewer and mapped onto the sequence of the known structure(s) using gapped BLAST. As a further visual link-up between the structure and sequence viewing windows, selecting a region of interest in either the sequence or the structure causes the corresponding region in the other window to be highlighted as well - providing an excellent means of exploring conserved regions in 3-D detail.

Genomes

The Genomes division of Entrez offers a holistic approach to organizing nucleotide and protein sequence information, especially data for completely sequenced genomes and data on genome-project organisms. It provides both text and graphical displays of integrated genetic, sequence, and physical maps that can be viewed chromosome by chromosome. Genomes are navigated by clicking and zooming in on chosen segments of a chromosome map, or by text searching for marker names or for nucleotide or protein unique identifiers. There are currently 687 entries in the Genomes division, representing six large taxonomic groups: Archaea, Eubacteria, Eukaryotae, viroids, viruses, and plasmids.

Taxonomy

In many ways the Taxonomy division is the linchpin of all Entrez information, since all nucleotide, protein, genomic, and structural records are required to include the name of the organism from which they are derived. Both morphological and molecular evidence is used to classify organisms, with consultation from a number of external curators and experts. About 35,000 organisms are currently represented in GenBank, out of approximately 1.5 million on the planet that have been described morphologically.

A Sampling of Biology at the NCBI

Online Mendelian Inheritance in Man (OMIM), based on the book Mendelian Inheritance in Man by Victor McKusick and colleagues, is the most comprehensive online text resource for human genetic disease information. Updated daily by staff based at Johns Hopkins University in Baltimore, OMIM currently weighs in at 9,439 gene entries, of which 572 have been added so far in 1998 alone. OMIM may be queried using any text-based term, including gene name, disease name, or contributing author. Links to proteins, genes, and abstracts of the literature cited within OMIM records are also provided.

The Cancer Genome Anatomy Project (CGAP) is a collaboration between NCBI and the National Cancer Institute to find new genes involved in cancer progression. It combines experimental biology with bioinformatics by making use of a microdissection technique that samples normal, precancerous, and cancer tissue. The three sample types are used to make libraries from which randomly selected clones are sequenced and grouped together according to similarity. These "fingerprints" for normal and cancerous tissues can be compared; the most interesting genes to a cancer researcher or a drug developer are probably those that have significantly different levels of expression between samples.

A Natural System of Gene Families from Complete Genomes provides a database of COGs, clusters of orthologous groups of proteins - or, more simply put, groups of proteins that represent ancient conserved domains found across different organisms. Seven completely sequenced genomes - one eukaryote (yeast), plus several eubacteria and archaebacteria - were used in the first edition of the COG scheme; more will be added in the future. The COGs can be searched with your favorite protein from any organism to find its nearest neighbors in the COG-featured genomes. Since all three main branches of the phylogenetic tree are represented, this resource can provide insight into evolutionary relationships between individual proteins and even whole metabolic and regulatory pathways.

The Gene Map, a new edition of which has just been released, maps about 30,000 sequence clusters of the human genome as defined by UniGene, an experimental system for automatically partitioning genomes. Each UniGene cluster consists of several identical espressed sequence tags (ESTs) and known genes, and is placed on the human gene map in accordance with an international standard set of gene markers. The Gene Map is linked to the Genes and Disease site, where nonexperts can browse information on about 60 selected human diseases that includes short summaries of the molecular pathogenesis of each disease along with with links to OMIM, protein sequences, relevant literature, and disease-specific sites.

These are only the staples of the NCBI diet, along with a few recent, more exotic morsels to whet the reader's appetite. The only way to achieve a full understanding of the fullness of this site is by paying it an actual visit.

Jo McEntyre is a visiting associate at the National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland.

Send us your comments and ideas for future articles.

Endlinks

Send email to NCBI for further information on any of the services discussed above.

Evaluated MEDLINE - BioMedNet's own MEDLINE implementation, with citations rated for relevance.

EMBL European Bioinformatics Institute (EBI) - maintains the European Molecular Biology Laboratory (EMBL) sequence database, SWISS-PROT, and other databases. Provides tools for database searching, data analysis, and data submission.

GenomeNet WWW Server - a Japanese network of databases and services for genome research that integrates information from other genome projects into sequence analysis tools. Includes the Kyoto Encyclopedia of Genes and Genomes (KEGG), a database for biochemical pathways. Supported by the Institute of Chemical Research of Kyoto University and the Human Genome Center at the Institute of Medical Sciences of the University of Tokyo.

Site Review - discusses FlyBase, a database of the Drosophila genome, maintained at Indiana University.

Web sites mentioned in this column:

National Center for Biotechnology Information

Entrez

Sequence Submission

Sequence Analysis

3-D Structures

Other NCBI Resources

Other Databases

Other Resources


Previous In Situ Articles
Summer Surfing with the Kids
by Amy Fluet (Posted August 7, 1998 · Issue 36)
Web Resources for Model Organisms
by Pamela M. Gannon (Posted July 24, 1998 · Issue 35)
Travel Medicine
by Dean A. Haycock (Posted July 10, 1998 · Issue 34)
Internet Resources for Women Biologists
by Susan L. Forsburg (Posted June 26, 1998 · Issue 33)
Useful Beauty: Photomicrography Websites
by Marina Chicurel (Posted June 12, 1998 · Issue 32)
Discussion Groups on the Web
by Amy Fluet (Posted May 15, 1998 · Issue 30)

more