by
(
Can an explosion of information last for decades? Twenty years ago we complained of too many research papers to read. Now the culprit is biological information, particularly of the molecular sequence kind. And the rate of publication of scientific papers has not diminished. In fact, there is so much information that a new field of biology is in now full swing, sporting the fashionable name of bioinformatics.
Where on the Web can biologists find a simple route to powerful
bioinformatics tools? Beginners and experts alike will find solace at the
site of the National Center for
Biotechnology Information (NCBI). Sponsored by the National Library of Medicine and the National Institutes of Health, it aims to
bring data analysis to your desktop as painlessly as possible. All NCBI
resources are free, and require no passwords or registration for use.
Integrating Information
Integration is the central tenet of NCBI's approach to organizing information. The Entrez retrieval system, developed for this purpose, allows searching and retrieval of nucleotide sequences, protein sequences, protein structures, whole-genome mapping information, taxonomic trees, and records from the PubMed (MEDLINE) literature database. What's more, all these databases are linked, making it easy to shift from a protein sequence to an article abstract to a related protein structure.
The results of a search are first displayed as a summary list; each item on
the list cites other records that are linked to it within the Entrez system.
This includes information from other Entrez databases, as well as any
"neighbors," or similar records of the same type.
Submitting Sequences
How does the proud possessor of a new sequence obtain an accession number? Submitting your sequence to a public database, which allocates it a unique identifying number, is now a prerequisite for publication in most high-profile molecular biology journals. Since the European (EMBL), Japanese (DDBJ), and U.S. (GenBank) nucleotide databases exchange information on a daily basis, you need only submit your sequence to one of them, whichever is most convenient.
While BankIt requires no downloading of software and is very easy to use, it
lacks the power to manipulate the sequence and submit more complex
information. If you need more versatility, your best bet is Sequin, a
stand-alone program available by FTP. Like BankIt, Sequin can handle simple
sequence submissions, but it offers several added advantages. It can handle
long sequences and sets of sequences (segmented entries as well as
population, phylogenetic, and mutation studies), and allows sequence editing
and extensive annotation. Sequin can also run in "network-aware"
mode, making it possible to run your sequence through PowerBLAST - a BLAST
client designed for large-scale analysis of genomic sequences - or check for
vector contamination, for example.
Tools that Mine Databases
With over 2.2 million sequences in the database and about 2,400 accession numbers allocated per day worldwide, finding effective and efficient analysis tools is at the top of many biologists' agendas.
Sequence Analysis
The BLAST
programs for sequence analysis were designed to detect similarities between
proteins and nucleotides that could indicate common biological function, and
give some statistical measure of the potential significance of the
similarity, all within a reasonable length of time. Reflecting the changes
in database size and therefore in the desirable level of search sensitivity,
a revised and improved version of BLAST was released in 1997. The algorithm
of the original (ungapped) BLAST was changed, and two new flavors were
introduced. All versions can be run with default parameters, which is the
best option for beginners; for those conversant with such intricacies as PAM
matrices and e values, a variety of advanced options can be invoked. The
various versions are accessible on the BLAST home page.
The new Basic BLAST (gapped BLAST) can be used to search both nucleotide and protein sequence databases. It allows gaps to be introduced into the alignment, so that when more than one region of similarity is found between the query sequence and the "hit" sequence, the result is returned as one segment, which is very important for detecting different functional domains in modular proteins.
PSI-BLAST (Position-Specific Iterated BLAST), for proteins only, uses the significant results from the first search to form a position-specific matrix, which can then be used to query the database again. In turn, the significant results from the first iteration form a matrix to query the database in the second iteration, and so on, until convergence is reached. This approach allows greater sensitivity in detecting subtle sequence similarities and identifying outlying members of protein families.
![]() |
| Figure 1 |
Free Access to the Literature
Should your quest for information turn to text, not sequences, you will find that NCBI's PubMed database offers free searches of the biomedical literature. The straightforward Web interface masks a sophisticated search engine for 9 million abstracts gathered since 1966. PubMed consists of all of NLM's MEDLINE, plus some non-biomedical articles from general science journals such as Science and the Proceedings of the National Academy of Sciences. PubMed can be used via simple text searches or Boolean searches or by selecting specific fields (such as author name, publication date, or journal name), and is set up to allow you to refine your search step by step.
Several features of PubMed set it apart from other literature search
facilities. In papers where a new sequence or structure is reported, a link
is provided to that information. Where a participating journal has an online counterpart,
the abstract is linked to the full text of the article, although a
subscription may be required for access. Finally, a related-articles, or
neighbors, button summons other articles similar to the one selected, based
on the content of the abstract, the title, and the key terms assigned under
the MeSH
(Medical Subject Headings) system, if any. This allows search results to
include articles that might have gone unnoticed with a conventional search,
and may uncover new links to other databases.
3-D Structures
The Molecular Modeling Database (MMDB) incorporates macromolecular structure information from the Protein Data Bank (PDB; maintained at Brookhaven National Laboratory) in a form compatible with other NCBI data sources. VAST (the Vector Alignment Search Tool) directly compares 3-D structures in MMDB to find and align those with statistically significant similarity. Each individual protein record in MMDB has these structure neighbors listed in a table.
For visual examination of structural alignments, Cn3D ("see
in 3-D"), created specifically to permit viewing of MMDB structures
over the Web, joined the ranks of RasMol and Mage as
a free 3-D molecular structure viewer in 1997. With the soon-to-be-released
latest version, Cn3D 2.0, two or more 3-D structures can be aligned, viewed,
and animated. Furthermore, their sequences can be seen in a separate window,
aligned according to structural rather than sequence identity. Other
sequences, even those for which structures are not available, can be
imported into the sequence viewer and mapped onto the sequence of the known
structure(s) using gapped BLAST. As a further visual link-up between the
structure and sequence viewing windows, selecting a region of interest in
either the sequence or the structure causes the corresponding region in the
other window to be highlighted as well - providing an excellent means of
exploring conserved regions in 3-D detail.
Genomes
The Genomes division of Entrez offers a holistic approach to organizing nucleotide and protein sequence information, especially data for completely sequenced genomes and data on genome-project organisms. It provides both text and graphical displays of integrated genetic, sequence, and physical maps that can be viewed chromosome by chromosome. Genomes are navigated by clicking and zooming in on chosen segments of a chromosome map, or by text searching for marker names or for nucleotide or protein unique identifiers. There are currently 687 entries in the Genomes division, representing six large taxonomic groups: Archaea, Eubacteria, Eukaryotae, viroids, viruses, and plasmids.
Taxonomy
In many ways the Taxonomy
division is the linchpin of all Entrez information, since all nucleotide,
protein, genomic, and structural records are required to include the name of
the organism from which they are derived. Both morphological and molecular
evidence is used to classify organisms, with consultation from a number of
external curators and experts. About 35,000 organisms are currently
represented in GenBank, out of approximately 1.5 million on the planet that
have been described morphologically.
A Sampling of Biology at the NCBI
Online Mendelian Inheritance in Man (OMIM), based on the book Mendelian Inheritance in Man by Victor McKusick and colleagues, is the most comprehensive online text resource for human genetic disease information. Updated daily by staff based at Johns Hopkins University in Baltimore, OMIM currently weighs in at 9,439 gene entries, of which 572 have been added so far in 1998 alone. OMIM may be queried using any text-based term, including gene name, disease name, or contributing author. Links to proteins, genes, and abstracts of the literature cited within OMIM records are also provided.
The Cancer Genome Anatomy
Project (CGAP) is a collaboration between NCBI and the National Cancer Institute to find new
genes involved in cancer progression. It combines experimental biology with
bioinformatics by making use of a microdissection technique that samples
normal, precancerous, and cancer tissue. The three sample types are used to
make libraries from which randomly selected clones are sequenced and grouped
together according to similarity. These "fingerprints" for normal
and cancerous tissues can be compared; the most interesting genes to a
cancer researcher or a drug developer are probably those that have
significantly different levels of expression between samples.
A Natural System of Gene Families from Complete Genomes provides a database of COGs, clusters of orthologous groups of proteins - or, more simply put, groups of proteins that represent ancient conserved domains found across different organisms. Seven completely sequenced genomes - one eukaryote (yeast), plus several eubacteria and archaebacteria - were used in the first edition of the COG scheme; more will be added in the future. The COGs can be searched with your favorite protein from any organism to find its nearest neighbors in the COG-featured genomes. Since all three main branches of the phylogenetic tree are represented, this resource can provide insight into evolutionary relationships between individual proteins and even whole metabolic and regulatory pathways.
The Gene Map, a new
edition of which has just been released, maps about 30,000 sequence clusters
of the human genome as defined by UniGene, an
experimental system for automatically partitioning genomes. Each UniGene
cluster consists of several identical espressed sequence tags (ESTs) and
known genes, and is placed on the human gene map in accordance with an
international standard set of gene markers. The Gene Map is linked to the Genes and Disease
site, where nonexperts can browse information on about 60 selected human
diseases that includes short summaries of the molecular pathogenesis of each
disease along with with links to OMIM, protein sequences, relevant
literature, and disease-specific sites.
These are only the staples of the NCBI diet, along with a few recent, more exotic morsels to whet the reader's appetite. The only way to achieve a full understanding of the fullness of this site is by paying it an actual visit.
Jo McEntyre is a visiting associate at the National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland.


Endlinks
Send email to NCBI for further information on any of the services discussed above.
Evaluated MEDLINE - BioMedNet's own MEDLINE implementation, with citations rated for relevance.
EMBL European Bioinformatics Institute (EBI) - maintains the European Molecular Biology Laboratory (EMBL) sequence database, SWISS-PROT, and other databases. Provides tools for database searching, data analysis, and data submission.
GenomeNet WWW Server - a Japanese network of databases and services for genome research that integrates information from other genome projects into sequence analysis tools. Includes the Kyoto Encyclopedia of Genes and Genomes (KEGG), a database for biochemical pathways. Supported by the Institute of Chemical Research of Kyoto University and the Human Genome Center at the Institute of Medical Sciences of the University of Tokyo.
Site Review - discusses FlyBase, a database of the Drosophila genome, maintained at Indiana University.
Web sites mentioned in this column: National Center for Biotechnology Information Entrez Sequence Submission Sequence Analysis 3-D Structures Other NCBI Resources Other Databases Other Resources
![]()
Previous In Situ Articles
![]()
Summer Surfing
with the Kids
![]()
Web Resources for Model Organisms
![]()
Travel Medicine
![]()
Internet Resources for Women Biologists
![]()
Useful Beauty: Photomicrography Websites
![]()
Discussion Groups
on the Web