by Giovanni Lavorgna, Edoardo Boncinelli,
Andreas Wagner, and Thomas Werner
![]() | Reprinted with permission from Trends in Genetics, Vol. 14, No. 9, September 1998 |
Transcription factors (TFs) not only control a wide range of physiological processes, but are also responsible for a host of pathological phenomena in eukaryotic cells. These molecules specifically recognize and bind to regulatory sequences of target genes, whose transcription is up- or downregulated as a consequence. Two phenomena that are typical of TFs in higher eukaryotes present serious obstacles to the analysis of their function by genetic means. These are pleiotropy (one TF might regulate many genes of sometimes apparently unrelated function) and genetic redundancy (several related TFs might regulate overlapping groups of genes). The latter phenomenon is known to hinder greatly interpretation of gene knockout experiments in vertebrates [1].
Given the problems associated with a genetic approach, the direct identification of TF target genes is an attractive alternative for dissecting TF function. Several in vitro methods have been used for this purpose [2, 3, 4, 5, 6, 7], with mixed success. An in silico analysis, aimed at identifying target genes via detection of potential TF-binding sites within general-purpose nucleotide sequence databases such as GenBank, is a promising extension of in vitro methods. However, owing to the low-information content [8, 9] of these often short and degenerate sites, many potential sites will be found occurring randomly almost anywhere in a genome. This is a major problem of almost any TF site-based approach that can be partially circumvented by (1) searching databases more specialized than GenBank, such as the Eukaryotic Promoter Database [10], and (2) including context information into the binding-site search.
A variety of methods and programs incorporating context information in promoter analysis has recently become available. A selection of them is reviewed here and contrasted to context-insensitive tools. Also, we put special emphasis on Web-accessible tools because they will be most useful and accessible for most investigators. The surveyed methods are different from those addressing the related problem of promoter recognition, which have been reviewed recently [11] and will not be discussed further.
A number of well-established, context-insensitive sequence analysis programs (e.g., FASTA [12], FindPatterns [19], ProfileSearch [19], PatScan [24], MatInspector [25], MatrixSearch [22], and SignalScan [26]) can be employed for the purpose of screening whole nucleotide-sequence databases (or subsections) for matches to short sequences such as TF-binding sites. All of these programs can identify potential binding sites for a TF of interest, with major advantages on the side of programs using weight matrices as opposed to those using IUPAC consensus sequences or definite nucleotide sequences [13]. (Programs using weight matrices use the distribution of all four nucleotides at each single position of the matrix in order to calculate a quantitative score, which results in an enhanced specificity. IUPAC consensus searches use, instead, a majority rule, which results in a simple yes/no decision [13]). However, because these programs lack context-sensitivity, they will find spurious matches in many sequences that are not target genes. This high false-positive rate will obscure the real target genes also found in the search. The otherwise popular BLAST [14] program is even less-well suited for locating the limited similarities represented by TF-binding sites. In fact, it requires a minimum number of seven exactly matching bases, which is too stringent for the majority of TF-binding sites.
Web-Accessible Tools for Context-Sensitive Sequence Analysis
The functional context of a TF-binding site includes the following: local status of chromatin compaction; the position of the binding site, relative to the transcription start site; and the presence of other binding sites nearby. The computational methods discussed here (see references for URLs and references) try to include this context-sensitivity in different ways. None of these programs is capable of pinpointing real target genes specifically, but their output should be enriched in these genes owing to the enormous reduction in the number of spurious matches. The user is responsible for the definition of the type of context to be considered by these methods. This step is crucial for the quality of the results and, therefore, should receive special attention.
As shown in table 1, a list of available resources for searching transcription factor target-genes in nucleotide sequence databases, programs like MatInspector and MatrixSearch come with a predefined library of carefully selected matrices, which are of immediate use to the researcher. MatInspector's library is based on the TRANSFAC database [23], whereas MatrixSearch is based on the Information Matrix Database [22]. It is important to stress the need to use high-quality weight matrices that can contribute to a good search outcome even more than the chosen search-algorithm [13]. A good introduction to the general issues about the criteria that need to be met by a TF-binding site in order to be included in a high-quality weight matrix can be found on the documentation pages of the TRANSFAC Web site [15].
The NCBI server provides CosMoS, a yeast-specific tool that allows the detection of user-defined patterns within putative promoter regions of the yeast genome (upstream of open reading frame start points), effectively focusing the search on known promoter regions. The program FastM exploits the spatial connection and, optionally, sequential order, between two different transcription factors in order to develop simple models of transcriptional regulatory DNA sections, independent of a priori knowledge about the location of promoters. FastM employs MatInspector and its matrix library, thus greatly facilitating the user selection of TF weight matrices or consensi. Another tool, TargetFinder, also uses MatInspector and a predefined TF library to search for TF sites in databases. The program takes advantage of annotated features present in GenBank entries to restrict matches to relevant gene subregions, significantly reducing the background usually associated with these searches. TargetFinder allows the inclusion of sequence annotation (e.g., TATA box, transcription start site, annotated promoters) that cannot be included in FastM models. The Transcription Factor Combination Discoverer [16] finds and analyzes combinations of transcription factor binding sites in the yeast genome and in upstream regions in particular.
Other Approaches for Context-Sensitive Sequence Analysis
Often, binding sites for one or more TFs are closely spaced in a regulatory region, indicating cooperativity in transcriptional regulation. Recently, statistical and heuristic techniques have been developed for the detection of such clusters in large genomic DNA regions [16, 17, 18]. For example, the GenomeInspector tool can detect distant correlations between sequence elements (e.g. between ORFs and TF-binding sites) on megabases of nucleotide sequences [17]. Another approach, not yet released as public-domain software, employs statistical tests to screen a genome for very closely spaced TF-binding sites [18]. Its application to the genome of yeast detects genes known to be regulated by particular TFs, and genes that are not known to be regulated by the TFs, but that act in the same cellular process (e.g. cell-cycle) as the studied TFs [18].
Although the search for TF target genes in sequence databases is still a difficult task, the tools discussed above can help to reduce the signal-to-noise ratio considerably. Importantly, these individual approaches can be used complementarily and their incorporation into one integrated analysis tool is highly desirable. However, even the current improvements in specificity, achieved by incorporating very simple biological principles into database searches, demonstrate that context-sensitive approaches hold great promise.
Andreas Wagner is an assistant professor of biology at the University of New Mexico.
Giovanni Lavorgna is an Associate Research Scientist at the Ospedale San Raffaele, Milan, Italy.

