News | November 20, 1998

Web Tool Generates Evolutionary Relationships Of Protein Sequences

A new computational tool to assist scientists in detecting shared features and evolutionary relationships among the growing stream of protein and DNA sequence data being produced, has been launched by the University of California, San Diego (UCSD) and the San Diego Supercomputer Center (SDSC).

Created by William Grundy and Charles Elkan of UCSD's Irwin and Joan Jacobs School of Engineering and Timothy Bailey at SDSC, Meta-MEME compares families of evolutionarily related DNA or protein sequences using a Sun Microsystems Enterprise Server 10000 at SDSC. The Meta-MEME project is funded by the National Biomedical Computation Resource of SDSC, UCSD, and The Scripps Research Institute.

A biologist begins by submitting a family of similar DNA or protein sequences for analysis. After that, the entire Meta-MEME process is automatic. Up to four sets of results are e-mailed to the user: the statistical model, alignments showing where common features appear in the sequences, an alignment showing how the sequences are related to one another, and the results of searching a large sequence database using the model. The statistical models and analyses produced by Meta-MEME can help biologists infer evolutionary family trees, uncover previously unrecognized relationships between species, or develop experiments to determine a protein's function.

Evolutionary Fingerprints

Biologists may also discover previously unknown relationships by using Meta-MEME models to search publicly available databases of unannotated genetic data, which might turn up distant evolutionary relatives of the genetic sequence.

"If a gene that causes cancer in mice, for example, were shown to share a common ancestor with a human gene, this would strongly implicate the human gene as a cancer agent," Bailey said. "However, because the amino acid sequence of the common ancestor is almost never available, this common ancestor can only be inferred rather than proven."

Meta-MEME applies the power of probabilistic reasoning to the task of recognizing ancestral relationships among biological sequences. Although not the first software system to apply such methods to biological sequence modeling, Meta-MEME is the only such system to focus on evolutionary fingerprints, called motifs.

In this context, a motif is a short "word" in the code for a protein or DNA that appears in a similar form in all or most of the members of a given sequence family. The appearance of a motif in a distantly related sequence implies that, over thousands or even millions of years, this particular region of the ancestral sequence has remained relatively unchanged. Such consistency nearly always means that the motif is required for the protein to function properly. Often the only evidence from which to infer a common ancestral relationship lies in a handful of these small motifs.

Inspired By Speech Recognition

To characterize and infer relationships among sets of DNA or protein sequences, Meta-MEME employs machine learning techniques from artificial intelligence (AI). In the past two years, one of AI's major successes has resulted in speech recognition systems that cost less than $100 and have relatively low error rates for dictation.

Every commercially available speech recognition system on the market today uses a class of statistical models called hidden Markov models (HMMs) as the basis for its processing. HMMs can also be applied to biological sequences. For Meta-MEME, the "speech" is the series of nucleotides or amino acids that make up the biological sequence. Just as speech recognition software trained on utterances of the word "hello" can accurately recognize new instances of that utterance, a Meta-MEME model trained on a set of related hemoglobin sequences can recognize previously unannotated hemoglobins in a large protein database.

Given a Meta-MEME model and a candidate biological sequence, there are efficient algorithms with HMM to answer questions such as, "What is the probability that this model generated this sequence?"

"Although humans are very good at recognizing shared features of spoken words, images, or even biological sequences, people are notoriously bad at estimating probabilities," Grundy said. "For detecting sequences that share a common ancestor, probabilities are precisely what is needed."

Molecular biologists worldwide can tap into the computational power at SDSC by using the Meta-MEME software on the Internet at metameme.sdsc.edu/.

For more information: Ann Redelfs, SDSC, MC 0505, 9500 Gilman Drive, Bldg 109, La Jolla, CA 92093-0505, USA. Telephone: 619-534-5032. Email: redelfs@sdsc.edu.