MUMMALS

MUMMALS (MUltiple alignment with Multiple MAtch state models of Local Structure) is a program for constructing multiple alignment for protein sequences (Pei and Grishin, submitted). It implements complex hidden Markov models (HMMs) of pairwise alignment with multiple match states that capture local structural information. MUMMALS adopts a progressive alignment method that applies a probabilistic consistency-based scoring function similar to the one used in ProbCons (Do et al., 2005). First, a tree is built in a fast way based on a K-mer count method (Edgar, 2004). An initial alignment is built progressively guided by the tree with a simple sum-of-pairs scoring function. A second tree is then built with a UPGMA method based on sequence identities calculated from the initial alignment. The probabilistic consistency strategy is applied in the same way as in the ProbCons program (Do et al., 2005). For each sequence pair, the match probabilities of residue pairs are calculated using one of the HMMs listed below. These probability matrices are subject to consistency measure, which involves multiplications of the matrices. Finally MUMMALS progressively aligns the sequences guided by the second tree using the consistency-based scoring function.

In this web server, users have the option of selecting one of the five hidden Markov models listed below:

HMM_1_1_0: with only one match state to model any residue pair.

HMM_1_1_1: with 2 match states. One match state models residue pairs in core blocks, the other match state models residue pairs in unaligned (structurally divergent) regions.

HMM_1_3_1: with 4 match states. Residue pairs in core blocks are modeled by 3 match states corresponding to three secondary structure types (helix, strand and coil). Residue pairs in unaligned regions are modeled by 1 match state.

HMM_3_1_1: with 4 match states. Residue pairs in core blocks are modeled by 3 match states corresponding to three categories of relative sidechain solvent accessibility (<11.2, 11.2~51.7, and >51.7). Residue pairs in unaligned regions are modeled by 1 match state.

HMM_3_3_1: with 10 match states. Residue pairs in core blocks are modeled by 9 match states corresponding to combinations of three secondary structure types and three solvent accessibility categories. Residue pairs in unaligned regions are modeled by 1 match state.

Here are diagrams about HMM_1_1_0, HMM_1_1_1 and HMM_1_3_0.

Running time order: HMM_1_1_0 < HMM_1_1_1 < HMM_1_3_1 ≈ HMM_3_1_1 < HMM_3_3_1

Alignment quality generally in the order: HMM_1_1_0 < HMM_1_1_1 < HMM_3_1_1 < HMM_1_3_1 ≈ HMM_3_3_1

HMM_1_3_1 is the default since it balances running time and alignment quality. The two best performing HMMs (HMM_1_3_1 and HMM_3_3_1) give on average slightly better results (several percent) than ProbCons (version 1.1), MAFFT (version 5.667) and MUSCLE (version 3.52) on several multiple alignment testing datasets (Pei and Grishin, submitted).

References

Do, C. B., Mahabhashyam, M. S., Brudno, M. and Batzoglou, S. (2005). ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 15, 330-340.
Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792-1797.
Katoh, K., Kuma, K., Toh, H., and Miyata, T. (2005). MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33, 511-518.
Pei, J. and Grishin, N. V. (submitted). MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information.