PROMALS documentation

PROMALS documentation

PROMALS (PROfile Multiple Alignment with predicted Local Structure) is a progressive method for aligning multiple protein sequences, with enhanced profile information from database searches and secondary structure prediction.

Alignment procedure (Figure)

The alignment order of PROMALS is set by a tree built using a k-mer count method [1] . Like PCMA [2] and MUMMALS [3] , PROMALS has two alignment stages for easy and difficult alignments. In the first stage, highly similar sequences are progressively aligned in a fast way with a weighted sum-of-pairs measure of BLOSUM62 scores [4] (step 2 in Figure 1). If two neighboring groups on the tree have an average sequence identity higher than a certain threshold (default: 60%), they are aligned in this fast way. The result of the first alignment stage is a set of sequences or pre-aligned groups that are relatively divergent from each other. In the second alignment stage, one representative sequence (the longest one) is selected from each pre-aligned group. For each representative, PSI-BLAST [5] is used to search for homologs from sequence database UNIREF90 [6]. PSI-BLAST alignment is processed to remove divergent hits. The PSI-BLAST checkpoint file after 3 iterations is used to predict secondary structures by PSIPRED [7] . For each pair of representatives, profiles are derived from the PSI-BLAST alignments and PSIPRED secondary structure prediction, and a matrix of posterior probabilities of matches between positions is obtained by forward and backward algorithms of a profile-profile hidden Markov model. These matrices are used to calculate the probabilistic consistency scores as described in [8] . The representatives are then aligned progressively according to the consistency-based scoring function, and the pre-aligned groups obtained in the first stage are merged to the multiple alignment of the representatives. Finally, gap placement is refined to make the gap patterns more realistic.

The utilization of homologs from database searches and secondary structure prediction makes PROMASLS more accurate than other stand-alone aligners for divergent sequences. However, this also makes the program slow when aligning a large number of sequences.

PROMALS output alignment format

PROMALS web server provides links to alignments in the following two formats.

1. Colored alignment

The first line in each block shows conservation indices for positions with a conservation index above 5. Each representative sequence has a magenta name and is colored according to PSIPRED secondary structure predictions (red: alpha-helix, blue: beta-strand). A representative sequence and the immediate sequences below it with black names, if there are any, form a closely related group (determined by option "Identity threshold"). Sequences within each group are aligned in a fast way. The groups are aligned using profile consistency with predicted secondary structures. In the example below, seq8, seq1, seq6, seq5 and seq9 are representative sequences; seq8, seq10 and seq7 form a closely related group, and seq6 is a group by itself. The consensus predicted secondary structures are shown in the last line in each block. If the fraction of helix or strand predictions among representative sequences in a position is larger than 0.5, the consensus letter is "h" or "e", respectively.

Colored alignment example:

Conservation:          9669    6 6    9   9   99 9  7 6  9 696  66 6 6   67 99
seq8              ----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKK
seq10             --FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKK
seq7              ----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKK
seq1              -KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKK
seq4              ------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE---------
seq3              -MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK-
seq2              EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKK
seq0              --FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKK
seq6              --FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKK
seq5              ----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKK
seq9              ---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
Consensus_ss:          hhhhhhhhhhhhh    eeeeeeeee    eeeeee    eeeeeee   hhhhh


Conservation:                                                    
seq8              MEKLNNIFFTLM-----------------------------------
seq10             MEK--------------------------------------------
seq7              MEKLNNIFF--------------------------------------
seq1              IEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
seq4              ------MRLFGVQKDNFALEHSLL-----------------------
seq3              -----------------------------------------------
seq2              VEKLHGK----------------------------------------
seq0              IEKF-------------------------------------------
seq6              LEKLSSTLLRSI-----------------------------------
seq5              IEKLTTLLMR-------------------------------------
seq9              -----------------------------------------------
Consensus_ss:     hhhhhhhhhhh

2. CLUSTAL format alignment

Each sequence and its name are on the same line and the sequences can be partitioned into a number of blocks separated by empty lines. The word "CLUSTAL" indicating the format can begin in the first line, but such a first line is optional.

CLUSTAL format alignment Example:

seq8    ----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKK
seq1    -KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKK
seq6    --FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKK
seq5    ----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKK
seq9    ---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
seq10   --FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKK
seq7    ----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKK
seq4    ------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE---------
seq3    -MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK-
seq2    EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKK
seq0    --FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKK

        

seq8    MEKLNNIFFTLM-----------------------------------
seq1    IEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
seq6    LEKLSSTLLRSI-----------------------------------
seq5    IEKLTTLLMR-------------------------------------
seq9    -----------------------------------------------
seq10   MEK--------------------------------------------
seq7    MEKLNNIFF--------------------------------------
seq4    ------MRLFGVQKDNFALEHSLL-----------------------
seq3    -----------------------------------------------
seq2    VEKLHGK----------------------------------------
seq0    IEKF-------------------------------------------

References

1. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004. 32(5): p. 1792-7.

2. Pei, J., R. Sadreyev, and N.V. Grishin, PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics, 2003. 19(3): p. 427-8.

3. Pei, J. and N.V. Grishin, MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res, 2006.

4. Henikoff, S. and J.G. Henikoff, Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A, 1992. 89(22): p. 10915-10919.

5. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-3402.

6. Wu, C.H., et al., The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res, 2006. 34(Database issue): p. D187-91.

7. Jones, D.T., Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol, 1999. 292(2): p. 195-202.

8. Do, C.B., et al., ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res, 2005. 15(2): p. 330-40.