PROMALS documentation PROMALS (PROfile Multiple Alignment with predicted Local Structure) is a progressive method for aligning multiple protein sequences, with enhanced profile information from database searches and secondary structure prediction. Alignment procedure (Figure) The alignment order of PROMALS is set by a tree built using a k-mer count method [1] . Like PCMA [2] and MUMMALS [3] , PROMALS has two alignment stages for easy and difficult alignments. In the first stage, highly similar sequences are progressively aligned in a fast way with a weighted sum-of-pairs measure of BLOSUM62 scores [4] (step 2 in Figure 1). If two neighboring groups on the tree have an average sequence identity higher than a certain threshold (default: 60%), they are aligned in this fast way. The result of the first alignment stage is a set of sequences or pre-aligned groups that are relatively divergent from each other. In the second alignment stage, one representative sequence (the longest one) is selected from each pre-aligned group. For each representative, PSI-BLAST [5] is used to search for homologs from sequence database UNIREF90 [6]. PSI-BLAST alignment is processed to remove divergent hits. The PSI-BLAST checkpoint file after 3 iterations is used to predict secondary structures by PSIPRED [7] . For each pair of representatives, profiles are derived from the PSI-BLAST alignments and PSIPRED secondary structure prediction, and a matrix of posterior probabilities of matches between positions is obtained by forward and backward algorithms of a profile-profile hidden Markov model. These matrices are used to calculate the probabilistic consistency scores as described in [8] . The representatives are then aligned progressively according to the consistency-based scoring function, and the pre-aligned groups obtained in the first stage are merged to the multiple alignment of the representatives. Finally, gap placement is refined to make the gap patterns more realistic. The utilization of homologs from database searches and secondary structure prediction makes PROMASLS more accurate than other stand-alone aligners for divergent sequences. However, this also makes the program slow when aligning a large number of sequences. PROMALS output alignment format PROMALS web server provides links to alignments in the following two formats. 1. Colored alignment The first line in each block shows conservation indices for positions with a conservation index above 5. Each representative sequence has a magenta name and is colored according to PSIPRED secondary structure predictions (red: alpha-helix, blue: beta-strand). A representative sequence and the immediate sequences below it with black names, if there are any, form a closely related group (determined by option "Identity threshold"). Sequences within each group are aligned in a fast way. The groups are aligned using profile consistency with predicted secondary structures. In the example below, seq8, seq1, seq6, seq5 and seq9 are representative sequences; seq8, seq10 and seq7 form a closely related group, and seq6 is a group by itself. The consensus predicted secondary structures are shown in the last line in each block. If the fraction of helix or strand predictions among representative sequences in a position is larger than 0.5, the consensus letter is "h" or "e", respectively.
Colored alignment example: Conservation: 9669 6 6 9 9 99 9 7 6 9 696 66 6 6 67 99 seq8 ----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKK seq10 --FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKK seq7 ----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKK seq1 -KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKK seq4 ------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE--------- seq3 -MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK- seq2 EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKK seq0 --FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKK seq6 --FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKK seq5 ----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKK seq9 ---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK Consensus_ss: hhhhhhhhhhhhh eeeeeeeee eeeeee eeeeeee hhhhh Conservation: seq8 MEKLNNIFFTLM----------------------------------- seq10 MEK-------------------------------------------- seq7 MEKLNNIFF-------------------------------------- seq1 IEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM seq4 ------MRLFGVQKDNFALEHSLL----------------------- seq3 ----------------------------------------------- seq2 VEKLHGK---------------------------------------- seq0 IEKF------------------------------------------- seq6 LEKLSSTLLRSI----------------------------------- seq5 IEKLTTLLMR------------------------------------- seq9 ----------------------------------------------- Consensus_ss: hhhhhhhhhhh 2. CLUSTAL format alignment
Each sequence and its name are on the same line and the sequences can be partitioned into a number of blocks separated by empty lines. The word "CLUSTAL" indicating the format can begin in the first line, but such a first line is optional.
CLUSTAL format alignment Example: seq8 ----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKK seq1 -KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKK seq6 --FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKK seq5 ----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKK seq9 ---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK seq10 --FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKK seq7 ----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKK seq4 ------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE--------- seq3 -MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK- seq2 EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKK seq0 --FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKK seq8 MEKLNNIFFTLM----------------------------------- seq1 IEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM seq6 LEKLSSTLLRSI----------------------------------- seq5 IEKLTTLLMR------------------------------------- seq9 ----------------------------------------------- seq10 MEK-------------------------------------------- seq7 MEKLNNIFF-------------------------------------- seq4 ------MRLFGVQKDNFALEHSLL----------------------- seq3 ----------------------------------------------- seq2 VEKLHGK---------------------------------------- seq0 IEKF------------------------------------------- References
1. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004. 32(5): p. 1792-7. 2. Pei, J., R. Sadreyev, and N.V. Grishin, PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics, 2003. 19(3): p. 427-8. 3. Pei, J. and N.V. Grishin, MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res, 2006. 4. Henikoff, S. and J.G. Henikoff, Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A, 1992. 89(22): p. 10915-10919. 5. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-3402. 6. Wu, C.H., et al., The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res, 2006. 34(Database issue): p. D187-91. 7. Jones, D.T., Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol, 1999. 292(2): p. 195-202. 8. Do,
C.B., et al., ProbCons: Probabilistic consistency-based multiple sequence alignment.
Genome Res, 2005. 15(2): p.
330-40. |