PROMALS3D (PROfile Multiple Alignment with predicted Local Structures and 3D constraints) is a tool for aligning multiple protein sequences and/or structures, with enhanced information from database searches, secondary structure prediction, 3D structures or user-defined constraints.
Overview of alignment process
PROMALS3D is a progressive method that clusters similar sequences for easy alignments and applies more elaborate techniques to align the relatively divergent clusters (see flowchart). In the first alignment stage, PROMALS3D aligns similar sequences using a scoring function of weighted sum-of-pairs of BLOSUM62 [1] scores. The first stage is fast and results in a number of pre-aligned groups that are relatively distant from each other. In the second alignment stage, one representative sequence is selected for each group and they are subject to PSI-BLAST [2] searches to retrieve additional homologs from UNIREF90 [3] database and PSIPRED [4] secondary structure prediction. Then a hidden Markov model of profile-profile alignments with predicted secondary structures [5] is applied to pairs of representatives to obtain posterior probabilities of residue matches. These probabilities serve as sequence-based constraints that are combined with constraints derived from homologs with 3D structures or user-defined alignment constraints to derive a probabilistic consistency scoring function. The representative sequences are progressively aligned using such a consistency scoring function, and the pre-aligned groups obtained in the first stage are merged into the alignment of representatives to form the final multiple alignment of all sequences.
PROMALS3D input
Multiple protein sequences and/or structures can be input to PROMALS3D. The program generates a multiple sequence alignment for input sequences and sequence records in input structures. If you only want to align input structures, leave the sequence input area blank.
Input sequences should be in FASTA format. A sequence record in a FASTA format consists of a single-line description (sequence name), followed by line(s) of sequence data. The first character of the description line should be a greater-than (">") symbol. For PROMALS3D server, sequences with ten or less amino acids are ignored, and the total number of input sequences and structures should be larger than one.
FASTA format Example:
>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
Any non-alphabetical character in the input sequences is ignored by PROMALS3D. The letters ([BJOUXZbjouxz]) that do not belong to abbreviations of the twenty standard amino acids are treated as alanines in alignment process, but are unchanged in the final alignment.
A note for sequence names: Certain characters in sequence names are changed to "_", including space, tab, and *?'`"&|\/{}()[]$; (.(dot) and - are kept). For long sequence names, only the first 25 characters are kept.
Input structures: The uploading structure files should be in PDB format. PDB ids can be specified instead of uploading structure files. The chain ids can also be specificed for the input structure files or pdb ids. If a chain id is not specified, only the first chain is used.
User-defined constraint alignment: The users can also input constraint alignments in fasta-like format. Multiple fasta-format constraint alignments in one file should be separated by lines beginning with character '@'. The names and sequences of constraints should match those in the input sequences. The following example contains two constraint alignments.
>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEK
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVE
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLK----YRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
@
>seq0
FQTWEEFSRAAEKLYL--ADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPM--KVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEK
>seq2
EEYQTWEEFARAAEKL--YLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVE
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYR---HCDG---NLCIKVTDNSVCLQYKTDQAQDVK
Input Email: PROMALS jobs can take a long time to complete for large sequence sets due to PSI-BLAST and PSIPRED runs, and consistency scoring function calculations. For example, the average CPU time for a data set with 50 sequences is about half an hour. It is thus highly recommended that an email address is provided so that the link to your result is sent to you when the alignment is finished.
Input job name: Assign your sequences a short name can help identify your alignment job. This name will appear in the subject line of the email sent to you.
PROMALS3D output
PROMALS3D web server provides result links to multiple alignments in the following three formats.
1. Colored alignment with secondary structure, conservation, and consensus sequence information
The first line in each block shows conservation indices for positions with a conservation index above 4. The last two lines show consensus amino acid sequence (Consensus_aa) and consensus predicted secondary structures (Consensus_ss). Representative sequences have magenta names and they are colored according to predicted secondary structures (red: alpha-helix, blue: beta-strand). If the sequences are in aligned order, the sequences with black names directly under a representative sequence are in the same pre-aligned group and are aligned in a fast way. The first and last residue numbers of each sequence in each alignment block are shown before and after the sequences respectively. Consensus predicted secondary structure symbols: alpha-helix: h; beta-strand: e. Consensus amino acid symbols are: conserved amino acids are in bold and uppercase letters; aliphatic (I, V, L): l; aromatic (Y, H, W, F): @; hydrophobic (W, F, Y, M, L, I, V, A, C, T, H): h; alcohol (S, T): o; polar residues (D, E, H, K, N, Q, R, S, T): p; tiny (A, G, C, S): t; small (A, G, C, S, V, N, D, T, P): s; bulky residues (E, F, I, K, L, M, Q, R, W, Y): b; positively charged (K, R, H): +; negatively charged (D, E): -; charged (D, E, K, R, H): c.
Colored alignment example:
Conservation: 9669 6 6 9 9 99 9 7 6 9 696 66 6 6 67 99 seq8 1 ----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFT 66 seq10 1 --FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK------- 61 seq7 1 ----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF- 65 seq1 1 -KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMR 69 seq4 1 ------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE-----------------MR 46 seq3 1 -MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK----------- 58 seq2 1 EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK--- 67 seq0 1 --FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF------ 62 seq6 1 --FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLR 68 seq5 1 ----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR 66 seq9 1 ---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK---------- 57 Consensus_aa: ....sWEEFs..t..L@.ssP..sRhshKY.HscG.LslKlTDs..Cl.@.s-.h.DhKK.......... Consensus_ss: hhhhhhhhhhhhh eeeeeeeee eeeeee eeeeeee hhhhhhhhhhhhhhh Conservation: seq8 67 LM----------------------------------- 68 seq10 ------------------------------------- seq7 ------------------------------------- seq1 70 LMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM 106 seq4 47 LFGVQKDNFALEHSLL--------------------- 62 seq3 ------------------------------------- seq2 ------------------------------------- seq0 ------------------------------------- seq6 69 SI----------------------------------- 70 seq5 ------------------------------------- seq9 ------------------------------------- Consensus_aa: ..................................... Consensus_ss: h
2. CLUSTAL format alignment
Each sequence and its name are on the same line and the sequences can be partitioned into a number of blocks separated by empty lines. The word "CLUSTAL" indicating the format can begin in the first line, but such a first line is optional.
CLUSTAL format alignment example:
CLUSTAL format multiple sequence alignment by PROMALS3D seq8 ----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFT seq10 --FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK------- seq7 ----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF- seq1 -KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMR seq4 ------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE-----------------MR seq3 -MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK----------- seq2 EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK--- seq0 --FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF------ seq6 --FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLR seq5 ----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR seq9 ---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK---------- seq8 LM----------------------------------- seq10 ------------------------------------- seq7 ------------------------------------- seq1 LMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM seq4 LFGVQKDNFALEHSLL--------------------- seq3 ------------------------------------- seq2 ------------------------------------- seq0 ------------------------------------- seq6 SI----------------------------------- seq5 ------------------------------------- seq9 -------------------------------------
3. FASTA format
>seq8 ----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM----------------------------------- >seq10 --FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK-------------------------------------------- >seq7 ----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF-------------------------------------- >seq1 -KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM >seq4 ------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE-----------------MRLFGVQKDNFALEHSLL--------------------- >seq3 -MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK------------------------------------------------ >seq2 EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK---------------------------------------- >seq0 --FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF------------------------------------------- >seq6 --FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI----------------------------------- >seq5 ----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR------------------------------------- >seq9 ---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK-----------------------------------------------
Alignment parameters are listed below
Identity threshold
The parameter "Identity threshold" is the sequence identity threshold that specifies the boundary between the fast-stage less accurate alignment process and the slow-stage more accurate alignment process.
To properly balance alignment speed and accuracy, we have applied a two-stage alignment strategy similar to the one used in our program PCMA. In the first stage, highly similar sequences are progressively aligned in a fast way without consistency scoring. The scoring function in this stage is weighted sum-of-pairs measure of BLOSUM62 scores (Note: in the new verion, the first stage alignment is done by MAFFT or PROMALS, see this option below). If two groups neighboring on a tree have an average sequence identity higher than a certain threshold (default is 0.6), they are aligned in this fast way. The result of the first stage is a set of pre-aligned groups that are relatively divergent from each other. One representative sequence is selected from each pre-aligned group. In the second alignment stage, these representative sequences are subject to the more time-consuming probabilistic consistency measure, and are aligned progressively according to the consistency scoring function. Finally, the pre-aligned groups obtained in the first stage are merged according to the alignment of the representatives to obtain the alignment of all sequences.
If "Identity threshold" is equal to or larger than 1, all sequences are subject to consistency measure and the alignment process is the most time-consuming. If it is set to 0, all sequences are aligned in a fast way; in this case the alignment quality for divergent sequences is expected to be low since consistency-based scoring function is not used. The default value is set to 0.6 since for sequences with identity above 60% the fast stage can still produce good quality alignments, but alignment proceeds about 6 times faster than when all sequences are aligned using consistency measure.
Weights for constraints
Four types of constraints can be used for constructing multiple sequence alignments by PROMALS3D.
- sequence-based constraints: these constraints are derived from profile-profile alignment with predicted secondary strucctures.
- constraints derived from homologs with 3D structures: for representatives sequences, homologs with known 3D structures (abbreviated as homolog3d) are identified from SCOP40 database. Alignment constraints can be derived from sequence-based representative-to-homolog3d alignments, and structure-based homolog3d-to-homolog3d alignments which are made by structure comparision programs DaliLite, FAST or TMalign.
- constraints derived 3D structures: these are for input 3D structures. Structure alignments are made by structure comparision programs DaliLite, FAST or TMalign.
- User-defined constraints: user can input alignment(s) that are used as constraints.
Parameters for profile-profile alignment:
- Weights for amino acid scores and predicted secondary structure scores: These two weights determine the relative contribution of amino acid similarity and predicted secondary structure similarity to the profile-profile hidden Markov model used in PROMALS. These weights can be set to any positive values. Larger weight means larger contribution of this score to the total alignment score. The default and recommended values are 0.8 for amino acid scores and 0.2 for secondary structure scores. These values were optimized on SCOP domain pairs with less than 20% sequence identity.
Parameters for deriving sequence profiles from PSI-BLAST searches
PSI-BLAST is run for each representative sequence against the UNIREF90 database and the PSI-BLAST alignment is processed so that divergent homologs to the query are removed and a limited number of the remaining homologs are kept to save time for building an amino acid profile for the query. The following parameters are provided:
- PSI-BLAST iteration number: maximum number of iterations done by PSI-BLAST.
- PSI-BLAST e-value inclusion threshold: only hits with an e-value less than this threshold will be included in the next iteration.
- Identity cutoff below which distant homologs are removed: any PSI-BLAST hit with a sequence identity to the query less than this cutoff will be removed. Divergent homologs could negatively affect sequence profile quality. (default: 0.2, corresponding to 20% sequence identity)
- Maximum number of homologs kept for each blast run: after removing divergent hits, select up to this number of homologs for constructing an amino acid sequence profile for the query. (default: 300)
Parameters for finding and using homologs with 3D structures
Homologs with 3D structures can be identified by running PSI-BLAST of representative sequences against a sequence database consisting of SCOP40 domains with known structures. The structure-based sequence alignments can be used to derive 3D pairwise constraints between representatives. These constraints can be mixed with constraints from comparing amino acid profiles and predicted secondary structures to make multiple sequence alignments. The following parameters are provided:
- PSI-BLAST e-value cutoff against structural database: when running PSI-BLAST using the checkpoint file of the representative sequences against the structural database of SCOP40 [6] domains, ignore those structure hits with e-value larger than this cutoff.
- Identity cutoff below which 3D structure templates are not used: when running PSI-BLAST using the checkpoint file of the representative sequences against the structural database of SCOP40 domains, ignore those structure hits with sequence identity less than this cutoff.
- Align homologs with 3D structures by programs: the alignments between the homologs with structures can be chosen from 3 structure comparison programs (DaliLite [7] , FAST [8] and TM-align [9]). More than one programs can be used.
- Weight of structure alignment constraint: this weight determine the relative contribution of structure constraint as compared to the combined scores of amino acid and predicted secondary structures.
Parameters for aligning input structures
- Align input structures by programs: the structure-based pairwise alignments between input structures can be chosen from 3 structure comparison programs (DaliLite [7] , FAST [8] and TM-align [9]). More than one programs can be used.
Parameters for alignment output (for output result page)
- sequence order: two options are provided. "input" means the sequences in the alignment have the same order as input sequences. "aligned" means the sequence order in the alignment is determined by the alignment process such that sequences within the same pre-aligned group (closely related sequences) will be neighbors.
- Alignment block size: number of residues in each alignment block of output CLUSTAL format and colored alignments. (default: 70; if this value is set to below 10 (too narrow), the default value is used). If "show each sequence in one line" is selected, each sequence will occupy one line (same effect as setting the block size to a number equal or larger than the alignment length).
- Show starting and ending residue numbers: in a colored alignment, show starting and ending residue numbers for subsequences in each alignment block. (default: checked). Unchecking this box will not display these numbers.
- Show conservation indices equal to or above: in a colored alignment, show conservation index values (integer values between 0 and 9, with 9 corresponding to highest conservation) for positions with a conservation index equal to or above this value. (default: 5)
- Consensus level: for the consensus amino acid sequence, show consensus symbol if the weighted frequency of a certain class of residues in a position is above this number. Consensus symbols: conserved amino acids are in bold and uppercase letters; aliphatic (I, V, L): l; aromatic (Y, H, W, F): @; hydrophobic (W, F, Y, M, L, I, V, A, C, T, H): h; alcohol (S, T): o; polar residues (D, E, H, K, N, Q, R, S, T): p; tiny (A, G, C, S): t; small (A, G, C, S, V, N, D, T, P): s; bulky residues (E, F, I, K, L, M, Q, R, W, Y): b; positively charged (K, R, H): +; negatively charged (D, E): -; charged (D, E, K, R, H): c. (default: 0.8)
Parameters for aligning sequences within groups in the first alignment stage
- Method for aligning sequences within groups in the first alignment stage: two options are provided.
- "MAFFT": use MAFFT program [10] to align similar sequences in a fast way. (default)
- "PROMALS": use PROMALS to align similar sequences in a slow but possibly more accurate way. This may give more accurate results, especially when the number of sequences is large resulting in large sequence groups in the first alignment stage. (see Identity threshold option for information about two-stage alignment strategy)
References
1. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89:10915-10919.
2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.
3. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al.: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34:D187-191.
4. Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292:195-202.
5. Pei J, Grishin NV: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 2007, 23:802-808.
6. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res 2004, 32:D189-192.
7. Holm L, Sander C: Mapping the protein universe. Science 1996, 273:595-603.
8. Zhu J, Weng Z: FAST: a novel protein structure alignment algorithm. Proteins 2005, 58:618-627.
9. Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 2005, 33:2302-2309.
10. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30:3059-3066.