PROMALS3D (PROfile Multiple Alignment with predicted Local Structures and 3D constraints) is a tool for aligning multiple protein sequences and/or structures, with enhanced information from database searches, secondary structure prediction, 3D structures or user-defined constraints.

Overview of alignment process

PROMALS3D is a progressive method that clusters similar sequences for easy alignments and applies more elaborate techniques to align the relatively divergent clusters (see flowchart). In the first alignment stage, PROMALS3D aligns similar sequences using a scoring function of weighted sum-of-pairs of BLOSUM62 [1] scores. The first stage is fast and results in a number of pre-aligned groups that are relatively distant from each other. In the second alignment stage, one representative sequence is selected for each group and they are subject to PSI-BLAST [2] searches to retrieve additional homologs from UNIREF90 [3] database and PSIPRED [4] secondary structure prediction. Then a hidden Markov model of profile-profile alignments with predicted secondary structures [5] is applied to pairs of representatives to obtain posterior probabilities of residue matches. These probabilities serve as sequence-based constraints that are combined with constraints derived from homologs with 3D structures or user-defined alignment constraints to derive a probabilistic consistency scoring function. The representative sequences are progressively aligned using such a consistency scoring function, and the pre-aligned groups obtained in the first stage are merged into the alignment of representatives to form the final multiple alignment of all sequences.

PROMALS3D input

Multiple protein sequences and/or structures can be input to PROMALS3D. The program generates a multiple sequence alignment for input sequences and sequence records in input structures. If you only want to align input structures, leave the sequence input area blank.

Input sequences should be in FASTA format. A sequence record in a FASTA format consists of a single-line description (sequence name), followed by line(s) of sequence data. The first character of the description line should be a greater-than (">") symbol. For PROMALS3D server, sequences with ten or less amino acids are ignored, and the total number of input sequences and structures should be larger than one.

FASTA format Example:

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK

Any non-alphabetical character in the input sequences is ignored by PROMALS3D. The letters ([BJOUXZbjouxz]) that do not belong to abbreviations of the twenty standard amino acids are treated as alanines in alignment process, but are unchanged in the final alignment.

A note for sequence names: Certain characters in sequence names are changed to "_", including space, tab, and *?'`"&|\/{}()[]$; (.(dot) and - are kept). For long sequence names, only the first 25 characters are kept.

Input structures: The uploading structure files should be in PDB format. PDB ids can be specified instead of uploading structure files. The chain ids can also be specificed for the input structure files or pdb ids. If a chain id is not specified, only the first chain is used.

User-defined constraint alignment: The users can also input constraint alignments in fasta-like format. Multiple fasta-format constraint alignments in one file should be separated by lines beginning with character '@'. The names and sequences of constraints should match those in the input sequences. The following example contains two constraint alignments.

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEK
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVE
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLK----YRHCDGNLCIKVTDNSVCLQYKTDQAQDVK

@
>seq0
FQTWEEFSRAAEKLYL--ADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPM--KVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEK
>seq2
EEYQTWEEFARAAEKL--YLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVE
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYR---HCDG---NLCIKVTDNSVCLQYKTDQAQDVK

Input Email: PROMALS jobs can take a long time to complete for large sequence sets due to PSI-BLAST and PSIPRED runs, and consistency scoring function calculations. For example, the average CPU time for a data set with 50 sequences is about half an hour. It is thus highly recommended that an email address is provided so that the link to your result is sent to you when the alignment is finished.

Input job name: Assign your sequences a short name can help identify your alignment job. This name will appear in the subject line of the email sent to you.

PROMALS3D output

PROMALS3D web server provides result links to multiple alignments in the following three formats.

1. Colored alignment with secondary structure, conservation, and consensus sequence information

The first line in each block shows conservation indices for positions with a conservation index above 4. The last two lines show consensus amino acid sequence (Consensus_aa) and consensus predicted secondary structures (Consensus_ss). Representative sequences have magenta names and they are colored according to predicted secondary structures (red: alpha-helix, blue: beta-strand). If the sequences are in aligned order, the sequences with black names directly under a representative sequence are in the same pre-aligned group and are aligned in a fast way. The first and last residue numbers of each sequence in each alignment block are shown before and after the sequences respectively. Consensus predicted secondary structure symbols: alpha-helix: h; beta-strand: e. Consensus amino acid symbols are: conserved amino acids are in bold and uppercase letters; aliphatic (I, V, L): l; aromatic (Y, H, W, F): @; hydrophobic (W, F, Y, M, L, I, V, A, C, T, H): h; alcohol (S, T): o; polar residues (D, E, H, K, N, Q, R, S, T): p; tiny (A, G, C, S): t; small (A, G, C, S, V, N, D, T, P): s; bulky residues (E, F, I, K, L, M, Q, R, W, Y): b; positively charged (K, R, H): +; negatively charged (D, E): -; charged (D, E, K, R, H): c.

Colored alignment example:

Conservation:            9669    6 6    9   9   99 9  7 6  9 696  66 6 6   67 99          
seq8             1  ----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFT   66
seq10            1  --FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK-------   61
seq7             1  ----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF-   65
seq1             1  -KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMR   69
seq4             1  ------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE-----------------MR   46
seq3             1  -MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK-----------   58
seq2             1  EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK---   67
seq0             1  --FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF------   62
seq6             1  --FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLR   68
seq5             1  ----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR   66
seq9             1  ---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK----------   57
Consensus_aa:       ....sWEEFs..t..L@.ssP..sRhshKY.HscG.LslKlTDs..Cl.@.s-.h.DhKK..........
Consensus_ss:            hhhhhhhhhhhhh    eeeeeeeee    eeeeee    eeeeeee   hhhhhhhhhhhhhhh


Conservation:                                            
seq8            67  LM-----------------------------------   68
seq10               -------------------------------------     
seq7                -------------------------------------     
seq1            70  LMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM  106
seq4            47  LFGVQKDNFALEHSLL---------------------   62
seq3                -------------------------------------     
seq2                -------------------------------------     
seq0                -------------------------------------     
seq6            69  SI-----------------------------------   70
seq5                -------------------------------------     
seq9                -------------------------------------     
Consensus_aa:       .....................................
Consensus_ss:       h                                    

2. CLUSTAL format alignment

Each sequence and its name are on the same line and the sequences can be partitioned into a number of blocks separated by empty lines. The word "CLUSTAL" indicating the format can begin in the first line, but such a first line is optional.

CLUSTAL format alignment example:

CLUSTAL format multiple sequence alignment by PROMALS3D


seq8    ----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFT
seq10   --FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK-------
seq7    ----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF-
seq1    -KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMR
seq4    ------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE-----------------MR
seq3    -MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK-----------
seq2    EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK---
seq0    --FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF------
seq6    --FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLR
seq5    ----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
seq9    ---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK----------


seq8    LM-----------------------------------
seq10   -------------------------------------
seq7    -------------------------------------
seq1    LMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
seq4    LFGVQKDNFALEHSLL---------------------
seq3    -------------------------------------
seq2    -------------------------------------
seq0    -------------------------------------
seq6    SI-----------------------------------
seq5    -------------------------------------
seq9    -------------------------------------

3. FASTA format

>seq8
----SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM-----------------------------------
>seq10
--FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK--------------------------------------------
>seq7
----SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF--------------------------------------
>seq1
-KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq4
------EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSV-VSYE-----------------MRLFGVQKDNFALEHSLL---------------------
>seq3
-MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK------------------------------------------------
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK----------------------------------------
>seq0
--FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF-------------------------------------------
>seq6
--FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI-----------------------------------
>seq5
----SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR-------------------------------------
>seq9
---KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK-----------------------------------------------
                                

Alignment parameters are listed below

Identity threshold

The parameter "Identity threshold" is the sequence identity threshold that specifies the boundary between the fast-stage less accurate alignment process and the slow-stage more accurate alignment process.

To properly balance alignment speed and accuracy, we have applied a two-stage alignment strategy similar to the one used in our program PCMA. In the first stage, highly similar sequences are progressively aligned in a fast way without consistency scoring. The scoring function in this stage is weighted sum-of-pairs measure of BLOSUM62 scores (Note: in the new verion, the first stage alignment is done by MAFFT or PROMALS, see this option below). If two groups neighboring on a tree have an average sequence identity higher than a certain threshold (default is 0.6), they are aligned in this fast way. The result of the first stage is a set of pre-aligned groups that are relatively divergent from each other. One representative sequence is selected from each pre-aligned group. In the second alignment stage, these representative sequences are subject to the more time-consuming probabilistic consistency measure, and are aligned progressively according to the consistency scoring function. Finally, the pre-aligned groups obtained in the first stage are merged according to the alignment of the representatives to obtain the alignment of all sequences.

If "Identity threshold" is equal to or larger than 1, all sequences are subject to consistency measure and the alignment process is the most time-consuming. If it is set to 0, all sequences are aligned in a fast way; in this case the alignment quality for divergent sequences is expected to be low since consistency-based scoring function is not used. The default value is set to 0.6 since for sequences with identity above 60% the fast stage can still produce good quality alignments, but alignment proceeds about 6 times faster than when all sequences are aligned using consistency measure.

Weights for constraints

Four types of constraints can be used for constructing multiple sequence alignments by PROMALS3D.

Parameters for profile-profile alignment:

Parameters for deriving sequence profiles from PSI-BLAST searches

PSI-BLAST is run for each representative sequence against the UNIREF90 database and the PSI-BLAST alignment is processed so that divergent homologs to the query are removed and a limited number of the remaining homologs are kept to save time for building an amino acid profile for the query. The following parameters are provided:

Parameters for finding and using homologs with 3D structures

Homologs with 3D structures can be identified by running PSI-BLAST of representative sequences against a sequence database consisting of SCOP40 domains with known structures. The structure-based sequence alignments can be used to derive 3D pairwise constraints between representatives. These constraints can be mixed with constraints from comparing amino acid profiles and predicted secondary structures to make multiple sequence alignments. The following parameters are provided:

Parameters for aligning input structures

Parameters for alignment output (for output result page)

Parameters for aligning sequences within groups in the first alignment stage

References

1. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89:10915-10919.

2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.

3. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al.: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34:D187-191.

4. Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292:195-202.

5. Pei J, Grishin NV: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 2007, 23:802-808.

6. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res 2004, 32:D189-192.

7. Holm L, Sander C: Mapping the protein universe. Science 1996, 273:595-603.

8. Zhu J, Weng Z: FAST: a novel protein structure alignment algorithm. Proteins 2005, 58:618-627.

9. Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 2005, 33:2302-2309.

10. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30:3059-3066.