ProCAIn documentation ProCAIn is a method for the comparison of two multiple alignments of protein sequences. Sadreyev and Grishin(2009) Nucleic Acids Research. This program runs a search with a submitted alignment as a query against a numerical database of profiles. DATABASE Database of numerical profiles representing a set of multiple sequence alignments. The search detects similarities between query and alignments from this set. The following databases are available: PFAM (http://pfam.wustl.edu/) : full sequence alignments for PFAM families (from distributed file Pfam-A.full) COG (http://www.ncbi.nlm.nih.gov/COG/) : alignments of all members in each COG (produced by MUSCLE : http://www.drive5.com/muscle/) KOG (http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgi) : alignments of all members in each KOG (produced by MUSCLE : http://www.drive5.com/muscle/) SCOP40 : PSI-BLAST alignments produced with sequences of SCOP representatives in ASTRAL (<40% identity) as queries (http://astral.berkeley.edu/) QUERY Alignment of muliple protein sequences, or a single sequence submitted by user. Possible formats for alignments: FASTA, ClustalW, Stockholm, or "simple" format : name start(optional) sequence_segment end(optional) name start(optional) sequence_segment end(optional) ... Possible formats for single sequence: FASTA; or simply single or multiple lines, with or without a name: name(optional) sequence_segment name(optional) sequence_segment ... Example: K-KQWdRKRFVG-.VELKQKTLGIVGLGRIGAEVAARAKGQRM. NVIAYDPFFTEE.KAEQMGVQ-YGTLEDVLRAGDFITVHTPLLK ETKHLINKDAFDLMKDGVQIVNCARGGIIDEDAL Single-letter notation for 20 amino acid types and two symbols for gaps (- or .) is used. Special letters of amino acid alphabet (BJUZXbjuzx) are treated as alanine in the search. Non-standard symbols are treated as gaps. OPTIONS 1. Input processing options Run PSI-BLAST on query To broaden the set of sequences representing the query family, the input MSA (or single sequence) can be used to initiate PSI-BLAST search on UNIREF database. The sequences detected by PSI-BLAST are filtered by user-specified E-value, coverage of the query length, and sequence identity to the query. The resulting PSI-BLAST alignment is submitted to COMPASS as a query. Note: By default, this PSI-BLAST search is performed only if query is a single sequence. If multiple sequence alignment is submitted, COMPASS will use it 'as is', unless the user specifically requires the initial PSI-BLAST search. Run PSI-BLAST if input is a sequence Prior to the COMPASS search, runs PSI-BLAST to generate an alignment of homologs if the submitted query is a single sequence. The resulting alignment is then used as a query for COMPASS. Default = Yes Run PSI-BLAST if input is an alignment Prior to the COMPASS search, runs PSI-BLAST to generate potentially more representative alignment of homologs if the submitted query is a multiple sequence alignment. Starting with the submitted alignment as a query, PSI-BLAST generates a new alignment of detected sequence homologs. This new alignment is then used as a query for the COMPASS search. Default = No Maximal number of PSI-BLAST iterations Maximal number of PSI-BLAST iterations performed on query. PSI-BLAST E-value Maximal E-value for sequence homologs to be included in the PSI-BLAST alignment used as a query for the COMPASS search. PSI-BLAST hit coverage Minimal percentage of the query length covered by a PSI-BLAST hit required for the hit to be included in the PSI-BLAST alignment that is used for the COMPASS search. Sequence identity of PSI-BLAST hit Minimal percent sequence identity between a PSI-BLAST hit and the query required fot the hit to be included in the alignment that is used for the COMPASS search. PSI-BLAST composition-based statistics In PSI-BLAST, amino acid substitution matrices may be adjusted to compensate for biases in amino acid composition of the compared sequences. "Composition- based statistics" is the procedure of scaling all substitution scores by an analytically determined constant, while leaving the gap scores fixed. PSI-BLAST low complexity filtering Masks off segments of the PSI-BLAST query that have low compositional complexity (e.g., common acidic-, basic- or proline-rich regions) detected by the SEG method. Gap fraction threshold (0.0 to 1.0) Upper threshold of effective gap content in alignment columns used in the construction of profile-profile alignment. If a column contains too many gaps, it is disregarded in the process of profile comparison. Such columns are shown in the output as lower-case letters (residues) and dots (gaps). Default = 0.5 2. Search options Gap opening penalty (integer) Score penalty for opening a new gap during the construction of profile-profile alignment. Default = 10 Gap extension penalty (integer) Score penalty for the extension of existing gap during the construction of profile-profile alignment. Default = 1 Effective length of the database (integer) Total effective length of profiles in the database (including only columns with gap content lower than the gap fraction threshold). This parameter is used for E-value calculation. Default = effective length of chosen database. Matrix Residue substitution matrix. Used to derive background residue frequencies and to generate profiles from alignments (incorporated in pseudocount calculation). Specifying a matrix other than default might require adjusting gap penalties. Default = BLOSUM62 3. Formatting options Expect (real) Maximal threshold of E-value for a hit to be displayed Default = 0.1 Significance threshold (real) Maximal E-value threshold for a hit to be considered reliable Default = 0.001 Display up to (integer) Maximal number of hits (sorted by E-value) to be displayed Default = 100 Top sequences to show Number of top sequences from compared alignments to show in the profile-profile alignment. If original multiple sequence alignment contains less sequences, all available sequences are shown. The names of the bottom displayed sequence in the query and the top displayed sequence in the subject are clickable links to the corresponding full alignments used in the search. User can also specify zero top sequences displayed for both profiles. In this case, if consensus sequences are required, only consensus is displayed. If consensus sequences are not required by the user, only alignment headers will be displayed. Show consensus sequences If this option is chosen, the consensus sequences are shown for both aligned profiles, in addition to the specified number of top sequences in each. At the alignment positions with low gap content (effective gap frequency < 0.5), consensus sequences contain residue type with maximal target frequency at this position. At the positions with higher gap content, consensus sequences contain gaps. Consensus of the query profile (CONSENSUS_1) is shown as the bottom sequence for the query. Consensus of the subject profile (CONSENSUS_2) is shown as the top sequence for the subject. Identical matches between the two consensus sequences are marked by letters. Note: Position numbering for consensus sequences is by alignment positions, NOT by sequence residues. Hide or show positions with gaps in all displayed sequences In subalignments used to display the profile-profile alignment, some positions may contain only gaps. If two such positions are aligned, they can be removed from the output, to make it more readable. Width of alignment segment Allows showing a long profile-profile alignment as multiple shorter segments, each containing a specified number of aligned positions. OUTPUT FORMAT Subject Name of alignment with detected similarity to query. This is a clickable link to the description of the corresponding protein family. length Total number of positions in the subject alignment. filtered_length Number of positions with effective gap fraction lower than specified threshold (default=0.5). These positions are used in the construction of profile-profile alignment. Neff Effective number of sequences in the subject alignment (average number of different residue types per position). Format of profile-profile alignments Clickable sequence names: links to full alignments. CAPITAL letters: residues at positions aligned by COMPASS, i.e. at input alignment positions with gap content < threshold of gap fraction (see above); lower-case letters: residues at positions not used by COMPASS, i.e. at input alignment positions with gap content >= threshold of gap fraction (see above); '-' : gaps retained from original alignments at positions aligned by COMPASS, i.e. at positions with gap content < threshold; '.' : gaps retained from original alignments at positions not used by COMPASS, i.e. at positions with gap content >= threshold; '=' : gaps introduced by COMPASS in profile-profile alignment; '~' : gaps introduced by COMPASS against positions that are not used in the construction of profile-profile alignment (positions with gap content >= threshold); In constructing a profile-profile alignment, COMPASS does not consider all positions of the input alignments. If an input alignment column contains too many gaps (i.e. effective gap fraction excedes a user-specified threshold), such position is considered non-informative and disregarded in the process of aligning two profiles. These positions are put back in the final output. To distinguish such positions from those actually aligned by COMPASS, they are displayed differently in the output profile-profile alignments. At the positions aligned by COMPASS, residues are shown as capital letters, while gaps from the input alignments are shown as dashes (-). At the positions not used by COMPASS, residues are shown as lower-case letters, while gaps from the input alignments are shown as dots (.). Since the latter positions are not aligned against any regions in the other profile, the output alignment has tildas (~) inserted in the other profile against such positions. The gaps that are actually introduced by COMPASS in the process of constructing profile-profile alignment are shown as equal signs (=). Thus, COMPASS output contains four types of gap symbols, each having its own meaning. Regardless of the symbols for residues and gaps in the input alignments, COMPASS forces them to upper/lower case and dash/dot depending on whether this position is actually used in profile-profile alignment. This is decided using threshold value of effective gap fraction (see above), which can be specified by the user (default = 0.5). If required by user, consensus sequences are shown for both aligned profiles. Consensus of the query profile (CONSENSUS_1) is shown as the bottom sequence for the query. Consensus of the subject profile (CONSENSUS_2) is shown as the top sequence for the subject. Identical matches between the two consensus sequences are marked by letters. Note: Position numbering for consensus sequences is by alignment positions, NOT by sequence residues.