ProCAIn documentation

ProCAIn is a method for the comparison of two multiple alignments of protein sequences.
Sadreyev and Grishin(2009) Nucleic Acids Research.
This program runs a search with a submitted alignment as a query against a numerical database of profiles.

DATABASE

Database of numerical profiles representing a set of multiple sequence alignments.
The search detects similarities between query and alignments from this set.

The following databases are available:
PFAM (http://pfam.wustl.edu/) : full sequence alignments for PFAM families (from distributed file Pfam-A.full)
COG (http://www.ncbi.nlm.nih.gov/COG/) : alignments of all members in each COG
(produced by MUSCLE : http://www.drive5.com/muscle/)
KOG (http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgi) : alignments of all members in each KOG
(produced by MUSCLE : http://www.drive5.com/muscle/)
SCOP40 : PSI-BLAST alignments produced with sequences of SCOP representatives
in ASTRAL (<40% identity) as queries (http://astral.berkeley.edu/)

QUERY

Alignment of muliple protein sequences, or a single sequence submitted by user.

Possible formats for alignments:
FASTA, ClustalW, Stockholm,
or "simple" format :
name start(optional) sequence_segment end(optional)
name start(optional) sequence_segment end(optional)
...

Possible formats for single sequence:
FASTA;
or simply single or multiple lines, with or without a name:
name(optional) sequence_segment
name(optional) sequence_segment
...

Example:
K-KQWdRKRFVG-.VELKQKTLGIVGLGRIGAEVAARAKGQRM.
NVIAYDPFFTEE.KAEQMGVQ-YGTLEDVLRAGDFITVHTPLLK
ETKHLINKDAFDLMKDGVQIVNCARGGIIDEDAL

Single-letter notation for 20 amino acid types and two symbols for gaps (- or .)
is used. Special letters of amino acid alphabet (BJUZXbjuzx) are treated as
alanine in the search. Non-standard symbols are treated as gaps.

OPTIONS

1. Input processing options

Run PSI-BLAST on query
To broaden the set of sequences representing the query family, the input
MSA (or single sequence) can be used to initiate PSI-BLAST search on UNIREF
database. The sequences detected by PSI-BLAST are filtered by user-specified
E-value, coverage of the query length, and sequence identity to the query.
The resulting PSI-BLAST alignment is submitted to COMPASS as a query.

Note:
By default, this PSI-BLAST search is performed only if query is
a single sequence. If multiple sequence alignment is submitted, COMPASS will
use it 'as is', unless the user specifically requires the initial PSI-BLAST
search.

Run PSI-BLAST if input is a sequence
Prior to the COMPASS search, runs PSI-BLAST to generate an alignment of homologs
if the submitted query is a single sequence. The resulting alignment is then used
as a query for COMPASS.
Default = Yes

Run PSI-BLAST if input is an alignment
Prior to the COMPASS search, runs PSI-BLAST to generate potentially more representative
alignment of homologs if the submitted query is a multiple sequence alignment.
Starting with the submitted alignment as a query, PSI-BLAST generates a new alignment
of detected sequence homologs. This new alignment is then used as a query for the
COMPASS search.
Default = No

Maximal number of PSI-BLAST iterations
Maximal number of PSI-BLAST iterations performed on query.

PSI-BLAST E-value
Maximal E-value for sequence homologs to be included in the PSI-BLAST
alignment used as a query for the COMPASS search.

PSI-BLAST hit coverage
Minimal percentage of the query length covered by a PSI-BLAST hit
required for the hit to be included in the PSI-BLAST alignment
that is used for the COMPASS search.

Sequence identity of PSI-BLAST hit
Minimal percent sequence identity between a PSI-BLAST hit and the query
required fot the hit to be included in the alignment that is used for the
COMPASS search.

PSI-BLAST composition-based statistics
In PSI-BLAST, amino acid substitution matrices may be adjusted to compensate
for biases in amino acid composition of the compared sequences. "Composition-
based statistics" is the procedure of scaling all substitution scores by an
analytically determined constant, while leaving the gap scores fixed.

PSI-BLAST low complexity filtering
Masks off segments of the PSI-BLAST query that have low compositional
complexity (e.g., common acidic-, basic- or proline-rich regions)
detected by the SEG method.

Gap fraction threshold (0.0 to 1.0)
Upper threshold of effective gap content in alignment columns used in the construction
of profile-profile alignment. If a column contains too many gaps, it is disregarded
in the process of profile comparison. Such columns are shown in the output as lower-case
letters (residues) and dots (gaps).
Default = 0.5

2. Search options

Gap opening penalty (integer)
Score penalty for opening a new gap during the construction of profile-profile alignment.
Default = 10

Gap extension penalty (integer)
Score penalty for the extension of existing gap during the construction of profile-profile alignment.
Default = 1

Effective length of the database (integer)
Total effective length of profiles in the database (including only columns with gap content
lower than the gap fraction threshold).
This parameter is used for E-value calculation.
Default = effective length of chosen database.

Matrix
Residue substitution matrix. Used to derive background residue frequencies and to generate profiles
from alignments (incorporated in pseudocount calculation).
Specifying a matrix other than default might require adjusting gap penalties.
Default = BLOSUM62

3. Formatting options

Expect (real)
Maximal threshold of E-value for a hit to be displayed
Default = 0.1

Significance threshold (real)
Maximal E-value threshold for a hit to be considered reliable
Default = 0.001

Display up to (integer)
Maximal number of hits (sorted by E-value) to be displayed
Default = 100

Top sequences to show
Number of top sequences from compared alignments to show in the profile-profile alignment.
If original multiple sequence alignment contains less sequences, all available sequences are shown.
The names of the bottom displayed sequence in the query and the top displayed sequence in the subject
are clickable links to the corresponding full alignments used in the search.

User can also specify zero top sequences displayed for both profiles.
In this case, if consensus sequences are required, only consensus is displayed.
If consensus sequences are not required by the user, only alignment headers will be displayed.

Show consensus sequences
If this option is chosen, the consensus sequences are shown for both aligned profiles,
in addition to the specified number of top sequences in each.
At the alignment positions with low gap content (effective gap frequency < 0.5),
consensus sequences contain residue type with maximal target frequency at this position.
At the positions with higher gap content, consensus sequences contain gaps.
Consensus of the query profile (CONSENSUS_1) is shown as the bottom sequence for the query.
Consensus of the subject profile (CONSENSUS_2) is shown as the top sequence for the subject.
Identical matches between the two consensus sequences are marked by letters.

Note: Position numbering for consensus sequences is by alignment positions, NOT by sequence residues.

Hide or show positions with gaps in all displayed sequences
In subalignments used to display the profile-profile alignment, some positions may contain only gaps.
If two such positions are aligned, they can be removed from the output, to make it more readable.

Width of alignment segment
Allows showing a long profile-profile alignment as multiple shorter segments, each containing
a specified number of aligned positions.

OUTPUT FORMAT

Subject
Name of alignment with detected similarity to query. This is a clickable link to the description
of the corresponding protein family.

length
Total number of positions in the subject alignment.

filtered_length
Number of positions with effective gap fraction lower than specified threshold (default=0.5).
These positions are used in the construction of profile-profile alignment.

Neff
Effective number of sequences in the subject alignment
(average number of different residue types per position).

Format of profile-profile alignments
Clickable sequence names: links to full alignments.

CAPITAL letters: residues at positions aligned by COMPASS, i.e. at input alignment positions
with gap content < threshold of gap fraction (see above);
lower-case letters: residues at positions not used by COMPASS, i.e. at input alignment positions
with gap content >= threshold of gap fraction (see above);
'-' : gaps retained from original alignments at positions aligned by COMPASS, i.e. at positions
with gap content < threshold;
'.' : gaps retained from original alignments at positions not used by COMPASS, i.e. at positions
with gap content >= threshold;
'=' : gaps introduced by COMPASS in profile-profile alignment;
'~' : gaps introduced by COMPASS against positions that are not used in the construction of
profile-profile alignment (positions with gap content >= threshold);

In constructing a profile-profile alignment, COMPASS does not consider all positions of the input alignments.
If an input alignment column contains too many gaps (i.e. effective gap fraction excedes a user-specified
threshold), such position is considered non-informative and disregarded in the process of aligning
two profiles. These positions are put back in the final output.
To distinguish such positions from those actually aligned by COMPASS, they are displayed
differently in the output profile-profile alignments.
At the positions aligned by COMPASS, residues are shown as capital letters, while gaps
from the input alignments are shown as dashes (-).
At the positions not used by COMPASS, residues are shown as lower-case letters, while gaps
from the input alignments are shown as dots (.).
Since the latter positions are not aligned against any regions in the other profile,
the output alignment has tildas (~) inserted in the other profile against such positions.
The gaps that are actually introduced by COMPASS in the process of constructing profile-profile alignment
are shown as equal signs (=).
Thus, COMPASS output contains four types of gap symbols, each having its own meaning.
Regardless of the symbols for residues and gaps in the input alignments,
COMPASS forces them to upper/lower case and dash/dot depending on whether this position is actually used
in profile-profile alignment. This is decided using threshold value of effective gap fraction (see above),
which can be specified by the user (default = 0.5).

If required by user, consensus sequences are shown for both aligned profiles.
Consensus of the query profile (CONSENSUS_1) is shown as the bottom sequence for the query.
Consensus of the subject profile (CONSENSUS_2) is shown as the top sequence for the subject.
Identical matches between the two consensus sequences are marked by letters.

Note: Position numbering for consensus sequences is by alignment positions, NOT by sequence residues.