=======================================================================

      AL2CO - A Program to Calculate Positional Conservation in a
		Protein Sequence Alignment (September, 2000)

=======================================================================

Please send bug reports, comments etc. to:
	jpei@mednet.swmed.edu

* =====================================================================
*
*                          PUBLIC DOMAIN NOTICE
*                       Department of Biochemistry
*               University of Texas Southwestern Medical Center at Dallas
*
*  This software is freely available to the public for use. We have not 
*  placed any restriction on its use or reproduction.
*
*  Although all reasonable efforts have been taken to ensure the accuracy
*  and reliability of the software and data, the University of Texas 
*  Southwestern Medical Center does not and cannot warrant the performance
*  or results that may be obtained by using this software or. The 
*  University of Texas Southwestern Medical Center disclaims all warranties, 
*  express or implied, including warranties of performance, merchantability 
*  or fitness for any particular purpose.
*
*  Please cite the authors in any work or product based on this material.
*
* =====================================================================

* Introduction
  This directory contains the conservation index calculation program: al2co.c.
The user provides a multiple sequence alignment (in ClustalW format) and 
specifies the calculation method; and the program will give the conservation  
index for each position in the alignment. Please refer to Pei & Grishin for  
the detail of the algorithm (1).

* Compilation
  cc al2co.c -o al2co -lm
  or
  gcc al2co.c -o al2co -lm

* Conservation calculation methods:
  Two steps are performed to estimate conservation of a position in a 
multiple sequence alignment. On the first step, amino acid frequencies
at the position are estimated. On the second step, conservation index is
calculated using the frequencies. An optional third step allows the user
to average the conservation indices over a window.

  The following Frequency estimation strategies are used.
1.1. Unweighted amino acid frequencies.

1.2. Weighted amino acid frequencies.
     We use modified Henikoff-Henikoff weighting scheme (2) that is  
applied in PSI-BLAST (3). The position is not used for weighting if it is 
invariant or contains gaps in no less than 50% of sequences.

1.3. Estimated independent counts.
     We use modified strategy of Sunyeav (4) to estimate independent 
counts of amino acids at a position (1).  

  Conservation index is then calculated using the frequencies by one of 
the following strategies:
2.1.Entropy-based measure.
  	C(i)=sum_{a=1}^{20}f_a(i)*ln[f_a(i)], where f_a(i) is the frequency
of amino acid a at position i.

2.2.Variance-based measure.
	C(i)=sqrt[sum_{a=1}^{20}(f_a(i)-f_a)^2], where f_a is the overall
frequency of amino acid a.

2.3.Sum-of-pairs measure.
	C(i)=sum_{a=1}^{20}sum_{b=1}^{20}f_a(i)*f_b(i)*S_{ab}, where
S_{ab} is the element of a scoring matrix for amino acids a and b.
If a reasonable amino acid substitution matrix S is applied,
this method takes into account the similarities between different 
amino acids. If the user want to make conservation indices the same for 
all invariant positions, the scoring matrix can be normalized (see 
-m option below). 

* The effect of gaps
  The presence of gaps at a position means that position is not necessary
in some proteins in correct alignment. So positions with gaps tend to
be less conserved (1). Gaps are not be treated the same way as amino acids  
in conservation calculation. 
  A gap fraction threshold is specified by the user (default value 0.5).
Conservation indices are calculated only for positions with gap fraction
less than that value. Then the mean value (mean) and standard deviation (sigma)
is calculated for these indices. For all positions with gap fraction no less
than the threshold, we set their conservation indices to be: mean-1.0*sigma.

* Arguments of the AL2CO program

  -i    Input alignment file [File in]
        Format: ClustalW or simple alignment format
        The title (first line) should begin with "CLUSTAL W", or
        the title line should be deleted.

  -o    Output file with conservation index for each position in the
        alignment [File out] Optional
	Default = STDOUT

  -t	Output file with conservation index mapped to the alignment
	[File out] Optional
	Conservation indices are linearly rescaled to be from 0
	to 9.99. C'=9.99*(C-MIN)/(MAX-MIN), where C and C' are the
	the indices before and after rescaling respectively, MAX and 
	MIN are the highest index and lowest index before rescaling 
	respectively. The integer part of each rescaled index is 
	written out along with the sequence alignment.
	Default = no output

  -b	Block size of the output alignment file with conservation
	[Integer] Optional
	Default = 60

  -s    Input file with the scoring matrix [File in] Optional
        Format: NCBI
	Notice: Scoring matrix is only used for sum-of-pairs measure
	with option -c 2.
        Default = identity matrix

  -m    Scoring matrix transformation [Integer] Optional
        Options:
        0=no transformation,
        1=normalization S'(a,b)=S(a,b)/sqrt[S(a,a)*S(b,b)],
        2=adjustment S"(a,b)=2*S(a,b)-(S(a,a)+S(b,b))/2
        Default = 0

  -f    Weighting scheme for amino acid frequency estimation [Integer] Optional
        Options:
        0=unweighted,
        1=weighted by the modified method of Henikoff & Henikoff (2)(3),
        2=independent-count based (1)(4)
        Default = 2

  -c    Conservation calculation method [Integer] Optional
        Options:
        0=entropy-based  C(i)=sum_{a=1}^{20}f_a(i)*ln[f_a(i)], where f_a(i) 
	  is the frequency of amino acid a at position i,
        1=variance-based C(i)=sqrt[sum_{a=1}^{20}(f_a(i)-f_a)^2], where f_a 
	  is the overall frequency of amino acid a,
        2=sum-of-pairs measure  C(i)=sum_{a=1}^{20}sum_{b=1}^{20}f_a(i)*f_b(i)*S_{ab}, 
	  where	S_{ab} is the element of a scoring matrix for amino acids a and b
        Default = 0

  -w    Window size used for averaging [Integer] Optional
        Default = 1
        Recommended value for motif analysis: 3

  -n    Normalization option [T/F] Optional
	Subtract the mean from each conservation index and divide by the 
	standard deviation.
        Default = T

  -a    All methods option [T/F] Optional
        If set to true, the results of all 9 methods will be output.
        1. unweighted entropy measure; 2. Henikoff entropy measure;
        3. independent count entropy measure;
        4. unweighted variance measure; 5. Henikoff variance measure;
        6. independent count variance measure;
        7. unweighted identity-matrix sum-of-pairs measure;
        8. Henikoff identity-matrix sum-of-pairs measure;
        9. independent count identity-matrix sum-of-pairs measure;
        Default = F

  -g    Gap fraction to suppress conservation calculation [Real] Optional
        The value should be more than 0 and no more than 1. Conservation 
	indices are calculated only for positions with gap fraction less 
	than the specified value. Otherwise, conservation indices are
	set to M-S, where M is the mean conservation value and S is
	the standard deviation.
        Default = 0.5

  -p    Input pdb file [File in] Optional
        The sequence in the pdb file should match exactly the first sequence 
        of the alignment.

  -d    Output pdb file [File Out] Optional
        The B-factors are replaced by the conservation indices.
        Default = STDOUT

* Examples: (The files are in the directory examples/)

  al2co -i 3RAB.aln -p 3RAB.pdb -d 3RAB.csv.pdb -o 3RAB.csv

  al2co -i ybak.aln -w 3 -o ybak.csv

  al2co -i ybak.aln -c 2 -s BLOSUM62
 
  al2co -i Sec7.aln -a T

  al2co -i Sec7.aln -n F -f 1

  al2co -i Sec7.aln -t Sec7.csv.aln -b 70

  input alignment format: ClustalW - Sec7.aln
		          Simple alignment format - ybak.aln, 3RAB.aln
  input matrix format: NCBI - BLOSUM62
  input pdb file: 3RAB.pdb
  output pdb file: 3RAB.csv.pdb
  output conservation file: 3RAB.csv, ybak.csv
  output alignment file with conservation: Sec7.csv.aln
	
  molscript file: 3RAB.in
	In this file, the command line to color according to B-factor (in our
	case replaced by conservation index) is:
	"colour ss from blue via green to red by b-factor from -1.0 to 2"
	The command to generate ps file with structure colored by conservation
	is "bobscript<3RAB.in>3RAB.ps".

References:

(1) Pei, J., and Grishin, N.V. (submitted). AL2CO:  Calculation of 
Positional Conservation in a Protein Sequence Alignment.

(2) Henikoff, S., and Henikoff, J.G. (1994). Position-based sequence weights, 
J Mol Biol 243, 574-578.

(3) Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, 
W., and Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation 
of protein database search programs, Nucleic Acids Res 25, 3389-3402.

(4) Sunyaev, S.R., Eisenhaber, F., Rodchenkov, I.V., Eisenhaber, B., Tumanyan, 
V.G., and Kuznetsov, E.N. (1999). PSIC: profile extraction from sequence 
alignments with position- specific counts of independent observations, 
Protein Eng 12, 387-394.