README
We developed the HangOut procedure to build clean profiles for
accurately defined single domains (e.g. SCOP domains). To our surprise,
single domains can also have corrupted profiles according to our recent
analysis (the manuscript submitted).
We implemented and tested the HangOut program under python 2.3
(however, the program should work higher versions) to fix those profile
corruptions. HangOut reads in a fasta file or a PDB
files and build a profiles (much like PSI-BLAST –m6 option output) using
the sequence extracted from the file. In the program, we added two other
convenient procedures of building profiles; original PSI-BLAST and RemoveHit
besides the HangOut procedure.
As HangOut is developed for generating
clean profiles for a given domain definition (especially those discontinuously
defined domains shown in SCOP domains; for example a SCOP domain d1xzpa1 is
defined for a range chain A:118-211 and A:372-450, due to an inserted domain),
the user should give a domain definition as command line arguments, otherwise
HangOut program will automatically run PSI-BLAST for those domains without
domain definitions.
Particularly, the domain definition can be given one of the two ways
according to the type of the input of protein sequence;
1) The input sequence can be extracted from PDB file.
2) The input sequence can be a simple FASTA format with an amino acid sequence.
HangOut can read in PDB file format with range definition similar to
SCOP database (http://scop.mrc-lmb.cam.ac.uk/scop/).
Shell% hangout [optional
parameters] <pdb file> <range definition>
The <range definition> can be like A:118-211,A:372-450.
Here, the first contiguous sequence will be from chain A residue number (in PDB) 118 to 211.
In addition, the second contiguous sequence will be from residues 372 to 450
from chain A.
Note that if you want to use full power of HangOut methodology, it is better
to use unmodified PDB files (downloaded from the PDB
website, not the SCOP domain PDB files downloaded from ASTRAL (SCOP domain deposition website).
If the ASTRAL domain PDB files are used, the inserted domain sequences will not be automatically detected (because ASTRAL people
removed it for you, and usually it is very convenient) but you can define the
inserted domain by youself (see –n option
below).
To input domain defined within a FASTA file, users can use the
following command line.
Shell% hangout [optional
parameters] <fasta file> [<range def>]
<range def> is defined as <start
position>-<end position> and concatenated by “,” where
start positions and end positions are integer values representing positions of
the amino acid in the sequence.
For example, if we have a fasta file, test.fa, that has an amino acid sequence of 10 residues.
################ test.fa #################
>test domain
MDLTSAEAVR
########################################
And if this test domain had an inserted domain between the two residues
“TS” (if the insertion is currently removed and not shown in the test.fa file),
then the range should be explicitly represented as 1-4 5-10.
The hangout command for this case would be like the following;
% hangout test.fa 1-4,5-10
If you want to use whole sequence of the test domain, you may set the
range using the keyword “all”
or put ranges “1-10”.
Note that if the range is defined this way, the
HangOut procedure cannot be used. Instead, RemoveHit or PSI-BLAST should be used.
Selecting profile building
methodology
-m <method_name>
<method_name>
should be one of the following choices;
hangout (default)
removehit
psiblast
You may choose profiles building method not the default HangOut
procedure but RemoveHit or PSI-BLAST. RemoveHit supposedly produce less clean
profiles compared to HangOut (slightly better than PSI-BLAST), but it does not
require domain definition.
Setting
a new query domain name.
-t <new_name>
The new_name can be used
for the output filename and as the query domain name in the profiles instead of
the name found in the original query fasta file or
from the pdb file.
Setting neighboring domain
ranges.
-n: <neighbor_def>
Usage:
-n auto (default)
-n “1c0p.pdb A:1128-1208”
-n “1c0p.fa 1128-1208”
The HangOut method uses inserted domains to build clean corruption free
profiles. However, users can add freely N-terminal or C-terminal end domains
for cleaner results and one or more domains around the user’s input
domain definitions.
Note that the HangOut only understand the –n optional input file
format the same as the input argument file; i.e. if HangOut input method #1 is
used (fasta) than –n should also expect to read
fasta file, and vice versa.
Currently, if the range definitions or insertion positions are not
given, or the given range is continuous, then HangOut will run as normal
PSI-BLAST since HangOut cannot do the profile checking without given clear
domain boundary. We plan to lift the strict requirement of the domain
definition and make HangOut a solution for the general multidomain problem.
Note for users who want to run RemoveHit:
RemoveHit runs similar manner as HangOut, but RemoveHit is not required
to have the domain definition (so –n option will not affect any results).
However, RemoveHit is not strongly recommended for
building profiles, since it performed worse than HangOut.