M2SG: A MUTATION DATABASE VALIDATED AT GENOME AND PROTEOME LEVEL

	M2SG: A MUTATION DATABASE VALIDATED AT PROTEOMIC AND GENOMIC LEVEL

M2SG - Mapping Mutations to Protein and Gene sequence

General Description:

M2SG web server could be queried by several types of inputs, including formats as protein sequence, OMIM entry number, UniProt accession, protein name, gene name, and disease phenotype. It would return the web page for the most relevant protein and the mapped mutations instantly.

DATA INPUT
DATA OUTPUT

DATA INPUT

Search by a Protein Sequence

You could input your protein sequence in FASTA format or plain-text format to query M2SG.

1.A protein sequence in FASTA format:

A protein sequence in FASTA-format contains a definition line followed by the actual sequence. The first line (definition line) starts with a ">" (greater-than) symbol and is usually a description of the sequence. From the second line is the actual sequence itself in standard one-letter code. Anything other than a valid code would be ignored (including spaces, tabulators, etc...). An exmaple is shown below,

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

2.A protein sequence in plain-text format:

In this format, the sequence should only contain the standard one-letter codes (without the definition line). Anything other than a valid code would be ignored (including spaces, tabulators, etc...). An example is given below,

LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

M2SG uses BLAST to detect homologous disease-related proteins with mutations.

The top hit in the BLAST result will be shon only if it satisfiese either of the two criteria below: Other hits are omitted.

1. The "HIT" WITH HIGH CONFIDENCE: the hit with sequence identity greater than 90% and hit coverage (The coverage of BLAST alignment for the hit) greater than 90%;

2. The "HIT" WITH WARNING AND CAUTION: the hit with E-value lower than 10-10 and hit coverage greater 50%.

Other hits will be omitted.

Search by an OMIM ID

One OMIM ID represents one OMIM entry indicated by a unique six-digit number, such as 202110. You can select the OMIM ID of your interest in the drop-down box to query our database. The server will return all disease-related mutations associated with this OMIM ID.

Search by an UniProt AC

Each sequence is indicated by an accession number (AC) in the Uniprot database. Accession numbers are stable from release to release. You can select the UniProt AC of your protein, such as P05093, in the drop-down box to query our database. The server will return all disease related mutations associated with this UniProt AC as a result.

Search by a Protein Name

The protein name here means the full name recommended by the UniProt consortium to represent a human protein. You can input the protein name, such as "vasopressin V2 receptor", to query our database. In case that you cannot remember the whole name, the protein name will be autocompleted in the drop-down box if it matches one in our database. The server will return all disease-related mutations associated with this protein as a result.

Search by a Gene Name

The gene name here presents the recommended name that is used to officially represent a human gene. You can input the gene namei, such as "AVPR2", to query our database. In case that you cannot remember the whole name, the gene name will be autocompleted in the drop-down box if it matches one in our database. The server will return all disease-related mutations associated with this gene as a result.

Search by a Disease Name

You can input the disease phenotype name of your interest, such as "diabetes insipidus", to query our database. In case that you cannot remember the whole name, the disease phenotype name will be autocompleted in the drop-down box if it matches one in our database. The server will return all disease related mutations associated with the disease as a result. If there is more than one protein in our database associated with the disease, you can select your proteins from a list to find all disease related mutations.

DATA OUTPUT (Explanation of Results)

Alignment of your Input and the Hit(Only shown when your query is a protein sequence)

Identifier: the UniProt accession number for the hit
Alignment graph: graphic view of the alignment between the query and hit. Gaps in the Query and Hit are neglected in this graph. Vertical red bars in the hit represents mutations in the corresponding position of the protein.
Length: the total sequence length of the Hit and Query.
Hit_from: BLAST alignment starting position for the Hit.
Hit_to: BLAST alignment ending position for the Hit.
Query_from: BLAST alignment starting position for the Query.
Query_to: BLAST alignment ending position for the Query.
Identity: identity between the Query and the Hit over the aligned region.
E-value: E-value from BLAST, a score indicating the likelihood of finding a protein in the same level of similarity as this hit in a random database of the same size as the Swiss-Prot database.

Graphic View of all Genetic Variants

A graphic view labeled by residue positions represents all the matches (of the native residues) of the mutations. The hit in M2SG is represented as a grey strip with its UniProt accession number. The red bars in the graph represent the mutations in coresponding positions. Details about the mutation will be shown when the mouse hovers over the red bar.

the Mutation Collection

The mutations are sorted by UniProt residue number.

M2SG ID: this is an unique ID (a number) created by M2SG database for each mutation.
Source Database: the source database where the mutation comes. (OMIM: mutations from OMIM without modification; Modified OMIM: mutations from OMIM with modification; SwissVar: mutations from the SwissProt Variant database)
Correct Mutation: the correct mutation that can be mapped to our reference sequence and the genomic loci.
Original Mutation: the mutation deposited in the original database, which may be incorrect.
OMIM mut_ID: the OMIM identifier (a 6-digit number) plus the allelic variant sub-identifier (a 4-digit number).
SwissVar ID: the variant identifier in the Swiss-Prot Variant database.
dbSNP ID: the variant identifier in the dbSNP database.
ENSG ID: the gene identifier in the Ensembl database.
Strand: the strand on which the protein is encoded. We use '+' for the forward strand, '-' for the reverse strand, and 'na' when this information is not available.
Chromosome: the chromosome number where the gene locates.
Chromosome Position and Allele: the corresponding SNP(s) in the genome for each mutation. The contents are usually in format like '12345C>T', where the number represents the chromosome location and 'C>T' stands for a change from C to T. When an amino acid substitution requires multiple SNPs, the record will be '12345-12347CCC>AGT' when these SNPs are in continuous positions or '12345C>G,12347C>T' when these SNPs are in discontinuous positions. If one mutation in a protein sequence can be attributed to several possible SNPs or several possible combinations of SNPs, these possible changes are separated by 'or' and the record will be like '12345C>G or 12345C>T'.
Associated disease: the disease name in the OMIM or SwissVar database.

Reference Protein Sequence (RS)

This protein sequence is used for mapping mutations. All positional information in that entry is based on this protein sequence.

M2SG: A MUTATION DATABASE VALIDATED AT PROTEOMIC AND GENOMIC LEVEL

M2SG - Mapping Mutations to Protein and Gene sequence

General Description:

DATA INPUT

DATA OUTPUT (Explanation of Results)