Three-state cysteine classification across ~700,000 ECOD representative domains.
Companion deposition site for Classification of cysteine fates in structure predictions using a protein language model — Yuan, Durham, Cong, Schaeffer. preprint link pending.
Cysteine fate breakdown stratified by classification source. PDB-geom uses geometric ground truth (Sγ–Sγ disulfides plus PDB metal LINK records); PDB-ESM and AFDB-ESM are ESM2-3state predictions on PDB-source and AFDB-source F70 representatives, respectively.
Eukaryotic domains contribute disproportionately to total cysteine count. Bacterial and archaeal domains are cysteine-poor by comparison; the gap between domain fraction and cysteine fraction is the headline of this panel.
Per-kingdom three-state classification rates. Eukaryotic cysteines are enriched for disulfides; archaeal cysteines retain a higher metal-binding rate consistent with iron–sulfur cluster prevalence in archaea.
Eukaryotic-only subcellular gradient: extracellular and secretory-pathway compartments are disulfide-rich; cytoplasmic, nuclear, and mitochondrial compartments are metal-binding-rich. Source: UniProt Subcellular Location annotations cross-referenced to ECOD F70 representative domains.
Distribution of max-class probability across all classified cysteines. The long right tail in disulfide and metal-binding classes shows that the positive predictions are made with high confidence; free-thiol calls dominate the lower-probability bins where the model is appropriately uncertain.
Source-type breakdown across F70 representative domains. PDB-source domains have experimental coverage; AFDB and other predicted-source domains rely on ESM2-3state predictions only.