TriCyp
BrowseH-GroupsBenchmarkDownloadsPaper

TriCyp

Three-state cysteine classification across ECOD F70 representative domains — disulfide-bonded, metal-binding, or free thiol — combining ESM2 predictions with PDB structural evidence.

Navigation

  • Dashboard
  • Browse Families
  • H-Groups
  • Benchmark
  • AF Geometric
  • Downloads & API
  • About / Methods
  • Paper

Resources

  • ECOD Database
  • RCSB PDB

© 2026 Schaeffer & Cong Labs, UT Southwestern Medical Center

data · paper-v1·refreshed 2026-05-06

About / Methods

How TriCyp classifies cysteine fates

TriCyp is the deposition for an ESM2-based three-state cysteine classifier — disulfide-bonded, metal-binding, or free thiol — applied across roughly 700,000 ECOD F70 representative domains. Section anchors below match the manuscript Methods subheadings.

Jump to

  • Pipeline
  • Training data
  • Model architecture
  • Threshold selection
  • Benchmarking
  • Software & data
  • License
  • Cite

Pipeline

For every cysteine in a domain, the pipeline runs a fine-tuned ESM2 classifier that emits a probability over three states — free thiol, disulfide, metal-binding. Probabilities are thresholded into a three-state call; per-cysteine outputs are aggregated to per-domain and per-H-group summaries that back the dashboard, the family browser, and the H-group browser. PDB-source domains additionally carry structural ground truth from geometric Sγ–Sγ scanning, PDB SSBOND records, and PDB metal LINK records, which the benchmark page uses to score the classifier.

Fig 1 — pipeline overview

Training data

The classifier is trained on PDB-source cysteines with structural ground truth: disulfide labels from PDB SSBOND records cross-checked by Sγ–Sγ geometric scanning at 2.5 Å, metal-binding labels from PDB LINK / SITE records where a cysteine sulfur coordinates a metal ion or metal-bearing cofactor, and free-thiol labels for cysteines not captured by either evidence stream. Held-out validation and test sets are drawn at the F-group level so the splits do not leak structural family between training and evaluation.

Model architecture

The published 3-state classifier is an ensemble of five ESM2 checkpoints (best_modelA.pth … best_modelE.pth), each fine-tuned from the same ESM2 base with a per-cysteine softmax head over the three classes. Inference averages per-class probabilities across the ensemble. The full source — cys3state — is available alongside the model weights; see Downloads.

Threshold selection

Operating thresholds were tuned on the held-out validation set to hit a fixed per-task precision target. The published 3-state call on TriCyp uses P(Met) ≥ 0.972 for metal-binding and P(Dis) ≥ 0.742 for disulfide; otherwise a cysteine is called free thiol. Raw per-class probabilities for every cysteine remain in the per-cysteine TSV so users can re-threshold for their own workflows.

Benchmarking

The full ROC + PR + threshold-tuning panels and the metal-type stratified strip live on the Benchmark page (paper Fig 2 + Fig S1). On the held-out v2 (zinc-rebalanced) benchmark, all three metal-binding tools score in the same band on the metals they were trained for (Zn / Ca / Mg / Mn, AUROC 0.994–0.996); ESM2-3state's residual advantage shows up specifically on iron coordination, where iron-stratum AUROC reaches 0.993 for ESM2-3state versus 0.917 (LMetalSite) and 0.877 (GPSite). This is a training-coverage difference, not an architectural one — the specialist tools were not designed to predict iron-coordinating cysteines.

Software & data availability

  • Predictor source: cys3state repository — see the link on the Downloads page.
  • Model weights: five ESM2 checkpoints, deposited on Zenodo alongside the per-cysteine TSV.
  • Per-cysteine TSV: canonical full dump (one row per classified cysteine across all F70 representative domains), regenerated nightly with SHA-256 sidecars. Direct download on the Downloads page.
  • Figure data: one CSV per main and supplementary figure, mirroring the manuscript's paper/figure_data/ exports.
  • REST API: read-only JSON endpoints for domain, family, H-group, and search lookups; documented on the Downloads & API page.

License

TriCyp data is released under CC-BY-4.0: free reuse with attribution to the manuscript. The predictor source code (cys3state) carries its own existing license — check the repository for terms before redistribution.

How to cite

Please cite Classification of cysteine fates in structure predictions using a protein language model (Yuan, Durham, Cong, Schaeffer, 2026). preprint DOI pending.

BibTeX

@article{yuan_tricyp_2026,
  title = {Classification of cysteine fates in structure predictions using a protein language model},
  author = {Yuan and Durham and Cong and Schaeffer},
  year = {2026},
  journal = {bioRxiv},
  note = {TriCyp companion site: https://tricyp.swmed.edu},
}

RIS

TY  - JOUR
TI  - Classification of cysteine fates in structure predictions using a protein language model
AU  - Yuan
AU  - Durham
AU  - Cong
AU  - Schaeffer, Dustin
PY  - 2026
JO  - bioRxiv
ER  -