Download Files

If you experience difficulty downloading files from this page, please visit the alternative site.

Final predictions

This folder contains the final set of predicted PPIs generated in our study, along with additional metadata. There are two files in this folder. "final_predictions_80.tsv" includes all predictions obtained at an expected precision of 80%, while "final_predictions_90.tsv" includes predictions obtained at an expected precision of 90%. The latter is a subset of the former.

best_models.tar.gz
Size: 63 GB

The best 3D structural models for 29,246 out of 29,257 predicted PPIs are provided here (selected based on the presence of consistently predicted inter-protein contacts across multiple models). A 3D model is considered confident only if it contains inter-protein contacts (distance < 6 Å) with AlphaFold2 interaction probabilities above 0.5. We were unable to obtain such 3D models for 11 predicted PPIs.

Our PPI screening and structural modeling were performed using segments rather than full-length protein sequences. Each segment contains one or more domains, and the relative orientation of different segments within a protein is largely flexible. The definition of each segment can be found here.

Each protein pair may be modeled using multiple segment pairs, and the results are organized into folders named after the protein pair. The model files follow the naming format: segment1__segment2__model, where the model can be:

  • AF: a model built using AlphaFold2 model 3
  • AF1-5: five models built by ColabFold using the AlphaFold2 network
  • AFMM1-5: five models built by ColabFold using the AlphaFold-Multimer network

We provide three files for each segment pair:

  • *.pdb: the predicted 3D structure
  • *.npz: a matrix of shape L1 × L2, where L1 and L2 are the lengths of the first and second segments, respectively
  • *.contacts: interacting residue pairs between two proteins. These contacts are residue pairs with distance below 8Å and predicted interaction probability above 0.6. Each file contains three columns: the first residue number (in relation to the entire protein), the second residue number (in relation to the entire protein), and the predicted interaction probability by Alphafold2.
other_models.tar.gz
Size: 166 GB

Similar to the "Best 3D Structure Models" (see above), these are the remaining models excluding the best one. We consider a 3D model to be confident only if it contains inter-protein contacts (distance < 6 Å) with AlphaFold2 interaction probabilities above 0.5. There are 0 to 10 confident models for each segment pair. For these models, we provide only the 3D structures in PDB format and the inter-residue predicted interaction probability matrices in NPZ format (see above).

Input sequence alignments

These MSAs are in an A3M-like format. Compared to the standard A3M format, we inserted an additional sequence at the beginning, named "mask," to indicate the alignment quality at each position. In this "mask," an asterisk (*) indicates high-quality positions, and a dash (-) indicates low-quality positions (these are poorly conserved and thus cannot be reliably assembled from genomic data). We recommend using only the high-quality positions (marked with *), as we did in our work. Insertions relative to the human (query) sequence are represented by lowercase letters.

Each sequence corresponds to one draft genome or genomic dataset, and the NCBI accession number of the genome or dataset is used to name the sequence in the header. We also include the taxonomic information of each sequence in the header, following the format: [genus]:[family]:[order]:[class]:[phylum].

Please note that because we assemble these sequences by aligning draft genomes or genomic reads to human proteins, insertions present in other species relative to the human sequence are often missed. Similarly, gaps in the MSAs may not represent deletions relative to the human sequence; they could result from alignment failures or incompleteness in the genomic dataset.

Building on the omicMSAs for full-length human proteins (see description above), these MSAs correspond to segments of human proteins used in our work. We split larger proteins into multiple segments and excluded the "low-quality" positions from these segments. The definition of each segment can be found in segment_def. These MSAs are in an A3M-like format. Insertions relative to the human (query) sequence are represented as lowercase letters.

Method development

SE3nv.sif
Size: 14 GB

Singularity image for running RoseTTAFold2-PPI, which supports our code deposited at: https://github.com/CongLabCode/RoseTTAFold2-PPI.

RF2-PPI.pt
Size: 310 MB

Trained weights for RoseTTAFold2-PPI, which can be used with our code deposited at: https://github.com/CongLabCode/RoseTTAFold2-PPI.

benchmarks.tar.gz
Size: 280 KB

There are two files in this folder. The file "positives_and_negatives.tsv" contains the positive and negative controls used to benchmark different methods. The file "pairs_partitioned_by_interface_sizes" contains additional positive controls derived from PDB complexes, partitioned into different categories based on interface size, which correlates with binding affinity.

PPI_training.tar.gz
Size: 226 GB

This training dataset for RF2-PPI is derived from interacting chains in biological assemblies of PDB entries. It includes 3D structures and paired MSAs for each chain pair. A README file in the folder describes the contents of the dataset.

DDI_training.tar.gz
Size: 330 GB

This training dataset for RF2-PPI is derived from interacting domain pairs in AlphaFold models from the AlphaFold Protein Structure Database. It includes 3D structures for each domain and the paired MSAs for each domain pair. A README file in the folder describes the contents of the dataset.

Key intermediate results

screened_pairs.gz
Size: 336 MB

Human protein pairs (among the 19,528 proteins included in our PPI screen) from various sources are provided in four files: (1) "PPI_database_pairs" contains candidate PPIs gathered from UniProt, BioGRID, and STRING physical interactions; (2) "STRING_genetic_pairs" includes genetically associated pairs based on STRING genetic interactions; (3) "same_locality_pairs" lists protein pairs that share subcellular locality as annotated by UniProt keywords; and (4) "unknown_locality_pairs" includes pairs involving proteins without known subcellular locality.

DCA_scores.zip
Size: 778 MB

Coevolution between 189.4 million protein pairs was evaluated using direct coupling analysis (DCA) on omicMSAs (excluding low-quality positions), followed by Average Product Correction (APC). Each protein pair has one score, representing the maximum DCA score among all residue pairs between the two proteins. We applied DCA to all 190.7 million pairs (all-against-all among the 19,528 proteins included in our pipeline), but were unable to compute DCA scores for 1.3 million pairs due to lack of variation in the sequence alignments of at least one protein.

RF2-PPI_scores.zip
Size: 444 MB

Interaction probabilities between 47 million protein pairs were predicted by RF2-PPI based on omicMSAs (excluding low-quality positions). We applied RF2-PPI only to protein pairs that exhibited high DCA scores or had prior experimental evidence supporting their interactions. Each protein pair has one probability value, representing the maximum interaction probability among all residue pairs between the two proteins. We also indicate the source of each pair, defined as follows:

  • DNS: de novo screen of pairs with shared subcellular localization
  • DNU: de novo screen of pairs involving proteins with unknown subcellular localization
  • PPI: pairs from PPI databases—BioGRID, STRING (physical), and UniProt
  • STR: genetically interacting pairs from STRING
  • NEG: negative controls used for accuracy estimation
AF_scores.zip
Size: 32 MB

Interaction probabilities between 3.4 million protein pairs were predicted by AlphaFold2 based on omicMSAs (excluding low-quality positions). We applied AlphaFold2 only to protein pairs that showed high RF2-PPI interaction probabilities or had prior experimental evidence supporting their interactions. Each protein pair has a single probability value, representing the maximum interaction probability among all residue pairs between the two proteins. We also indicate the source of each pair, defined as follows:

  • DNS: de novo screen of pairs with shared subcellular localization
  • DNU: de novo screen of pairs involving proteins with unknown subcellular localization
  • PPI: pairs from PPI databases—BioGRID, STRING (physical), and UniProt
  • STR: genetically interacting pairs from STRING
  • NEG: negative controls used for accuracy estimation