If you experience difficulty downloading files from this page, please visit the alternative site.
This folder contains the final set of predicted PPIs generated in our study, along with additional metadata. There are two files in this folder. "final_predictions_80.tsv" includes all predictions obtained at an expected precision of 80%, while "final_predictions_90.tsv" includes predictions obtained at an expected precision of 90%. The latter is a subset of the former.
The best 3D structural models for 29,246 out of 29,257 predicted PPIs are provided here (selected based on the presence of consistently predicted inter-protein contacts across multiple models). A 3D model is considered confident only if it contains inter-protein contacts (distance < 6 Å) with AlphaFold2 interaction probabilities above 0.5. We were unable to obtain such 3D models for 11 predicted PPIs.
Our PPI screening and structural modeling were performed using segments rather than full-length protein sequences. Each segment contains one or more domains, and the relative orientation of different segments within a protein is largely flexible. The definition of each segment can be found here.
Each protein pair may be modeled using multiple segment pairs, and the results are organized into folders named after the protein pair. The model files follow the naming format: segment1__segment2__model, where the model can be:
We provide three files for each segment pair:
Similar to the "Best 3D Structure Models" (see above), these are the remaining models excluding the best one. We consider a 3D model to be confident only if it contains inter-protein contacts (distance < 6 Å) with AlphaFold2 interaction probabilities above 0.5. There are 0 to 10 confident models for each segment pair. For these models, we provide only the 3D structures in PDB format and the inter-residue predicted interaction probability matrices in NPZ format (see above).
These MSAs are in an A3M-like format. Compared to the standard A3M format, we inserted an additional sequence at the beginning, named "mask," to indicate the alignment quality at each position. In this "mask," an asterisk (*) indicates high-quality positions, and a dash (-) indicates low-quality positions (these are poorly conserved and thus cannot be reliably assembled from genomic data). We recommend using only the high-quality positions (marked with *), as we did in our work. Insertions relative to the human (query) sequence are represented by lowercase letters.
Each sequence corresponds to one draft genome or genomic dataset, and the NCBI accession number of the genome or dataset is used to name the sequence in the header. We also include the taxonomic information of each sequence in the header, following the format: [genus]:[family]:[order]:[class]:[phylum].
Please note that because we assemble these sequences by aligning draft genomes or genomic reads to human proteins, insertions present in other species relative to the human sequence are often missed. Similarly, gaps in the MSAs may not represent deletions relative to the human sequence; they could result from alignment failures or incompleteness in the genomic dataset.
Building on the omicMSAs for full-length human proteins (see description above), these MSAs correspond to segments of human proteins used in our work. We split larger proteins into multiple segments and excluded the "low-quality" positions from these segments. The definition of each segment can be found in segment_def. These MSAs are in an A3M-like format. Insertions relative to the human (query) sequence are represented as lowercase letters.
Singularity image for running RoseTTAFold2-PPI, which supports our code deposited at: https://github.com/CongLabCode/RoseTTAFold2-PPI.
Trained weights for RoseTTAFold2-PPI, which can be used with our code deposited at: https://github.com/CongLabCode/RoseTTAFold2-PPI.
There are two files in this folder. The file "positives_and_negatives.tsv" contains the positive and negative controls used to benchmark different methods. The file "pairs_partitioned_by_interface_sizes" contains additional positive controls derived from PDB complexes, partitioned into different categories based on interface size, which correlates with binding affinity.
This training dataset for RF2-PPI is derived from interacting chains in biological assemblies of PDB entries. It includes 3D structures and paired MSAs for each chain pair. A README file in the folder describes the contents of the dataset.
This training dataset for RF2-PPI is derived from interacting domain pairs in AlphaFold models from the AlphaFold Protein Structure Database. It includes 3D structures for each domain and the paired MSAs for each domain pair. A README file in the folder describes the contents of the dataset.
Human protein pairs (among the 19,528 proteins included in our PPI screen) from various sources are provided in four files: (1) "PPI_database_pairs" contains candidate PPIs gathered from UniProt, BioGRID, and STRING physical interactions; (2) "STRING_genetic_pairs" includes genetically associated pairs based on STRING genetic interactions; (3) "same_locality_pairs" lists protein pairs that share subcellular locality as annotated by UniProt keywords; and (4) "unknown_locality_pairs" includes pairs involving proteins without known subcellular locality.
Coevolution between 189.4 million protein pairs was evaluated using direct coupling analysis (DCA) on omicMSAs (excluding low-quality positions), followed by Average Product Correction (APC). Each protein pair has one score, representing the maximum DCA score among all residue pairs between the two proteins. We applied DCA to all 190.7 million pairs (all-against-all among the 19,528 proteins included in our pipeline), but were unable to compute DCA scores for 1.3 million pairs due to lack of variation in the sequence alignments of at least one protein.
Interaction probabilities between 47 million protein pairs were predicted by RF2-PPI based on omicMSAs (excluding low-quality positions). We applied RF2-PPI only to protein pairs that exhibited high DCA scores or had prior experimental evidence supporting their interactions. Each protein pair has one probability value, representing the maximum interaction probability among all residue pairs between the two proteins. We also indicate the source of each pair, defined as follows:
Interaction probabilities between 3.4 million protein pairs were predicted by AlphaFold2 based on omicMSAs (excluding low-quality positions). We applied AlphaFold2 only to protein pairs that showed high RF2-PPI interaction probabilities or had prior experimental evidence supporting their interactions. Each protein pair has a single probability value, representing the maximum interaction probability among all residue pairs between the two proteins. We also indicate the source of each pair, defined as follows: