Graph Neural Networks for Protein Inverse Folding
Improving ProteinMPNN through data augmentation and pre-training strategies
IPD Summer Research Fellowship | May - September 2022
Baker Lab, Institute for Protein Design, University of Washington
Under Nobel Laureate Dr. David Baker
Project Overview
Investigated and improved graph neural network architectures for protein inverse folding, demonstrating enhanced performance through AlphaFold augmentation and BERT-style pre-training.
Key Contributions
Model Implementation & Benchmarking
- Reproduced results from ProteinMPNN (Science, 2022) through complete implementation from scratch
- Compared performance with state-of-the-art models: MIF-ST and ESM
- Validated model architecture and training procedures against published benchmarks
- Created tutorials to explore and try out ProteinMPNN and other plugins
Performance Improvements
Data Augmentation Strategy:
- Augmented training set with AlphaFold-predicted structures
- Trained with noisy backbone coordinates to improve robustness
- Demonstrated improved generalization to unseen protein families
Pre-training Approach:
- Implemented BERT-style pre-training for ProteinMPNN
- Objective: reconstruct masked sequences given input structure
- Showed competitive performance with pre-trained models on downstream tasks
Architectural Insights
- Validated sparsity assumption: Performance saturates with increasing edge connections in protein graphs
- Demonstrated efficiency of k-nearest neighbor graph construction over fully connected graphs
- Confirmed importance of local structural context over distant interactions
Technical Details
Architecture: Message Passing Neural Network (MPNN) encoder with self-attention and feedforward layers for neighborhood aggregation, autoregressive decoder for sequence generation
Key Techniques:
- Graph construction from backbone coordinates
- Edge features from pairwise distances and orientations
- Autoregressive decoding with temperature sampling
Frameworks: PyTorch, PyTorch Geometric
Impact
This research:
- Validated the effectiveness of ProteinMPNN architecture
- Introduced practical improvements through data augmentation
- Provided insights into graph sparsity and pre-training for protein design