Deep Learning for Enzyme Activity Prediction
Sequence-to-function learning pipeline using protein language models
MITACS Globalink Research Internship | May - August 2023
Laboratory for Metabolic Systems Engineering, University of Toronto
Supervisor: Dr. Krishna Mahadevan
Project Overview
Developed an end-to-end sequence-to-function learning pipeline for predicting enzyme activity from protein sequences, achieving state-of-the-art performance with R-Score of 0.71 on mutation datasets.
Key Achievements
Pipeline Development
- Designed flexible, modular pipeline supporting multiple Protein Language Models (PLMs)
- Implemented end-to-end workflow from data preprocessing to prediction with minimal configuration
- Enabled easy experimentation with different model architectures and hyperparameters
Model Performance
- Employed ESM-2 to generate high-quality sequence embeddings for three point-mutation datasets
- Benchmarked multiple architectures: LSTM-VAEs, convolutional pooling, and self-attention mechanisms
- Achieved R-Score of 0.71 on enzyme activity prediction tasks
Key Insights
- Identified lack of structural information as the primary bottleneck in predictive performance
- This insight motivated the thesis work on integrating structure with sequence via GNNs
- Demonstrated competitive performance of attention-based pooling at similar parameter sizes
Technical Stack
Models: ESM-2, LSTM-VAE, Convolutional Networks, Self-Attention
 Frameworks: PyTorch, Hugging Face Transformers
 Datasets: Three enzyme point-mutation datasets with measured activity values
Impact
This work laid the foundation for:
- Understanding the importance of structural information in protein function prediction
- Developing modular pipelines for protein ML research
- Subsequent thesis work on sequence-structure fusion