Siddharth Betala

Bengaluru, India

betalas5@gmail.com

About Me

I am a Machine Learning Research Engineer at Entalpic, where I work with Victor Schmidt (CTO) and Alexandre Duval (CSO) on building a unified platform for materials discovery and pushing the boundaries of AI4Science to address climate change. My work involves developing generative models for inorganic crystal and material generation, designing evaluation methods and benchmarks for scientific generation tasks, and building LLM-based agentic systems to assist in synthesis planning and multi-modal scientific workflows.

Prior to joining Entalpic, I completed a dual degree in Data Science and Biological Engineering at the Indian Institute of Technology (IIT) Madras in July 2024. My thesis—Enhancing Protein Fitness with Deep Learning—was co-advised by Dr. Krishna Mahadevan (University of Toronto) and Dr. Nirav Bhatt (IIT Madras), and was nominated for the Best Thesis in Data Science Award at IIT Madras. The work was presented as a poster at MLCB 2024.

In the summer of 2023, I was a research intern in Dr. Mahadevan’s lab through the MITACS Globalink Research Fellowship, where I focused on fine-tuning large protein language models for enzyme activity prediction. Earlier, in 2022, I interned at Dr. David Baker’s lab at the University of Washington through the Institute for Protein Design Summer Fellowship, working on GNNs and transformer architectures for protein inverse folding.

Throughout my undergraduate years, I was actively involved in the institute’s technical community. I was part of the Inter-IIT Tech contingent for two years, led multiple projects with the Robotics Club as a strategist and team lead, and contributed to the award-winning iGEM team under the guidance of Dr. Karthik Raman.

Research Interests

I am interested in using AI to accelerate scientific discovery — particularly in biological design, materials science, and drug discovery. I’m excited about generative modelling, scientific agent systems, and building tools that improve iteration speed and steerability for experimental scientists.

I also care about data-centric AI, particularly in low-resource or underrepresented settings. My interests in multilingual and multimodal NLP, benchmarking, alignment techniques, and evaluation methods stem from a core belief: that impactful AI must be reliable, accessible, and inclusive. A significant part of my current and past work involves building or repairing datasets, identifying biases, and designing metrics that better reflect real-world generalization.

Open Science and Community Involvement

I actively contribute to open-source and open-science efforts across several communities:

🔹 LeMaterial (Entalpic ⚛️ x HuggingFace 🤗):

Led the development of LeMat-GenBench — the first evaluation benchmark for generative models for inorganic crystals.
Co-led LeMat-Synth, a multimodal agentic library for extracting synthesis recipes and reaction performance data from literature and plots, resulting in a high-quality curated dataset.

🔹 ML Collective:

Worked with Stefan Larson (Vanderbilt) on identifying label noise and train–test overlaps in the RVL-CDIP dataset, and on generating realistic synthetic data for PII replacement in the IIT-CDIP corpus.
Collaborated with Dr. Chirag Agarwal (UVA) and Dr. Guadalupe Gonzalez (Genentech) on studying out-of-distribution (OOD) performance as a proxy for explanation quality in GNNs.

🔹 Hugging Face Science
Contributing to Deep Critical, an agentic architecture for autonomous scientific discovery workflows.

🔹 Cohere Labs
Member of the Open Science Community.

news

Oct 19, 2025	🎉 LeMat-GenBench has been awarded a spotlight at the NeurIPS AI4Mat Workshop!
Sep 21, 2025	🎉 Both LeMat-GenBench and LeMat-Synth have been accepted to the AI4Mat Workshop at NeurIPS 2025! Excited to contribute to AI-driven materials discovery research.
Jan 06, 2025	Thrilled to join Entalpic as a Machine Learning Research Engineer! Excited to work on cutting-edge AI for accelerated materials discovery. 🚀
Dec 10, 2024	Our work on “Out-of-Distribution performance as a proxy metric for graph neural network explainers in the absence of ground-truth explanations” was presented at the WiML Workshop @ NeurIPS 2024! Grateful to collaborate with Guadalupe Gonzalez and Chirag Agarwal. Thanks to Guadalupe for presenting our work!
Sep 25, 2024	Delighted to announce that our shared task submission “Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning” has been accepted for poster presentation at WMT 2024!