Machine Learning Scientist - Diffuse

Astera Institute • Full-time • Remote (Emeryville, CA, US) • $160,000 - $300,000 / year • 1d ago

About The Diffuse Project:

The Diffuse Project is dedicated to advancing our understanding of protein motion through the use of diffuse scattering – a signal in X-ray crystallography that is currently under-utilized or ignored, that will unlock our ability to measure protein dynamics. We are bringing together a diverse team of researchers, software developers, and beamline scientists to accomplish our mission. We are committed to Open Science principles of making all of our work, software, and data open and FAIR all along the way. The Diffuse Project is generously funded by and is part of the Astera Institute. You can read more about The Diffuse Project here, and Astera’s mission, vision, and programming here.

Position Summary:

The Diffuse Project is seeking a Machine Learning Scientist to join a multidisciplinary team developing machine learning methods to extract hidden functional states from experimental structural biology data (i.e. electron density, structure factors, diffraction patterns, etc.). We are particularly interested in applicants with deep expertise in generative modeling, reinforcement learning, or representation learning who are excited to apply these approaches to real-world structural data. We aim to develop a new generation of tools that treat experimental data as central inputs to model training, validation, and discovery. This role is part of a highly interdisciplinary team of machine learning experts, structural biologists, and biophysicists.

Key Responsibilities:

Design, train, and deploy open-source ML models that learn directly from experimental X-ray crystallography data (structure factors, electron density, diffraction patterns) for conformational ensemble modeling
Develop and benchmark metrics for conformational ensemble modeling and comparison against experimental data
Maintain data pre-processing and organization for ML models
Collaborate with domain scientists to integrate outputs with experimental pipelines and refine hypotheses in an iterative design–test–learn loop

Experience and Skills:

Strong background in linear algebra, statistics, probability theory, optimization methods, and deep learning architectures
Hands-on experience with generative models (e.g., diffusion, flow models), representation learning (e.g., contrastive learning, GNNs), or reinforcement learning (e.g., policy gradient, actor-critic)
Proven ability to build and debug large-scale ML algorithms
Familiarity with structural biology, protein dynamics, or physics-based modeling
Ability to work effectively in a multidisciplinary team environment
Familiarity with Pytorch

Education and Certifications:

MS, or PhD in Data Science, Computer Science, Bioinformatics, Biophysics, Computational Chemistry, or related field. with at least two years of experience working on ML models

Location:

This role is Remote, with access to our office located in Emeryville, CA. Some travel may be required from time-to-time for in-person collaboration and work.

Compensation:

The posted salary range is based on location in the Bay Area. The successful candidate will receive a competitive compensation package, commensurate with their experience and location.

Compensation Range: $160K - $300K