Research

My research develops vertically integrated inference systems for complex scientific data. By vertical integration, I mean designing inference architectures that maintain explicit control over the full generative chain: from physical source models and instrument response, through likelihood construction and noise modeling, to hierarchical population inference and scientific interpretation. When the full data-generating process is made explicit, operations can be performed at the level where they are well-defined, uncertainty propagates coherently, and historical approximations can be replaced with modern scalable algorithms.

Generative inference and vertical integration

Most scientific analysis pipelines evolve organically and become locally optimized: each component works in isolation, but the global structure is often fragmented. Derived quantities are treated as primary observables, and approximations compound across steps. My work starts from the opposite direction: first map out the full statistical data-generating process — the generative model — and then build inference as its principled inverse.

This perspective was shaped by six years as a senior data scientist at Microsoft, where vertical integration was not philosophy but operational practice. Each engagement began by understanding the full data-generating process — from telemetry and logging pipelines to production deployment. At petabyte scale, small statistical inconsistencies amplify and approximate reasoning that "works" in a prototype can silently fail in production. When I returned to academia, I brought this architectural discipline back to astrophysics.

The clearest demonstration of this approach is MetaPulsar, a likelihood-level data combination framework for pulsar timing arrays. Rather than reconciling timing residuals — derived quantities — across collaborations, MetaPulsar operates at the likelihood level, preserving dataset-specific structure while maintaining global statistical consistency. This replaced a years-long manual process with a well-defined generative procedure. The framework was used by Yu & Allen (2026) in the first five-PTA stochastic background search based on public datasets.

Pulsar timing arrays and gravitational waves

Pulsar timing arrays use networks of millisecond pulsars as a galaxy-scale gravitational-wave detector, sensitive to nanohertz-frequency signals from supermassive black hole binaries and other cosmological sources. The data analysis challenges are substantial: complex correlated noise, high-dimensional parameter spaces, subtle signal-to-noise regimes, and datasets produced by multiple telescopes and collaborations worldwide.

My work in this area spans the foundational methods that underpin modern PTA science:

Bayesian search methods for stochastic gravitational-wave backgrounds and individual sources, including the methods used in the first PTA detections.
Gaussian-process approaches to pulsar timing that provide a principled framework for modeling correlated noise and signals simultaneously.
Computational acceleration: rank-reduced likelihood approximations that make Bayesian PTA analyses computationally tractable as datasets grow.
Hierarchical models for PTA data, addressing the need to model pulsar-specific and population-level parameters jointly rather than in separate stages.
Noise modeling and model comparison: methods for spectral characterization of stochastic processes, model averaging, and robust uncertainty quantification.

I am a member of the NANOGrav collaboration and the International Pulsar Timing Array, where I contribute to data analysis methodology and have played a role in key results including the 2023 evidence for a gravitational-wave background.

Bayesian computation and scalable methods

A recurring theme in my work is the design of inference algorithms that scale to the complexity of real scientific problems. This includes:

Sampling methods: Hamiltonian Monte Carlo, Gibbs sampling, and hybrid schemes for high-dimensional posterior exploration in PTA and other applications.
Model comparison: Bayesian evidence computation, model averaging as an alternative to model selection, and methods for comparing complex nested and non-nested models.
Low-rank and structured approximations: exploiting the mathematical structure of covariance matrices and likelihoods to reduce computational cost without sacrificing statistical fidelity.
Scalable workflows: from model design through efficient implementation, including GPU acceleration, automatic differentiation, and reproducible computational pipelines.

Good inference requires more than good theory — it requires the full chain from model design to implementation to be coherent. I am interested in numerical methods, efficient software, and computational strategies that allow sophisticated models to be applied to real datasets at scale.

Bridging data science and astrophysics

Between 2016 and 2021, I worked at Microsoft on machine-learning problems including language models, text analysis, classification, and reinforcement learning. That period deepened my understanding of scalable modeling, production-quality software, and the operational discipline needed to make complex methods work reliably.

I am committed to bringing the best ideas from statistics, machine learning, and scientific computing into domain science in a way that is technically serious and scientifically meaningful. This includes both specific methodological contributions — such as embedding modern ML within principled generative frameworks — and broader efforts in mentoring, collaboration across fields, and developing shared methodological infrastructure.

Machine learning offers transformative opportunities for astronomy, but predictive performance alone is not sufficient in the exact sciences. Models must connect back to the underlying physics and provide controlled uncertainty quantification. My aim is to ensure that AI and ML methods are deployed as structured statistical tools within coherent generative models, rather than as black-box substitutes for reasoning.

Software

GitHub profile — open-source tools for PTA data analysis, Bayesian inference, and scientific computing.

For a full list of publications, see Publications, Google Scholar, or NASA/ADS.