Dataset Transformations & Training Dynamics

Overview

How does the structure of training data shape what models learn? We study the dynamics of learning from a data-centric perspective—understanding how datasets evolve under transformations and how their properties influence model behavior throughout training.

Using tools from optimal transport and dynamical systems, we analyze how data augmentation, filtering, and other transformations affect downstream performance, and how gradient flows can model dataset evolution.

Key Questions

  • How do common data transformations (augmentation, filtering, mixing) affect learning outcomes?
  • What role does data structure play in phenomena like grokking, phase transitions, and emergent capabilities?
  • Can we predict how changes to training data will affect model behavior?

Methods & Tools

  • Wasserstein Gradient Flows: Modeling dataset evolution as flows in probability space
  • Equilibrium Models: Deep equilibrium architectures for processing distributional inputs
  • Training Dynamics Analysis: Understanding learning trajectories through a data-centric lens

Selected Publications