1. What AI capabilities must we measure?
I’m interested in identifying AI capabilities that matter, grounding them in literature from the human sciences, and breaking them down into constructs that are theoretically meaningful and machine measurable.
2. How do we measure them rigorously?
I’m interested in working with annotators, designing interfaces, understanding disagreement, and building human-in-the-loop processes to create benchmarks and evaluation datasets. I apply psychometrics (IRT, validity, reliability, and measurement theory) to evaluate whether our systems capture what they are intended to measure.
3. Do systems built to demonstrate these capabilities have real-world impact?
Once we define what to measure and build rigorous evaluation methods, I draw on research methods from the social sciences to test whether these signals hold in real-world settings, linking evaluations to deployment through causal/experimental methods.
Looking for
Researcher or research TPM roles on benchmarks, evaluations, and human data teams.
Skills & tools
Programming: Python, R, SQL, Jupyter, Git, LaTeX, Cursor
Data / ML: NumPy, Pandas, Scikit-Learn, PyTorch, Transformers, and NLTK
AI: Benchmark & Evaluation Design; Post-Training (QLoRA, RLHF/DPO); RAG; Prompt Engineering; Human & LLM Annotation
Research Methods: Item Response Theory; Dimensionality Reduction (Factor Analysis/PCA); Reliability & Validity; Regression Analysis; Experimental & Quasi-Experimental Designs; Structural Equation Modelling
news
| May 05, 2026 | Returning at Stanford AI4ALL 2026 and the Stanford AIMI Summer Research Internship 2026 as an NLP mentor for high school students |
|---|---|
| Apr 30, 2026 | Two papers accepted at 21st BEA @ ACL 2026: A Bigger Catch: Fine-Grained Curriculum Standards Alignment on the MathFish Benchmark (with Xinman Liu & Teah Shi), and Predicting Item Difficulty and Generating Reading Comprehension Items via an Annotated Repository (with collaborators) |
| Apr 29, 2026 | ClaimCLAIRE: A Trust-Aware Multi-Component Fact-Checking Agent for Open-World Claims (with Xinman Liu) accepted for oral presentation at 6th TrustNLP @ ACL 2026 |
| Apr 15, 2026 | ConvoLearn dataset (40K turns, post-training data for dialogic alignment of LLM tutors) released on Hugging Face |
| Apr 12, 2026 | Attending ASU+GSV 2026 in San Diego on Stanford GSE scholarship |