Scientist AI Twins

MIT class project on personalized AI agents modeled after real scientists, grounded in their top-cited papers. Two twins in conversation: one proposes, one challenges. Toward automated digital science twins as a primitive for hypothesis generation.

Researcher

AIagentsresearch

Can we cross-validate scientific hypotheses by simulating discussions between digital twins of real scientists? That was the question I spent a semester on for AM³L at MIT — and the answer is partially, with caveats.

The problem

Human researchers face limitations. They are pulled across reading, mentoring, meetings, presentations — leaving little room to actually discuss research with colleagues and chase new directions. They also can't remember every detail of every paper, including their own.

The proposed solution

If we can capture a scientist's idiosyncratic expertise in an AI agent — equipped with their published knowledge — then pairs or groups of those agents could interact, discuss, and propose hypotheses continuously and efficiently. Simulating expert interactions could accelerate finding research gaps in mechanics and materials.

My original proposal was more ambitious: multi-agent crowdsourcing, where a first set of agents propose hypotheses and a second set discusses and votes on them — a small in-silico peer review. I ran out of time to build the voting layer and got more interested in the personalization itself, so the final implementation scoped down to pairs of twins in conversation.

What got built

For each scientist, I assembled a small corpus and indexed it into a vector database via LlamaIndex:

Top 20 all-time cited papers from Google Scholar
5 most-cited recent papers (2019–2024) — to balance the citation-accrual lag and the recency effect

Two RAG-grounded agents then enter conversation:

Agent 1 offers detailed ideas and explanations
Agent 2 asks challenging, adversarial questions
A third agent summarizes the key findings at the end

The base model was bioinspiredZephyr, with sanity comparisons against GPT-4 and Llama 3.

Experiments

I built twins of six scientists — Markus Buehler, Julia Greer, Pattie Maes, Refik Anadol, Neri Oxman, and Hilary Hahn — and ran roughly ten to twelve pairings across them. Three highlights, chosen to show the range of what the conversations could sustain:

Markus Buehler × Julia Greer — same domain (bio-inspired, high-strength structural materials). Agreed quickly on biodegradable ceramic nanostructures and the scalability problems with additive manufacturing. Conclusion: not in-depth enough.
Markus Buehler × Pattie Maes — cross-domain (materials × HCI). Surface-level for most of the run; the genuinely interesting question — "Could we generate and test thousands of interface designs in a simulated environment?" — landed in the last turn, just as the conversation timed out.
Markus Buehler × Refik Anadol — expert × artist, moonshot framing. Returned reasonable directions (nature-inspired materials, multi-scale biological simulation, ML-driven material discovery) but looped on the same four ideas.

What I learned

RAG sharpens but narrows. With paper grounding, the agents become more technically grounded but also more rigid — sometimes too rigid to leap across domains.
GPT-4 was the best of the bunch. More creative and more technically precise — possibly because of the larger volume of paywalled scientific literature in its training corpus.
Adversarial framing backfired. The "challenger" agent often refused to cooperate where a real colleague would have built on the idea. Real discussions are mostly yes-and, not yes-but.
Conversation length matters. 6–8 turns was usually too few. The interesting questions tended to surface late — after the agents had exhausted the obvious.
Top-cited isn't the only signal. Worth experimenting with the long tail of moderately cited papers; sometimes the unsung work carries the actually novel idea.

Where it points

This was a step toward automated digital science twins — once the personalization mechanism is faithful enough, the twins become a primitive you can plug into automated pipelines for hypothesis generation, design exploration, and discovery. The orchestration isn't the hard part. Faithful personalization is. We're not there yet.

A few months later, with a different team, I extended these ideas into Society of Scientists — a multi-agent system for full grant-proposal generation that won first place at the AGI House × Sundai hackathon. Same primitive, broader application.