I Built an AI Agent to Draft My Own NIH Grant

The architecture, four prompting patterns, and the open-source code.

May 14, 2026

The Specific Aims page is the rate-limiting step in NIH SBIR grant writing. It’s the first page a study-section reviewer reads, the page that determines whether they keep reading, and structurally the hardest page in the whole submission to write. For a medtech founder operating without a grant-writing department, the page can take weeks of drafts, rewrites, and revisions before it’s ready to put inside a submission.

Most AI tools that claim to help with this are unsafe to use. They produce confident-sounding text built on hallucinated citations — papers that don’t exist, statistics that were invented, attributions to authors who never wrote them. For most uses of language models this is annoying. For an NIH submission it can end your funding trajectory before it starts.

I built an open-source agent that addresses this directly. It produces publication-quality first drafts of NIH SBIR Phase I Specific Aims pages, grounded in live PubMed retrieval, refined through adversarial self-critique, in two to five minutes per run. It’s open-source under Apache 2.0 at github.com/hippocrai/sbir-aims-agent.

This post walks through how it works, why each design choice matters, and what’s left for the human to do. It’s also the first technical release from HippocrAI — the open-source layer of AI tooling for medtech, built under physician oversight.

What this is, and what it isn’t

This is a writing-assistance tool. It produces a draft of an aims page that the human PI then verifies, revises, and authors as a submission. Used well, it cuts time-to-publishable-draft from days to hours.

This is not a medical device. It does not analyze patient data, does not make diagnostic or treatment recommenda

tions, does not constitute clinical decision support. It produces administrative writing — grant text — and explicitly nothing more. The README states this in three places, and the design itself enforces it: there is no clinical input the agent will accept, only project-briefing markdown.

The separation matters because much of what fails in medical AI fails by overreach. The discipline of staying clearly within the administrative layer is part of what makes the tool releasable, and part of what HippocrAI is built around.

The agent, in one paragraph

You feed it a markdown briefing — device, clinical problem, competitive landscape, preliminary data, target NIH institute, funding ceiling, duration. The agent grounds the document in real literature by querying PubMed via NCBI’s E-utilities API and the NIH funding landscape via RePORTER. It drafts an aims page using a hardcoded canonical NIH structure. And it submits the draft to a separate Claude instance prompted to act as a brutal study-section chair — the critic returns specific weaknesses, the drafter revises, and the cycle continues until the critique stops finding substantive issues. The output is a single markdown file ready for human verification.

The architecture

The whole thing is around 360 lines of Python in a single file. No agent framework. No vector database. No orchestration layer. Just the Anthropic Messages API with tool use, two HTTP clients for PubMed and RePORTER, and a critique step that spawns a separate Claude call.

Two distinct Claude calls run in different roles. The drafter is the main agent, running Claude Sonnet 4.5 in a multi-turn ReAct loop with three tools available. ReAct is the standard pattern: the model reasons about what to do next, calls a tool, observes the result, reasons again, eventually concludes. The critic is a one-shot Claude call invoked from inside the drafter’s loop, with a different system prompt and no shared conversation context. It sees only the draft text.

The three tools are search_pubmed, search_nih_reporter, and critique_draft. The first two hit live external APIs. The third spawns the critic.

This is deliberately simple. Almost any production agent framework would be more complex than this, and would not produce better output for this specific task. The architecture is sized to fit the problem.

A short detour through hallucination

It’s worth pausing to explain why live retrieval matters at all, because hallucination is the failure mode this whole architecture is built around.

A language model is mechanically a next-token predictor. Given the conversation so far, it computes a probability distribution over what word should come next, and samples. It does this thousands of times to produce a paragraph. Throughout the process, the model has no representation of “this is true” or “I don’t know this.” It has only “this token is likely to follow these tokens, given my training.”

When you ask such a model for a citation, it produces citation-shaped text. Sometimes that text refers to a real paper that genuinely supports the claim. Sometimes it refers to a real paper that doesn’t actually support the claim. Sometimes the citation is structurally plausible but the paper doesn’t exist. Studies put the error rate for biomedical citations from a chat interface at 20–60%, depending on the model and the prompt. The output looks identical in all four cases. There is no “I’m guessing” tag, no internal flag. The model has no way to surface uncertainty even when it’s profoundly uncertain.

For most uses of language models this is a manageable nuisance. For NIH submissions it’s unacceptable. So the architecture is built to make this category of failure either impossible or explicit.

The four prompting patterns

These are the design choices that distinguish the agent from a chat-interface-plus-prompt approach. They generalize beyond grant writing — anyone building an agent for structured technical writing will find these useful.

1. Adversarial self-critique via separated contexts

A model asked to critique its own work in the same conversation tends to defend its choices. It’s seen the reasoning that produced the draft; it has internal pressure toward consistency; it leans sycophantic. The critique it produces is qualitatively worse than what a fresh reader would produce.

So the agent spawns the critic as a separate Claude call with a different system prompt and no shared context. The critic sees the draft cold. It can’t defend choices it never made. It has no reason to be polite. Its system prompt opens with: “You are a senior NIH SBIR study section chair with twenty years of experience reviewing medical device applications. Your critique is specific, brutal, and actionable. For every weakness you flag, you propose a concrete fix.”

Empirically, the critique produced this way is qualitatively better than self-critique in the same context. The drafter then revises against the critique, often firing additional searches to fill gaps the critic identified. Two to three cycles usually gets the draft to publication quality.

2. Hardcoded document structure

NIH Specific Aims has a canonical form. Without explicit structure in the system prompt, language models write good essays instead of good aims pages. They drift toward what they’ve seen most: literature reviews, scientific manuscripts, grant pitches. The output is plausible but wrong-shaped.

So the system prompt enumerates the section headers in order with rough word-count budgets and required content for each: Problem → Critical Barrier → Solution → Preliminary Data → Hypothesis → 2–3 Aims with measurable endpoints → Expected Impact → Citations. The result is output that reliably looks like an aims page rather than an essay about the project.

3. Feasibility constraints as hard rules

Out of the box, language models default to ambitious scope. Without explicit constraints, the agent confidently scopes a randomized clinical trial as a Phase I aim. That’s not malicious — it’s that the model has read a lot of grants and ambition is a pattern it learned.

So the system prompt has explicit hard rules: “Phase I aims MUST be feasible in 6–12 months on ≤$500K. No clinical enrollment, no GLP large-animal studies, no custom-engineered software beyond a validated prototype.” This single constraint prevents most of the “infeasible aim” failures that would otherwise need the critic to catch.

4. `[FILL: source]` markers for unverified claims

Hallucinated citations are the canonical scientific-writing failure mode. When the agent has a numerical claim it can’t ground in a real PubMed retrieval, the system prompt instructs it to insert a [FILL: source] marker rather than invent a citation. This converts a hidden hallucination into a visible one.

The downstream effect is that human verification becomes a checklist — find every marker, source it — instead of a hunt for which of the citations are real. It’s a quiet pattern but it’s what makes the output safe to use.

The verification protocol

The agent produces a draft. The PI produces the submission. That separation is enforced by an explicit protocol the README documents:

Open every cited PubMed ID. Confirm the paper exists, that the cited claim is supported, and that the journal, authors, and year match. The agent grounds via live retrieval but cannot verify that the cited text actually supports the specific claim attributed to it.

Resolve every [FILL: source] marker with a primary source. Do not paste the markers into a submission.

Confirm aims are achievable in 6–12 months on ≤$500K with realistic timelines for materials, fabrication, testing, and analysis.

Have a biostatistician review any sample sizes, power calculations, and statistical claims.

Confirm the framing of the device is consistent with the regulatory pathway and any prior FDA correspondence.

Confirm the framing matches the priorities and language of the target NIH institute and program announcement.

Author the submission. Per NIH’s “substantially developed by AI” guidance, the PI must revise and add original content. The agent produces a draft. The submission is the PI’s work.

The agent is fast. The verification is the bottleneck. That’s the correct ordering.

How I built this

People sometimes ask how I built this without a software background. The honest answer is that I didn’t write most of the code by hand. I worked with Claude as a thought partner — specified what I wanted, evaluated what came back, iterated on the prompts and the architecture until the output matched the standard I knew was right. That workflow only converges if the person in the loop has the domain expertise to recognize good output. The medical training is what made the iteration end up somewhere useful.

Five years ago, a physician with my background couldn’t have built this in three weeks of evenings without a co-founder. Today the bottleneck has moved. It isn’t writing code. It’s having the judgment to know what good output looks like, and the discipline to keep the work safe under medical, regulatory, and IP scrutiny while doing it openly. The architecture isn’t where the value sits — the patterns are well-known and any engineer could rebuild them. The value sits in the IP firewall, the FDA-compliant language, the verification protocol, and the credibility of a physician maintaining it.

That’s the meta-thesis under HippocrAI. The era where a physician needed an engineering co-founder to build their own tools is over. What replaces it is physicians using AI to build openly, under the discipline the profession demands. The agents are free. The discipline is the moat.

About HippocrAI

HippocrAI is the open-source layer of AI tooling for medical device development. It’s organized around four pillars:

Open by default. The agents are free, Apache-licensed, and built so anyone — physician-founder, advisor, or curious engineer — can read the code, fork it, or rebuild it for an adjacent domain.

Physician-led. Every agent is designed and maintained under physician oversight. The discipline that medicine demands — patient-safety language, regulatory awareness, IP firewall, verification expectations — is built into the architecture rather than tacked on.

Medtech-specific. General-purpose AI tools fail at medtech work because they don’t know what good looks like in this domain. HippocrAI’s tools are tuned for the unglamorous middle of medtech development: grant writing, regulatory drafting, literature review, IP support, founder operations.

Built under the oath. First, do no harm is the operating principle, not a brand tagline. The Pre-Release Review Protocol gates every public release. Tools that could conceivably leak provisional-patent content, frame the agent as a medical device, or enable misuse for clinical decision-making are not released.

The thesis is that the next wave of useful AI tools in medicine will not come from AI labs that don’t understand medicine, or from medical institutions that don’t move fast. They will come from physicians who have learned to use AI as a force multiplier and who hold themselves to the discipline the profession demands. HippocrAI is what that work looks like in public.

Try it, fork it, or read more

The SBIR Aims Generator is at github.com/hippocrai/sbir-aims-agent. Clone it, try a run on the synthetic example brief, fork it for adjacent grant-writing problems. The README walks through the architecture and the verification protocol.

If you build agents for adjacent domains using these patterns — FDA Q-Sub drafting, EU MDR Clinical Evaluation Reports, IRB applications, scientific manuscript drafts — I’d be interested to hear what breaks and what generalizes. Open an issue, send a PR, or reply to a HippocrAI essay.

Subscribe to HippocrAI for the long-form essays and new agent releases. The next post in this series walks through adversarial self-critique as a general pattern for technical writing — a generalized version of the critique loop in this agent, with patterns drawn from medical, legal, and regulatory writing more broadly.

— Joseph L. Hayhurst, MD

HippocrAI

Discussion about this post

Ready for more?