HippocrAI

Discipline at the Input

Joseph L. Hayhurst, MD — Tue, 02 Jun 2026 19:54:06 GMT

A common reaction when I describe HippocrAI is some version of “I couldn’t do that — I’m not technical.” The reaction is sincere and almost always wrong. The bottleneck for physicians building their own tools used to be writing code. With frontier models capable of producing most of the boilerplate when given a clear specification, the bottleneck has moved. It now sits at the input.

The single insight that explains why some physicians can use AI tools to build useful, safe, releasable software while others can’t is this: the quality of the tool reflects the quality of the prompting that built it. Discipline at the input creates discipline at the output. There is no shortcut. A model asked to “help me write some Python that uses the API” produces something generic and unsafe. A model asked, with full context and clear constraints, to build an agent that does a specific thing under specific guardrails produces something usable.

What follows are seven specific prompting patterns I used to build the SBIR Aims Agent. They are not novel. They are not technical. Most physicians already practice versions of them in clinical work without knowing they’re transferable. Naming them here makes the transfer easier.

The agent itself, for context: an open-source AI tool that drafts NIH SBIR Phase I Specific Aims pages, grounded in live PubMed retrieval, refined through adversarial self-critique, in two to five minutes per run. Around 360 lines of Python in a single file. Released under Apache 2.0 at github.com/hippocrai/sbir-aims-agent. The architecture is well-known engineering. What’s interesting isn’t the architecture; it’s how a physician without an engineering background was able to specify it well enough that a frontier model could produce it.

1. Specify the deliverable, not the process

The most common prompting failure is asking a model to perform steps rather than to produce an outcome. “Help me write a Python script” is a process prompt. “Build an agent that drafts NIH SBIR Phase I aims pages, uses live PubMed retrieval to ground citations, and includes a separate critique step” is a deliverable prompt. The first will produce something generic, because there is no destination defined. The second forces the model to make architectural choices that fit a target.

This pattern transfers from clinical work directly. When a chief resident describes a consult, they don’t list steps — they specify a deliverable: the patient needs a NJ tube placed, sedated, with confirmation by KUB before feeds restart. The deliverable is bounded, measurable, and lets the team work backward from it. AI prompting works the same way. State the destination clearly and the model can choose the path. Leave it ambiguous and the model defaults to the most generic interpretation of your words.

When I wrote the spec for the agent, I did not write “use the Anthropic API to make a tool.” I wrote, in effect, “produce the smallest possible Python program that takes a project briefing as input and outputs a publication-quality NIH SBIR Phase I Specific Aims page, grounded in live PubMed and NIH RePORTER retrieval, refined through a separate adversarial critique step, with [FILL: source] markers for any unverified claims.” Every later choice — the ReAct loop, the three-tool design, the separated critic context — followed from that specification.

2. Iterate on identity before iterating on output

Early in this project I wrote about myself as a “surgeon-founder.” It survived two drafts before I caught it: three years of general surgery residency without board certification doesn’t qualify me to use that title, even if the term is technically accurate to past training. I changed it to “physician-founder,” which is honest, distinctive, and survives any future scrutiny.

That change was upstream of the technical work. But every artifact downstream — the README, the LICENSE copyright line, the Substack drafts, the eventual Twitter bio — inherits the precision of that early decision. If I’d let the overclaim stand, every downstream artifact would now contain a small lie. Worse, every revision I asked the model to make would inherit the same lie because the system prompt would carry it forward.

Calibrate identity, naming, and self-description before you start generating output. The cost of fixing it later compounds.

3. Demand calibrated truth, not validation

Throughout the build I asked questions like “how unique is this thing we’re creating?” and “is this widely adopted, or have we found something useful that nobody’s done?” These were explicit invitations for the model to puncture optimism if it was warranted. The honest answer — that the architecture is well-known engineering and what’s actually rare is the discipline plus domain expertise — was more useful than a flattering one would have been. It became the strategic positioning thesis for the entire project.

Most prompting goes the other direction. Users ask leading questions and the model, trained to be helpful, gives reassuring answers. The output drifts toward inflated claims. Public artifacts written from inflated claims fail under scrutiny. Investors notice. Reviewers notice. Hostile journalists notice.

The pattern is to ask questions phrased to invite disconfirmation rather than seek confirmation. Instead of “this looks good, right?” ask “what’s wrong with this?” Instead of “is this unique?” ask “who else is doing this and why is this version distinctive?” The model will give you better material, and the resulting work will survive the rooms it eventually has to enter.

4. Enforce discipline upstream, not as cleanup

Before any agent code was written, I had already established the Pre-Release Review Protocol — seven hard guardrails covering provisional patent containment, FDA-compliant framing, no real client briefings, generic-only public prompts, and pre-release checklists. The protocol existed because of medtech-specific risks I had thought through before any AI work began.

When I asked the model to build the agent, the protocol was part of the context. The HIPEC briefing that originally lived inside the code was extracted into a private file. The example brief was synthetic and clearly labeled as such. The README disclaimers were written first, not retrofitted. The .gitignore patterns excluded private briefings before any private briefing existed locally to be excluded.

The opposite pattern — build first, clean up later — is how leaks happen. By the time you notice the leak, the artifact is already public, the commit is already pushed, the screenshot is already taken. Pulling things back is much harder than never pushing them in the first place.

5. Ask meta-questions about reasoning, not just answers

The instruction “explain hallucinations to me and why they happen” produces a different conversation than “fix the hallucinations in this output.” The first asks the model to expose its reasoning so I can transfer the understanding to other contexts. The second asks for a localized fix.

Throughout the build, I consistently asked meta-questions. Why does this prompting pattern work better than that one? What’s the failure mode this pattern is defending against? What would I tell another physician who wanted to build a similar tool? Each meta-question converted output from instructions-I-can-execute-once to understanding-I-can-deploy-elsewhere. By the end of the build, the same patterns I’d applied to the SBIR agent could be transferred to FDA Q-Sub drafting, EU MDR Clinical Evaluation Reports, IRB applications, manuscript drafts. The understanding was portable in a way the code wasn’t.

For domain experts, especially, this is the pattern that compounds. Each project teaches not just a tool but a generalizable lesson if the prompt explicitly elicits it.

6. Verify by running the actual system

I did not accept the agent as built when the model said it was done. I cloned the published repository from GitHub onto a fresh machine, installed dependencies, set the API key, and ran the agent against the synthetic example brief. Then I ran it against the real HIPEC briefing. The first run hit a rate limit before completion. The second exposed a folder-naming bug from when files had been uploaded back from local. A third run, after fixes, completed cleanly.

Each round of in-the-field testing surfaced problems that would never have appeared inside the development conversation. The agent looked correct on inspection. It only revealed its actual flaws when run end-to-end as a stranger would experience it.

This is the medical version of “treat the patient, not the chart.” The output of any tool is just a hypothesis until it survives execution against reality. Domain experts who skip the execution step ship work that fails in production, often in embarrassing ways. The pattern is to define a complete end-to-end test that exercises the actual deployment path — not a contrived example, not a half-mocked test, but the real thing — and run it before declaring done.

7. Name the friction when it appears

Through the build, I surfaced every confusion as it appeared. Where is the terminal. What does git init mean. The README didn’t show up on GitHub. The .DS_Store keeps coming back. Every one of these was friction I could have hidden by working around it silently or pretending I understood. Naming each one explicitly produced two outcomes. The immediate problem got solved. And the build’s documentation got better — because each unnamed assumption that I caught added a sentence somewhere that future users wouldn’t trip over.

The opposite pattern is to nod along, work around problems silently, and ship a tool that’s only usable by people who think exactly like the builder. Most open-source software fails this way. The README assumes Mac. The setup script assumes a specific shell. The error messages are written for someone who already understands the system. Domain-expert-built tools — when the domain expert is willing to surface their own confusions during development — tend to be more usable, because they’re documented for the audience that matters.

Why medical training is preparation for this

Look back at those seven patterns. Specify the deliverable, not the process. Calibrate identity before you ship it. Demand truth, not flattery. Enforce discipline upstream. Reason about why, not just what. Verify by execution. Name the friction.

These are not technical disciplines. They are forms of intellectual hygiene that medical training selects for. Specifying a deliverable is what a chief complaint forces. Demanding truth over comfort is what differential diagnosis requires. Upstream discipline is what consent forms and time-outs embody. Reasoning about why is what evidence-based medicine teaches. Verification is what a physical exam plus imaging exists for. Naming friction is what morbidity and mortality conferences are for.

Physicians already have the dispositions that produce good AI-built tools. The gap isn’t temperament. The gap is that most physicians don’t yet know that the same dispositions transfer. They look at AI tooling and see a domain that requires CS skills they don’t have. What it actually requires is the prompting discipline that medical training has already given them. Discipline at the input creates discipline at the output. There is no shortcut. And there are very few shortcuts worth having.

What this means for HippocrAI

The thesis under this entire project is that the next wave of useful AI tools in medicine will not come from AI labs that don’t understand medicine, or from medical institutions that don’t move fast. They will come from physicians who have learned to use AI as a force multiplier and who hold themselves to the discipline the profession demands.

If the bottleneck has moved from “can the physician code” to “does the physician have the discipline to specify well,” then the multiplier on physicians who already practice that discipline is enormous. The agents are free. The discipline is the moat. The era where this kind of work needed an engineering co-founder is over.

If you’re a physician sitting on a tool you wish existed, the question to ask yourself isn’t can I build this. The question is can I specify it clearly, can I calibrate the framing, can I enforce upstream discipline, can I demand calibrated truth from the model, can I run the result against reality, and can I name the friction when it shows up. If the answer to any of those is yes, the rest is just iteration.

The next post in this series walks through adversarial self-critique as a generalized prompting pattern for technical writing — drawing examples from medical documentation, regulatory drafting, and scientific manuscripts. Subscribe to HippocrAI to receive it.

— Joseph L. Hayhurst, MD

I Built an AI Agent to Draft My Own NIH Grant

Joseph L. Hayhurst, MD — Thu, 14 May 2026 15:20:12 GMT

The Specific Aims page is the rate-limiting step in NIH SBIR grant writing. It’s the first page a study-section reviewer reads, the page that determines whether they keep reading, and structurally the hardest page in the whole submission to write. For a medtech founder operating without a grant-writing department, the page can take weeks of drafts, rewrites, and revisions before it’s ready to put inside a submission.

Most AI tools that claim to help with this are unsafe to use. They produce confident-sounding text built on hallucinated citations — papers that don’t exist, statistics that were invented, attributions to authors who never wrote them. For most uses of language models this is annoying. For an NIH submission it can end your funding trajectory before it starts.

I built an open-source agent that addresses this directly. It produces publication-quality first drafts of NIH SBIR Phase I Specific Aims pages, grounded in live PubMed retrieval, refined through adversarial self-critique, in two to five minutes per run. It’s open-source under Apache 2.0 at github.com/hippocrai/sbir-aims-agent.

This post walks through how it works, why each design choice matters, and what’s left for the human to do. It’s also the first technical release from HippocrAI — the open-source layer of AI tooling for medtech, built under physician oversight.

What this is, and what it isn’t

This is a writing-assistance tool. It produces a draft of an aims page that the human PI then verifies, revises, and authors as a submission. Used well, it cuts time-to-publishable-draft from days to hours.

This is not a medical device. It does not analyze patient data, does not make diagnostic or treatment recommenda

tions, does not constitute clinical decision support. It produces administrative writing — grant text — and explicitly nothing more. The README states this in three places, and the design itself enforces it: there is no clinical input the agent will accept, only project-briefing markdown.

The separation matters because much of what fails in medical AI fails by overreach. The discipline of staying clearly within the administrative layer is part of what makes the tool releasable, and part of what HippocrAI is built around.

The agent, in one paragraph

You feed it a markdown briefing — device, clinical problem, competitive landscape, preliminary data, target NIH institute, funding ceiling, duration. The agent grounds the document in real literature by querying PubMed via NCBI’s E-utilities API and the NIH funding landscape via RePORTER. It drafts an aims page using a hardcoded canonical NIH structure. And it submits the draft to a separate Claude instance prompted to act as a brutal study-section chair — the critic returns specific weaknesses, the drafter revises, and the cycle continues until the critique stops finding substantive issues. The output is a single markdown file ready for human verification.

The architecture

The whole thing is around 360 lines of Python in a single file. No agent framework. No vector database. No orchestration layer. Just the Anthropic Messages API with tool use, two HTTP clients for PubMed and RePORTER, and a critique step that spawns a separate Claude call.

Two distinct Claude calls run in different roles. The drafter is the main agent, running Claude Sonnet 4.5 in a multi-turn ReAct loop with three tools available. ReAct is the standard pattern: the model reasons about what to do next, calls a tool, observes the result, reasons again, eventually concludes. The critic is a one-shot Claude call invoked from inside the drafter’s loop, with a different system prompt and no shared conversation context. It sees only the draft text.

The three tools are search_pubmed, search_nih_reporter, and critique_draft. The first two hit live external APIs. The third spawns the critic.

This is deliberately simple. Almost any production agent framework would be more complex than this, and would not produce better output for this specific task. The architecture is sized to fit the problem.

A short detour through hallucination

It’s worth pausing to explain why live retrieval matters at all, because hallucination is the failure mode this whole architecture is built around.

A language model is mechanically a next-token predictor. Given the conversation so far, it computes a probability distribution over what word should come next, and samples. It does this thousands of times to produce a paragraph. Throughout the process, the model has no representation of “this is true” or “I don’t know this.” It has only “this token is likely to follow these tokens, given my training.”

When you ask such a model for a citation, it produces citation-shaped text. Sometimes that text refers to a real paper that genuinely supports the claim. Sometimes it refers to a real paper that doesn’t actually support the claim. Sometimes the citation is structurally plausible but the paper doesn’t exist. Studies put the error rate for biomedical citations from a chat interface at 20–60%, depending on the model and the prompt. The output looks identical in all four cases. There is no “I’m guessing” tag, no internal flag. The model has no way to surface uncertainty even when it’s profoundly uncertain.

For most uses of language models this is a manageable nuisance. For NIH submissions it’s unacceptable. So the architecture is built to make this category of failure either impossible or explicit.

The four prompting patterns

These are the design choices that distinguish the agent from a chat-interface-plus-prompt approach. They generalize beyond grant writing — anyone building an agent for structured technical writing will find these useful.

1. Adversarial self-critique via separated contexts

A model asked to critique its own work in the same conversation tends to defend its choices. It’s seen the reasoning that produced the draft; it has internal pressure toward consistency; it leans sycophantic. The critique it produces is qualitatively worse than what a fresh reader would produce.

So the agent spawns the critic as a separate Claude call with a different system prompt and no shared context. The critic sees the draft cold. It can’t defend choices it never made. It has no reason to be polite. Its system prompt opens with: “You are a senior NIH SBIR study section chair with twenty years of experience reviewing medical device applications. Your critique is specific, brutal, and actionable. For every weakness you flag, you propose a concrete fix.”

Empirically, the critique produced this way is qualitatively better than self-critique in the same context. The drafter then revises against the critique, often firing additional searches to fill gaps the critic identified. Two to three cycles usually gets the draft to publication quality.

2. Hardcoded document structure

NIH Specific Aims has a canonical form. Without explicit structure in the system prompt, language models write good essays instead of good aims pages. They drift toward what they’ve seen most: literature reviews, scientific manuscripts, grant pitches. The output is plausible but wrong-shaped.

So the system prompt enumerates the section headers in order with rough word-count budgets and required content for each: Problem → Critical Barrier → Solution → Preliminary Data → Hypothesis → 2–3 Aims with measurable endpoints → Expected Impact → Citations. The result is output that reliably looks like an aims page rather than an essay about the project.

3. Feasibility constraints as hard rules

Out of the box, language models default to ambitious scope. Without explicit constraints, the agent confidently scopes a randomized clinical trial as a Phase I aim. That’s not malicious — it’s that the model has read a lot of grants and ambition is a pattern it learned.

So the system prompt has explicit hard rules: “Phase I aims MUST be feasible in 6–12 months on ≤$500K. No clinical enrollment, no GLP large-animal studies, no custom-engineered software beyond a validated prototype.” This single constraint prevents most of the “infeasible aim” failures that would otherwise need the critic to catch.

4. `[FILL: source]` markers for unverified claims

Hallucinated citations are the canonical scientific-writing failure mode. When the agent has a numerical claim it can’t ground in a real PubMed retrieval, the system prompt instructs it to insert a [FILL: source] marker rather than invent a citation. This converts a hidden hallucination into a visible one.

The downstream effect is that human verification becomes a checklist — find every marker, source it — instead of a hunt for which of the citations are real. It’s a quiet pattern but it’s what makes the output safe to use.

The verification protocol

The agent produces a draft. The PI produces the submission. That separation is enforced by an explicit protocol the README documents:

Open every cited PubMed ID. Confirm the paper exists, that the cited claim is supported, and that the journal, authors, and year match. The agent grounds via live retrieval but cannot verify that the cited text actually supports the specific claim attributed to it.

Resolve every [FILL: source] marker with a primary source. Do not paste the markers into a submission.

Confirm aims are achievable in 6–12 months on ≤$500K with realistic timelines for materials, fabrication, testing, and analysis.

Have a biostatistician review any sample sizes, power calculations, and statistical claims.

Confirm the framing of the device is consistent with the regulatory pathway and any prior FDA correspondence.

Confirm the framing matches the priorities and language of the target NIH institute and program announcement.

Author the submission. Per NIH’s “substantially developed by AI” guidance, the PI must revise and add original content. The agent produces a draft. The submission is the PI’s work.

The agent is fast. The verification is the bottleneck. That’s the correct ordering.

How I built this

People sometimes ask how I built this without a software background. The honest answer is that I didn’t write most of the code by hand. I worked with Claude as a thought partner — specified what I wanted, evaluated what came back, iterated on the prompts and the architecture until the output matched the standard I knew was right. That workflow only converges if the person in the loop has the domain expertise to recognize good output. The medical training is what made the iteration end up somewhere useful.

Five years ago, a physician with my background couldn’t have built this in three weeks of evenings without a co-founder. Today the bottleneck has moved. It isn’t writing code. It’s having the judgment to know what good output looks like, and the discipline to keep the work safe under medical, regulatory, and IP scrutiny while doing it openly. The architecture isn’t where the value sits — the patterns are well-known and any engineer could rebuild them. The value sits in the IP firewall, the FDA-compliant language, the verification protocol, and the credibility of a physician maintaining it.

That’s the meta-thesis under HippocrAI. The era where a physician needed an engineering co-founder to build their own tools is over. What replaces it is physicians using AI to build openly, under the discipline the profession demands. The agents are free. The discipline is the moat.

About HippocrAI

HippocrAI is the open-source layer of AI tooling for medical device development. It’s organized around four pillars:

Open by default. The agents are free, Apache-licensed, and built so anyone — physician-founder, advisor, or curious engineer — can read the code, fork it, or rebuild it for an adjacent domain.

Physician-led. Every agent is designed and maintained under physician oversight. The discipline that medicine demands — patient-safety language, regulatory awareness, IP firewall, verification expectations — is built into the architecture rather than tacked on.

Medtech-specific. General-purpose AI tools fail at medtech work because they don’t know what good looks like in this domain. HippocrAI’s tools are tuned for the unglamorous middle of medtech development: grant writing, regulatory drafting, literature review, IP support, founder operations.

Built under the oath. First, do no harm is the operating principle, not a brand tagline. The Pre-Release Review Protocol gates every public release. Tools that could conceivably leak provisional-patent content, frame the agent as a medical device, or enable misuse for clinical decision-making are not released.

The thesis is that the next wave of useful AI tools in medicine will not come from AI labs that don’t understand medicine, or from medical institutions that don’t move fast. They will come from physicians who have learned to use AI as a force multiplier and who hold themselves to the discipline the profession demands. HippocrAI is what that work looks like in public.

Try it, fork it, or read more

The SBIR Aims Generator is at github.com/hippocrai/sbir-aims-agent. Clone it, try a run on the synthetic example brief, fork it for adjacent grant-writing problems. The README walks through the architecture and the verification protocol.

If you build agents for adjacent domains using these patterns — FDA Q-Sub drafting, EU MDR Clinical Evaluation Reports, IRB applications, scientific manuscript drafts — I’d be interested to hear what breaks and what generalizes. Open an issue, send a PR, or reply to a HippocrAI essay.

Subscribe to HippocrAI for the long-form essays and new agent releases. The next post in this series walks through adversarial self-critique as a general pattern for technical writing — a generalized version of the critique loop in this agent, with patterns drawn from medical, legal, and regulatory writing more broadly.

— Joseph L. Hayhurst, MD

Welcome to HippocrAI: The physicians oath, in code.

Joseph L. Hayhurst, MD — Wed, 06 May 2026 20:18:54 GMT

What HippocrAI Is

Last month, I built an AI agent to draft my own NIH SBIR Specific Aims page.

Three tools — a PubMed search, an NIH RePORTER search, and a self-critique loop — are wired to Claude via a system prompt that hard-codes the canonical NIH aims structure. I gave it a project briefing and let it run. A few minutes later, the draft needed only verification before it was ready for use. The code is now public on GitHub. The companion essay explaining how I built it goes up next.

This is the publication where work like that gets shared, with the code and the reasoning intact.

It’s also where I write about the rest of medtech founding from a cold start — accelerators, engineering partnerships, regulatory pathways, IP, the unglamorous infrastructure that consumes the actual hours of building a medical device. The agents I publish are mostly tools for that work.

The premise is simple: the writing-heavy, format-rigid work that consumes a disproportionate fraction of medtech founders’ time is exactly the work that AI agents are good at — if they’re built carefully, scoped narrowly, and grounded in real source material. Those last three conditions are where most public AI tools fail. The discipline of meeting them is what HippocrAI is built around.

That’s the short version. A longer essay on what that means, who I am, and why this is open source rather than a startup follows.

Who I am

I’m Joseph Hayhurst — MD with a concentration in economics, with prior training in general surgery, currently focused on medical device development through Hayhurst Medical Technologies, LLC. Engaged with multiple investors in the regional medtech ecosystem. Provisional patent filed on the device with plans for an eventual non-provisional.

What that adds up to in practice: I’m a one-physician operation building a regulated medical device, with a calendar that has to absorb everything from CAD review to grant drafting to investor updates. The agents I publish here exist because I needed them — first for myself, then because it became clear the same tools would be useful to anyone else in this position.

Most physician-founders are in the same situation. Most early-stage medtech companies are. The infrastructure for getting from “I have a device” to “I have funded development” is mostly writing infrastructure: SBIR aims pages, FDA pre-submission documents, literature reviews, IP filings, regulatory pathway analyses, investor updates. None of it is glamorous. All of it is rate-limiting. AI agents — properly built — can do most of it in hours instead of weeks.

That’s what HippocrAI is for.

The thesis

The most useful AI tools in medicine are not going to come from AI labs that don’t understand medicine, or from medical institutions that don’t move fast. They are going to come from physicians who have learned to use AI as a force multiplier and who hold themselves to the discipline the profession demands.

Five years ago, a physician without engineering training couldn’t build their own software tools at any meaningful level. Today, with frontier language models capable of producing most of the boilerplate when given a clear specification, the bottleneck has moved. It is no longer “can the physician write code.” It is “does the physician have the discipline to specify well, calibrate framing, enforce upstream guardrails, and verify output against reality.” That kind of discipline is what medical training produces. Most physicians who’d be skeptical of their ability to build tools have, embedded in their training, exactly the prompting discipline that produces good output from AI.

HippocrAI is what that work looks like in public. The architectures aren’t novel research — the patterns are well-known and any competent engineer could rebuild them in an afternoon. What’s rare is the combination of medical domain expertise, regulatory discipline, and willingness to build openly. That combination is mostly held by physicians who don’t yet realize they could build their own tools, and by engineers who lack the domain context to know what to build.

The agents are free. The discipline is the moat.

What HippocrAI stands for

Four pillars that define what gets built and what gets released.

Open by default. Every agent ships with code, prompts, and the reasoning behind both. Apache 2.0 license. No paywalled “premium” tier — that’s a different business model and not this one. Open-source isn’t a marketing tactic; it’s a structural commitment that forces the work to be honest under public review.

Physician-led. The work is grounded in clinical context, not pattern-matched from afar. Every agent passes physician review before public release. As contributors join, the goal is that contributor agents will get reviewed by physicians on an editorial board. This is the credentialing infrastructure that AI-only or non-physician medtech tooling can’t easily replicate.

Medtech-specific. Not general medical AI. Not a general developer AI. The vertical is medical device development — devices that go through FDA pathways, NIH funding, and the regulatory and operational stack that comes with them. Specificity is what makes the agents actually useful.

Built under the oath. First, do no harm. Every public agent is unambiguously not a medical device, not clinical decision support, not a substitute for a clinician. Every release goes through a written pre-release review protocol with hard guardrails — provisional-patent firewall, FDA-compliant framing, no real client content, generic prompts only, explicit human-in-the-loop verification. The discipline is what protects the integrity of the work.

What HippocrAI is not

This part matters. Public-facing AI in medicine has a real problem with overclaiming, and I don’t intend on adding to it.

The agents published here:

Do not analyze patient data
Do not produce diagnoses or differential diagnoses
Do not recommend treatments
Do not calculate doses, clinical parameters, or risk scores
Do not triage patients or assess clinical risk
Do not produce any output that could reasonably be characterized as clinical decision support

The acceptable surface for HippocrAI tooling is administrative, regulatory, scientific-writing, and operational work. SBIR aims pages. FDA pre-submission documents. Literature reviews. Predicate-device searches. IP and patent drafting support. Investor updates. The kinds of writing tasks that take medtech founders weeks and that AI agents — properly built — can do in hours.

Nothing on this site is medical advice. Nothing on this site is legal advice. Nothing on this site is a substitute for an attorney, a regulatory consultant, a biostatistician, or a physician.

What to expect

Three categories of post, with each new agent shipped alongside its companion essay.

Agent build essays. Walk-throughs of an AI agent I’ve built for medtech work — the architecture, the prompting patterns that turned out to matter, the design tradeoffs, what surprised me. Code on GitHub, linked from each post. The first one — I Built an AI Agent to Draft My Own NIH Grant — drops next.

Pattern essays. When I notice something general about agent design, prompting, or AI in regulated industries, I write it up. The next one in this series, Discipline at the Input, walks through seven prompting patterns that any domain expert can use to build their own tools — with a focus on why medical training is unusually good preparation for this kind of work.

Medtech founder field notes. What it looks like to build a medical device as a solo physician-founder from a cold start — accelerators, engineering partnerships, regulatory pathways, IP. Specific, honest, useful for the next person doing this.

Posts arrive roughly every one to two weeks during launch, settling into a more sustainable cadence after the first three. Subscribe and they come straight to your inbox.

Why I’m publishing this in public

A few reasons.

The work is more rigorous when it’s reviewable. Closed AI tooling in regulated industries is exactly where you don’t want bad incentives. The cleanest way to keep my own incentives honest is to make every prompt and every architectural decision visible. The pre-release review protocol I run before any public release is a stricter discipline than I’d impose on myself privately.

The next physician-founder shouldn’t have to re-derive these tools from scratch. There are several thousand physician-founders building medtech companies right now, and the writing infrastructure they need is mostly the same. Releasing the agents openly is the highest-leverage thing I can do for that group, and it’s the work I’d want them to do for me.

Open-source is also the only honest answer to the question of who AI tools in medicine should belong to. The closed proprietary version of this project would be more lucrative in a narrow sense and would compromise nearly everything I think AI in medicine should be. So this is the version.

What success looks like

Not subscriber count. Specifically, it looks like physician-founders shipping better SBIR submissions, faster regulatory drafts, and cleaner IP work because of tooling that exists here. It looks like the regulatory consultant who tells me they used the literature-review agent for a predicate-device search and saved a week. It looks like the engineer at a medtech startup who finds the GitHub org and forks an agent into something specific to their device. It looks like another physician deciding that they, too, can build the tools they need.

If a few hundred medtech founders save a few hundred hours each because of work published here, that’s enough.

How to follow along

If you’re a physician-founder, a medtech engineer or operator, an AI engineer interested in regulated-industry work, or someone curious about what medtech founding actually looks like from the inside — subscribe. Posts come straight to your inbox.

The first technical essay drops next. The agent’s GitHub repository is already public.

If you want to reach me, reply to any email — those land in my inbox directly. The agents are free. The discipline is the moat. Welcome.

— Joseph

Joseph L. Hayhurst, MD Physician-founder · HippocrAI hippocr.ai · github.com/hippocrai ~ Joseph.l.hayhurst@gmail.com