Discipline at the Input
Seven prompting patterns I used to build my SBIR agent — and that any domain expert can adopt.
A common reaction when I describe HippocrAI is some version of “I couldn’t do that — I’m not technical.” The reaction is sincere and almost always wrong. The bottleneck for physicians building their own tools used to be writing code. With frontier models capable of producing most of the boilerplate when given a clear specification, the bottleneck has moved. It now sits at the input.
The single insight that explains why some physicians can use AI tools to build useful, safe, releasable software while others can’t is this: the quality of the tool reflects the quality of the prompting that built it. Discipline at the input creates discipline at the output. There is no shortcut. A model asked to “help me write some Python that uses the API” produces something generic and unsafe. A model asked, with full context and clear constraints, to build an agent that does a specific thing under specific guardrails produces something usable.
What follows are seven specific prompting patterns I used to build the SBIR Aims Agent. They are not novel. They are not technical. Most physicians already practice versions of them in clinical work without knowing they’re transferable. Naming them here makes the transfer easier.
The agent itself, for context: an open-source AI tool that drafts NIH SBIR Phase I Specific Aims pages, grounded in live PubMed retrieval, refined through adversarial self-critique, in two to five minutes per run. Around 360 lines of Python in a single file. Released under Apache 2.0 at github.com/hippocrai/sbir-aims-agent. The architecture is well-known engineering. What’s interesting isn’t the architecture; it’s how a physician without an engineering background was able to specify it well enough that a frontier model could produce it.
1. Specify the deliverable, not the process
The most common prompting failure is asking a model to perform steps rather than to produce an outcome. “Help me write a Python script” is a process prompt. “Build an agent that drafts NIH SBIR Phase I aims pages, uses live PubMed retrieval to ground citations, and includes a separate critique step” is a deliverable prompt. The first will produce something generic, because there is no destination defined. The second forces the model to make architectural choices that fit a target.
This pattern transfers from clinical work directly. When a chief resident describes a consult, they don’t list steps — they specify a deliverable: the patient needs a NJ tube placed, sedated, with confirmation by KUB before feeds restart. The deliverable is bounded, measurable, and lets the team work backward from it. AI prompting works the same way. State the destination clearly and the model can choose the path. Leave it ambiguous and the model defaults to the most generic interpretation of your words.
When I wrote the spec for the agent, I did not write “use the Anthropic API to make a tool.” I wrote, in effect, “produce the smallest possible Python program that takes a project briefing as input and outputs a publication-quality NIH SBIR Phase I Specific Aims page, grounded in live PubMed and NIH RePORTER retrieval, refined through a separate adversarial critique step, with [FILL: source] markers for any unverified claims.” Every later choice — the ReAct loop, the three-tool design, the separated critic context — followed from that specification.
2. Iterate on identity before iterating on output
Early in this project I wrote about myself as a “surgeon-founder.” It survived two drafts before I caught it: three years of general surgery residency without board certification doesn’t qualify me to use that title, even if the term is technically accurate to past training. I changed it to “physician-founder,” which is honest, distinctive, and survives any future scrutiny.
That change was upstream of the technical work. But every artifact downstream — the README, the LICENSE copyright line, the Substack drafts, the eventual Twitter bio — inherits the precision of that early decision. If I’d let the overclaim stand, every downstream artifact would now contain a small lie. Worse, every revision I asked the model to make would inherit the same lie because the system prompt would carry it forward.
Calibrate identity, naming, and self-description before you start generating output. The cost of fixing it later compounds.
3. Demand calibrated truth, not validation
Throughout the build I asked questions like “how unique is this thing we’re creating?” and “is this widely adopted, or have we found something useful that nobody’s done?” These were explicit invitations for the model to puncture optimism if it was warranted. The honest answer — that the architecture is well-known engineering and what’s actually rare is the discipline plus domain expertise — was more useful than a flattering one would have been. It became the strategic positioning thesis for the entire project.
Most prompting goes the other direction. Users ask leading questions and the model, trained to be helpful, gives reassuring answers. The output drifts toward inflated claims. Public artifacts written from inflated claims fail under scrutiny. Investors notice. Reviewers notice. Hostile journalists notice.
The pattern is to ask questions phrased to invite disconfirmation rather than seek confirmation. Instead of “this looks good, right?” ask “what’s wrong with this?” Instead of “is this unique?” ask “who else is doing this and why is this version distinctive?” The model will give you better material, and the resulting work will survive the rooms it eventually has to enter.
4. Enforce discipline upstream, not as cleanup
Before any agent code was written, I had already established the Pre-Release Review Protocol — seven hard guardrails covering provisional patent containment, FDA-compliant framing, no real client briefings, generic-only public prompts, and pre-release checklists. The protocol existed because of medtech-specific risks I had thought through before any AI work began.
When I asked the model to build the agent, the protocol was part of the context. The HIPEC briefing that originally lived inside the code was extracted into a private file. The example brief was synthetic and clearly labeled as such. The README disclaimers were written first, not retrofitted. The .gitignore patterns excluded private briefings before any private briefing existed locally to be excluded.
The opposite pattern — build first, clean up later — is how leaks happen. By the time you notice the leak, the artifact is already public, the commit is already pushed, the screenshot is already taken. Pulling things back is much harder than never pushing them in the first place.
5. Ask meta-questions about reasoning, not just answers
The instruction “explain hallucinations to me and why they happen” produces a different conversation than “fix the hallucinations in this output.” The first asks the model to expose its reasoning so I can transfer the understanding to other contexts. The second asks for a localized fix.
Throughout the build, I consistently asked meta-questions. Why does this prompting pattern work better than that one? What’s the failure mode this pattern is defending against? What would I tell another physician who wanted to build a similar tool? Each meta-question converted output from instructions-I-can-execute-once to understanding-I-can-deploy-elsewhere. By the end of the build, the same patterns I’d applied to the SBIR agent could be transferred to FDA Q-Sub drafting, EU MDR Clinical Evaluation Reports, IRB applications, manuscript drafts. The understanding was portable in a way the code wasn’t.
For domain experts, especially, this is the pattern that compounds. Each project teaches not just a tool but a generalizable lesson if the prompt explicitly elicits it.
6. Verify by running the actual system
I did not accept the agent as built when the model said it was done. I cloned the published repository from GitHub onto a fresh machine, installed dependencies, set the API key, and ran the agent against the synthetic example brief. Then I ran it against the real HIPEC briefing. The first run hit a rate limit before completion. The second exposed a folder-naming bug from when files had been uploaded back from local. A third run, after fixes, completed cleanly.
Each round of in-the-field testing surfaced problems that would never have appeared inside the development conversation. The agent looked correct on inspection. It only revealed its actual flaws when run end-to-end as a stranger would experience it.
This is the medical version of “treat the patient, not the chart.” The output of any tool is just a hypothesis until it survives execution against reality. Domain experts who skip the execution step ship work that fails in production, often in embarrassing ways. The pattern is to define a complete end-to-end test that exercises the actual deployment path — not a contrived example, not a half-mocked test, but the real thing — and run it before declaring done.
7. Name the friction when it appears
Through the build, I surfaced every confusion as it appeared. Where is the terminal. What does git init mean. The README didn’t show up on GitHub. The .DS_Store keeps coming back. Every one of these was friction I could have hidden by working around it silently or pretending I understood. Naming each one explicitly produced two outcomes. The immediate problem got solved. And the build’s documentation got better — because each unnamed assumption that I caught added a sentence somewhere that future users wouldn’t trip over.
The opposite pattern is to nod along, work around problems silently, and ship a tool that’s only usable by people who think exactly like the builder. Most open-source software fails this way. The README assumes Mac. The setup script assumes a specific shell. The error messages are written for someone who already understands the system. Domain-expert-built tools — when the domain expert is willing to surface their own confusions during development — tend to be more usable, because they’re documented for the audience that matters.
Why medical training is preparation for this
Look back at those seven patterns. Specify the deliverable, not the process. Calibrate identity before you ship it. Demand truth, not flattery. Enforce discipline upstream. Reason about why, not just what. Verify by execution. Name the friction.
These are not technical disciplines. They are forms of intellectual hygiene that medical training selects for. Specifying a deliverable is what a chief complaint forces. Demanding truth over comfort is what differential diagnosis requires. Upstream discipline is what consent forms and time-outs embody. Reasoning about why is what evidence-based medicine teaches. Verification is what a physical exam plus imaging exists for. Naming friction is what morbidity and mortality conferences are for.
Physicians already have the dispositions that produce good AI-built tools. The gap isn’t temperament. The gap is that most physicians don’t yet know that the same dispositions transfer. They look at AI tooling and see a domain that requires CS skills they don’t have. What it actually requires is the prompting discipline that medical training has already given them. Discipline at the input creates discipline at the output. There is no shortcut. And there are very few shortcuts worth having.
What this means for HippocrAI
The thesis under this entire project is that the next wave of useful AI tools in medicine will not come from AI labs that don’t understand medicine, or from medical institutions that don’t move fast. They will come from physicians who have learned to use AI as a force multiplier and who hold themselves to the discipline the profession demands.
If the bottleneck has moved from “can the physician code” to “does the physician have the discipline to specify well,” then the multiplier on physicians who already practice that discipline is enormous. The agents are free. The discipline is the moat. The era where this kind of work needed an engineering co-founder is over.
If you’re a physician sitting on a tool you wish existed, the question to ask yourself isn’t can I build this. The question is can I specify it clearly, can I calibrate the framing, can I enforce upstream discipline, can I demand calibrated truth from the model, can I run the result against reality, and can I name the friction when it shows up. If the answer to any of those is yes, the rest is just iteration.
The next post in this series walks through adversarial self-critique as a generalized prompting pattern for technical writing — drawing examples from medical documentation, regulatory drafting, and scientific manuscripts. Subscribe to HippocrAI to receive it.
— Joseph L. Hayhurst, MD



"Discipline at the input creates discipline at the output" is the cleanest version of this argument I have read.
Pattern 3 is the one most people skip. They ask the model to confirm instead of disconfirm. The output sounds good, survives no scrutiny, and they wonder why it failed in the room that mattered.
Pattern 4 is the one with the highest stakes in your domain specifically. The leak you don't catch before release in a medical context is not a README problem.
Writing about the input discipline problem from a different angle at michaelklingebiel.substack.com. "Begin With the Question, Not the Answer" covers exactly why confirmation seeking is the default and how to break it.