Building AI That Doctors Actually Trust , Lessons from 3 Years in Clinical AI Development

May 20, 2025
9 min read

Updated: Apr 7

Model accuracy means nothing if the physician ignores the output. Here's what we learned about building AI that clinicians actually rely on — and the mistakes that taught us.

There's a persistent myth in healthtech that goes something like this: build a model with high enough accuracy, publish an impressive AUC score, put together a slick demo, and doctors will adopt it. The logic seems reasonable. If the AI is demonstrably better than the alternative, why wouldn't clinicians use it?

After three years of developing, deploying, and iterating on clinical AI products at Quremarvel, we can say with absolute certainty — that's not how it works. Not even close.

We've built models that scored 98% on benchmark datasets and got completely ignored in practice. Clinicians glanced at the output, shrugged, and went back to their existing process. We've also built simpler models with lower headline accuracy that clinicians now refuse to work without — that they ask for on day one when rotating to a new ward. The difference was never the model. It was never the accuracy score. It was everything around it.

Clinical AI adoption is fundamentally a trust problem masquerading as a technology problem. And trust, in a clinical environment, is earned through mechanisms that have nothing to do with ROC curves.

Here are the hardest, most expensive, and most important lessons we've learned.

Lesson 1: Explainability isn't a feature — it's the entire product.

Clinicians are scientists by training. They don't accept conclusions without evidence, and they shouldn't. When a colleague recommends a diagnosis, the first question is always "why do you think that?" The same standard applies to AI — except most AI products fail this test completely.

When we first deployed QurePredict, our risk prediction engine, we presented a clean risk score with a confidence percentage. "This patient has an 87% risk of deterioration within 24 hours." The model was well-calibrated. The predictions were accurate. Adoption was dismal.

Clinicians would glance at the score, note it mentally, and continue with their existing assessment process. When we interviewed them, the feedback was consistent: "I don't know why it's saying that. I can't chart 'the AI told me so' as a clinical rationale. If I can't explain it to the patient or to the attending, I can't act on it."

The breakthrough came when we rebuilt the output to include contributing factors. Instead of just "87% risk," QurePredict now shows: "Elevated creatinine trend over 72 hours (+0.4 mg/dL), combined with new onset tachycardia (HR 112, up from baseline 78), recent administration of nephrotoxic medication (vancomycin, day 3), and declining urine output over the past 12 hours."

Adoption tripled within a month. The score didn't change. The accuracy didn't change. The explanation did.

The lesson extends deeper than just showing contributing factors. We learned that the format of the explanation matters as much as the content. Clinicians think in physiological narratives — "the kidneys are struggling because of this drug, which is causing this cascade." They don't think in feature importance rankings. So we restructured our explanations to tell a clinical story, not a statistical one.

We also learned that different specialties need different explanation styles. An ICU intensivist wants granular, real-time variable tracking. A primary care physician wants a high-level summary with actionable next steps. An emergency physician wants the three most critical findings and nothing else. One explanation format does not fit all clinical contexts.

Doctors don't distrust AI. They distrust anything they can't interrogate, can't verify, and can't explain to their colleagues. Build the explanation into the core output — not as an optional tooltip, not as a secondary screen, but as the primary interface.

Lesson 2: Workflow placement matters more than model performance.

This lesson cost us six months and significant engineering resources to learn, and it's the one we now consider the single most important factor in clinical AI adoption.

Our medical imaging AI, QureVision, had excellent performance in standalone testing. Sensitivity and specificity numbers that would make any ML team proud. We deployed it in a partner hospital's radiology department with enthusiasm and confidence.

Usage was minimal. After two weeks, only 3 out of 11 radiologists were using it regularly. After a month, even those three had dropped to occasional use.

The problem had nothing to do with the model. The problem was that QureVision lived in a separate tab — a standalone web application that radiologists had to actively navigate to, outside their PACS viewer. Radiologists are already managing 50-80 studies per shift, each requiring focused attention, dictation, and sign-off. Their entire cognitive workflow is built around the PACS system. Asking them to context-switch to a separate application, no matter how good the AI inside it was, meant asking them to add friction to an already overloaded process.

We went back to the engineering drawing board and rebuilt QureVision as a PACS-embedded module. AI-flagged findings now appear as overlay annotations directly on the scan itself — subtle highlights that draw the radiologist's eye to potential areas of concern without interrupting their reading flow. Priority indicators automatically push critical cases to the top of the worklist. A small confidence badge in the corner shows the AI's assessment without demanding attention.

Same model. Same accuracy. Completely different adoption curve. Within two weeks of the embedded deployment, all 11 radiologists were engaging with it. Within a month, they were reporting that they felt uncomfortable reading studies without it — the same tool they had ignored a month earlier.

The lesson is unambiguous: if your AI requires the clinician to change their workflow, add a step, open a new application, or context-switch in any way, you've already lost. It doesn't matter how good the model is. Clinical workflows are deeply ingrained, time-pressured, and optimised over years of practice. Your AI must meet clinicians exactly where they already are — inside their existing tools, at the exact moment they need the information, in a format that requires zero additional effort to consume.

This principle now drives every product decision at Quremarvel. Before we write a single line of model code, we map the clinical workflow in granular detail — observing clinicians, timing each step, identifying the exact moment where an AI insight would be most valuable and least disruptive. The model comes second. The workflow comes first.

Lesson 3: False positives are far more expensive than you think.

In a research paper, a false positive is a number in a confusion matrix. It's a statistical trade-off that you optimise based on your objective function. In a hospital, a false positive is a cascade of real-world consequences that compound rapidly.

A false positive on a chest X-ray AI means an unnecessary CT scan, radiation exposure, patient anxiety, radiologist time spent on a follow-up read, and a slot taken from someone who actually needs the scan. A false positive sepsis alert means a nurse interrupting a physician, a review of vitals that were actually fine, a potential blood culture draw that wasn't needed, and — most importantly — one more alert in a stream of alerts that the clinician learns to ignore.

That last consequence is the most dangerous. Alert fatigue is one of the most well-documented problems in clinical informatics. When a system produces too many false alarms, clinicians develop a reflexive habit of dismissing alerts without reading them. The true positives get lost in the noise. The AI becomes the system that cries wolf.

Early versions of QureAssist had a sensitivity-first approach. We tuned the model to catch as many true positives as possible, accepting a higher false positive rate on the theory that it's better to over-alert than to miss something. Clinically, this logic seemed sound. In practice, it was catastrophic.

Within weeks of deployment, we observed clinicians developing exactly the alert fatigue pattern we should have anticipated. Notifications were being dismissed in under two seconds — faster than anyone could have actually read them. The tool that was supposed to catch missed diagnoses was itself being missed.

We had to fundamentally rethink our approach to calibration. The key insight was that the optimal sensitivity-specificity trade-off is not a universal constant — it varies dramatically by clinical context, patient population, and the downstream cost of each error type.

An ICU setting can tolerate a higher false positive rate because the cost of missing a true deterioration — a patient coding, a preventable death — is catastrophic and immediate. The clinical team is already in high-vigilance mode, and the incremental burden of investigating one more alert is manageable.

A primary care screening tool, on the other hand, needs much higher specificity. A false positive in a screening context triggers referrals, imaging, specialist consultations, patient anxiety, and follow-up appointments — all of which compound across thousands of patients. The downstream burden is enormous.

An emergency department falls somewhere in between, but with its own unique constraints: the AI needs to be right fast, because there's no time for a second look.

We now calibrate every model deployment specifically for its clinical environment, using locally validated thresholds rather than benchmark-derived defaults. It's more work. It requires close collaboration with the clinical team at each site. But it's the difference between a tool that gets used and a tool that gets muted.

Lesson 4: Validation must happen in their hospital, with their patients, on their data.

This lesson emerged gradually from a pattern we kept seeing: models that performed beautifully on public datasets and well-curated research cohorts would underperform — sometimes significantly — when deployed in a new hospital environment.

The reasons were always local, always specific, and almost never visible in the training data. Different imaging equipment producing subtly different scan characteristics. Different documentation practices meaning the same clinical finding was recorded in different ways. Different patient demographics with different disease prevalence and comorbidity profiles. Different lab reference ranges. Different formularies. Different nursing documentation habits. Different EHR configurations that structured the same data differently.

Each of these differences introduces a distribution shift — a gap between what the model learned and what it's now seeing. Some shifts are small enough to be inconsequential. Others are large enough to meaningfully degrade performance. And you can't tell which is which without testing on local data.

This realisation fundamentally changed our deployment process. Now, every Quremarvel deployment begins with a mandatory local validation phase. Before any product goes live, we run it on the hospital's own historical data — typically 6-12 months of de-identified records. We measure performance against their specific patient population, compare results to their existing clinical benchmarks, and identify any areas where the model underperforms relative to its general performance.

If gaps are found — and they often are — we fine-tune the model using local data before going live. This might mean adjusting thresholds, reweighting certain features, or in some cases retraining specific model components on the local distribution.

This adds time to the deployment cycle. Typically two to four weeks of additional validation work. Some sales teams hate it because it slows the time to revenue. But it's the single most important factor in building clinician trust.

When a physician sees that the AI was validated on 10,000 patients from their own hospital — patients they might have treated themselves — the conversation transforms. Skepticism becomes curiosity. "Show me the cases it caught" becomes the first question, not "why should I trust this?"

We've learned to treat local validation not as a deployment obstacle but as a trust-building feature. It's now a core part of the Quremarvel value proposition, and we believe any vendor who skips this step is prioritising speed over safety.

Lesson 5: Give clinicians an off switch — and watch what happens.

This is perhaps the most counterintuitive lesson we've learned, and it's one that every AI company building for clinical environments needs to internalise.

Every Quremarvel product gives the clinician full, unrestricted control to override, dismiss, modify, or disagree with any AI recommendation. There is no scenario in which the AI's output is forced or locked. The clinician is always the final decision-maker. Always.

When we first built this override capability, some members of our team worried it would undermine the AI's utility. If doctors can just ignore it, what's the point? Won't override rates be high? Won't it reduce the AI's measured impact?

The opposite happened. When clinicians feel in control — when they know they can push back, disagree, and override without consequence — they engage more deeply with the AI's output. They read the explanations more carefully. They consider the recommendations more thoughtfully. They develop a calibrated sense of when the AI is likely right and when it might be off. In short, they build a working relationship with the tool, the same way they build working relationships with colleagues whose judgment they learn to trust over time.

In our deployments, the override rate typically starts around 15-20% in the first month. Clinicians are testing the system — pushing back to see how it responds, checking edge cases, comparing its recommendations against their own clinical judgment. By month three, override rates consistently drop to under 5%. Not because we changed the model or pressured anyone. But because the clinician built trust through experience.

We log every override anonymously and use that data as one of the most valuable feedback signals for model improvement. When a clinician overrides in a case where the AI was actually correct, that's a training opportunity — perhaps the explanation wasn't convincing enough, or the presentation wasn't clear. When a clinician overrides correctly — catching something the AI missed — that's gold-standard labelled data that goes back into the training pipeline.

The override mechanism turned out to be both a trust-building feature and a model improvement engine. It's now a non-negotiable design principle for every product we ship.

The bottom line.

Clinical AI adoption is not an accuracy problem. It's a trust problem. And trust, in a clinical environment, is built through five specific mechanisms: explainability that tells a clinical story, seamless workflow integration that adds zero friction, context-appropriate calibration that respects the local clinical reality, validation on local data that proves relevance, and clinician autonomy that keeps the human in control.

We're still learning. Every hospital deployment teaches us something new. Every clinician who pushes back on our outputs makes the next version better. Three years in, we're more convinced than ever that the hardest part of clinical AI isn't building the model — it's earning the trust of the people who use it.

These five lessons have shaped every product decision at Quremarvel, and we believe they're essential reading for anyone building AI that's meant to work alongside doctors, not replace them.

QureMarvel

Building AI That Doctors Actually Trust , Lessons from 3 Years in Clinical AI Development

Recent Posts

QureMarvel