A physician audience in Dallas heard the talk this paper is drawn from in late May. The questions afterward all came back to the same thing. Our hospitals are deploying these tools. Who is checking what they actually do? The honest answer in most institutions today is: no one with the right framework.
That gap is closing fast, but not because anyone has solved it internally. The closing is happening because state attorneys general, plaintiffs' bars, and state legislatures are now defining the framework on the institution's behalf, and they are not asking permission.
This paper is for the hospital leaders whose desks now hold the question that medical staff already started asking. What is our exposure on the tools we have deployed? What do we owe our physicians by way of governance? When does this become something we cannot resolve with internal IT and a quarterly committee?
The adoption curve is already past you
Ambient documentation is the canary in this mine. Adoption is essentially universal at major academic medical centers and moving fast into private practice and community systems. Vendors have signed thirty thousand clinicians across roughly eighty-five healthcare organizations on a single product line. Most hospital boards have either approved an ambient deployment or have one pending in 2026 procurement cycles.
The pattern repeats across other categories. AI-assisted clinical decision support for sepsis screening, fall risk, and readmission prediction is mandated at some institutions and being evaluated at most others. AI second-read in radiology is moving from research-grade to standard at academic centers. Algorithmic prior authorization is processing hundreds of thousands of claims a month at the payer level, with denial decisions arriving in clinical inboxes before the institution has reviewed what model produced them.
The medical staff at most institutions is already operating inside this environment. They are not waiting for governance to catch up. They are signing notes generated by models whose training data they cannot inspect, attesting to documentation produced by systems they cannot audit, and accepting denial decisions from algorithms whose objective functions they have not been shown.
"The governance question is not whether to deploy. The question is whether the institution has any framework for assessing what has already been deployed."
The audit committee question coming next year is whether the absence of that framework constitutes a board-level governance failure.
The mental model: it is all the same machine
The technical foundation that makes the rest of the argument coherent is older than most of the people reading this paper. Frank Rosenblatt built the Mark I Perceptron at the Cornell Aeronautical Laboratory in 1958. It was hardware, not software. It used a 400-photocell retina to detect light patterns, passed signals through 512 association units, and converged through more than 4,000 adjustable potentiometers to produce an output. The machine was trained by adjusting the potentiometer settings up or down based on whether the output was right or wrong.
Every AI system your hospital is encountering today is the same machine. Modern large language models have hundreds of billions of parameters instead of 4,096. The operations are identical. Inputs come in. They are multiplied by weights. The weighted signals are summed. A threshold determines the output. There is no reasoning happening anywhere in the pipeline. There is no understanding. There is statistical pattern matching against whatever the training data taught the weights to recognize.
This matters because the failure modes do not change with parameter count. They are architectural. Hallucination is what the architecture does when it encounters input outside its training distribution. Target misalignment is what happens when the loss function the model was trained against does not match the objective the institution actually wants. Both failure modes were present in 1958. They are present today. They cannot be patched out of any individual deployment because they are not bugs. They are the architecture working as designed.
The leadership implication is consequential. Most institutions are evaluating AI deployments as if each system were a discrete vendor decision: contract review, business associate agreement, IT security review, go-live. The architecture argument says the underlying engineering reality is the same across all deployments. A governance framework that does not start from the architecture cannot generalize across the institution's portfolio of AI systems. Without that generalization, every deployment requires its own ad-hoc oversight, and at current adoption velocity, no internal team can keep up.
Failure mode one: hallucination
In October 2024, the Associated Press and ABC News published an investigation sourced from researchers at Cornell, the University of Michigan, the University of Washington, and Carnegie Mellon. The subject was Whisper, OpenAI's open-source speech-to-text model, and its use as the engine inside Nabla, an ambient documentation product. Nabla had over thirty thousand clinicians on the platform across approximately eighty-five organizations, including Mankato Clinic and Children's Hospital Los Angeles.
The Cornell-led study found hallucinations in roughly one percent of audio snippets. A University of Michigan researcher found at least one hallucination in eighty percent of the transcripts he inspected. The hallucinations were not transcription errors. They were fabrications: sentences never said, invented medications, racial commentary, violent content. Nabla's product deleted the original audio after generating the note, meaning there was no audit trail to reconstruct what had actually been said in the room.
OpenAI's own documentation explicitly warned against using Whisper in high-stakes domains. The product was built and deployed against that warning. The institutions using the product were not informed of the warning, nor positioned to evaluate it independently.
The architectural reason is the one Rosenblatt built into the machine in 1958. A perceptron does not know whether it is right. It produces an output. The output is whichever pattern the weighted-sum optimizer learned to associate with this input distribution during training. Confidence is the architecture's only output mode. There is no "I do not know" channel built into the system. When a real-world input falls outside the training distribution (silence, unfamiliar accents, overlapping speech, background noise, terminology the model has not seen), the model does not stop. It outputs the closest match. Fluently. Confidently. In coherent sentences.
"This is not a bug. It is the architecture functioning as designed in a context where we expected something the architecture cannot provide."
Two implications for institutional leadership. First, ambient scribes require substantive note review, not signature review. If a clinician signs without reading the body of the note against memory of the visit, the institution is accepting whatever the model produced. The model has no way to flag uncertainty because the model has no way to represent uncertainty. Whether this distinction is operationalized in workflow is a governance decision, not a technology decision. Second, this problem does not get solved by waiting for better models. Bigger models hallucinate differently (sometimes less often, sometimes more plausibly) but the architectural reality is unchanged. Vendors who promise this will be fixed are making claims their own engineers will not make on the record.
Failure mode two: target misalignment
The second failure mode is structurally distinct from hallucination but produces equally consequential errors.
In November 2023, STAT News published an investigation into UnitedHealth's nH Predict, a model originally developed by NaviHealth (acquired by UnitedHealth in 2020) and used to predict how long Medicare Advantage patients would require post-acute care. The model's predictions were used to authorize or deny coverage for skilled nursing, inpatient rehabilitation, and home health services. When clinicians disagreed with the model's prediction and recommended continued care, the model's recommendation prevailed in the denial decision. When patients appealed those denials, the reversal rate was approximately ninety percent.
In March 2023, STAT News and ProPublica jointly reported on Cigna's PXDX, an automated claim review system. In a two-month window in 2022, PXDX processed over three hundred thousand claim denials. The average physician review time per denial was 1.2 seconds. The physicians were not reading the claims. They were batch-signing what the algorithm flagged. Both cases are now in active class action litigation, including Barrows v. UnitedHealth Group, filed in Minnesota in November 2023, and parallel suits against Cigna.
Both models worked exactly as designed. Neither malfunctioned. The targets they were trained against were the targets they hit. A neural network is trained against a loss function: a mathematical expression of what the model is being rewarded for getting right. The model is extraordinarily good at hitting whatever target you set. If you train a model to predict "appropriate length of stay" using training data that reflected historical payer denial patterns rather than clinical recovery patterns, the model will reproduce those denial patterns. It is not failing. It is succeeding at the wrong thing.
This is the failure mode no amount of technical remediation can solve, because there is no technical defect. The question is what the model was built to do and on whose behalf. From a governance perspective, the implication is sharper: the procurement question for any clinical AI system has to include an explicit answer to what the loss function was, and whose interest it represents. Most institutional procurement processes do not currently ask either question. The vendor contracts do not require disclosure. The clinical staff cannot evaluate what they cannot see.
The ninety percent reversal rate on appeal in the nH Predict case is the model's report card. The system is telling the institution the model is wrong, in detail, every time a denial is overturned. The institution that aggregates and reviews its denial patterns has a governance asset. The one that does not has a liability building quietly on its balance sheet.
The ambient scribe: four specific failure modes
Ambient documentation deserves its own treatment because it concentrates both failure modes in a single product, and because it is the AI system most likely to be inside the institution's walls right now.
| Failure Mode | Consequence |
|---|---|
| Multi-party consent capture | State wiretapping exposure for out-of-state patients; BAA does not cover |
| Hallucinated consent | Falsified record signed by physician (Sharp HealthCare litigation) |
| Decimal-place automation bias | Learned-intermediary doctrine; physician owns the medication error |
| Note bloat and chart contradictions | Payer audit and malpractice exposure on subpoenaed records |
Multi-party consent capture. The clinical encounter regularly includes more than two voices: the patient, a family member, a medical interpreter, a medical assistant, a spouse on speakerphone. All of those voices are captured and uploaded to the vendor's cloud infrastructure. Texas is a one-party consent state for general recording, which covers a Texas patient with a Texas clinician in a Texas exam room when the patient has been informed. But patients travel. California, Florida, Pennsylvania, Maryland, Washington, and others require all-party consent. The institution's HIPAA-compliant business associate agreement with the vendor does not address this exposure. BAAs operate under HIPAA. Wiretapping claims do not.
Hallucinated consent. In the Sharp HealthCare litigation, the ambient-generated note stated that the patient had been informed about the scribe and had verbally consented, when no such conversation occurred. The model had inserted a boilerplate consent script into the chart. That is not a transcription error. That is a falsified record, signed by a physician.
Decimal-place automation bias. The model is accurate ninety-five to ninety-eight percent of the time on a typical day. That accuracy is high enough to lull a clinician into signature review instead of substantive review. Defense panels have documented real cases where "we will start you on 0.5 milligrams" became "5 milligrams" in the note, and the physician signed it. Under the learned-intermediary doctrine, the physician owns the error. The vendor's contractual indemnification language does not survive contact with a malpractice plaintiff.
Note bloat and chart contradictions. The model captures everything in the room, including side conversations and stable problems mentioned in passing. Chit-chat about a past condition lands in the active history of present illness. The chart becomes internally contradictory. Coders flag the contradictions. Payer auditors flag them. Plaintiffs' attorneys love them. A subpoenaed chart full of automated contradictions is devastating to a denial defense and to a malpractice defense.
The common thread across all four: the failure is upstream of the physician. The liability is downstream of the signature. The institution that does not have a governance answer for this dynamic is accepting a class of exposure its existing risk management framework was not designed to address.
The legal landscape has already shifted
As of May 2026, three active class actions illustrate the direction plaintiffs' counsel has chosen.
Washington v. Sutter Health, filed in the Northern District of California, concerns the Abridge ambient documentation platform. Patients allege that sensitive medical conversations were captured and transmitted to the vendor without clear prior patient consent. The pleaded theory is state wiretapping, not HIPAA violation. That is a deliberate choice. HIPAA has no private right of action. State wiretapping statutes do, with statutory damages and class certification potential.
The Sharp HealthCare class action is built on the same wiretapping theory and is also where the hallucinated-consent fact pattern was first exposed in litigation. The Heartland Dental class action applies the same legal theory to two-party and all-party consent statutes outside the acute care setting.
The pattern is more important than any single case. Plaintiffs' attorneys have figured out that HIPAA is the wrong door. State wiretapping and consumer privacy law are the right door. The institution's HIPAA-compliant BAA with the vendor does not address these claims.
The regulatory side is moving in parallel. The FDA has, to date, treated most ambient scribes as administrative clinical decision support rather than as regulated medical devices. There is a regulatory gap. States are filling it. California's AB 489 explicitly prohibits AI tools from masking or misleading patients about automated system involvement. Comparable legislation is moving in other states. Multi-state legislative trends are pushing toward mandatory documented human-in-the-loop validation as a statutory requirement, not a best practice.
The practical implication for institutional leadership is sharp. HIPAA compliance is the floor, not the ceiling. The institution's compliance team needs to be able to answer three questions for every deployed clinical AI system. What is the institution's exposure under state wiretapping statutes for out-of-state patients? Is the institution capturing multi-party audio without all-party consent? Where is the documented patient consent workflow that is distinct from a verbal mention in the exam room? If the answer to any of these is "we are relying on the BAA," the institution has a problem.
The standard of care will move without you
Malpractice law contains a doctrine called custom, or standard of care. What constitutes acceptable practice depends on what the community of reasonably competent physicians is doing. When the community shifts, the standard shifts.
The community is shifting now. Ambient scribes are moving from early-adopter to majority. AI second-read in radiology is moving from research to clinical norm at academic centers. Algorithmic decision support is mandated at some institutions for sepsis screening, fall risk, and readmission prediction. None of those movements is waiting for institutional governance to consent.
Three implications fall out of this for hospital leadership. The early adopter without governance becomes the visible test case. When a hallucination harms a patient, the physician who signed the note is the named defendant. The vendor has indemnification language. The physician usually does not. The institution that deployed the tool without governance is structurally next to the physician on the deposition list.
The cautious refuser becomes the deviant. Once enough peers adopt a tool that demonstrably reduces missed diagnoses, the refuser is no longer being conservative. The refuser is below the standard. This is the dynamic that played out with electronic health records, with safety checklists, with computerized provider order entry. The standard moves. The institution moves with it or gets named in the suit that establishes the new standard.
"There is no neutral position. The standard moves whether or not the institution participates."
The institution's responsibility is to move with informed governance, not to refuse and wait.
What outside help actually does
The institutional question that follows naturally is: can we solve this with internal resources? For some institutions the honest answer is yes, eventually. For most the honest answer is no, not at the velocity the deployment curve is requiring.
The reasons are structural. Internal IT is built to evaluate vendor software for security, integration, and uptime. None of those competencies overlap with assessing whether a model's training distribution matches the deployment population, scoring failure modes against severity and detection, or reviewing whether a loss function reflects clinical priorities or payer priorities. Internal compliance is built to monitor HIPAA and Joint Commission, both of which are insufficient for the legal landscape described above. Internal quality and patient safety teams have the right vocabulary (RCA, FMEA, sentinel event review) but have not yet been asked to apply it to AI systems. The committees that should own this oversight are still being formed at most institutions.
Outside help, when it works, does three things internal teams typically cannot do at speed. It brings a framework that is independent of any vendor relationship, which matters when the vendor relationship is itself the subject of the assessment. It applies a methodology refined across multiple institutions, which is faster than developing one from scratch. It produces deliverables (a risk-priority-number-scored failure-mode inventory, a mitigation roadmap tied to existing governance, a board-ready briefing) in a vocabulary the board's audit committee can act on.
This is the work that Cortivus SAFE AI (Safety Assessment Framework for AI) was designed for. The framework brings two decades of patient safety practice (root cause analysis and failure mode and effects analysis) to the AI systems entering clinical workflows. The output is recognizable to any quality department: severity, occurrence, detection, RPN, mitigation, escalation. The vocabulary is the one the institution's quality committee already speaks. The novelty is the application to a new class of system that was not in scope when the existing safety committees were formed.
The decision about whether to bring in outside help is not a binary. The honest question is which engagements warrant outside discipline and which can be absorbed internally. The institution with a single ambient documentation deployment may need a two-week Quick Read on that system and nothing more. The institution with five active clinical AI systems and three more in procurement needs a broader AI Inventory of Record, scoped against state regulatory exposure, and probably an ongoing monitoring retainer. The right level scales with the deployment portfolio.
Where to start
The most natural starting point is the most visible system already in production: the ambient documentation deployment. A two-week SAFE AI Quick Read produces a worksheet with RPN-scored failure modes specific to the deployment, a mitigation roadmap tied to existing quality and risk governance, and an executive briefing for the institution's AI committee or board. It is also the fastest way to see what an outside framework looks like applied to the institution's environment, before committing to a broader engagement.
For institutions with multiple clinical AI systems already in production, the AI Inventory of Record provides a defensible baseline. Every system in clinical use is cataloged with risk tier, vendor accountability status, governance gap analysis, and regulatory alignment. The deliverable is a living document, refreshed annually, that anchors the institution's AI governance posture for audit committees, regulators, and the institution's own legal counsel.
Neither engagement requires the institution to predetermine the answer. Both begin with the same thirty-minute scoping call. The conversation produces a recommendation about which engagement fits the institution's environment and timeline.
The cost of waiting is not theoretical. It is the next subpoenaed chart, classified as a sentinel event after the fact, reviewed in vocabulary that should have been applied before.
Start a SAFE AI Quick Read
Two-week assessment of your ambient documentation deployment. Worksheet with RPN-scored failure modes, mitigation roadmap, executive briefing. The most common entry point to the Cortivus practice.
See the Practice Scope an EngagementAbout Cortivus
Cortivus is an independent clinical AI safety, audit, and governance practice for hospitals deploying AI into care. The firm's proprietary methodology, Cortivus SAFE AI (Safety Assessment Framework for AI), brings two decades of patient safety practice (root cause analysis and failure mode and effects analysis) to the AI systems entering clinical workflows. Cortivus is a NVIDIA Inception and Google for Startups member, founded by Troy Sybert, MD, MPH, board certified in preventive medicine and public health, and in clinical informatics.
This paper was adapted from a briefing delivered at the 9th Annual inPHYnity Conference of the Collin-Fannin County Medical Society, Dallas TX, in May 2026.
Sources
- Burke G, Schellmann H. "Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said." Associated Press / ABC News, October 26, 2024.
- Ross C, Herman B. "UnitedHealth pushed employees to follow an algorithm to cut off Medicare patients' rehab care." STAT News, November 14, 2023.
- Rucker D, Miller M, Armstrong D. "How Cigna saves millions by having its doctors reject claims without reading them." ProPublica, March 25, 2023.
- Barrows v. UnitedHealth Group. Class action complaint, D. Minnesota, November 2023.
- Washington v. Sutter Health. Class action complaint, N.D. California, 2025.
- Sharp HealthCare class action. 2024-2025.
- Heartland Dental class action. 2024-2025.
- California Assembly Bill 489 (AB 489).
- Holland & Knight; ABA Health Law Section; Becker's Hospital Review; Reuters Legal; Fisher Phillips; McAfee & Taft. 2024-2026 reporting.
- Rosenblatt F. "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." Psychological Review, 1958.