1. The Challenge of Unstructured Clinical Data
The digitization of healthcare has created a paradox of abundance and inaccessibility. While Electronic Health Records (EHRs) have successfully centralized patient data, a staggering proportion of high-value clinical information remains locked within unstructured free-text narratives. Clinical notes--ranging from discharge summaries and echocardiogram reports to psychiatric evaluations and progress notes--contain the nuanced details necessary for precise phenotyping, quality improvement, and longitudinal research. However, the extraction of structured variables from these narratives presents a formidable computational challenge, complicated by the idiosyncratic nature of medical language, the prevalence of abbreviations, and the critical importance of context.
The stakes of extraction accuracy are exceptionally high. In the domain of cardiology, the Left Ventricular Ejection Fraction (LVEF) is a singular metric that determines eligibility for life-saving therapies, yet it is frequently buried in the "Findings" section of an echo report, often surrounded by historical values or hypothetical targets.1 Similarly, in mental health, scores from instruments like the Mini-Mental State Examination (MMSE) or the Patient Health Questionnaire-9 (PHQ-9) are vital for tracking cognitive decline and depression severity but are often documented in ad-hoc templates or narrative prose that defies simple parsing.3
Historically, the field of Clinical Information Extraction (IE) has been dominated by rule-based systems, specifically Regular Expressions (RegEx), which offer deterministic control but suffer from brittleness. The advent of Deep Learning introduced Transformer-based encoder models like BERT (Bidirectional Encoder Representations from Transformers), which revolutionized Named Entity Recognition (NER) by capturing semantic context. Most recently, the emergence of Large Language Models (LLMs) has introduced a third paradigm: generative extraction, characterized by reasoning capabilities and zero-shot flexibility but plagued by novel risks such as hallucination.5
This report provides an exhaustive analysis of these three methodologies--Rule-based, BERT, and LLM--specifically focusing on the extraction of numerical and scored clinical variables. We analyze the distinct error patterns inherent to each architecture, provide a rigorous taxonomy of failure modes, and present a strategic framework for implementation that balances accuracy, cost, and computational latency.
2. Architectural Paradigms in Clinical Extraction
To understand the failure modes of clinical extraction systems, one must first dissect the mechanisms by which they process text. The three dominant paradigms operate on fundamentally different principles, each conferring specific advantages and limitations regarding clinical data.
2.1 Rule-Based Systems: The Deterministic Foundation
Rule-based systems, primarily leveraging Regular Expressions (RegEx) and lexicon-matching algorithms like NegEx, represent the traditional bedrock of clinical NLP. These systems operate on symbolic logic, scanning text for rigid character patterns defined by human experts.
The mechanism is purely deterministic. For identifying an Ejection Fraction, a system might employ a pattern such as LVEF\s*[:=]\s*(\d)%. This approach offers absolute interpretability; every extraction can be traced to a specific line of code, and the computational overhead is negligible, allowing for the processing of millions of notes in minutes on standard hardware.2 Libraries like spaCy's EntityRuler or Apache cTAKES have operationalized these rules into robust pipelines.8
However, the strength of rule-based systems is also their fatal flaw: brittleness. Clinical language is notoriously heterogeneous. A slight variation in phrasing--such as "Left ventricular function is calculated at approximately 45%"--will fail to trigger a rigid pattern. Furthermore, RegEx is inherently context-blind. It struggles to distinguish between "LVEF: 55%" (the current value) and "History of LVEF 55%" (a past value) without the layering of complex, often fragile, logic loops.9 While heuristic algorithms like NegEx have been developed to handle negation by defining "pre-negation" and "post-negation" windows, they frequently fail when sentences become syntactically complex or involve double negatives.10
2.2 BERT and Discriminative Encoders: The Contextual Standard
The introduction of BERT in 2018 marked a paradigm shift from keyword matching to contextual representation learning. In the clinical domain, models like BioBERT and ClinicalBERT were pre-trained on massive biomedical corpora (e.g., PubMed, MIMIC-III), allowing them to learn the specialized vocabulary and syntax of medicine.12
Unlike rule-based systems, BERT models are discriminative; they do not generate text but rather classify existing tokens or sequences. For extraction tasks, they are typically fine-tuned for Named Entity Recognition (NER) or Relation Extraction (RE). The self-attention mechanism allows the model to process an entire sentence bidirectionally, enabling it to resolve ambiguity based on surrounding words. For instance, a BERT model can learn that a number following "Target EF" should not be extracted as the patient's current hemodynamic status, a distinction that usually confounds RegEx.14
The performance of these models is often State-of-the-Art (SOTA) for specific tasks where labeled data is abundant. Recent benchmarks on the i2b2 and n2c2 datasets show that fine-tuned BERT models often outperform zero-shot LLMs in strict F1 scores for entity extraction.16 However, their reliance on supervised learning creates a significant bottleneck: developing a ClinicalBERT model for a new variable requires thousands of manually annotated examples, a resource-intensive prerequisite that limits scalability to niche applications.12
2.3 Large Language Models (LLMs): The Generative Frontier
The emergence of Large Language Models (LLMs) like GPT-4, Llama 3, and Mistral has introduced "Generative Extraction." These models are decoder-only (or encoder-decoder) architectures trained on internet-scale data to predict the next token in a sequence. In clinical applications, they are prompted to read a note and generate a structured output, often in JSON format.5
The primary advantage of LLMs is their reasoning capability and zero-shot performance. An LLM can be instructed to "extract the PHQ-9 score, but only if it was administered today," a complex logical constraint that would require intricate programming in a rule-based system. They excel at normalizing non-standard text, such as converting "moderate depression" to a score range or parsing messy formatting that breaks RegEx templates.19
However, this generative flexibility introduces a critical safety risk: hallucination. Unlike BERT, which can only extract spans of text that exist in the document, an LLM can fabricate values that look plausible but are entirely absent from the source. Studies have documented instances where LLMs generated specific blood pressure readings or cognitive scores simply because the probability of those numbers appearing in that context was high, despite the data being missing from the specific note in question.20 Furthermore, the computational cost of inference for LLMs is orders of magnitude higher than BERT, posing significant economic challenges for health systems processing high volumes of data.13
3. Deep Dive: Extraction Patterns and Errors by Variable Type
The efficacy of extraction methods is not uniform; it varies significantly depending on the nature of the clinical variable. We analyze three distinct categories: hemodynamic parameters (LVEF), cognitive scores (MMSE/MoCA), and psychiatric indices (PHQ-9/GAD-7).
3.1 Hemodynamics: The Case of Ejection Fraction (LVEF)
Left Ventricular Ejection Fraction (LVEF) is a ubiquitous metric in cardiology, typically expressed as a percentage or a range. It serves as a prime example of the "Numeric Extraction" class of problems.
The RegEx Failure Mode: While LVEF seems amenable to pattern matching (e.g., EF = 55%), clinical documentation is rife with linguistic variation that defeats rigid rules. A clinician might write, "The ventricle appears to be functioning at roughly 40 to 45 percent," or use qualitative descriptors like "severely reduced." RegEx systems often miss these narrative descriptions or, worse, extract "40" from "40 mmHg" (a pressure reading) if the proximity rules are loose.9 A study utilizing the CUIMANDREef system found that while rule-based precision can be high (up to 98%), recall drops significantly when notes deviate from standard formatting.2
The Temporal Ambiguity: A pervasive error across all methodologies is temporal confusion. An echocardiogram report often summarizes prior studies for context: "Comparison is made to study from 2020 where EF was 35%." A naive extractor will grab "35%" as the current value. BERT models, when fine-tuned with assertion classification, handle this adeptly by linking the value to the "History" section. LLMs can be prompted to distinguish current from historical values, but they occasionally suffer from "recency bias," prioritizing the last number mentioned regardless of its semantic framing.2
The Qualitative Hallucination: LLMs introduce a unique risk when dealing with qualitative descriptions. If a note states "LVEF is normal," an LLM prompted to return a number might hallucinate a specific value like "55%" or "60%" to satisfy the prompt's data type requirement. While "55%" is indeed normal, extracting it as a measured value is a fabrication that introduces noise into clinical registries.20
3.2 Cognitive Assessment: MMSE and MoCA
Extracting scores from the Mini-Mental State Examination (MMSE) and Montreal Cognitive Assessment (MoCA) presents a "Semantic Disambiguation" challenge. Both tests are scored out of 30, and they often appear in the same clinical note, leading to high rates of attribution error.
The "MoCA as MMSE" Error: Research comparing LlaMA-2 and GPT-4 found a dominant error mode where models extracted a MoCA score (e.g., 24/30) and labeled it as an MMSE score. This "Test Attribution Error" stems from the identical scoring range. In one study of 765 notes, LlaMA-2 committed this error in 19 cases, while GPT-4 made the same mistake in 17 cases.22 This highlights a critical limitation of LLMs: while they understand language, they can be easily confused by numerically identical entities in close proximity.
Component vs. Total Score: Cognitive assessments are often broken down by domain (e.g., "Recall: 3/3, Orientation: 10/10... Total: 26/30"). Rule-based systems looking for the pattern \d/30 might extract "3/30" if the denominator is missing or implied for sub-scores. LLMs generally excel here due to their ability to recognize the semantic marker "Total," but they are prone to calculation hallucinations if asked to sum sub-scores that are not explicitly totaled in the text.22
Date Hallucination: A specific hallucination pattern observed in cognitive extraction is "Date Shifting." When prompted to extract the date of the test along with the score, LLMs frequently pull the date of the note encounter rather than the specific date of the exam mentioned in the text (e.g., "Test performed on 10/12" inside a note from 11/01). LlaMA-2 failed in this specific manner in 23 cases in the aforementioned study, significantly degrading its utility for longitudinal tracking.22
3.3 Psychiatric Indices: PHQ-9 and GAD-7
The extraction of depression (PHQ-9) and anxiety (GAD-7) scores illustrates the "Template and Null Value" challenge. These scores are often embedded in semi-structured templates or checklists within the note.
The "Refused" vs. "Zero" Conflation: A critical error pattern in mental health extraction is the mishandling of null values. If a note states "PHQ-9: Patient refused," a simple RegEx or poorly prompted LLM might extract "0" (finding no digits) or fail to return a value. In clinical research, "0" indicates total absence of depression, whereas "refused" indicates missing data. Conflating the two biases population health statistics. Studies using NegSpacy have shown that explicit handling of terms like "refused" or "unable to complete" is required to prevent this error.11
Symptom Counts vs. Severity Scores: Notes often summarize findings as "Positive for 5 symptoms on PHQ-9." A naive extraction of "5" would interpret this as "Mild Depression" (Score 5-9). However, 5 positive symptoms often correlate with a score greater than 15 (Moderately Severe to Severe). Rule-based systems struggle profoundly with this distinction. LLMs show promise in interpreting "5 symptoms" correctly if the prompt explicitly instructs them to distinguish between item counts and total scores, but they require careful calibration.3
Template Bleed-Through: Clinical notes often contain empty templates (e.g., "GAD-7 Score: [ ]"). LLMs, trained on pattern completion, have been observed to "fill in" these blanks with plausible averages (e.g., "GAD-7: 0") or extract the brackets themselves. This "Auto-complete Hallucination" is a direct artifact of the causal language modeling objective (predicting the next likely token).20
4. A Taxonomy of Clinical Extraction Errors
To systematically mitigate risks, we categorize the observed failures into a taxonomy of error patterns. Each category requires a distinct "Catch" strategy.
4.1 Hallucination (Fabrication)
Definition: The generation of a value, date, or clinical fact that is completely absent from the source text. This is exclusive to generative models.
Mechanism: The model relies on its pre-trained probability distribution rather than the specific context window. It "guesses" what a note should say based on millions of training examples.
How to Catch It:
Evidence Extraction: Do not just ask for the value. Prompt the LLM to return a JSON object: {"Value": 24, "Evidence_Quote": "Score: 24/30"}.
Verification Function: Implement a post-processing script that searches for the Evidence_Quote string in the original text. If the string is not found verbatim, the extraction is flagged as a potential hallucination.21
Self-Consistency: Run the extraction prompt three times with a non-zero temperature (e.g., 0.7). If the model returns three different numbers, flag the record for manual review.21
4.2 Negation and Status Flipping
Definition: Identifying a clinical entity but failing to recognize that it is negated or ruled out.
Mechanism: In RegEx/NegEx, this occurs when the negation term ("no", "denies") is outside the pre-defined window of the entity. in BERT, it occurs due to insufficient attention weight on the negation token in long sequences.
How to Catch It:
Dependency Parsing: Use libraries like spaCy to generate a dependency tree. Check if the entity (e.g., "depression") is the object of a negation verb (e.g., "denies"). This handles long-range dependencies better than window-based NegEx.11
Assertion Classification: Do not treat extraction as a binary task. Train a BERT classifier to label entities as Present, Absent, Possible, Conditional, or Hypothetical.10
4.3 Temporal and Attribution Dislocation
Definition: Extracting a valid value but assigning it to the wrong patient (e.g., family history), the wrong time (historical), or the wrong attribute (wrong test).
Mechanism: The model identifies the "Signal" (the number) but fails to link it to the correct "Anchor" (the subject or date).
How to Catch It:
Section Segmentation: Pre-process the note using a segmentation model (e.g., a lightweight BERT) to isolate the "Current Assessment" section. Run extraction only on this segment to eliminate historical data from the "History" section.9
Relation Extraction (RE): Use a dedicated RE model to explicitly link the "Value" to the "Test Name." If the link probability is low, discard the extraction.
4.4 Formatting and Normalization Failures
Definition: Correctly identifying the value but outputting it in an unusable format (e.g., "twenty-four" instead of "24", or "50-55%" as a string instead of a mean integer).
Mechanism: Lack of strict schema enforcement in the output generation.
How to Catch It:
Constrained Decoding: Use tools like guidance or outlines that force the LLM's output to adhere strictly to a RegEx pattern (e.g., \d+) or a Pydantic object. This prevents the model from generating "about 50" or other free-text variations.16
5. Comparative Analysis: Performance, Cost, and Speed
The choice of methodology involves a trade-off between accuracy, computational resources, and financial cost.
5.1 Benchmarking Accuracy
Recent comparative studies have established clear performance hierarchies. For standard entity extraction (NER), fine-tuned BERT models often remain the gold standard. In a study on infection inference, a fine-tuned Bio+Clinical BERT model achieved an F1 score of 0.97, significantly outperforming both traditional RegEx (F1 0.71) and zero-shot GPT-4 (F1 0.71). However, GPT-4's performance improved to F1 0.86 with few-shot prompting, and fine-tuned GPT-3.5 matched the BERT model.6 This suggests that while LLMs are powerful, smaller, specialized models (BERT) are often more accurate for well-defined tasks where training data exists.
In zero-shot settings, LLMs dominate. For extracting complex, non-standard entities where no training data is available, GPT-4 significantly outperforms dictionary-based methods, which simply cannot cope with vocabulary mismatches.13
5.2 The Economics of Extraction
The cost disparity between methods is arguably the most critical factor for implementation at scale.
BERT Inference Cost: Running a distilled BERT model locally is estimated at $0.000187 per note. This low cost is due to the model's small size (millions of parameters vs. billions) and efficient processing on standard GPUs.
LLM Inference Cost: Using a commercial API like GPT-4o is estimated at $0.0159 per note.
The Implication: For a healthcare system processing 1 million clinical notes, the cost for BERT would be roughly $187, whereas the cost for GPT-4o would be $15,900. This represents an 8,400% cost increase for using the LLM.13
5.3 Computational Latency
Speed is another differentiator. Rule-based systems operate in microseconds. BERT models typically process a note in 50-100ms, enabling real-time applications (e.g., alerting a doctor during note entry). LLMs, due to their autoregressive nature (generating token by token), are significantly slower, with latencies ranging from 500ms to several seconds per note.13 This latency makes LLMs less suitable for synchronous, user-facing workflows but acceptable for asynchronous batch processing.
5.4 Summary Comparison Table
| Feature / Metric | Rule-Based (RegEx/NegEx) | BERT / Encoder (Bio/ClinicalBERT) | LLM (GPT-4, Llama 3) |
|---|---|---|---|
| Core Mechanism | Deterministic Pattern Matching | Discriminative Token Classification | Generative Probabilistic Prediction |
| Best Use Case | Highly standardized formats (e.g., "BP: 120/80") | High-volume, specific entity extraction (NER) | Complex reasoning, zero-shot, heterogeneous text |
| Accuracy (Standard) | High (if patterns exhaustive) | SOTA (F1 ~0.90-0.97) 6 | Competitive (F1 ~0.85-0.95) |
| Accuracy (Inference) | Very Low (Fails on nuance) | Moderate (Needs specific training) | High (Can infer context) |
| Hallucination Risk | None (Errors are omissions) | Low (Classification errors) | High (Fabrication of values) 21 |
| Inference Cost/Note | Negligible (~$0.00) | Low (~$0.0002) 13 | High (~$0.01 - $0.02) 13 |
| Inference Speed | Real-time (<1ms) | Fast (~50ms) | Slow (~500ms - 2s) |
| Training Data | None (Code-based) | High (1000s of labels) | None (Zero-shot) or Low (Few-shot) |
| Explainability | Transparent (Line of code) | Attention Maps (Opaque) | Black Box (Requires "CoT" prompting) |
| Maintenance | High (New rules for every variant) | Moderate (Retraining) | Low (Prompt adjustment) |
| Privacy/Security | On-premise | On-premise | Cloud (Commercial) or On-prem (Open Source) |
6. Strategic Selection and Implementation Framework
Given the trade-offs outlined above, relying on a single method is rarely the optimal strategy. We propose a tiered selection framework based on the dimensions of Task Complexity, Data Volume, and Resource Availability.
6.1 Scenario A: High Volume, Low Complexity
Use Case: Extracting standard vital signs or ICD-10 codes from millions of daily notes. Recommendation: Rule-Based or Distilled BERT. Reasoning: The cost of LLMs is unjustifiable here. The patterns are regular enough for RegEx or a lightweight BERT model (like PubMedBERT) to achieve >95% F1 score. The throughput requirements demand the sub-100ms latency of these architectures.7
6.2 Scenario B: Complex Reasoning & Ambiguity
Use Case: "Identify if the patient's worsening EF is attributed to non-compliance with medication." Recommendation: LLM (GPT-4 or Llama 3). Reasoning: This task requires inferring causality and synthesizing information across sentences, a capability that BERT models struggle with without complex, custom architectures. RegEx is impossible here. The higher cost per note is justified by the high value of the insight and the impossibility of achieving it with cheaper methods.19
6.3 Scenario C: The "Cold Start" (No Labeled Data)
Use Case: Extracting a newly identified biomarker or variable where no annotated dataset exists. Recommendation: LLM-to-BERT Distillation. Reasoning: Use a high-quality LLM (like GPT-4) to label a few thousand notes (creating a "Silver Standard" dataset). Then, train a smaller, cheaper BERT model on this synthetic data. This approach, known as knowledge distillation, yields the reasoning benefits of LLMs with the cost profile of BERT. Studies have shown this method can approach the performance of fully supervised models with a fraction of the manual effort.13
6.4 The "LLM-Augmented" Hybrid Pipeline
For robust production systems, the most effective architecture is often a hybrid one.
Preprocessing (Rule-Based): Use RegEx to segment the note. For example, extract only the "Echocardiogram" or "Psychiatric Assessment" section. This significantly reduces the context window size, lowering token costs and reducing the noise available for hallucination.28
Tiered Extraction:
Tier 1: Run high-precision RegEx. If a clear value is found (e.g., "LVEF: 55%"), extract and stop.
Tier 2: If no value is found, pass the segment to a Fine-Tuned BERT model.
Tier 3: If the BERT model's confidence score is low, or if the entity is not found, escalate the segment to an LLM for reasoning.33
Validation (LLM-as-Judge): Deploy a smaller, local LLM (e.g., Llama-3-8B) to validate random samples of the BERT/RegEx extractions. This acts as a continuous quality assurance mechanism to detect data drift or new documentation patterns.
7. Conclusion
The landscape of clinical information extraction is undergoing a profound transformation. The transition from rule-based systems to BERT and now to LLMs represents a shift from deterministic control to probabilistic reasoning. Rule-based systems offer absolute control but limited capability. LLMs offer unprecedented capability--the ability to "read" and "understand" clinical narrative--but come with the inherent risks of probabilistic generation and significant cost barriers.
For the extraction of variables like LVEF, MMSE, and PHQ-9, the optimal strategy for the near future is not to abandon discriminative models (BERT) for generative ones (LLMs), but to leverage LLMs to build better discriminative systems. By using LLMs to generate synthetic training data, handle complex edge cases, and validate outputs, healthcare systems can deploy robust, cost-effective pipelines that unlock the vast potential of unstructured data.
However, the integration of these tools requires vigilance. Errors such as hallucination and negation failure are not merely technical nuisances; in a clinical setting, they are patient safety risks. Therefore, any deployment must be wrapped in a "safety sandwich": Rule-based pre-processing to narrow the scope, and verification post-processing to catch hallucinations. Only through this layered, hybrid approach can we safely harness the power of AI to decipher the complex narrative of patient care.
Works cited
Extraction of left ventricular ejection fraction information from various types of clinical reports - PubMed, accessed on January 30, 2026, https://pubmed.ncbi.nlm.nih.gov/28163196/
Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Manageme, accessed on January 30, 2026, https://academic.oup.com/jamia/article-pdf/19/5/859/5892455/19-5-859.pdf
Ascertaining Depression Severity by Extracting Patient Health Questionnaire-9 (PHQ-9) Scores from Clinical Notes | Request PDF - ResearchGate, accessed on January 30, 2026, https://www.researchgate.net/publication/331439922_Ascertaining_Depression_Severity_by_Extracting_Patient_Health_Questionnaire-9_PHQ-9_Scores_from_Clinical_Notes
Ascertaining Depression Severity by Extracting Patient Health Questionnaire-9 (PHQ-9) Scores from Clinical Notes - PMC - NIH, accessed on January 30, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC6371338/
A Survey on Open Information Extraction from Rule-based Model to Large Language Model, accessed on January 30, 2026, https://arxiv.org/html/2208.08690v6
Transformers and large language models are efficient feature extractors for electronic health record studies - PubMed Central, accessed on January 30, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC11928488/
Leveraging Large Language Models to Extract Information on Substance Use Disorder Severity from Clinical Notes: A Zero-shot Learning Approach - arXiv, accessed on January 30, 2026, https://arxiv.org/html/2403.12297v1
Extract Medical Entities with Regular Expressions in Healthcare NLP at Scale - John Snow Labs, accessed on January 30, 2026, https://www.johnsnowlabs.com/extract-medical-named-entities-with-regex-in-healthcare-nlp-at-scale/
Extracting Coronary Lesion Information from Angiogram Reports for Patient Screening Applications Leah Paige Gaffney - DSpace@MIT, accessed on January 30, 2026, https://dspace.mit.edu/bitstream/handle/1721.1/155970/gaffney-leahg-mba-mgt-2024-thesis.pdf?sequence=1&isAllowed=y
Negation Detection using Regular Expression, Syntactic and Classification Methods - i2b2, accessed on January 30, 2026, https://www.i2b2.org/software/projects/hitex/negation.pdf
Clinical Text Negation handling using negspaCy and scispaCy | by Mansi Kukreja - Medium, accessed on January 30, 2026, https://medium.com/@MansiKukreja/clinical-text-negation-handling-using-negspacy-and-scispacy-233ce69ab2ac
Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings - arXiv, accessed on January 30, 2026, https://arxiv.org/html/2507.20859v1
Distilling Large Language Models for Efficient Clinical Information Extraction - arXiv, accessed on January 30, 2026, https://arxiv.org/html/2501.00031v1
ModernBERT vs LLMs for Detecting Adverse Drug Reactions - Paul Simmering, accessed on January 30, 2026, https://simmering.dev/blog/modernbert-vs-llm/
Is BERT an LLM Understanding Natural Language Processing - Cognativ, accessed on January 30, 2026, https://www.cognativ.com/blogs/post/is-bert-an-llm-understanding-natural-language-processing/308
Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports - RSNA Journals, accessed on January 30, 2026, https://pubs.rsna.org/doi/10.1148/radiol.240895
Clinical concept and relation extraction using prompt-based machine reading comprehension - PMC - NIH, accessed on January 30, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC10436141/
Enhancing Clinical Named Entity Recognition via Fine-Tuned BERT and Dictionary-Infused Retrieval-Augmented Generation - MDPI, accessed on January 30, 2026, https://www.mdpi.com/2079-9292/14/18/3676
Large language models help decipher clinical notes | Harvard-MIT Health Sciences and Technology, accessed on January 30, 2026, https://hst.mit.edu/news-events/large-language-models-help-decipher-clinical-notes
Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework - PMC - PubMed Central, accessed on January 30, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12712565/
Medical Hallucination in Foundation Models and Their Impact on Healthcare - medRxiv, accessed on January 30, 2026, https://www.medrxiv.org/content/10.1101/2025.02.28.25323115v1.full-text
Evaluating Large Language Models in extracting cognitive exam ..., accessed on January 30, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC11634005/
Extracting Critical Information from Unstructured Clinicians' Notes Data to Identify Dementia Severity Using a Rule-Based Approach: Feasibility Study - NIH, accessed on January 30, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC11462099/
GAD-7 and PHQ-9 - Medi-Stats, accessed on January 30, 2026, https://www.medi-stats.com/gad-7-phq-9
LLM hallucinations and failures: lessons from 5 examples - Evidently AI, accessed on January 30, 2026, https://www.evidentlyai.com/blog/llm-hallucination-examples
Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization - arXiv, accessed on January 30, 2026, https://arxiv.org/html/2506.00448v1
Large Language Model Benchmarks in Medical Tasks - arXiv, accessed on January 30, 2026, https://arxiv.org/html/2410.21348v3
Advancing clinical information extraction with LLM-Augmenter - Truveta, accessed on January 30, 2026, https://www.truveta.com/blog/research/advancing-clinical-information-extraction-with-llm-augmenter
Large Language Models ยท spaCy Usage Documentation, accessed on January 30, 2026, https://spacy.io/usage/large-language-models
Beating BERT? Small LLMs vs Fine-Tuned Encoders for Classification | Alex Jacobs, accessed on January 30, 2026, https://alex-jacobs.com/posts/beatingbert/
Leveraging Large Language Models for Knowledge-free Weak Supervision in Clinical Natural Language Processing - PMC - NIH, accessed on January 30, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC11230489/
A Novel Compact LLM Framework for Local, High-Privacy EHR Data Applications - arXiv, accessed on January 30, 2026, https://arxiv.org/html/2412.02868v1
Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data - arXiv, accessed on January 30, 2026, https://arxiv.org/html/2412.14276v1