James Mullenbach

James Mullenbach is a Research Engineer at ASAPP working on the next generation of agent dialogue augmentation systems, along with new directions for machine learning in healthcare. He received his Master’s and undergraduate degrees from the Georgia Institute of Technology.

Introducing CLIP: A Dataset to Improve Continuity of Patient Care with Unsupervised NLP

James Mullenbach

Continuity of care is crucial to ensuring positive health outcomes for patients, especially in the transition from acute hospital care to out-patient primary care. However, information sharing across these settings is often imperfect.

Hospital discharge notes alone easily top thousands of words and are structured with billing and compliance in mind, rather than the reader, making poring over these documents for important pending actions especially difficult. Compounding this issue, primary care physicians (PCPs) already are short on time—receiving dozens of emails, phone calls, imaging, and lab reports per day (Baron 2010). Lost in this sea of hospital notes and time constraints are important actions for improving patient care. This can cause errors and complications for both patients and primary care physicians.

Thus, in order to improve the continuity of patient care, we are releasing one of the largest annotated datasets for clinical NLP. Our dataset, which we call CLIP, for CLInical Follow-uP, makes the task of action item extraction tractable, by enabling us to train machine learning models to select the sentences in a document that contain action items.

By leveraging modern methods in unsupervised NLP, we can automatically highlight action items from hospital discharge notes and action items for primary care physicians–saving them time and reducing the risk that they miss critical information.

James Mullenbach

‍

We view the automatic extraction of required follow-up action items from hospital discharge notes as a way to enable more efficient note review and performance for caregivers. In alignment with the ASAPP mission to augment human activity by advancing AI, this dataset and task provide an exciting test ground for unsupervised learning in NLP. By automatically surfacing relevant historical data to improve communication, this work represents another key way ASAPP is improving human augmentation with AI. In our ACL 2021-accepted paper, we demonstrate this with a new algorithm.

The CLIP Dataset

Our dataset is built upon MIMIC-III (Johnson et al., 2016), a large, de-identified, and open-access dataset from the Beth Israel Deaconess Medical Center in Boston, which is the foundation of much fruitful work in clinical machine learning and NLP. From this dataset, with the help of a team of physicians, we labeled each sentence in 718 full discharge summaries, specifying whether the sentence contained a follow-up action item. We also annotated 7 types to further classify action items by the type of action needed; for example, scheduling an appointment, following a new medication prescription, or reviewing pending laboratory results. This dataset, comprising over 100,000 annotated sentences, is one of the largest open-access annotated clinical NLP datasets to our knowledge, and we hope it can spur further research in this area.

How well does machine learning accomplish this task? In our paper we approach the task as sentence classification, individually labeling each sentence in a document with its followup types, or “No followup”. We evaluated several common machine learning benchmarks on the task, adding some tweaks to better suit the task, such as including more than one sentence as input. We find that the best models, based on the popular transformer-based model BERT, provide a 30% improvement in F1 score, relative to the linear model baseline. The best models achieve an F1 score around 0.87, close to the human performance benchmark of 0.93.

Model pre-training for healthcare applications

We found that an important factor in developing effective BERT-based models was pre-training them on appropriate data. Pre-training exposes models to large amounts of unlabeled data, and serves as a way for large neural network models to learn how to represent the general features of language, like proper word ordering and which words often appear in similar contexts. Models that were pre-trained only on generic data from books or the web may not have enough knowledge on how language is used specifically in healthcare settings. We found that BERT models pre-trained on MIMIC-III discharge notes outperformed the general-purpose BERT models.

For clinical data, we may want to take this focused pre-training idea a step further. Pre-training is often the most costly step of model development due to the large amount of data used. But, can we reduce the amount of data needed, by selecting data that is highly relevant to our end task? In healthcare settings, with private data and less computational resources, this would make automating action item extraction more accessible. In our paper, we describe a method we call task-targeted pre-training (TTP) that builds datasets for pre-training by selecting sentences that look the most like those in our annotated data that do contain action items. We find that it’s possible, and maybe even advantageous, to select data for pre-training in this way, saving time and computational resources while maintaining model performance.

Improving physician performance and reducing cognitive load

Ultimately, our end goal is to make physicians’ jobs easier by reducing the administrative burden of reading long hospital notes, and bring their time and focus back where it belongs: on the patient. Our methods can condense notes down to what a PCP really needs to know, reducing note size by at least 80% while keeping important action items readily available. This reduction in “information overload” can reduce physicians’ likelihood of missing important information (Singh et al., 2013), improving their accuracy and the well-being of their patients. Through a simple user interface, these models could enable a telemedicine professional to more quickly and effectively aid a patient that recently visited the hospital.

Citations

Richard J. Baron. 2010. What’s keeping us so busy in primary care? a snapshot from one practice. The New England Journal of Medicine, 362 17:1632–6.
Hardeep Singh, Christiane Spitzmueller, Nancy J. Petersen, Mona K. Sawhney, and Dean F. Sittig. 2013. Information overload and missed test results in electronic health record-based settings. JAMA Internal Medicine, 173 8:702–4.
Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li wei H. Lehman, Mengling Feng, Mohammad M. Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. Mimic-III, a freely accessible critical care database. In Scientific Data

Recently Published

Browse Blog

Learn what CX leaders in financial services should expect from AI agents—and why safety and security must go far beyond the basics.

The challenge isn’t cost—it’s trust. GenerativeAgent delivers enterprise-ready AI with tools for safe testing, human review, and live monitoring.

Measuring GenAI agents isn’t about sounding human. It’s about outcomes. Here’s what to track to protect your brand and bottom line.

A generative AI agent isn’t built to mimic humans—it’s built to deliver faster, safer, more consistent results in customer service.

How Assurant is using generative AI to boost CX, empower agents, and move toward agentic AI—starting with strategy, not shortcuts.

Discover 6 powerful use cases for AI agents in financial services to boost customer service, cut costs, and scale support with confidence.

Get full visibility into GenerativeAgent’s performance with tools that surface issues, show decision paths, and drive scalable CX quality and ROI.

How Tangerine Bank is using AI to boost CX, empower agents, and redefine the digital contact center—without losing the human touch.

Is your AI agent saving you money—or just creating the illusion of efficiency? Learn how to measure real impact with the metrics that matter.

Discover how autonomous AI agents solve key retail contact center challenges—scaling service, cutting costs, and improving customer experience.

Scalable, secure deployment for CX: How ASAPP supports fast, reliable AI rollouts to meet enterprise needs without slowing developer velocity.

Learn why real AI impact in the contact center starts with solutions designed AI-native—the key to driving value, scale, and loyalty today.