Clinical text classification/information extraction to understand real-world treatment effects at a large, academic medical center
Partner: Vivek Rudrapatna, University of California, San Francisco, Academic
Overview
Project Description
Every time you visit the doctor and watch her document your complete medical history, your data is being captured by huge electronic health records (EHR) systems on the backend. Although these data are typically captured to communicate medical information and support the business of healthcare, recent years have a growing interest in mining these data for new insights in the arena of precision medicine. However, a major roadblock towards the optimal repurposing of these data is that they are mostly unstructured (free-text) rather than well-organized and ready to analyze.
A prior Discovery Project Cohort (Taline Mardirossian, Saransh Gupta, and Rohan Narain, all Cal Class of ’20) successfully demonstrated that standard text classification techniques could achieve human-level performance on the task of converting colonoscopy reports to Mayo Scores: a standard scaling system used to assess disease activity for Inflammatory Bowel Disease. Their work is currently being presented in national conferences and undergoing preparation for publication.
We propose to extend this information extraction pipeline to other clinical text domains: CT/MRI scans, clinical notes, pathology reports and beyond. The output of your models will be the key variables needed to understand the real-world effectiveness of IBD treatments. But likely, your models will be disease agnostic. We see this work as being the very first steps towards unlocking the full information content hidden within EHR systems and accelerating clinical research across the full landscape of diseases.
Expected Deliverable
I hope to work closely with you to develop, validate and deploy a series of text classification models on a variety of textual documents present with electronic health records systems. We will compare your models to human level performance and attempt to reach it as the prior Discovery cohort did. We’ll explore a variety of machine learning-informed modalities such as active learning, and even compare deep learning approaches to conventional methods. By the end of the cycle hopefully we’ll have a finished product that we can present at conferences and publish both your results and your reproducible code.
What would a successful semester look like to you?
One (or several) models with near expert-level performance on a variety of text classification tasks, accompanied by the code to reproduce the analysis in any health system.
Additional Skills from ideal candidates
Experience with (or a strong desire to learn) text processing methods, and a strong background in machine learning will be really helpful. An interest in healthcare and advancing medical research will only sweeten the experience.
Data
Models
Conclusion