Validating AI for Impact: Tackling Bias for Underrepresented Communities

Sep 16

Dandelion was founded in 2019 after two of our co-founders, Ziad Obermeyer and Sendhil Mullainathan, published a pivotal paper on how algorithmic racial bias can exacerbate existing healthcare inequalities. Their results suggest that algorithm bias is fixable, and comes down to choosing the right training data.

Despite the rapid proliferation of clinical AI tools since 2019, however, the industry still lags in evaluating and correcting for bias. Many algorithms already in clinical use have not been vetted for their impact on diverse populations. These same populations are often poorly represented in AI training data, increasing their risk of algorithmic bias.

Promisingly, there are now solutions available to ensure AI tools perform equitably across populations. Dandelion and The SCAN Foundation have worked together over the past year to create one such solution – made available to developers for free.

Dandelion Health and The SCAN Foundation believe AI can usher in a new age of precision medicine and clinical care. In order for this to become a reality, however, AI must be evaluated for its impact on those who stand to gain the most: populations that have historically been underserved in clinical care.

Clinical AI development challenges

Developing clinical AI tools is a complex process, and developers are hamstrung by the data and resources available to them. After defining the clinical use case, developers must then source training data, which is typically difficult and costly to acquire. They must then annotate every single case to determine ground truth, typically done by human reviewers. Once trained, AI tools must be validated on larger, entirely different datasets. Before Dandelion’s free offering, conducting a truly robust validation was inaccessible or cost-prohibitive for many developers.

What does this all mean for clinical AI development? Cutting edge technologies can be developed, but at a steep price. Without industry-wide demand or specific regulatory requirements to demonstrate an AI tool is unbiased, many developers simply do not have the capacity or incentive to do so. Without a guarantee of algorithm performance, it is increasingly difficult for healthcare providers and other users to choose the “best” AI tools for their needs and specific patient populations.

AI’s Impact on Underserved Populations

Dandelion and The SCAN Foundation are among a growing number of organizations that have stood up a solution to address this challenge. We provide a rigorous, independent validation of AI tools by running them on almost a decade of data collected from our consortium of geographically diverse, non-academic health systems that are representative of the U.S. population.

Our validation encompasses not only overall performance and well-known sources of bias (e.g. by race/ethnicity or sex), but also bias across social determinants of health (SDoH) measures. These measures include median income decile, social vulnerability index (SVI) decile, and rural vs. urban. This information can, for example, help developers determine whether they need to add more representative patient data to their training datasets.

Algorithm validation is particularly critical for underserved and vulnerable populations. Patients in these communities often have limited access to preventative care and may delay seeking treatment, resulting in more advanced disease stages when finally diagnosed. They often tend to be sicker – heart disease, for example, is 40% more prevalent in rural populations compared to those living in urban areas. For these patients, each medical visit and diagnostic test is crucial and often more challenging to obtain.

AI tools provide an opportunity to maximize the positive benefits of each clinical encounter by acting as a second set of eyes. At the same time, however, an incorrect AI result has outsized implications as well. A false positive result could lead to unnecessary and costly follow up procedures, while a false negative could leave a condition undiagnosed and untreated.

AI Validation In Practice

Through the algorithm validations we have performed over the past year, we have been heartened to see that biased algorithms are the exception, not the rule. Most of the algorithms we’ve validated have had relatively few dimensions where they show drastic bias – if any at all. When we do detect statistically significant bias,the degree of bias often isn’t clinically significant. While keeping this in mind, it is important to study and learn from those exceptions where we do see bias, especially given the continued growth and adoption of AI-driven healthcare.

We have found that even seemingly innocuous technical limitations have the potential to significantly impact patient care. One ECG algorithm we validated, for example, performed well overall, but showed poor performance from ECGs performed at one specific hospital. After exploring a number of different hypotheses about demographics and patient mix, the answer turned out to be much simpler. The algorithm was trained on more modern ECG machines (which collect data at 500Hz or 1000Hz). This one hospital – a smaller community hospital – happened to have a number of older 250Hz ECG machines, which caused the overall performance of the algorithm for that population to be materially worse.

Had this algorithm remained unvalidated before being deployed in a lower-resourced care facility that relied on an older machine, it would have been useless at best and negatively impacted care trajectories at worst. We found, for example, that the algorithm performed 12% worse in rural patients compared to urban patients. Rural facilities tend to have fewer specialists available, and consequently may rely on generalists who do not frequently encounter rare conditions. These providers may turn to AI to supplement their clinical decision-making, and they need to be confident that these tools perform well for the populations they are serving.

To take another example, AI algorithms can analyze routine blood test results taken for more than 60% of patients, such as creatinine levels and estimated glomerular filtration rate (eGFR), to detect early changes in kidney function. These subtle changes, which might be missed by a human observer, can indicate the onset of chronic kidney disease (CKD) well before symptoms appear. Early detection is crucial, particularly for older patients, as it allows for timely lifestyle changes and treatments to prevent further progression, potentially avoiding more severe outcomes like kidney failure or the need for dialysis.

Let’s consider the potential impact of this example algorithm on more vulnerable patients or those in lower income deciles. This algorithm may flag patients at risk who may otherwise have been missed, leading to earlier preventative care. If the algorithm were biased against these populations, however, early symptoms could remain undetected in the case of a false negative. A false positive, on the other hand, may result in unnecessary patient stress and expensive follow-up tests they cannot afford.

In both of these examples, AI has the potential to dramatically impact patient care trajectories and quality of life. Dandelion and The SCAN Foundation’s goal in providing free validation services is to enable AI tools to provide a counterweight to existing inequities – rather than perpetuate or even worsen them.

Ensuring AI Works for All

Healthcare providers should not have to cross their fingers and hope the AI tools they deploy will help their patient populations – there are tested, rigorous options to validate algorithms for performance and bias. The cost and effort required for developers to do so are at an all time low, and will continue on a downward trend. Dandelion’s validation offering, for example, is available to developers at no cost thanks to support from The SCAN Foundation. When bias is detected, there are proven ways to correct it and improve performance, such as retraining on new data or modifying features or weights.

With unparalleled validation options available, there is no excuse for not comprehensively testing performance and bias on diverse patient populations. We believe industry-wide algorithm validation will enable clinical AI, like any other product, to compete on quality. AI has the potential to powerfully impact the lives of our most vulnerable populations – it is incumbent on the healthcare industry as a whole to ensure its impact is positive.

Shivaani Prakash, MSc, PhD

Head of Data, Dandelion Health

AV Ploumpis

Director of Growth Strategy, Dandelion Health