Will Data Science in Healthcare Overcome Its Many Roadblocks?

The Fragmented State of Healthcare Data
Data Governance, Privacy and Security in Healthcare Data Science
Possible Solutions for Healthcare Data Governance, Privacy and Security
Conclusion

The potential for AI development to transform global healthcare culture from one of 'reactive disaster management' to one of 'prevention and early intervention' through predictive systems has become clear over the last ten years. The potential benefits remain compelling, with the healthcare AI market forecast recently to reach US$51.3 billion by 2027.

Possible applications and developments include:

Automation of medical imaging analysis.

Medical image visualization and analysis in MATLAB

Identification of molecular biomarkers for disease, without biopsies, from simple fluid samples.

A liquid biopsy system enabled by machine learning

Predictive health analytics (PHA) powering AI-based home monitoring systems for terminal and high-risk patients (a group representing just 5% of patients but 60% of healthcare costs in the US).
Drug discovery, including prediction, design and drug-target interaction (DTI) phases.

Training data assessment for drug discovery

Machine-driven analysis and logistical strategies for co-morbidities in chronic disease management (CDM).
Clinical trials for the identification of suitable candidates; for the identification of biologics (large biological molecules) that may contribute to drug evolution for trials; and for ROI analysis.

AI based subject progress analysis in a clinical trial

New machine learning insights and applications related to epidemiology, including statistical prediction models and containment.
The use of AI input for virtual, augmented and mixed reality (VAMR) in healthcare research, monitoring and palliative care, as well as for adding an 'informed' VR layer to telesurgery procedures.
AI in genomics research.
AI-informed chatbots and engagement systems in the telehealth sector.

Systems and applications like this require access to patient data — perhaps the most politically and socially volatile topic in debates over the future of data science and machine learning. Since this represents a major impediment to progress, and since it tends to dominate the discourse, let's examine the obstacles, and consider possible remedies.

Meet our data science practitioners to collaborate on your project

Get in touch

The Fragmented State of Healthcare Data

The World Health Organization has acknowledged the problems of scale and cost in transitioning healthcare data from paper to digital. Besides highly politicized issues around privacy (which we shall examine), the budgetary demands and long-term commitment needed to fulfil digitization strategies face a number of challenges:

The limited length of political administrations under pressure to achieve shorter-term victories.
Sporadic political manipulation of healthcare data, which can render the data less accurate or representative, and adversely affect budget allocation for healthcare IT.
The limited fiscal resources of developing countries.
The way that US patient data is sub-divided between healthcare insurers.
Endemic undercounting of cases, including undercounting of COVID-19 infections, which can also skew the data for potential machine learning applications.
A baffling array, in many countries, of diverse and incompatible digital healthcare solutions across states, counties, and other segmentations of the population.
Resistance from the medical community related to fears about patient outcome under the burden of substandard EHR; fears about the increased security risks of digitizing healthcare infrastructure; a general reluctance, personal and professional, to cede control to machine systems.
Poor provisioning of EHR solutions in competitive, grant-hungry and bid-driven markets.

The United Kingdom's ambitious plans for full patient health data digitization by 2020 resulted in only 10% of NHS trusts achieving paper-free records by 2019. This disappointing result came in the wake of the abandoned £12.7-billion National Electronic Health Record Program in 2012, possibly the most expensive government IT failure in the world at that time.

Around 2009 in the US, a $30-billion government stimulus package for healthcare digitization, which might have cohered the country's fragmented health IT sector, was hijacked by the free market: a competitive commercial environment for electronic health records emerged, with rivals eschewing open standards in the hope of market capture through proprietary formats. The result was a multiplicity of non-interoperable EHR systems across US hospitals.

Although poorer countries face economic barriers to the costs of implementing digitization, the long-term cost reductions are desirable. In Eastern Europe, where struggling economies make healthcare provision challenging, grants from the European Union are aiding digitization, most notably in the cases of Estonia and Poland.

The Outlook for Healthcare Digitization

The political aspect of healthcare data-gathering and disclosure is difficult to solve, particularly where it reveals inequalities. If the route to saving lives and money with machine learning is via naked disclosure of real-world healthcare data/statistics (which may upset governments' more favorable interpretations), the key to progress may be more diplomatic than scientific in nature. Misleading government management of healthcare data systems and statistics predates the pandemic and, sadly, seems likely to survive it.

Meanwhile, competitive market forces fragment EHR deployment into multiple non-interoperable networks, particularly in the United States, where federal public health IT proposals must also endure the arguable stigma of 'socialized medicine'. In a culture where state/private partnerships have co-opted state provisioning, the retarding effects of the free market on healthcare data digitization seem likely to continue after COVID-19, particularly in the US.

If a single large country cannot standardize an EHR system, a global standard seems even more unlikely. The best solution may be cross-systems translation matrices that use robotic process automation (RPA) to extract and homogenize variously-sourced data for the purposes of training AI healthcare and medical research models. Federated solutions may also help, as we'll see later.

Data Governance, Privacy and Security in Healthcare Data Science

Public ambivalence and consumer fears regarding the growing digitization and sharing of health data center mainly on the possibility that health insurance companies will somehow obtain privileged patient information in order to raise costs for coverage against particular conditions or pre-existing genetic markers in individual patients — or even for broader geographical swathes of patients, where research identifies specific health issues in a neighborhood, town or state.

Healthcare's Move from Actuarial to Social Science Data

While insurance rate increases can be explained and justified with actuarial tables, the black-box nature of machine learning presents a novel legal obstacle.

Healthcare companies are rising to this challenge: a new study from the US uses the Python package SHAP to break down the logic of healthcare premium price rises that are calculated with machine learning methods.

Another machine learning project can also demonstrably identify potential high-cost claimants (HiCCs) at a 91% accuracy by scraping publicly available data in an explicable and accountable fashion.

That healthcare insurance providers are keen to switch from actuarial to user-specific data, and that they are willing to use data mining and other oblique techniques to justify increased premiums, has become clear in recent years.

For instance, it was revealed in 2014 that patient information in the UK was repeatedly sold to 178 private health insurance companies, leading to a change in the law but also long-term financial implications for those affected.

Two years later it was discovered that the Google/DeepMind AI project had obtained a far wider access to the data of millions of UK patients than had been anticipated in an earlier deal with the UK government.

Companies in the US openly vaunt their ability to merge information from data brokers with actuarial data, in order to identify individual health data and case histories, while insurance companies are publiclymoving from actuarial to socially mined data via providers such as LexisNexis.

The migration from actuarial data is occurring on the pretext of improved early intervention, health risk prediction, and fraud prevention. However, research has found that these legitimate business uses of web-scraping and other mining techniques can lead to the leveraging of personal health information (PHI) for the purposes of raising consumer premiums.

Gaining Healthcare Data by Stealth and Opportunism

Where anonymization is used in health databases, re-identification can be achieved by cross-referencing the database with another database (such as a voter register) created for a different purpose. This is similar to the way that advertising companies track web-users across domains to facilitate targeted marketing.

Two such databases may each be legal and legitimate in scope, regulation and intent; but cross-referencing them can enable re-identification of anonymized data — a use likely not permitted under the acquisition terms of either dataset.

In the United States and elsewhere, legislation against such abuses has proved easier to implement than enforce. In the US, the HIPAA act of 1996 and the HITECH Act of 2009 were conceived, respectively, to address governance and security concerns regarding the use of PHI in machine systems and networks.

In Europe this issue has been covered by the GDPR since 2018, and with varying levels of governmental commitment in other frameworks around the world over the last few decades.

Healthcare Security Breaches Disarm Legislation

The reach of such laws has been undermined not only by the aforementioned impediments to digitization, but the way that the digitization process massively increases the attack surface for healthcare infrastructure:

Public healthcare institutions in the US are enduring year-on-year waves of ransomware attacks as the rate of digitization and network-facing data systems grows, with the U.S. Cybersecurity and Infrastructure Security Agency anticipating a greater rise in incidents.
A DNA testing service had 92 million accounts breached in 2018, leading to fears that the stolen data could end up in the hands of insurance companies, if only because publication puts it into the public domain and into the path of insurers' web-scraping and data mining activities.
The U.S. Department of Health and Human Services Office for Civil Rights currently lists 686 data breaches of health-related companies. The list only reports the previous 24 months, and only covers the United States.
In 2017, a report on cyber-incursions on healthcare providers detailed the sale of PHI on the deep web, and the common exploitation of unsecured medical IoT devices and networked printers in health-related environments.

Stolen health data is frequently posted into the public domain, either as punishment for non-payment of ransoms, or as an anarchic act. Mismanagement and poor security practices can also lead to inadvertent publication of PHI. In all cases, there's no way to repair the damage done.

Possible Solutions for Healthcare Data Governance, Privacy and Security

Governmental Mandates and Legislation

In the US, there are currently no federally mandated security procedures for healthcare software development or other sectors. While stronger laws around the world (such as the GDPR) puts companies at greater default liability for poor security practices, the best of them do little more than insist on secure encryption protocols. However, some governments actively seek to undermine even this minimal safeguard.

In fact, social engineering and poorly-designed network infrastructure, among other factors, frequently play a more important role in healthcare data incursions.

Internationally and on a state-by-state basis in the US, varying statutory provisions exist to limit price optimization by insurance companies, among the most effective being California's Proposition 103 (1988). Such legislation enshrines the actuarial approach against any better and later technologies (including machine learning), and enforces this for social and economic reasons.

Clearly the quality and scope of legislation must be updated to address new issues arising from machine learning: in Sweden, the Analytic Imaging Diagnostics Arena (AIDA) has recently begun to develop a new legislative framework to address healthcare data sharing in the AI era, while a similar proposal out of Stanford attempts to clarify the extent to which patient data can be commercialized.

New initiatives like this would seem to be a necessary step for all countries that currently rely on older laws to govern newer technologies.

Siloed Institutions for PHI Processing

Air-gapped networks are often used by sensitive industries such as nuclear power, air traffic control, and the military, though it's not an unsolvable problem for determined hackers and does not remove the possibility of social engineering attacks.

However, with appropriate funding and popular support, this approach can work: the private/public partnership behind the Danish National Genome Center (DNGC) sets a high standard for state governance of sensitive medical data: only one copy of a patient's genetic data exists in the center, which cannot be copied or erased and will never be allowed to cross borders.

The DNGC is not just a 'data center' for PHI, but operates many research initiatives under the protection of the same level of state-led governance that ordinarily encompasses tax information, driving licenses, and other civic functions precluded from the private sector.

That said, transferring this concept from a small country like Denmark to a larger one like the USA would entail the creation of a well-funded organization on a scale equivalent to the CDC — with a predictable accompanying gauntlet of political and public objection around costs and the de-emphasis of private sector participation.

Machine Learning Approaches

In theory, machine learning can solve its own problems in this respect. In July 2020, new research into federated learning proposed a system where multiple institutions contribute data 'blindly' into machine learning models. Under this approach, all participants benefit from the entirety of the results without gaining access to the other contributing databases.

Federated learning

In this case, the black-box nature of neural network data models serves as an aide to security, since the resulting algorithms will be too abstracted to deconstruct.

Other AI-based approaches include:

PrivPy, a secure multi-party computation (MPC) framework for machine learning tasks.
SecureML, a system of protocols to preserve privacy in machine learning tasks.
An edge-based framework to prevent centralization of healthcare data on the cloud1.

Conclusion

Left entirely to market forces, the evolution of AI-driven machine learning in the healthcare sector risks to reduce global healthcare availability through increased premiums, and turn the known social benefits of affordable coverage into a genetic lottery, among many other potentially negative effects.

However, the technologies that enable this arguable abuse of knowledge also have the potential to prolong and improve life. Society must navigate and mitigate this conflict through legislation that balances a reliance on market forces against the hard-won tenets of civilization. Though many of the specific challenges around this are technical, the central problem is ethical and sociological in nature.

Partner with Iflexion to find the workarounds for your AI challenge

Get in touch

November 19, 2020 | Yaroslav Kuflinski

The Pros and Cons of Artificial Intelligence: A Global Outlook

In this article, we explore the impact of AI on the global economy and what companies can do to balance out this technology’s pros and cons.

Learn more

The Pros and Cons of Artificial Intelligence: A Global Outlook

January 13, 2025 | Yaroslav Kuflinski

Jobs That AI Can’t Replace: The Impact of Automation on Workforce

In this article, we discover the impact of AI automation on workforce and discuss how companies can prepare for AI-ready talent scarcity.

Learn more

Jobs That AI Can’t Replace: The Impact of Automation on Workforce

March 21, 2023 | Yaroslav Kuflinski

Balancing The Benefits & Risks of Artificial Intelligence: A C-Suite Guide

In this article, we explore the common risks and advantages of artificial intelligence companies face during its adoption, and what can be done to address them.

Learn more

Balancing The Benefits & Risks of Artificial Intelligence: A C-Suite Guide

November 8, 2018 | Yaroslav Kuflinski

How Medical Image Analysis Will Benefit Patients & Physicians

Automated medical image analysis is set to change the way physicians work and bring benefits to medical organizations and their patients.

Learn more

How Medical Image Analysis Will Benefit Patients & Physicians

September 5, 2025 | Maria Bura

Facts about AI: statistics, applications & top tools

Explore the most important facts about AI, from its history and related technologies to recent AI adoption trends. Learn about AI use cases across industries, the most widely used AI tools, and actionable strategies to address common challenges of AI implementation.

Learn more

Facts about AI: statistics, applications & top tools

April 5, 2019 | Yaroslav Kuflinski

AI Applications: an Overview of 8 Emerging Artificial Intelligence Use Cases

An overview of emerging artificial intelligence examples in healthcare, insurance and banking companies’ daily life, as well as their economic and productivity impact.

Learn more