Modern Evidence Based Medicine is more dependent than ever on multi-dimensional data

Modern Evidence Based Medicine is more dependent than ever on multi-dimensional data

Modern Evidence Based Medicine is more dependent than ever on multi-dimensional data sets

originating from various sources, such as medical imaging, genome expression analysis, patient

health records and contextual factors. This data drives analysis, facilitates correct diagnoses and ultimately helps to develop and select the appropriate treatments. The volume and impact of

this data has increased tremendously through technological developments such as real time

streaming of device driven vitals, high-throughput genomics and high-resolution medical

imaging data. Additionally, the availability and popularity of different web applications that

integrate wearable healthcare devices has allowed the collection and monitoring of fine-grained

personal healthcare data.


The fusion and combination of these heterogeneous data sources has already led to many

breakthroughs in health research and shows high potential for the development of methods that will push current reactive practices towards prescriptive, personalized and preventive

healthcare. This potential is recognized and has led to the development of many platforms for

the collection and statistical analysis of healthcare data (e.g. Apple Health, Microsoft Health

Vault, Oracle Health Management, and Philips Health Suite). However, the heterogeneity of the

data, privacy concerns, and the complexity and multiplicity of healthcare processes (e.g.

diagnoses, therapy control, and risk prediction) creates significant challenges for data fusion,

algorithm selection and tuning. These challenges leave a huge gap between the actual and the

potential data usage in healthcare, which prevents a paradigm shift from delayed generalized

medicine to evidence based predictive personalized medicine.


As such, a platform for collaborative and privacy-preserving share, analysis and evaluation of

healthcare data would drastically facilitate the creation of advanced models on heterogeneous

fused data. It will also ensure the reproducibility of results, and provide a solid basis for the

development of algorithm ranking and selection methods based on collaborative meta-learning.

Ongoing research that is focused on building such meta learning platforms has matured to a

point that several institutions are already well on their way to bring their ideas to market.


Cerebrum, OpenML

Cerebrum is a cloud-based machine learning tool for healthcare apps that helps analyze the

mountains of patient generated data to try and improve prediction of health events so that we

can intervene sooner.


According to its creator organization Medable, Cerebrum provides machine learning across the

ecosystem of clinical study data, including standard clinical instruments and patient reported

outcomes data, metadata from mobile devices, connected devices, and genomic and epigenomic Data.


Cerebrum’s unique machine learning system has the ability to automate the identification of

high-value predictors and provide rapid generation of novel insights. Cerebrum can help

[healthcare providers] gain a much better understanding of their data through text classification

and mining, emotion/behavior analysis, tagging, and other health-specific features.


Cerebrum is currently being piloted by some of the healthcare institutions with whom Medable

works. It will be interesting to see if this platform helps translate the data generated by some of

the large digital cohorts, whether through ResearchKit or independent studies like Health eHeart, into learning models that can be applied outside of those studies.


OpenML is a true open source, free to serve platform that was started by a group of researchers from Netherlands and Serbia.


The OpenML platform serves to address collaboration to meet the needs of meta-learning in healthcare AI by caching on a hub repository rich meta-datasets, workflows and evaluations, and algorithms.


Currently, OpenML is not fully distributed but can be installed on local instances which can

communicate with the main OpenML database using mirroring techniques. The downside of

this approach is that code (machine learning workflows), datasets, experiments (models and

evaluations) are physically kept on local instances, so users cannot communicate and share.

Ultimately OpenML can turn into a fully distributed machine learning platform, which will be

accessible from different data mining and machine learning platforms such as RapidMiner, R,

WEKA, KNIME, or similar. Such a distributed platform would allow the ease of sharing data

and knowledge.


Currently, regulations and privacy concerns often prevent hospitals to learn from each other’s

approaches (e.g. machine learning workflows), reproduce work done by others (data version

control, preprocessing and statistical analysis), and build models collaboratively.


On the other hand, metadata such as type of hospital, percentage of readmitted patients or

indicator of emergency treatment, as well as the learned models and their evaluations can be

shared and have great potential for the development of a cutting edge meta-learning system for

ranking, selection and tuning of machine learning algorithms.


The success of meta-learning systems is not only greatly influenced by the size of problem

(data) and algorithm spaces, but also by the quality of the data and algorithm descriptions

(meta-features). Such platforms, therefore, need to employ domain knowledge provided by clinical experts and formal sources (e.g. ontologies) in order to extend the meta-feature space for meta-learning in health care applications. For example, in meta-analyses of VOC spectral microarray data from diagnostic devices for detection of cancer, the type of MEMS chip is very important in predicting algorithm performance. Further, in mixed data sources it would be useful to know which type of data contributed to the performance (electronic health records, laboratory tests, data from wearables, etc.).


In contrast to data descriptions, algorithm descriptions are much less analyzed and applied in

the meta-learning process. Very often results showed that descriptions on the level of algorithm

sub-parts (e.g. initialization type and internal evaluation measures in clustering algorithms),

could improve the quality of meta-learning predictions, and additionally identify which

algorithm sub-parts really influenced the overall performance.


Hence, ML Platforms need to include component based algorithm definitions as meta-features

and allow their usage as predictors in meta-learning systems. The development of such a

collaborative meta-learning system would address different challenging tasks in healthcare

predictive analytics like early diagnostics and risk detection, hospital readmission prediction,

automated therapy control or similar, with many potential stakeholders: patients, doctors, hospitals, insurance companies, among others.


About the Author

Umakant Jayaram, is a BI consultant for Ohum Healthcare Solutions. Umakant has over two decades of experience in the area of Analytics and Business Intelligence and has largely been involved in the Financial Markets and Healthcare for the development and deployment of statistical analytical models. He is also a Visiting Faculty in the area of Strategy in NMIMS University in Bombay, and is on the Model Review Committee for a Global Bank.



Joaquin Vanschoren, Jan N van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.