Modern Evidence Based Medicine is more dependent than ever on multi-dimensional data sets
originating from various sources, such as medical imaging, genome expression analysis, patient
health records and contextual factors. This data drives analysis, facilitates correct diagnoses and ultimately helps to develop and select the appropriate treatments. The volume and impact of
this data has increased tremendously through technological developments such as real time
streaming of device driven vitals, high-throughput genomics and high-resolution medical
imaging data. Additionally, the availability and popularity of different web applications that
integrate wearable healthcare devices has allowed the collection and monitoring of fine-grained
personal healthcare data.
The fusion and combination of these heterogeneous data sources has already led to many
breakthroughs in health research and shows high potential for the development of methods that will push current reactive practices towards prescriptive, personalized and preventive
healthcare. This potential is recognized and has led to the development of many platforms for
the collection and statistical analysis of healthcare data (e.g. Apple Health, Microsoft Health
Vault, Oracle Health Management, and Philips Health Suite). However, the heterogeneity of the
data, privacy concerns, and the complexity and multiplicity of healthcare processes (e.g.
diagnoses, therapy control, and risk prediction) creates significant challenges for data fusion,
algorithm selection and tuning. These challenges leave a huge gap between the actual and the
potential data usage in healthcare, which prevents a paradigm shift from delayed generalized
medicine to evidence based predictive personalized medicine.
As such, a platform for collaborative and privacy-preserving share, analysis and evaluation of
healthcare data would drastically facilitate the creation of advanced models on heterogeneous
fused data. It will also ensure the reproducibility of results, and provide a solid basis for the
development of algorithm ranking and selection methods based on collaborative meta-learning.
Ongoing research that is focused on building such meta learning platforms has matured to a
point that several institutions are already well on their way to bring their ideas to market.
Cerebrum is a cloud-based machine learning tool for healthcare apps that helps analyze the
mountains of patient generated data to try and improve prediction of health events so that we
can intervene sooner.
According to its creator organization Medable, Cerebrum provides machine learning across the
ecosystem of clinical study data, including standard clinical instruments and patient reported
outcomes data, metadata from mobile devices, connected devices, and genomic and epigenomic Data.
Cerebrum’s unique machine learning system has the ability to automate the identification of
high-value predictors and provide rapid generation of novel insights. Cerebrum can help
[healthcare providers] gain a much better understanding of their data through text classification
and mining, emotion/behavior analysis, tagging, and other health-specific features.
Cerebrum is currently being piloted by some of the healthcare institutions with whom Medable
works. It will be interesting to see if this platform helps translate the data generated by some of
the large digital cohorts, whether through ResearchKit or independent studies like Health eHeart, into learning models that can be applied outside of those studies.
OpenML is a true open source, free to serve platform that was started by a group of researchers from Netherlands and Serbia.
The OpenML platform serves to address collaboration to meet the needs of meta-learning in healthcare AI by caching on a hub repository rich meta-datasets, workflows and evaluations, and algorithms.
Currently, OpenML is not fully distributed but can be installed on local instances which can
communicate with the main OpenML database using mirroring techniques. The downside of
this approach is that code (machine learning workflows), datasets, experiments (models and
evaluations) are physically kept on local instances, so users cannot communicate and share.
Ultimately OpenML can turn into a fully distributed machine learning platform, which will be
accessible from different data mining and machine learning platforms such as RapidMiner, R,
WEKA, KNIME, or similar. Such a distributed platform would allow the ease of sharing data
Currently, regulations and privacy concerns often prevent hospitals to learn from each other’s
approaches (e.g. machine learning workflows), reproduce work done by others (data version
control, preprocessing and statistical analysis), and build models collaboratively.
On the other hand, metadata such as type of hospital, percentage of readmitted patients or
indicator of emergency treatment, as well as the learned models and their evaluations can be
shared and have great potential for the development of a cutting edge meta-learning system for
ranking, selection and tuning of machine learning algorithms.
The success of meta-learning systems is not only greatly influenced by the size of problem
(data) and algorithm spaces, but also by the quality of the data and algorithm descriptions
(meta-features). Such platforms, therefore, need to employ domain knowledge provided by clinical experts and formal sources (e.g. ontologies) in order to extend the meta-feature space for meta-learning in health care applications. For example, in meta-analyses of VOC spectral microarray data from diagnostic devices for detection of cancer, the type of MEMS chip is very important in predicting algorithm performance. Further, in mixed data sources it would be useful to know which type of data contributed to the performance (electronic health records, laboratory tests, data from wearables, etc.).
In contrast to data descriptions, algorithm descriptions are much less analyzed and applied in
the meta-learning process. Very often results showed that descriptions on the level of algorithm
sub-parts (e.g. initialization type and internal evaluation measures in clustering algorithms),
could improve the quality of meta-learning predictions, and additionally identify which
algorithm sub-parts really influenced the overall performance.
Hence, ML Platforms need to include component based algorithm definitions as meta-features
and allow their usage as predictors in meta-learning systems. The development of such a
collaborative meta-learning system would address different challenging tasks in healthcare
predictive analytics like early diagnostics and risk detection, hospital readmission prediction,
automated therapy control or similar, with many potential stakeholders: patients, doctors, hospitals, insurance companies, among others.
About the Author
Umakant Jayaram, is a BI consultant for Ohum Healthcare Solutions. Umakant has over two decades of experience in the area of Analytics and Business Intelligence and has largely been involved in the Financial Markets and Healthcare for the development and deployment of statistical analytical models. He is also a Visiting Faculty in the area of Strategy in NMIMS University in Bombay, and is on the Model Review Committee for a Global Bank.
Joaquin Vanschoren, Jan N van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.