1 Introduction

1.1 Motivation and Objectives

Subpopulation discovery is an essential objective of data analysis in medical research [1], [2]. Knowledge of characteristic subpopulations can improve prevention (public health) and treatment (clinical medicine) of adverse medical conditions. Subpopulations are detected by (i) identifying long-term determinants and protective factors of a medical condition of interest [3][5], (ii) revealing subcohorts with increased disease prevalence or with different treatment response [6][8], and (iii) generating robust statistical models that can explain relationships between one or more independent variables and the outcome [9][11]. For example, epidemiologists attempt to discover associations between specific features (e.g., demographics and descriptors of lifestyle) and a target variable (e.g., obesity) in cohort studies by collecting and analyzing extensive participant data obtained from questionnaires, medical examinations, laboratory analyses, and imaging [12][15]. In studies with a longitudinal design, these measurements are collected repeatedly over time and contain hidden temporal information, the investigation of which can potentially lead to new insights.

To find associations between variables, medical researchers usually first carefully derive hypotheses from clinical practice, experimental studies, or extensive literature reviews to test them formally for statistical significance [16]. However, with the ever-increasing volume and heterogeneity of medical data [17], traditional hypothesis-driven workflows are becoming increasingly impractical, as, for this reason, some critical inherent associations between variables may go undetected [18]. Machine learning can improve medical research by discovering understandable descriptions of patient or study participant subpopulations similar in terms of the target variable and can thus be used to derive new hypotheses [19], [20].

The proliferation of medical machine learning applications is motivated, among others, by the desire to make automated use of the plethora of information collected about study subjects, but sometimes also by the ubiquity of deep learning success stories in the media [21]. However, the ease of creating complex data-driven models is no guarantee that insights can be effortlessly derived [22]. Most state-of-the-art machine learning algorithms such as deep neural networks [23] and gradient boosting machines [24] generate so-called black-box models with multiple layers of complexity that involve many multivariate, nonlinear interactions between variables that are difficult to represent intuitively [25], [26].

It is critical that the application expert, who is not a practitioner but a scientist working in a clinical or epidemiological setting, is equipped with tools to understand, explore, and visualize the models [27], [28] so that they can drill down to specific individual patterns and gain actionable insights that ultimately contribute to the prevention, diagnosis, and treatment in clinical practice. Because medical data come from a wide variety of sources, key characteristics of the collected datasets vary, requiring adaptation of methods to each application scenario [29].

This work proposes methods that serve as intelligent assistance to medical researchers to analyze high-dimensional, timestamped medical data. Hence, the core research question of the thesis is:

RQ: How to derive accurate yet understandable patterns for subpopulation discovery in high-dimensional timestamped medical data?

Before, during, and after the generation of machine learning models, several challenges must be overcome for the medical expert to derive actionable knowledge. We translate these challenges into the following three goals:

  1. Comprehensibility and distinctiveness of subpopulations: The extracted models, including clusters, rules, and other patterns, must be made understandable; preferably, the model generation process must also be comprehensible. Furthermore, we must minimize redundancy, which negatively affects the perceived quality of the model. Our task is to extract, process, and display the most relevant patterns for expert-driven model exploration.
  2. Exploitation of time: Hidden temporal information must be exploited. Medical scholars search for long-term determinants of severe diseases. Finding patterns from subject “evolution” can contribute to this goal.
  3. Post-hoc interpretation of complex black-box models: If the discovered patterns are not intrinsically interpretable, methods are required that extract the most relevant subpopulations that can be presented to the application expert.

1.2 Structure and Contributions of This Thesis

This thesis presents solutions to support medical researchers for expert-driven subpopulation discovery in high-dimensional, timestamped medical data. Design decisions and developments were partly inspired by suggestions from the respective domain experts and cooperation partners, including three tinnitus experts, an epidemiologist with statistical expertise, and a diabetes expert.

The thesis is organized into three parts and ten chapters tackling the research question and challenges mentioned above. Part I covers methods for subpopulation discovery in high-dimensional data. Part II focuses specifically on temporal aspects of medical datasets and provides approaches that extract informative representations from timestamped data. Part III addresses the post-hoc analysis of machine learning models and includes solutions to derive model-, observation-, and subpopulation-level insights from otherwise “opaque” black-box models.

  • Chapter 2 (Medical Background and Datasets) presents the medical background relevant to this thesis, a brief comparison of medical study types, and an overview of the medical studies used to validate the proposed methods.
  • Chapter 3 (Interactive Discovery and Inspection of Subpopulations) presents a workflow for interactive data-driven analysis of population-based cohort data using hepatic steatosis as an example. It includes steps (i) to detect subpopulations that have different distributions with respect to the target variable, (ii) to classify each subpopulation taking class imbalance into account, and (iii) to detect variables associated with the outcome.
  • Chapter 4 (Identifying Distinct Subpopulations) refines the analysis of the previous chapter by examining redundancy in large rule sets describing subpopulations. We present a workflow that extracts a smaller number of “representative” rules, i.e., rules that avoid instance overlap as much as possible, thus covering different subpopulations.
  • Chapter 5 (Visual Identification of Informative Features) introduces a parameter-free clustering approach for deriving phenotypes, phenotype exploration, and visual juxtaposition of phenotypes in a high-dimensional feature space.
  • Chapter 6 (Constructing Evolution Features to Capture Change over Time) presents a solution for cohort analysis in longitudinal cohort study data to construct “evolution features” from latent temporal information describing the cohort participants’ change over time.
  • Chapter 7 (Feature Extraction from Short Temporal Sequences for Clustering) complements the previous solution by presenting an approach to create representations from short temporal sequences via clustering in experimental data.
  • Chapter 8 (Post-Hoc Interpretation of Classification Models) builds upon the insights of the previous chapters on the role of features in subpopulation understanding. We propose a method that makes already learned, complex classification models understandable to the domain experts. We combine the classification of high-dimensional medical data with model explanation using post-hoc interpretation methods. To this end, we use Shapely value explanations (SHAP), LASSO coefficients, and partial dependency plots. Our approach delivers statistics and visualizations representing global feature importance, instance-individual feature importance, and subpopulation-specific feature importance, all of which help illuminate complex black-box machine learning models.
  • Chapter 9 (Subpopulation-Specific Learning and Post-Hoc Model Interpretation) addresses the issue of visualizing differences between two subpopulations in temporal data. For this purpose, we derive a post-hoc interpretation measure to assess the difference in the predictors’ association with the target variable between two subpopulations.
  • Chapter 10 (Summary and Future Work) concludes the thesis by summarizing the contributions and providing a detailed outlook for the presented work.

For the validation of the proposed methods, we used datasets from the following epidemiological and clinical studies:

  • SHIP: the longitudinal population “Study of Health in Pomerania” [12],
  • CHA: an observational therapy study involving data on self-report questionnaire responses from tinnitus patients [30],
  • DIAB: a clinical experiment yielding timestamped plantar pressure and temperature recordings from diabetes patients and non-diabetic volunteers [31], and
  • ANEUR: a retrospective clinical study involving image data on intracranial aneurysms [32].

These datasets are used for method validation as follows:

Dataset Chapter 3 Ch. 4 Ch. 5 Ch. 6 Ch. 7 Ch. 8 Ch. 9
SHIP x x x x
CHA x x x
DIAB x
ANEUR x