Deterministic and Probabilistic Predictive Modelling of Environmental and Clinical Risk Factors in T2D and Asthma
Authors: E. Longato, A. Zandonà, M. Vettoretti, B. Di Camillo. PULSE project partner, University of Padova
The team at the University of Padova have been working on predictive models of type 2 diabetes ((T2D) and asthma onsets, with the goal of choosing and developing the most scientifically sound risk-evaluation tools within the PULSE systems.
T2D is a serious health condition, characterised by elevated blood glucose levels and insulin resistance. An early detection of T2D onset, a good therapy, and healthier lifestyle choices are important to limit, or even avoid, T2D complications. Similarly, a prompt asthma diagnosis can prevent under-treatment, slow down the progression of the disease and keep its symptoms (e.g., airflow obstruction, bronchospasm, and shortness of breath) under control. Notably, though, unlike childhood asthma, the risk factors for adult-onset asthma are not yet well understood.
Predicting Diabetes and Asthma
The concept of a “predictive model” is rather intuitive: we would like to have at our disposal a tool that, given sufficient information on an individual at a certain point in time, could predict whether something will happen to him/her in the future.
The first requirement for a good predictive model is – of course – accuracy, i.e. the ability to make “good” predictions. In the case of the PULSE project, a “good” prediction is such that high risk-scores are generated for subjects with high probability of developing T2D or asthma, without generating false alarms. In this way, we can be confident that all the suggestions we give are reliable and that they specifically target those who would benefit from them.
The second requirement is somewhat subtler, as it involves the concept of “generalisation,” i.e., the ability of a model developed under a specific set of experimental conditions to retain its usefulness in a real-life scenario. This is particularly important given the global nature of PULSE and the many cities in which the system will be deployed: there is no a priori guarantee that performance will remain consistent when the general demographical characteristics of a population change (e.g.: when moving from Barcelona to Singapore, or when considering different age groups).
Unfortunately, the metrics published together with literature models are not sufficient to make a fair comparison between them, because they are strongly population-dependent. Hence, the UNIPD team validated and recalibrated all the suitable models (6 for T2D and 2 for asthma, plus their variations) on two common datasets, extracted from the Health and Retirement Study (HRS) and the Multi-Ethnic Study of Atherosclerosis (MESA). This approach has three main advantages:
It guarantees fairness by levelling the playing field (all the models “compete” by predicting on the same external dataset);
It gives a good proxy of what happens if we were to apply a model in different cities;
It allows to retrace the steps taken by the original investigators and, in a sense, re-develop the model from scratch to check what it would have looked like, had it been tuned on MESA or HRS in the first place.
Toer the models developed by UNIPD as artificial intelligence entities that automatically learn from all these data.
Predictive models will decide which variables should be considered as risk factors for T2D or asthma onset and will provide a tool to rank subjects based on their risk of developing these diseases. Bayesian Networks will identify how these variables affect each other and how they might trigger a disease, for example showing how a habit or a particular lifestyle choice affects the probability of developing T2D or asthma.
The combination of predictive models and BN will empower PULSE systems, giving Public Health Observatories the possibility to provide the citizens with specific feedback suggestions on lifestyle and organize interventions based on public health data. Well-being models will also be provided, in order to manage public health problems and promote community health in cities. get a comprehensive picture of what is actually going on, the UNIPD team considered model performance under three complementary points of view: “Does the model assign higher risk scores to subjects who eventually develop T2D or asthma vs. those who do not”; “Does the model assign higher risk scores to subjects who develop T2D or asthma earlier” and “Does a risk of 80% output from the model actually mean you have 4 in 5 chances to develop T2D or asthma?”
By painstakingly analysing these results, UNIPD team was able to discern some common trends in state-of-the-art predictive modelling:
One of the simplest and most used models for T2D prediction, FINDRISC, is well calibrated, but exhibits suboptimal discrimination performance
The main contributor to T2D models accuracy is the knowledge of blood work results and, specifically, of fasting blood glucose levels.
In general, T2D models are well behaved in terms of discrimination ability (i.e., a change of populations does not greatly affect performance), but often lack calibration.
In general, literature asthma models perform poorly on all accounts (this may be partly because they have not been validated for prediction by their original authors).
The UNIPD team also investigated T2D and asthma onsets from a descriptive point of view. They developed a mathematical model that characterizes the status of a subject (health, lifestyle, environmental conditions, etc.) before and at T2D or asthma onset.Monitoring the changes from a healthy to a disease status is crucial to define risk factors as well as characterize the effects of T2D/asthma on subjects.
Specifically, the team applied a technique known as Bayesian network (BN) to detect probabilistic relationships among variables and identify synergistic effects. Indeed, BN can identify the combination of factors maximizing the probability of T2D or asthma outcome and can also show how these factors regulate each other.
The Road Ahead
Imagine PULSE as a big sensor collecting a plethora of heterogeneous data: air pollution, traffic, hours of physical activity, smoking habits, weight, and height, and so on