Prediction of mortality risk in health screening participants using machine learning-based models: the J-SHC study


This study was conducted as part of the ongoing study to design a comprehensive medical system for chronic kidney disease (CKD) based on individual risk assessment through a specific health examination (J-SHC study). Each year, a specific health examination is conducted for all residents aged 40–74 covered by the National Health Insurance in Japan. In this study, a baseline survey was conducted among 685,889 people (42.7% male, aged 40–74 years) who participated in specific health examinations from 2008 to 2014 in eight regions (Yamagata, Fukushima, Niigata, Ibaraki, Toyonaka, Fukuoka, Miyazaki and Okinawa prefectures). The details of this study are described elsewhere11. Of the 685,889 baseline participants, 169,910 were excluded from the study because baseline data on lifestyle information or blood tests were not available. In addition, 399,230 participants with a survival follow-up of less than 5 years from the main study were excluded. Therefore, 116,749 patients (42.4% male) with known 5-year survival or mortality status were included in this study.

This study was conducted in accordance with the guidelines of the Declaration of Helsinki. This study was approved by the Yamagata University Ethics Committee (Approval No. 2008–103). All data were anonymized prior to analysis; therefore, the ethics committee of Yamagata University waived the need for informed consent from study participants.

A data set

To validate a predictive model, the most desirable way is a prospective study on unknown data. Data on the dates of health examinations are available in this study. Therefore, we divided the total data into training and test datasets to build and test predictive models based on health checkup dates. The training data set consisted of 85,361 participants who participated in the study in 2008. The test data set consisted of 31,388 participants who participated in this study from 2009 to 2014. These data sets were temporally separated and there were no overlapping participants. This method would estimate the model in a manner similar to a prospective study and has the advantage of being able to demonstrate generalizability over time. Clipping was performed for 0.01% outliers for preprocessing and normalization was performed.

Information on 38 variables was obtained during the main survey of health examinations. When there were highly correlated variables (correlation coefficient greater than 0.75), only one of these variables was included in the analysis. High correlations were found between body weight, abdominal circumference, body mass index, hemoglobin A1c (HbA1c), fasting blood glucose, and AST and alanine aminotransferase (ALT) levels. We then used body weight, HbA1c level, and AST level as explanatory variables. Finally, we used the following 34 variables to build the prediction models: age, sex, height, weight, systolic blood pressure, diastolic blood pressure, urine glucose, urine protein, urine occult blood, uric acid, triglycerides, lipoprotein high-density cholesterol (HDL-C), LDL-C, AST, γ-glutamyl transpeptidase (γGTP), estimated glomerular filtration rate (eGFR), HbA1c, smoking, alcohol consumption, medications (for hypertension, diabetes, and dyslipidemia) , history of stroke, heart disease and kidney failure, weight gain (more than 10 kg since age 20), exercise (more than 30 minutes per session, more than 2 days per week), walking (more than 1 hour per day), walking speed, eating speed, dinner 2 hours before bed, skipping breakfast, late night snacks and sleep status.

The values ​​of each item in the training data set for the live/dead groups were compared using the chi-square test, Student’s t -test, and the Mann-Whitney U test and significant differences (P<0.05) were marked with an asterisk

(Supplementary Tables S1 and S2).

Forecasting models [XGBoost]We used two machine learning-based methods (gradient-boosted decision tree

, neural network) and a conventional method (logistic regression) to build the prediction models. All models were created using Python 3.7. We used the XGBoost library for GBDT, TensorFlow for neural network and Scikit-learn for logistic regression.

Missing value fill

The data obtained in this study contained missing values. XGBoost can be trained to predict even with missing values ​​due to its nature; however, neural network and logistic regression cannot be trained to predict with missing values. Therefore, we imputed the missing values ​​using the k-nearest neighbor method (k = 5), and the test data were imputed using an impulse computer trained to use only the training data.

Defining parameters

The parameters required for each model were determined for the training data using the RandomizedSearchCV class of the Scikit-learn library and repeating five-fold cross-validation 5000 times.

Performance evaluationThe performance of each prediction model was evaluated by predicting the test data set, plotting the ROC curve, and using the AUC. In addition, precision, precision, recall, F1 scores (harmonic mean of precision and recall) and confusion matrix were calculated for each model. To assess the importance of explanatory variables for predictive models, we used SHAP and obtained SHAP values ​​that express the influence of each explanatory variable on the model output4.12

. The workflow diagram of this study is shown in Fig. 5.

figure 5

Workflow diagram of developing and evaluating the performance of predictive models.

Leave a Comment