Hugo Broucke & Christelle Algans - Data scientists Kaduceo

Following a first hospitalization, readmissions for pulmonary diseases are among the conditions which generate the most readmissions and consequently cause additional expenses for social security. While some of these readmissions are scheduled and necessary, others are unexpected and potentially preventable. Thus, identifying a patient at high risk of readmission for a pulmonary disease would improve his post-admission follow-up, in order to minimize his risk of readmission and generate significant savings. At Kaduceo, we seek to identify these patients’ profile by working with French PMSI data from 2010 to 2019 from one of our partner hospitals.

Several studies focusing on predicting 30-day readmissions for patients with chronic obstructive pulmonary disease have recently been carried out. These studies tackle the problem by studying the clinical claims (age, sex, body mass index, etc.), comorbidities (Charlson Comorbidity Index [1]), administrative data (duration of stay, mode of arrival, etc.), medical procedures (surgical procedures) or even social-economic factors of the patient (employment, number of children, marital status, etc.) [2-4]. Some studies use Natural Language Processing to extract patterns from clinical notes associated with a greater risk of readmission [5, 6]. Despite the diversity of Machine Learning or Deep Learning techniques used, the predictive scores remain relatively low (c-statistic [0.6-0.9]) regardless of the disease studied [7-30]. Although a few tools are starting to emerge (Hospital Internal Medicine Readmission Prediction Score proposed by A. Zapatero [31]), there is currently no reliable model for predicting the readmission of a patient.

At Kaduceo, we have been working since 2019 on predicting readmissions (Artificial Intelligence Model for predicting emergency room activity). In this study, we use medico-social data from the Medicalized Information System Program (PMSI). In order take a step back from existing work, we take into account features that are not customary for predicting readmissions such as features related to the amount of pollutants in the air.


We excluded children as well as pregnant women from our cohort. Indeed, we believe that children constitutes a bias for our study since this population tends to be readmitted more often, for multiple causes (caused by parents’ heightened concern). Likewise, the population of pregnant women tends to be readmitted in emergency just before childbirth.

In order to identify our target variable, we first carried out a study on 30-day readmissions. Indeed, this is the most widely used window of time in the scientific literature and this indicator is commonly used in the United States to gauge the quality of care provided by the hospital. In a second step, we used the variable Major Diagnostic Category 4 (CMD): “Disorders of the respiratory system” from the Homogeneous Patient Group (GHM) of a patient’s stay as the target disease class. Carrying out a study on a certain type of illness rather than all kind of diseases allows us to identify the pathologies to be predicted. Depending on the results obtained, we will refine our research by predicting on a specific type of disease.

Our study focuses on a cohort of:

+ 10000
Adults between 2010 et 2019

Of these records, 1 054 are identified as 30-day pre-readmission records for patients with lung condition.

In order to make these predictions, we worked with medico-administrative variables from PMSI (age, sex, GHM complexity, is the patient coming from emergency department, hospital service visited by the patient during the last 6 months, number of visits by a patient over the last 6 months…). We have also added variables linked to seasonality, to geographical context in which the patient evolves (population density of patient’s city, social deprivation index [32], distance between patient’s home and hospital), comorbidities as well as data related to the amount of pollutant in the air.

We used clustering techniques to identify groups of patients for which the readmission rate is greater than the overall rate in the initial dataset. This allowed us to identify groups of patients with a higher likelihood of being readmitted.

Finally, in order to overcome the problem of class imbalance (as only 3% of patients are readmitted), we tried several sampling techniques (SMOTE, oversampling and undersampling). We chose the last as it outputed the best results.

For complete understanding of our process, the different stages of the project are detailed in the figure below.

Results and future work

In order to make the predictions, we compared the results obtained with different Machine Learning algorithms used for classification: Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), Stochastic Gradient Descent Classifier (SGDC).

After optimizing hyper-parameters of each algorithm, and using a cross-validation (k = 10) in order to ensure the robustness of the results, the average c-statistic between different algorithms is 0.89 [0.893 – 0.898]. Figure 2 shows the ROC curves obtained according to the considered algorithm.

We obtain average sensitivity and specificity of 0.82, which means that the different models manage to classify both patients belonging to the pre-readmitted group as well as patients belonging to the non-preadmitted group. Among the variables that best explain the model, we find:

  • Number of hospital visits 6 months before patient’s pre-readmission

  • Patient’s visit or not at 6 months in a pulmonology service

  • “cancer” and “chronic pulmonary pathology” comorbidities

  • Features created from the clustering step

These results are promising and confirm our our will to create a tool that could be useful for health professionals to identify patients at high risk of readmission. Following this first study, we are continuing our research; the analysis of the care pathway as well as the medical care provided are first avenues to explore to improve predictive scores. We will expand our research using PMSI data from other partner hospitals to make our work generalizable.

Share this article
Previous post
A Prospective Pharmacoepidemiologic Cohort Study in 30 French NICUs From 2014 to 2020

A Prospective Pharmacoepidemiologic Cohort Study in 30 French NICUs From 2014 to 2020

No consensus exists about the doses of analgesics, sedatives, anesthetics, and paralytics used in critically ill neonates.

Comparison of explanatory methods: influence of characteristics

Comparison of explanatory methods: influence of characteristics

Kaduceo, co-author of work presented at the 24th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data


  1. Charlson, M.E., et al., A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis, 1987. 40(5): p. 373-83.
  2. Goto, T., et al., Machine Learning-Based Prediction Models for 30-Day Readmission after Hospitalization for Chronic Obstructive Pulmonary Disease. COPD: Journal of Chronic Obstructive Pulmonary Disease, 2019. 16(5-6): p. 338-343.
  3. Lee, S., et al., Reducing COPD Readmissions: A Causal Bayesian Network Model. IEEE Robotics and Automation Letters, 2018. 3(4): p. 4046-4053.
  4. Min, X., B. Yu, and F. Wang, Predictive Modeling of the Hospital Readmission Risk from Patients’ Claims Data Using Machine Learning: A Case Study on COPD. Scientific Reports, 2019. 9(1).
  5. Agarwal, A., et al., A Natural Language Processing Framework for Assessing Hospital Readmissions for Patients with COPD. IEEE Journal of Biomedical and Health Informatics, 2018. 22(2): p. 588-596.
  6. Jain, P., A. Agarwal, and R. Behara. An approach to supervised classification of highly imbalanced and high dimensionality COPD readmission data on HPCC. 2019. Institute of Electrical and Electronics Engineers Inc.
  7. Almardini, M. and Z.W. Raś, A supervised model for predicting the risk of mortality and hospital readmissions for newly admitted patients, M. Kryszkiewicz, et al., Editors. 2017, Springer Verlag. p. 29-36.
  8. Baig, M.M., et al. Machine Learning-based Risk of Hospital Readmissions: Predicting Acute Readmissions within 30 Days of Discharge. 2019. Institute of Electrical and Electronics Engineers Inc.
  9. Barbieri, S., et al., Benchmarking Deep Learning Architectures for Predicting Readmission to the ICU and Describing Patients-at-Risk. Scientific Reports, 2020. 10(1).
  10. Ben-Assuli, O. and R. Padman, Analysing repeated hospital readmissions using data mining techniques. Health Systems, 2018. 7(3): p. 166-180.
  11. Brindise, L.R. and R.J. Steele. Machine learning-based pre-discharge prediction of hospital readmission. 2018. Institute of Electrical and Electronics Engineers Inc.
  12. Eckert, C., et al., Development and Prospective Validation of a Machine Learning-Based Risk of Readmission Model in a Large Military Hospital. Applied Clinical Informatics, 2019. 10(2): p. 316-325.
  13. Eggerth, A., et al., Utilising Information of the Case Fee Catalogue to Enhance 30-Day Readmission Prediction in the German DRG System, in Studies in Health Technology and Informatics, K. Fister, et al., Editors. 2018, IOS Press. p. 40-44.
  14. Garcia-Arce, A., F. Rico, and J.L. Zayas-Castro, Comparison of Machine Learning Algorithms for the Prediction of Preventable Hospital Readmissions. Journal for Healthcare Quality, 2018. 40(3): p. 129-138.
  15. Golmohammadi, D. and N. Radnia, Prediction modeling and pattern recognition for patient readmission. International Journal of Production Economics, 2016. 171: p. 151-161.
  16. Grzyb, M., et al. Multi-task cox proportional hazard model for predicting risk of unplanned hospital readmission. 2017. Institute of Electrical and Electronics Engineers Inc.
  17. Hilton, C.B., et al., Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence. NPJ Digit Med, 2020. 3: p. 51.
  18. Jamei, M., et al., Predicting all-cause risk of 30-day hospital readmission using artificial neural networks. PLoS ONE, 2017. 12(7).
  19. Jones, C.D., et al., Predicting Hospital Readmissions from Home Healthcare in Medicare Beneficiaries. Journal of the American Geriatrics Society, 2019. 67(12): p. 2505-2510.
  20. Kulkarni, P., L.D. Smith, and K.F. Woeltje, Assessing risk of hospital readmissions for improving medical practice. Health Care Management Science, 2016. 19(3): p. 291-299.
  21. Lin, Y.W., et al., Analysis and prediction of unplanned intensive care unit readmission using recurrent neural networks with long shortterm memory. PLoS ONE, 2019. 14(7).
  22. Liu, W., et al., Predicting 30-day hospital readmissions using artificial neural networks with medical code embedding. PLoS One, 2020. 15(4): p. e0221606.
  23. Pakbin, A., et al. Prediction of ICU Readmissions Using Data at Patient Discharge. 2018. Institute of Electrical and Electronics Engineers Inc.
  24. Radovanović, S., et al. Framework for integration of domain knowledge into logistic regression. 2018. Association for Computing Machinery.
  25. Rajkomar, A., et al., Scalable and accurate deep learning with electronic health records. NPJ Digit Med, 2018. 1: p. 18.
  26. Sushmita, S., et al. Predicting 30-day risk and cost of « all-cause » hospital readmissions. 2016. AI Access Foundation.
  27. Venugopalan, J., et al. Combination of static and temporal data analysis to predict mortality and readmission in the intensive care. 2017. Institute of Electrical and Electronics Engineers Inc.
  28. Wang, H., et al., Predicting Hospital Readmission via Cost-Sensitive Deep Learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2018. 15(6): p. 1968-1978.
  29. Wang, L., et al., The application of unsupervised deep learning in predictive models using electronic health records. BMC Medical Research Methodology, 2020. 20(1).
  30. Yang, C., et al. Predicting 30-day all-cause readmissions from hospital inpatient discharge data. 2016. Institute of Electrical and Electronics Engineers Inc.
  31. Zapatero, A., et al., Predictive model of readmission to internal medicine wards. Eur J Intern Med, 2012. 23(5): p. 451-6.
  32. INSERM. Indice de défavorisation sociale (Fdep) par IRIS. 1 Avril 2019; Available from