Improving Age Prediction with an Ensemble-based Model: Introducing eClock

EClock is an ensemble-based method that accurately predicts ages from DNA methylation with a biased distribution.

When training a model of gestational age from placental methylation, for example, samples can only be taken after the delivery of both the baby and placenta. Most samples are older than 30 weeks and correspond to full-term or moderately preterm births. Samples with a younger gestational period are rare, which means that the distribution of samples is heavily biased towards large gestational periods. This makes it difficult for the model to accurately predict smaller gestational periods. Even small differences in gestational ages can have a significant impact on neonatal mortality, morbidity, and long-term outcomes. The accuracy of the model across the entire gestational range is therefore essential.

We developed eClock, an R package that solves this problem. It is an improvement of the traditional machine-learning strategy for handling the imbalance problem in category data [24]. Bagging and SMOTE methods (Synthetic Majority Oversampling Technique), combined with ensemble models, are used to correct the biased age distribution. It is the first application of these techniques to a clock model. This creates a new framework in clock model construction. eClock provides additional functions such as training a traditional clock model, displaying the features and converting methylation values for probes, genes, or DMR (DNA methylation regions). We used three different datasets to test the performance of eClock. The results showed that it can improve the clock model’s performance for rare samples.