Coping with imbalanced data problem in digital mapping of soil classes期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

Coping with imbalanced data problem in digital mapping of soil classes

Authors:	Amin Sharififar Fereydoon Sarmadian

Institution:	1. The James Hutton Institute, Aberdeen, UK;2. Department of Soil Science, School of Agricultural Engineering and Technology, University of Tehran, Karaj, Iran Contribution: Writing - review & editing, Resources

Abstract:	An unsolved problem in the digital mapping of categorical soil variables and soil types is the imbalanced number of observations, which leads to reduced accuracy and the loss of the minority class (the class with a significantly lower number of observations compared to other classes) in the final map. So far, synthetic over- and under-sampling techniques have been explored in soil science; however, more efficient approaches that do not have the drawbacks of these techniques and guarantee retention of the minority classes in the produced map are essentially required. Such approaches suggested in the present study for digital mapping of soil classes include machine learning models of ensemble gradient boosting, cost-sensitive learning and one-class classification (OCC) of the minority class combined with multi-class classification. In this regard, extreme gradient boosting (XGB) as an ensemble gradient learner, a cost-sensitive decision tree (CSDT) within the C5.0 algorithm, and a one-class support vector machine combined with multi-class classification (OCCM) were investigated to map eight soil great groups with a naturally imbalanced frequency of observations in northwest Iran. A total of 453 profile data points were used for mapping the soil great groups of the study area. A data split was done manually for each class separately, which resulted in an overall 70% of the data for calibration and 30% for validation. The bootstrapping approach of calibration (with 10 runs) was performed to produce multiple maps for each model. The 10 bootstraps were evaluated against the hold-out validation dataset. The average values of accuracy measures, including Kappa (K), overall accuracy (OA), producer's accuracy (PA) and user's accuracy (UA), were explored. In addition, the results of this study were compared with a previous study in the same area, in which resampling techniques were used to deal with imbalanced data for digital soil class mapping. The findings show that all three suggested methods can deal well with the imbalanced classification problem, with OCCM showing the highest K (= 0.76) and OA (= 82) in the validation stage. Also, this model can guarantee the retention of the minority classes in the final map. Comparing the present approaches with the previous study approach demonstrates that the three newly suggested methods can remarkably increase both overall and individual class accuracy for mapping.

Keywords:	class imbalance cost-sensitive decision tree digital soil mapping extreme gradient boosting imbalanced multi-class classification machine learning one-class classification support vector machine

设为首页 | 免责声明 | 关于勤云 | 加入收藏