br Kim investigated a diagnostic model based on SVM
Kim (2012) investigated a diagnostic model based on SVM to predict breast cancer recurrence (namely BCRSVM) and it SPDP has been compared with two other methods, i.e., Neural network and Regression model. Wang (2014) proposed the combinations of SMOTE, PSO, and three popular classifiers including C5, Logistic Regression, and 1-NN for predicting 5-year survivability of breast cancer patients. SMOTE is an over-sampling based method that creates new synthetic instances in the minority class for balancing the dataset. Feature selection is conducted using PSO algorithm as well. Their results indicate that the hybrid of SMOTE, PSO and C5 is the best framework among all possible combinations.
Batista and Monard (2003) proposed three imputation methods, namely, Hot-deck, mean and k-nearest neighbor and compared them on four datasets. These approaches where evaluated using two methods namely, C4.5 decision tree and CN2 (Clark and Niblett, 1989). Farhangfar et al. (2008) examined the effect of six classifiers i.e., C4.5, k-nearest neighbor, RIPPER (Cohen, 1995), Naïve Bayes and SVM with RBF and polynomial kernel on 15 data-sets with missing ratios of 5%, 10–50% with 10% increments and five imputation methods.
Jerez (2010) examined three statistical imputation methods, i.e., mean, Hot-deck, and a hybrid of them and three Machine Learning methods, i.e., k-nearest neighbor, self-organization maps (SOM) (Kohonen, 1995) and multi-layer perceptron (MLP) (Bishop et al., 2013) on breast cancer data. They also introduce breast cancer recurrence prediction with neural network as their final objective. The results of their work indicate that ML methods are better than statistical algorithms. Dauwels et al. (2012) utilized Tensor (espe-cially, CP and normalized CP factorization) for imputation of miss-ing data on medical questionnaires. They compared the approach with mean, k-nearest neighbor, and iterative local least square (Cai et al., 2006) with missing ratios of 10%, 20% and 30%. The experimental results suggest that Tensor imputation outperforms the other methods.
Aydilek and Arslan (2013) proposed a combination approach of optimized fuzzy c-means with support vector regression (Vapnik et al., 1996) and genetic algorithm for imputation of missing val-ues. They considered genetic algorithm for optimizing fuzzy c-means parameters including number of clusters and weighting fac-tor. The method is compared with three imputation methods, namely fuzzy c-means, SVR genetic (SvrGa) and Zero imputation with missing ratios of 1% and 5–25% with increment of 5%.
Although acceptable results are obtained in studies related to prediction of breast cancer recurrence, they are not considered as an improvement of recurrence prediction from the perspective of missing values imputation, and their limitation is the use of old statistical methods. Regarding previous missing data estimations, Nonsense codon should be noted that most of them, fill the missing data irrespec-tive of the dependencies between attributes and the type of incom-plete attribute.
When facing missing values, classifiers often either remove the instance containing missing value or impute it using various impu-tation methods. Based on our researches and studies, we categorize imputation models into four groups which are summarized in Fig. 1. Although many other imputation methods fit in these cate-gories, we mention only some instances. Also this paper chooses representative methods from each group both to evaluate the accuracy of the proposed method and create a set of new and well-known methods.
3. Materials and methods
The following is a brief description of each imputation method and each predictive model that is used in this paper. The imputation methods are representative methods from the three