Breast Cancer Risk Prediction with Stochastic Gradient Boosting
Mehmet Kivrak^{1*}
^{1}Department of Biostatistics and Medical Informatics, Faculty of Medicine, Recep Tayyip Erdogan University, Rize, Turkey.
Abstract
Breast cancer, which is an important public health problem worldwide, is one of the deadliest cancers in women. This study aims to classify openaccess breast cancer data and identify important risk factors with the Stochastic Gradient Boosting Method. The openaccess breast cancer dataset was used to construct a classification model in the study. Stochastic Gradient Boosting was used to classify the disease. Balanced accuracy, accuracy, sensitivity, specificity, and positive/negative predictive values were evaluated for model performance. The accuracy, balanced accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score metrics obtained with the Stochastic Gradient Boosting model were 100 %, 100 %, 100 %, 100 %, 100 %, and 100 %, and 100 % respectively. In addition, the importance of the variables obtained, the most important risk factors for breast cancer were a cave. points_mean, area_worst, and perimeter_worst, concave. points_worst respectively. According to the study results, with the machinelearning model Stochastic Gradient Boosting used, patients with and without breast cancer were classified with high accuracy, and the importance of the variables related to cancer status was determined. Factors with high variable importance can be considered potential risk factors associated with cancer status and can play an essential role in disease diagnosis.
Keywords: Breast cancer, Machine learning, Ensemble learning, Stochastic gradient boosting
Breast cancer, which is a significant public health concern globally, is one of the deadliest cancers in women.^{[1]} Especially in such cancer diseases, early diagnosis is very important and may affect the course of the disease. There are many traditional methods for early detection of breast cancer. First, a physical examination is performed after listening to the patient's medical history. Afterward, ductoscopy (examination of milk ducts by entering very thin fiberoptic systems from the mouth of the canal at the nipple) with imaging methods such as mammography or breast ultrasound may be requested. Additional examinations such as ductography (or galactography, imaging with contrast material from the nipple) and magnetic resonance imaging (MR) may also be requested.^{[2]}
Data mining methods, which are also defined as the process of knowledge discovery from large amounts of data, enable one to make predictions and interpretations about the future by revealing hidden relationship structures.^{[3]} While machinelearning methods that can work based on association rules, classification, and regression perform databased learning in the training phase, they aim to make predictions about new data in the testing and validation phase.^{[4]}
This study aims to classify patients with and without breast cancer using the Stochastic Gradient Boosting (SGB) method. In addition, it is to determine the risk factors related to breast cancer and to find the variable importance of the cancerrelated factor.
The public dataset "UCI (Machine Learning Repository) Data Set" was obtained from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 to classify the presence or absence of breast cancer via the SGB method in the study.^{[5]} Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image. Attribute information is id number, diagnosis (M = malignant, B =
The boosting algorithm, designed as a meta classifier, is an ensemble learning method that can also make predictions.^{[6]} Stochastic gradient boosting (SGB) is a data processing approach introduced by ^{[7]}. SGB is a crucial technique accustomed to creating forecasts and classification tasks and adjusting forecast performance through the appliance of preprocessing procedures. XGB, SGB, and Lasso methods are both widely and successfully used techniques in breast cancer/mammography research. They are also widely used approaches to select critical predictive variables in health and medical informatics applications. XGB or SGB can effectively identify important variables when the variables have nonlinear and/or high dimensional interactions.^{[8]} SGB was implemented in R by the Generalized Boosted Regression Models (GMB) Package.^{[9]} The hyperparameters of the SGB classifier are n.trees, shrinkage, and n.minobsinnode.
Henze zirkler test was used for the assumption of multivariate normality. The median (minimummaximum) was used to summarize quantitative data, and the numbers were used to summarize qualitative variables (percentages). The MannWhitney U test was utilized to see if significant difference in the target exits. The relationship between the variables was evaluated with the spearman correlation coefficient. The model's fit was checked with the Likelihood Ratio Test. Pvalue <0.05 was regarded as significant. IBM SPSS Statistics 26.0 program was employed in the analysis.
In the data set used in the study, there are 357 (63 %) benign cancer patients and 212 (37 %) malignant cancer patients. Distributions of the variables are given (Figure 1).

Figure 1. Distributions of the Variables 
Descriptive statistics for the variable examined in this study are presented in (Table 1). There is a significant difference between the diagnosis groups regarding other variables (P<0.001) except the radius_mean, fractal_dimension_mean, and smoothness_se variables.
Table 1. Descriptive Statistics for Variables 

Variables 
Breast Cancer 
p 

Bening (357) 
Malignant (212) 

Median (MinMaks) 
Median (MinMaks) 

radius_mean 
21.1 (6.947.4) 
19.4 (13.044.9) 
0.108 
texture_mean 
19.2 (13.147.0) 
22.2 (14.244.9) 
<0.001 
perimeter_mean 
78.1 (43.7114.6) 
114.2 (71.9188.5) 
<0.001 
area_mean 
458.4 (143.5992.1) 
932.0 (361.62501) 
<0.001 
smoothness_mean 
0.09 (0.050.16) 
0.1 (0.070.14) 
<0.001 
compactness_mean 
0.08 (0.020.22) 
0.13 (0.050.35) 
<0.001 
concavity_mean 
0.04 (0.00.41) 
0.15 (0.020.43) 
<0.001 
concave points_mean 
0.02 (0.00.09) 
0.09 (0.020.2) 
<0.001 
symmetry_mean 
0.17 (0.110.27) 
0.19 (0.130.3) 
<0.001 
fractal_dimension_mean 
0.06 (0.050.1) 
0.06 (0.050.1) 
0.612 
radius_se 
0.26 (0.110.88) 
0.54 (0.191.35) 
<0.001 
texture_se 
1.14 (0.361.76) 
1.14 (0.361.67) 
<0.001 
perimeter_se 
1.94 (0.764.56) 
3.84 (1.334.67) 
<0.001 
area_se 
23.24 (6.847.1) 
60.01 (13.9944.83) 
<0.001 
smoothness_se 
0.01 (0.00.02) 
0.01 (0.00.03) 
0.749 
compactness_se 
0.02 (0.00.11) 
0.03 (0.010.14) 
<0.001 
concavity_se 
0.02 (0.00.4) 
0.04 (0.010.14) 
<0.001 
concave points_se 
0.01 (0.00.05) 
0.01 (0.010.04) 
<0.001 
symmetry_se 
0.02 (0.010.06) 
0.02 (0.010.08) 
0.018 
fractal_dimension_se 
0.0 (0.00.03) 
0.0 (0.00.01) 
0.007 
The correlation matrix is indicated (Figure 2) in this study. the matrix that includes a lot of numbers. These numbers range from 1 to 1. A value of 1 mean that the two variables, mean radius and area, are positively interrelated with each other. The zero mean does not correlate with variables such as radial mean and fractal dimension SE. The mean 1 that has two variables, radius and fractal dimension mean are negatively correlated with each other. The correlation between them is not 1, it is 0.3 but the idea is that if the sign of the correlation is negative that means that there is a negative correlation. According to the table, there is no correlation between radius_mean and concavity_mean or there is a very weak negative correlation. However, the relationship is statistically significant. This may be due to the high sample size. Although the result is statistically significant, it may not be clinically significant (r=0.0917, p=0.029). Similarly, there was no statistically significant correlation between radius_mean and texture_mean variables (r=0.0113, p=0.789). However, there is a positive, weak and statistically significant correlation between texture_mean and texture_worst variables (r=0.2318, p<0.001). Other variables can be interpreted similarly among themselves.
