A Comparison Study of Goodness of Fit Tests of Logistic Regression in R: Simulation and Application to Breast Cancer Data

Goodness of fit (GOF) tests of logistic regression attempt to find out the suitability of the model to the data. The null hypothesis of all GOF tests is the model fit. R as a free software package has many GOF tests in different packages. A Monte Carlo simulation has been conducted to study two situations; the first, studying the ability of each test, under its default settings, to accept the null hypothesis when the model truly fitted. The second, studying the power of these tests when assumptions of sufficient linear combination of the explanatory variables are violated (by omitting linear covariate term, quadratic term, or interaction term). Moreover, checking whether the same test in different R packages had the same results or not. As the sample size supposed to affect simulation results, so the pattern of change of GOF tests results under different sample sizes as well as different model settings was estimated. All tests accept the null hypothesis (more than 95% of simulation trials) when the model truly fitted except modified Hosmer-Lemeshow test in "LogisticDx" package under all different model settings and Osius and Rojek’s (OsRo) test when the true model had an interaction term between binary and categorical covariates. In addition, le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares (CHCH) test gave unexpected different results under different packages. Concerning the power study, all tests had a very low power when a departure of missing covariate existed. Generally, stukel's test (package 'LogisticDX) and CHCH test (package "RMS") reached a power in detecting a missing quadratic term greater than 80% under lower sample size while OsRo test (package 'LogisticDX') was better in detecting missing interaction term. Beside the simulation study, we evaluated the performance of GOF tests using the breast cancer dataset.


Introduction
Many scientific branches (other than statistics) use logistic regression (LR) modeling to get information or to prove or reject specific theory or hypothesis. The important component of any modeling process is an assessment of goodness of fit (GOF) of the model. It reflects whether the model fits the observed data accurately or not [1]. Non statistician scientists use the default of a statistical program to assess GOF. R is one of those programs that its use increased dramatically in previous years especially that it is free and nearly each day statistical tests coded into functions easily and these functions used by non-statistician users. Many GOF tests were already coded in many R packages and sometimes the same test coded in different packages. However choosing which test of which package will assess the fit of a specific data is not a piece of cake. It is supposed that same tests under different R functions should give the same result but there is no scientific work (to our knowledge) checked that. Moreover, some tests didn't reach a satisfied power except when a sample exceeds a specific size. On the other hand, some tests fail to detect the truly specified model. In addition, not all tests can detect all misspecifications. Furthermore, a test can detect a specific misspecification under some settings and fails to detect the same misspecification under different settings.
This paper represented the GOF tests that already coded in R and showed their packages. The function of each test was kept under its default settings as possible because non-statistician users most probably use a function of the test under its default setting. This study also compared the power of same tests (from different packages) and examines whether they had the same simulation results under different settings as expected or there was a variation. Adding to that, The pattern of each test under ten different sample sizes (100, 200, … , 1000) was examined and compared with each other by a simulation (1000 simulation trial) study from two aspects; first is the ability (percentage) of the test to accept the null hypotheses when the true model is fitted. For simplifying, this concept will be abbreviated as "The Null". The second aspect is the ability (percentage) of the test to reject the null hypotheses when a false model is fitted i.e., "The Power". Both aspects were studied under three Misspecifications; A-Omission of a covariate term, B-Omission of quadratic term, C-Omission of interaction term.
After that, the best test for each misspecification was chosen and finally an application on breast cancer data took place. For this work, the best test means the test which has a power greater than 80% and a null greater than 95% at a lower sample size.
This paper was organized as follows: section (2) represents an overview of a binary LR model, whereas section (3) reviews its GOF tests, settings chosen for the simulation study is at section (4), while the results of it will clarified in section (5), and finally applied statistical analysis and a conclusion will be under section (6) and (7) respectively.

Binary LR model
In any regression problem, the key quantity is the mean value of the outcome variable, given the value of the independent variable. This quantity is called the conditional mean and is expressed as "E(Y|x)" where Y denotes the outcome variable and x denotes a specific value of the independent variable [2]. In order to simplify notation, we use the quantity π(x) = E(Y|x) to represent the conditional mean of Y given x when the logistic distribution is used. The specific form of the LR model used is: , where . A transformation of π(x) that is central to our study of LR is the logit transformation. This transformation is defined, in terms of π(x): The importance of this transformation is that g(x) has many of the desirable properties of a linear regression model. The logit, g(x), is linear in its parameters, may be continuous, and may range from −∞ to +∞, depending on the range of x [2][3][4].

GOF Tests
Knowing whether the probabilities produced by the model accurately reflect the true outcome experience in the data is referred to as model goodness of fit. The null hypothesis of all GOF tests is that the model truly specified whereas when the alternative hypothesis can't be rejected (i.e. p-value of the test is less than 0.05) indicates that the model is misspecified.
In the LR context, the essential components of fit are listed by the following three assumptions [2]: A1: If the logit link function is appropriate:g(x) = Xβ. A2: If the linear combination Xβ of the explanatory variables are sufficient; No omission of predictions, transformation of predictors, or interactions of predictors, so the Xβ is sufficient.
A3: If the underlying distribution for the outcome variable is Bernoulli; the variance is Bernoulli: The consequences of violation of all the above assumptions (A1 to A3) are serious; the estimated coefficients and corresponding standard errors will be biased thus significance tests and confidence interval may be misleading. Besides, it could also lead to poor estimation of other quantities. For example, the inaccurate odds ratio estimation will influence the interpretation of the treatment effect. If the model is misspecified, it could reduce the accuracy of the prediction and classification for the new subjects. The interpretation of the relationship between the response variable and independent variables could also be inaccurate.
In this paper, six GOF tests coded in R (see Table 1) were chosen to detect the violation of the second assumption (insufficiency of Xβ).
It is worth noting that Stukel's tests in "LogisticDx" package referred to six different tests: -Score test for addition of vector z1 (S1) -Score test for addition of vector z2 (S2) -Score test for addition of vector z1 and z2 (S3) -Log-likelihood test for addition of vector z1 (S4) -Log-likelihood test for addition of vector z2 (S5) -Log-likelihood test for addition of vectors z1 and z2(S6) Many other studies simulate GOF tests to check this violation. Hosmer-Lemeshow tests [5,6] were nearly tested in all simulation studies against most of departures as they considered the base line of goodness of fit tests as well as they are the default of most software packages [1,[7][8][9][10].
The power of Osius and Rojek [11] where examined when detecting missing covariate as well as wrong functional form of the covariate under different dispersion levels of data [12].
Unweighted sum of squares test [13], Stukel's score test [14] as well as a likelihood ratio test of Stukel's suggested by [15] were tested against omission of a quadratic term in a continuous variable, and the omission of the main effect for a dichotomous variable and its interaction with a continuous variable [9] and omission of log term [15].
It deserves mentioning that all the previous simulation studies chose two or maximum three sample sizes (most probably 100, 500 and/or 1000) and none of simulation studies -to our knowledgestudied a detailed pattern of tests power or the optimum point at which the power of the test can exceed 80%.

Simulation Settings
The settings of the coefficients of the Monte Carlo simulation study 1 were according to Canary, et al. [8], based on the simulation settings that show in Table 2. 2 In true models, setting 1 had three linear covariates, while settings 2 and 3 contained a covariate and it quadratic term but the latter had an extra addition chi-square distributed term, whereas settings 4 and 6 have an interaction term between continuous and categorical covariate and again the latter had and extra addition Bernoulli distributed covariate. Finally, setting 5 had also and interaction term between two continuous covariates.
True models under each setting were examined 1000 times by GOF tests and the percentage of p-value of each GOF test greater than 0.05 was calculated and in this study we referred to this percentage as "The Null" of the test.
For testing the power, independent variable for each setting were generated under the true model terms and then a false model were fitted to the generated independent variable after omitting the last term of the true model. Thus, the ability of GOF tests to detect a missing covariate (misspecification A) were tested under the false model of setting 1 whereas false model of settings 2 and 3 was for misspecification B (missing quadratic term) and for mispsecifcation C (missing interaction term), false models of settings 4, 5 and 6 were used.

Simulation Results
Results were classified to three parts: null results, power results of same tests, and power results of different GOF tests.

Null Results
All tests under all settings showed percentage of null greater than 92% except:  The S1 and S3 tests had a non-applicable (NA) results under setting 4 in sample sizes 100 up till 500. In addition, S1 and S2 had a NA results under setting 1 for the lowest sample size.

Power Results of Different Tests
In general, different tests gave different power pattern under different misspecification. In addition, patterns varied a lot under different settings of same misspecification. For each setting, the Best test would refer to the test which gave a power greater than 0.80 at a lower sample size.

Power Results of Misspecification A
Under setting 1, all the tests, except mHL, under all sample sizes studied had low power less than 0.12% to detect the omission of a covariate with a χ2 distribution (Table 3).
This result was compatible with [8] as HL-C statistic didn't exceed a power of 0.05. In addition, [12] showed that OsRo and HL-C power at sample size of 100 didn't exceed 0.45 while the power of the former exceed the 0.80 under lower degree of sparseness at sample size 500.
Moreover, Kuss [12] stated that the test which has the best power in his study when a missing covariate exists, was Farrington test where the satisfied power was achieved at more sparse data although this test has the structural deficiency of never rejecting the null hypothesis with extreme sparseness data (when each covariate pattern contains single observation).
It is observable that most of simulation studies of goodness of fit tests didn't include the misspecification of omitting a covariate term as this title is considered mainly under the title of model selection.
On the other hand, the power of mHL test is increased as long as the sample size increased till reached its maximum (71%) at the highest sample size (Table 3). it is expected that, this test may exceed 0.80 at larger sample sizes.
Furthermore, S2 and S3 had a NA results (p-value failed to be calculated) under setting 1 of sample size 100.

Power Results of Misspecification B
The misspecification of quadratic term omission was tested under two settings (2 & 3) as shown in Table 4 and Table 5 respectively.
Under the second setting, The Best test S3 where it gave the satisfied power at a sample size =300 whereas the second best tests were OsRo, S2, S5 and CHCH-Rms at sample size 400. On the other hand, the power of Hosmer C and H statistics (which are the default of many packages) didn't exceed 0.80 except at sample size 600 and 1000.
However, the power of all tests failed to reach the satisfied ratio to detect the omission of quadratic term when the model contains a second covariate as in setting 3 (where the power didn't exceed 40%) except mHL at the largest sample size. These results agreed with [8] where HL gave a power of 80% only when there was a single covariate and its square in the true model (as in setting 2) at sample size of 500 whereas when there existed other covariates than the squared one, the test fails to detect the misspecification at all the sample sizes under the study.
In addition, at the study of Hosmer and Hjort [23] the HL-C gave the same result as this work. However, the partial sum of resedual test gave a better result for this misspecification.

Power Results of Misspecification C
Misspecificaion C was expressed under three settings (4, 5 and 6). For the missing interaction term between categorical and continuous covarites (Set 4), the OsRo was the best to detect that departure when sample size reach 600 and the second best was CHCH-Rms and S5 whereas the worst tests were Hl-C, HL-H and S4 as even at the largest sample size, the power didn't reach a satisfied percent. Furthermore, under this setting, S1 and S3 failed to be calculated at all sample sizes (Table 6).
Concerning setting 5, where the ommited interaction term was between two continuous covariates, the best test was of OsRo at sample size 200 even at sample size 100 the power was very close to 80%. All the tests gave a power over 90% at sample size 300 Except CHCH-Rms where the its power exeed 80% at sample size 800 (Table 7). It was worth mentioning that S2 and S3 had a NA results (failed to be calculated) under setting 5 of sample size 200. for the last setting (Table 8), unfortunately all the tests (except OsRo) had a very low power (less than 30%) to detect missing interaction term of binary and continuous covariates when an extra covariate exist in the model. a weired unexpected power results gained from OsRo where the power decreased while the sample size increased.

An Application to Breast Cancer Data
It was discovered that the age of patients and the location of cancer (LOC) whether in right or left breast, or both, contributes significantly to the survival of patients [24]. In this study the dataset of Oguntunde, et al. [24] was reanalyzed and two models were represented; the first model is the same as Oguntunde, et al. [24] but without categorizing the age where the specified linear predictor of model A was: This linear predictor is used to compute the predicted risk of death (PRD) in model A, defined as The result of the model A was showing in Table 9. It is clear that the magnitude of the intercept is higher than other betas and it was significant which may indicate a missing term in the model. In this study model B was proposed where, The PRD in model B, , is defined using the standard logistic function in the same manner as in model A. the result of model B was shown in Table 10 where the magnitude of the intercept decreased nearly four times and became not significant which may indicate a stronger effect of the new term.  Assessing the fit of both models took place using different R 2 as in Fig 5, Akaike information criterion (AIC) and Bayesian information criterion (BIC) as in Table 11, as well as goodness of fit tests under this study as in Table   HL 12. It was clear that R 2 was always higher in model B than model A whereas AIC and BIC was lower in the second model.   [25] In general, the p-value of most of GOF tests increased in model B which indicated that this model fitted better. However, not all tests detect the missed quadratic term (i.e. not all tests gave significant p-value in model A). Furthermore, from all the tests that reject the null hypothesis in model A, the lowest p-value was from S3 test (Fig 6) which is compatible with the simulation study. Moreover, CHCH-MK and CHCH-Rms had the same results as in the simulation study and both detect the omission of quadratic term whereas CHCH-DT had a different result (under the default settings of the function) and failed to reject the null hypothesis in model A. Fig-6. P-values of different tests which gave significant results in model A Now the model had two coefficients for the age; a negative one for linear and a positive coefficient for quadratic term. That is mean that, when other variables are fixed, the decreases as long as the age increases till a specific age equal to years old where the start to increase.

Conclusion
Functions of goodness of fit tests existed in R were examined from different aspects. The first aspect was to check the ability of each function to accept the null hypotheses when the model is true (the null). All tests showed a null over 92% except mHL (from package "LogisticDx") because it had a strange pattern that it failed to accept the null hypotheses when the sample size get larger. Furthermore, the null of OsRo test didn't reach 90% at all sample sizes under two different settings where there was an interaction term between binary and categorical covariate existed in the model.
The second aspect was comparing the power of same tests under different packages in R; Hosmer Lemeshow tests (C and H statistics) displayed the same simulation results at all settings whereas the functions of le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares test (CHCH) under different packages revealed different power results except when quadratic term present (under setting 2) "MKmisk" and "Rms" were the same. Only the CHCH test of package "Rms" exhibited reasonable results for all settings.
Finally, the power of different tests under various settings and ten sample sizes were compared to select the best test for each setting as the best test defined in this study as the test which gave a power equal to or exceed 80% at the smallest sample size and its null was greater than 90%.
All goodness of fit test unfortunately failed to detect the departure of missing linear covariate term. For the departure of omitted quadratic term, S3 was the best when sample size is not less than 300 and S2, S3, S5, OsRo, and CHCH-Rms when sample size is equal to or greater than 400 was the best test when the model contained one covariate and its square but in another setting when another covariate added in the true model, all tests failed to reach a satisfied power.
On the other hand, the CHCH test of package "rms", S2 and S5 was the best to detect a missing interaction term between binary and categorical covariates when sample size is not less than 700. However, when another covariate added in the true model, all tests failed to detect such departure.
OsRo, S4, S5, and S6 tests were the best to detect a missing interaction term of two categorical covariates for sample size greater than or equal 200.
Analysis applied on a data of breast cancer where the first model had a categorical and continuous covariate and another proposed model where quadratic term of the continuous covariate added to other covariates. S2, S3, S4, S5, S6, and CHCH-Rms failed to accept the null hypothesis for the first model and the S3 had the lower p-value. In the second model, p-values increased and nearly all of them became not significant. CHCH test of packages "Rms" and "Mkmisc" had same results.