A Note on Different Types of Probabilities of Misclassification

Whenever a discriminant function is constructed, the attention of a researcher is often focused on classification. The underlined interest is how well does a discriminant function perform in classifying future observations correctly. In order to assess the performance of any classification rule, probabilities of misclassification of a discriminant function serves as a basis for the procedure. Different forms of probabilities of misclassification and their associated properties were considered in this study. The misclassification probabilities were defined in terms of probability density functions (pdf) and classification regions. Apparent probability of misclassification is expressed as the proportion of observations in the initial sample which are misclassified by the sample discriminant function. Different methods of estimating probabilities of misclassification were related to each other using their individual shortcomings. The status of degrees of uncertainties associated with probabilities of misclassification and their implications were also specified.


Introduction
Probability of misclassification expressed by jk P is the probability of classifying an observation to population j when it is actually from population k. It occurs when there is a selection of criteria that is not suitable for classification [1,2]. An observation X may be classified as belonging to population 1  when it actually comes from population 2 ,  or vice versa. These errors are of serious concern in the choice of the procedure and as such, one is required as much as possible to reduce the errors or, more appropriately, their probabilities of occurrence [3,4]. Let 1 () fx and 2 () fx be the probability density functions associated with X for populations 1  and 2  The different probabilities of misclassification considered in this study are significant in the sense that the construction of a discriminant function would prompt a researcher to determine how this function performs on the validity of future samples [5] 2. Description of Probabilities of Misclassification

Optimum Probability of Misclassification
Optimum probability of misclassification assumes that the parameters of a distribution in the two populations are known and cannot be improved upon. According to John [6], the total optimum probability of misclassification is defined as: where R is the entire region of classification, f is the distribution of the observations that will be classified and , ( 1,2) i Pi  refers to the a priori probability that an observation comes from population The optimum probability of misclassification when observation from 1  is misclassified is given by where 1 () fx is the probability density function associated with the random vector X for the population 1  , 2 R is the set of values of X for which observations into 2  are classified,  is the cumulative standard distribution function and  is the mahalanobis distance between populations 1  and 2  defined by Similarly, the optimum probability of misclassification when an observation from 2  is misclassified is given where 2 () fx is the probability density function associated with the random vector X for the population 2  , 1 R is the set of values of X for which observations into 1  are classified,  is the cumulative standard distribution function and  is the mahalanobis distance between populations 1  and 2  defined by Sedransk and Okamato [7], gave similar result on probability of misclassification when the variance in two populations, 12 ,  , is given by Suppose X in populations, 1  and 2  has the density function The parameters, i  and ∑, satisfy the conditions, i      and ∑ is a positive definite symmetric matrix of order p. The optimum probabilities based on the classification regions: : are given by

Conditional Probability of Misclassification
The conditional probability of misclassification is usually calculated when a sample discriminant function is involved in the classification rule. Given a discriminant function, the probability can be described as the conditional probability that a randomly chosen member of It is not only conditional on the individual coming from one of populations 12 or ,  but also on the estimates of the means of the distribution in the two populations.
John [6], obtained the conditional probability of misclassification when an observation from population π 1 is misclassified as: where Ф is the cumulative standard distribution function, σ 1 -1 is the inverse of standard deviation fr The conditional probability of misclassification when an observation from population π 2 is misclassified is given as where Ф is the cumulative standard distribution function, σ 2 -1 is the inverse of standard deviation

Estimated Probability of Misclassification
Estimated probability of misclassification often referred to as the "plug-in estimate" was suggested by Fisher [8]. This was premised on the fact that the maximum likelihood estimates of the parameters are plugged in the discriminant function prior to classification. The total estimated probability of misclassification is given by where 12 R and R are respective sub-regions of classification corresponding to populations 12 and  , 12 ( ) ( ) f x and f x are the respective density functions of X in populations, 12 and  and 12 P and P are the a priori probabilities that an observation comes from 12 , and  respectively.

Apparent Probability of Misclassification
Apparent probability of misclassification was suggested by, Smith [9] and defined as the proportion of observations in the initial sample which are misclassified by the sample discriminant function. If 1 n is the proportion of observation misclassified by the discriminant function in population 1 ,  and n is the total sample size in population 1  , then the apparent probability of misclassification is 1 n n .

Expected Probability of Misclassification
The expected probability of misclassification has been discussed in the literature as the expected value of the conditional probability of misclassification. It is otherwise known as unconditional probability of misclassification [6]. The total expected probability of misclassification is defined as:  (11) where 12 R and R are respective sub-regions of classification corresponding to populations 12 and  , 12 ( ) ( ) f x and f x are the respective density functions of X in populations 12 and  and 12 P and P are the a priori probabilities that an observation comes from 12 and  respectively.
The expressions for the expected probability of misclassification and its approximations were given by John [6] using the Anderson's classification statistic (W) as:  (12) where µ 1 and µ 2 are the respective means from populations, π 1 and π 2 , q(µ,v,ƿ) is the standard bivariate normal density function with correlation coefficient ƿ, µ 1 and µ 2 are the means from populations 12 and  , n 1 and n 2 are the sample sizes from 12 and  , and 2 1  , 2 2  are the variances from 12 and  .

Parameter Substitution Method
With this method, the probability of misclassification is estimated directly by substituting sample estimates of population parameters in the theoretical expression for the probability of misclassification. The method is a natural estimate and maximum likelihood estimator of the error rate. It is also said to be highly biased for small sample sizes [10].

Re-substitution Method
This procedure results to apparent error rate since the proportion of the sample incorrectly classified is used as the estimate of probability of misclassification [11]. Let the probabilities of misclassification of erroneously assigning an observation to group when the observation comes from group , then ̂ ̂ are the sample proportions of misclassified observations. The estimates are consistent, but can be severely biased for small sample sizes. This method underestimates the probability of misclassification since the data used for fitting and validating the model are the same [12].

Holdout Method
This method splits the total sample into two equal parts so as to overcome the shortcoming of re-substitution approach. One subsample is employed to construct the classification rule and second part for validation. However, it requires large samples; otherwise its estimate of misclassification suffers [13].

Cross-validation Method
The method uses all of the available data without serious bias in the estimated error rates. It holds out one observation at a time, estimates the distribution function based on observations and classifies the held out observations. This process is repeated until all observations are classified. Let be the number of sampled observations misclassified in groups , then the estimated classification error rates are ̂ ̂ The method produces unbiased estimates of the probability of misclassification for a rule based on observations, respectively [4]

Jacknife Method
In order to overcome the defects of methods (3.2) and (3.3), application of Jacknife was proposed by Lachenbruch [14]. According to this procedure, the linear discriminant function is fitted to all but one observation. The linear discriminant function is then applied to the (n-1) observations in the sample, and repeated n times [15]. This method was later examined in the context of the discrimination problem by Crask and Perreault [16]. Their work focused on the simultaneous use of its cross validation and Jacknife analysis. While cross validation method obtains good estimates of classification error rates, Jacknife analysis considers coefficient stability.

Boostrap Method
The bootstrap method is an extension of Jack-knife and might also be thought of as a finite sample Monte Carlo procedure. According to Samprit and Sangit [10], the method operates as follows:  From the sample of the population ( ), draw an independent sample of size with each unit being drawn with a probability ( ). The sample drawn from each of the G groups constitutes the bootstrap sample.  On the basis of the bootstrap sample, the linear discriminant function is constructed and its performance is evaluated by classifying all the observations not included in the bootstrap sample. The proportion of observations correctly classified is observed.  The aforementioned steps are repeated a large number of times and each trial generates an estimate of misclassification probability. The average of all the sample outcomes is taken as the bootstrap estimate, and the standard deviation of the estimates provides an estimate of the standard error.

Significance of Probabilities of Misclassification
A qualitative value of predictive performance of a classification model provided by uncertainty estimation is anchored on probabilities of misclassification. Low probability of misclassification is linked to low degree of misclassification which implies high reliability. High probability of misclassification is connected to high degree of improbability indicating propensity to generate erroneous classification.

Conclusion
Probability of misclassification is a decisive factor used to evaluate a classification procedure. Different approaches have been designed and related to one another in order to find the best possible way of estimating the true probabilities of misclassification. These methods have resulted to different types of probabilities of misclassification. The boostrap method has the advantage of not only furnishing the estimates of misclassification probabilities but also provides an estimate of the standard error of estimate.