An Analysis of the Impact of Various Sampling Designs on the Headcount Index: A Simulation Study Based on the EU-SILC

The analysis and the comparison of poverty between regions and countries are important topics in social sciences, which have relevant demands of many national (Cáritas, Intermón Oxfam, Cruz Roja, etc) and international (UN, World Bank, OECD, Eurostat, IMF, etc) agencies and organizations. One of the most common poverty indicators in practice is the headcount index, which analyzes the proportion of individuals considered as poor in a population. In this paper, we first analyze the impact on the headcount index when different sampling designs are considered. Note that this study is based on real data sets taken from different countries of the European Union, and the empirical measures for comparisons are based on different Monte Carlo simulation studies. For instance, we observe that stratified sampling has the best performance in comparison to alternative sampling designs. Post-stratification performs similar to simple random sampling without replacement, and the use of auxiliary information provides similar results to ones derived from stratified sampling. Second, we also analyze the empirical performance of different variance estimators under the commented sampling designs. We conclude that they have a similar empirical performance, and they provide, in general, confidence intervals with desirable coverage rates.


Introduction
The main aim of this manuscript is to analyze the impact of various common sampling designs on the estimation of the proportion of individuals classified as poor. This proportion is also named as the headcount index or low income proportion. Note that this study is based on various real data sets extracted from the 2011 European Union Survey on Income and Living Conditions (EU-SILC). The EU-SILC is carried out by the statistical agency Eurostat. Some references that discuss about the EU-SILC are Eurostat (2010), Eurostat (2013), Muñoz et al. (2015), López-Escobar et al. (2016), Berger and Torres (2014), Goedemé (2010), Goedemé (2013) and Van Kerm (2007), Museux (2005), among others.
One of the main challenges from governments and national and international R&D+i strategies is to eradicate poverty. In addition, many international (UN, World Bank, OECD, Eurostat, IMF, etc) and national (Cáritas, Intermón Oxfam o Cruz Roja, etc) organizations require a better understanding of the poverty. In this sense, first of all, this paper is motivated by the fact that headcount index is an important topic with an increased interest from all these institutions. Also statistical agencies show high interest in this field of study. For example, Eurostat (2010), Eurostat (2013), and Haughton and Khandker (2005) obtain regularly estimations of the headcount index.
In order to study the different situations that derivate into a poverty position, it is necessary to make analysis from information and indicators that allow an accurate and reliable valuation from the reality. For this reason, it is crucial to have the proper instruments to make a right approximation from this problem and therefore the right decision making in the political sphere. Although plentiful literature does exist about the poverty analysis and the use of indicators, nowadays there are no studies that focus on the impact of different sampling designs over one of the most important poverty indicators, the headcount index.
Therefore, this study is based in the intense debate about the necessity of introducing improvements in the measurement, estimation and analysis stage (Eurostat, 2013;Goedemé, 2013). This problem has its root in the fact that the majority of the social and economic indicators have to be formed from sampling information (see Haughton and Khandker (2005); Eurostat (2010)), which means, there is a necessity of using estimation methods that generate more accurate and high quality estimations.
It aims to provide a reference for the measurement and the estimation of the headcount index through different estimations methods and a basis for the improvement of the national poverty reduction strategies and policies as well.
We consider different sampling designs commonly used in practice, such as simple random sampling without replacement (SRSWOR), simple random sampling and using auxiliary information, stratified random sampling, and post-stratified sampling. The point estimation of the headcount index is an important topic, but sampling errors may arise in practice and it is quite important to measure the level of this sampling error, and this is the purpose of the variance estimation of a given estimator. For this reason, we also calculate the variance estimation of the different estimators based on the corresponding sampling designs. Finally, note that Monte Carlo simulation studies are used to analyze the precision of the different estimators and variance estimators. The analysis of the impact of the various sampling designs can be also realized. In particular, estimators are compared in terms of Relative Bias (RB) and Relative Root Mean Square Error (RRMSE). The precision of the variance estimators is analyzed via the RB and the coverage of the corresponding confidence intervals. For instance, such empirical measures are defined and described by Chambers and Dunstan (1986), Rao et al. (1990), Silva and Skinner (1995), Harms and Duchesne (2006), Berger and Munoz (2015), Muñoz et al. (2015), López-Escobar et al. (2016), etc.

The Headcount Index
From a social point of view, the poverty is a complex phenomenon, in which have influenced many factors and that should appear on the political and economic agenda. Ravallion (1999) "A credible measure of poverty can be a powerful instrument for focusing the attention of policy makers on the living conditions of the poor; it is easy to ignore the poor if they are statistically invisible." Some of the authors that specialize in this topic are, among others, Sen (1976), Ravallion (1999), Foster (1998), Foster et al. (1984, Khandker (2005), Khandker (2009), Atkinson (1987).
Others important reasons to measure the poverty, in spite of its complexity, are the following. It is necessary to know the characteristics and behaviors of the populations in which the poverty correcting measures should be implemented. By getting this aim, the effect or impact of the measures could be greater. Also predicting the effects of poverty projects and policies with the aim of improving or cancel them if they are not viable. Finally, we can point out that evaluating the actions from governments and institutions is not possible without measures that summarize in a value the information provides by data.
To sum up, the poverty measurement as well as the analysis and evaluation of programs and projects are decisive to set policies to reduce the poverty worldwide. In this sense, the use of accurate indicators is important in order to explain the distribution of poor people and the real proportion from these.
Poverty as a pronounced deprivation in wellbeing (Haughton and Khandker, 2005) can be studied from many different perspectives. So, there are many ways to measure it, starting from the different definitions. In this paper, we focus in the monetary approach, which is the most use to compute poverty and attributes a monetary value to poverty. For this, it is necessary to establish a poverty line or income threshold. Those individuals with income below the line are deemed to be poor. From this point of view, the measurement of poverty is, therefore, a kind of measurement of the income necessary to meet certain basic needs (poverty line), including food and nonfood needs (Haughton and Khandker, 2009).
The relative poverty line is defined as a point in the distribution of income or expenditure and, hence, the line can be updated automatically over time for changes in living standards (Zheng, 2001). In practice, researchers often specify the relative poverty line as a percentage of mean or median income or expenditure. For instance, EUROSTAT uses a relative poverty line given by the 60 % of the national median equivalised disposable income. This means that the threshold below which a person is qualified as poor, is set at 60 % of the national median equivalised disposable income. The OECD used as poverty line the 50% of the national median equivalised disposable income. Regarding relative poverty lines see also, Thompson (2013).
After defining a poverty line, we need a statistical measure that assigns a value to the distribution of poor. There exist a wider set of poverty measures (Haughton and Khandker, 2005). The measure used in this paper is the headcount index, which gives us the proportion of people bellow an official poverty line or, the proportion of people classified as poor into a population (Álvarez et al., 2014). This parameter is given by P, because we are talking about a proportion, so formally we obtain that, is the headcount index for a given population, where Np, is the number of poor into this population, and N is the population size.
The greatest virtues of the headcount index are that it is simple to construct and easy to understand; and these are important qualities (Haughton and Khandker, 2009). Such authors also point out that this index has some limitations as i) does not take the intensity of poverty into account; ii) does not indicate how poor the poor are, and hence does not change if people below the poverty line become poorer; or iii) the estimates should be calculated for individuals, not households.

Some Sampling Designs and Estimators of the Headcount Index
The purpose of this paper is to analyze the estimation of the headcount index under different sampling designs. In particular, we can compare sampling designs based on stratification to sampling designs without stratification. In addition, we can analyze numerically the gain of using auxiliary information at the estimation stage with real data taken from different countries from the European Union. The idea is to observe numerically the effect on this parameter when different estimation methods are applied. For this reason, in this section we describe various known estimation methods which could be used in practice.
In the previous section, we commented that the headcount index for a given population is denoted as . This parameter is unknown since it is not possible to know the value of the variable of interest y (the equivalised net income) at the population level. To remedy this, we can select a sample (which is denoted as s), with size n, from this population, and where the variable of interest can be observed for the individuals selected in this sample. { } denote the sampled values of the variable of interest. The various sampling designs considered in this study are briefly described are follows.

Simple Random Sampling without Replacement
The simple random sampling without replacement (Cochran, 1977;Särndal et al., 2003), SRSWOR, has the property that all the individuals in the population have the same probability of selection. In this situation, we get a sample or subset of individuals, where each one is chosen randomly and each individual has the same probability of being chosen at any stage during the sampling process, being this statement its main principle. This method is developed here without replacement, which avoids choosing any individual more than once. Note that SRSWOR is the sampling method most commonly used, and it is considered as basic type of sampling, since it can be a component of other more complex sampling methods.
Assuming that the poverty line is given by L, the estimator of the headcount index under SRSWOR can be expressed as: is the number of poor in the sample s, and is the indicator variable, which takes the value 1 if its argument is true and takes the value 0 otherwise. It is well known that ̂ is an unbiased estimator, i.e., ̂ . The problem of estimating the variance of estimators is discussed in Section 3.5.
There exist many methods based on auxiliary information, but some of the most known are the ratio and regression techniques. For simplicity, we consider the ratio method. Let x be an auxiliary variable related to the variable of interest y. Taking into account our data base EU-SILC, which provides information on income, poverty, social inclusion and living conditions for a sample of households and individuals; in our context, the auxiliary variable can be the wages or wage inequality, tax on income contributions and so on (Berger and Munoz, 2015;Muñoz et al., 2015;Rueda and Muñoz, 2011).
The ratio estimator of the headcount index can be calculated as is the customary estimator of ∑ ̂ and ̂ ̅ ̅ is the customary estimator of the population ratio ̅ ̅ where ̅ and ̅ are, respectively, the sample means of the interest and auxiliary variables, and ̅ and ̅ are the corresponding population means. It is well known that the ratio estimator is a biased estimator, but it is asymptotically unbiased.

Stratified Simple Random Sampling
The stratified sampling (Cochran, 1977;Särndal et al., 2003), is a very popular sampling method in which the first step is to divide the whole population into different groups. Hence, the population of N units is first divided into H subpopulations called "strata". Then, we consider that the sample is selected randomly through SRSWOR from each stratum, so this whole procedure is named as stratified simple random sampling (SSRS). Strata can be formed by using many methods, but it is most common and simple to use a variable such as age, gender, socioeconomic status, geographic region, religion, nationality, educational attainment, etc. Note that the customary way to carry out this method is to estimate the parameters separately for each stratum, and then to combine them into a weighted estimator. The estimator of the headcount index based upon SSRS is given by where is the population size of the hth stratum, and ̂ ∑ is the estimation of the headcount index for the hth stratum, and is the sample size for the hth stratum. In this paper we consider the method "4E" suggested by Silva and Skinner (1995) to obtain the strata, and which provided good results to such authors. This method consists on ordering the auxiliary variable and take strata with equal (E) number of units according to this auxiliary variable. We consider a total of 4 strata, hence this method is named as 4E. The sample sizes are obtained by using proportional allocation (see also Silva and Skinner (1995).

Post-Stratified Sampling
Post-stratification sampling consists on applying the concept of stratification according to auxiliary information about the sampled population but after the selection of the sample. Some relevant references that discuss about poststratification are Särndal et al. (2003), Silva and Skinner (1995) and Valliant (1993), among others.
The expression of the estimator of the headcount index under post-stratified sampling is the same than the expression (1) given by SSRS. This estimator will be denoted as ̂ . However, the difference is the fact that the stratification is carried out after the sample is selected. It can be interesting to observe the differences in accuracy when the stratification in considered before and after the sample is selected. This study can be also seen in Section 4. Note that post-strata are obtained by using the method 4E described in Section 3.3 and suggested by Silva and Skinner (1995).

Variance Estimator
We have previously defined point estimators for the headcount index. However, the problem of estimating the variance of the estimator of the headcount index is also an important topic, since we can have more information about the accuracy of the given estimation. An additional aim in this study is to analyze and compare various variance estimators under the sampling designs under study. In particular, we analyze the Horvitz-Thompson variance estimator, the Sen-Yates-Grundy variance estimator (Sen, 1953;Yates and Grundy, 1953), and the Hájek variance estimator (Hájek, 1964) for the Narain-Horvitz-Thompson (Horvitz and Thompson, 1952;Narain, 1951) point estimator for the headcount index. Such expressions can be also seen in López-Escobar et al. (2016), and which create an R package to calculate such expressions in practice.

The Horvitz-Thompson Variance Estimator
For a general sampling design, the Horvitz-Thompson variance estimator of the estimator of a population mean is given by where and are, respectively, the first and the second order inclusion probabilities. Assuming SRSWOR these quantities are given by For the case of estimating the headcount index, the corresponding Horvitz-Thompson variance estimator is obtained by substituting by into equation (2), i.e.,

̂ ̂ ∑ ∑
Similarly, we can obtain the Horvitz-Thompson variance estimator of the ratio estimator of the headcount index. In this situation, the corresponding expression is given by Assuming SSRS, the Horvitz-Thompson variance estimator of the estimator of the headcount index is given by where ̂ ̂ is the variance given by the expression (3) but calculated at the hth stratum. Finally, the Horvitz-Thompson variance estimator of the estimator of the headcount index under poststratification is given by

The Sen-Yates-Grundy Variance Estimator
The Sen-Yates-Grundy variance estimator can be similarly obtained under the difference sampling designs. However, we have to substitute the expression (2) by For example, the corresponding Sen-Yates-Grundy variance estimator of the headcount index under SRSOWR is given by Expressions for alternative estimation methods are similarly defined.

The Hájek Variance Estimator
Finally, we also compute in this paper the Hájek variance estimator (Hájek, 1964). For the problem of estimating a population mean, this variance estimator is given by where ̂ ∑ and ̂ ̂∑ Following the notation used for the Horvitz-Thompson variance estimator, the Hájek variance estimator of estimators of the headcount index can be easily defined. For this purpose, we just need to replace by into the previous expressions.

Monte Carlo Simulation Studies
In this section, we carry out various Monte Carlo simulation studies which analyze the aspects described in this paper. Note that the various studies are based on real data sets extracted from the 2011 EU-SILC, and which are taken from 8 countries from the European Union. In particular, we analyze numerically the bias and the efficiency of estimators of the headcount index when samples are selected under different sampling designs. In addition, we evaluate different variance expressions under the same sampling designs. For this purpose, the empirical coverages of the corresponding confidence intervals are computed.

Populations
The simulation studies are based on real survey data obtained from the 2011 EU-SILC. We considered data from 8 different countries and they were considered because they have different properties according to various poverty indicators, i.e., they have quite different values for the headcount index, the poverty gap index, etc. Such countries considered in this paper are: Belgium, Bulgaria, Spain, Slovenia, Italy, Lithuania, Poland and UK. Note that the variable of interest, y, is equivalised net income, whereas we consider the tax on income contributions as the auxiliary variable. Note that this choice is due to the fact that both variables have a large linear relationship for the various populations used in this paper. We consider that the data associated to each country are considered as a population with size N, from which samples will be selected, and our studies can be thus realized.

Empirical Measures
Monte Carlo simulation studies are based on different empirical measures, which are described in this section. Point estimators of the headcount index are compared in terms of Relative Bias (RB) and Relative Root Mean Square Error (RRMSE), which are defined, respectively, by: where ̃ denotes a given estimators of the parameter P, and and are, respectively, the empirical expectation and mean square error based on R = 10000 simulation runs, i.e., where ̃ denotes the value of ̃ at the rth simulation run. On the other hand, the accuracy of the different variance estimators is analyzed via the corresponding relative bias of such variance estimators, and also, the coverage of confidence intervals. For the rth iteration run, the lower and the upper limits for the confidence interval of a given estimator ̃ are defined, respectively, by: We considered confidence intervals with a 95% for the confidence level, and for this reason, the empirical coverages should take values close to the nominal level of 95%. The coverage rates at the 95% confidence level are defined as,

̃ ∑
In addition, we computed the percentage of times that the real parameter is, respectively, below and above the lower and upper limits of the confidence intervals. Note that the ideal situation is to have percentages close to 2.5%.

Discussion and Future Research
The analysis and measurement of the poverty is an important topic, and which receives quite attention from many agencies and institution. One of the most important indicators is the headcount index, which is the poverty indicator analyzed in this work.
After analysing numerically the bias and the efficiency of estimators of the headcount index when samples are selected under different sampling designs, we can conclude that the sampling design with better properties is stratified random sampling, i.e., this sampling design gives better results in terms of Relative Bias (RB) and Relative Root Mean Square Error (RRMSE). The second sampling design with better results is simple random sampling with auxiliary information.
In addition, we also evaluated different variance expressions under the same sampling designs. We used the empirical coverage rates of the corresponding confidence intervals and the relative bias of such variance estimators. Our findings are that the various confidence intervals give desirable results, with reasonable coverage rates, excepts the post-stratified sampling design, which provides variance estimators with large biases and confidence intervals with coverage rate close to 90%.
A further step from this study could consider alternative populations, or include additional sampling designs, such as systematic sampling, cluster sampling, etc. Also the simulation studies could be based on another data basis instead of the 2011 EU-SILC, used in this paper. In addition, this paper focuses on the headcount index. We could extend this study to alternative poverty measures, such as the poverty gap index, the poverty severity index, etc. In particular, it could be interesting to analyze the effect of different sampling designs on multi-dimensional poverty indicators.
In summary, the results derived from this paper and using different sampling designs indicate the advantages of using stratified random sampling when estimating the headcount index. Different variance estimators are also analyzed, and desirable results are also obtained. As a much more general vision, the paper also summarizes the importance of the poverty measurement and the use of statistical instruments as a key factor in the fight against poverty.