# Intro to Sampling

## What is sampling?

A sample is a part of a population. For example, you might have information on 100 people (sample) out of 10,000 people (population). Using the information about the sample, we can make assumptions about the population’s behaviour. Sampling is used in circumstances in which it is impractical to obtain information from every member of the population, as in biological/chemical analysis or social surveys.

For example, imagine an experiment to test the effects of a new education technique on schoolchildren. It would be impossible to select the entire school age population of a country, divide them into groups and perform research. This is where statistical sampling comes in- the idea of trying to take a representative section of the population, perform the experiment, and extend it back to the population as a whole.

## Common types of sampling

**Probability sampling**Probability sampling is a sampling technique wherein the samples are gathered in a process that gives all the individuals in the population an equal chance of being selected.**Simple Random sampling –**The key to random selection is that there is no bias involved in the selection of the sample. An example of simple random sampling would be choosing 50 student’s names out of a hat from a school of 1000 students. In random sampling, any variation observed between the sample characteristics and the population characteristics is only a matter of chance.**Stratified sampling –**In this method, the population is divided into subgroups (called strata) which are important for the research – for example, by gender, social class, education level, religion, etc. Then the population is randomly sampled within each category.

Stratified samples are better than random samples as it ensures each subgroup within the population receives proper representation, but they require a fairly detailed advance knowledge of the population characteristics, and therefore are more difficult to construct.**Cluster sampling –**In cluster sampling, first, the population is divided into separate groups, called clusters. Then, simple random sampling is performed on the clusters. Finally, analysis is conducted on data from the sampled clusters.

Cluster random sampling is done when simple random sampling is almost impossible because of the size of the population. For example, performing a simple random sampling for the entire population of Asia is not feasible, but it’s effective to implement cluster sampling.**Systematic sampling –**Systematic sampling is similar to an arithmetic progression, where the difference between any two consecutive numbers is the same. For example, imagine we have to choose subjects for research from a school with 2000 students. First, we would choose a random number, let’s say three. Then we choose the difference, the number of subjects between the next number, let’s say ten. So our sample subjects will be the students numbered 3,13,23,33,43,53,63… and so on.**Mixed/Multi-stage sampling –**This probability sampling technique involves a combination of two or more sampling techniques enumerated above. In most of the complex researches done in the field or in the lab, it is not suited to use just a single type of probability sampling.

Most of the researches are done in different stages with each stage applying a different sampling technique.

**Non Probability Sampling**Non-probability sampling is a sampling technique where the samples are gathered in a process that does not give all the individuals in the population equal chances of being selected.**Convenience sampling –**In convenience sampling, the samples are selected because they are accessible to the researcher. Subjects are chosen simply because they are easy to recruit. This technique is considered easiest, cheapest, and least time consuming. However, this method may induce bias and sampling error as the sample may not be representative of the population.**Quota sampling –**In Quota sampling we take a very tailored sample that’s in proportion to some characteristic or trait of a population. For example, you could divide a population by the state they live in, income or education level, or sex. The population is divided into groups and samples are taken from each group using judgment (which makes it non-probability sampling) to meet a quota.**Snowball sampling –**Snowball sampling is usually done when there is a very small population size. In this type of sampling, the researcher asks the first subject to identify another potential subject who also meets the criteria of the research, then the second subject would identify another subject and so on. As the sample builds up, enough data is gathered to be useful for research. The downside of using a snowball sample is that it is hardly representative of the population.

## Sampling bias

Sampling bias occurs when a sample is collected in such a way that some members of the population are less likely to be included than others. This occurs as it is practically impossible to ensure perfect randomness in sampling. While some researchers deliberately use a biased sample to get misleading results, usually bias occurs due to difficulty in obtaining a truly representative sample.

For example, in 1936, Literary Digest magazine collected over two million postal surveys and predicted that the Republican candidate would beat the incumbent president in the presidential election by a huge margin. However, the result was the exact opposite. Bias was induced in the sample because the survey was collected from the magazine readers, registered automobile owners and telephone users. The sample over-represented individuals who were rich and were more likely to vote for the republican candidate.

## Conclusion

A sample is essentially a subset of the population, selected so as to be representative of the larger population. Since we cannot study the entire population we need to take a sample. Sampling helps in reducing cost and speeding up the research. However, care must be taken to avoid the introduction of bias or error. If the sample is not collected properly, it will not represent the system we are trying to analyze, making our analysis futile.

## Resources

- http://www.statisticshowto.com/probability-and-statistics/sampling-in-statistics/
- https://onlinecourses.science.psu.edu/stat100/node/16/
- http://www.statisticssolutions.com/sample-size-calculation-and-sample-size-justification/sampling/
- https://explorable.com/probability-sampling

### About Rohan Joseph

Practicing the dark arts of data science. I am currently pursuing Master's in Operations Research at Virginia Tech and working with Chartio to democratize analytics in every organization.