Descriptive Statistics 101
After collecting data, the first step of data analysis is to describe, summarize, and organize the characteristics of the data set. In this post, I will describe the main descriptive statistical concepts and the implications each of those have so you can have a more complete understanding of the topic.
Table of content:
- Measures of Central Tendency
- Measures of Dispersion (or Variability)
- Key Concepts for distributions
Statistics is a math subfield responsible for collecting and interpreting data. Withing Statistics there are two main categories:
- Descriptive Statistics: as the name says you describe, present, summarize and organize a data set either through numerical calculations, graphs or, tables.
- Inferential Statistics: using a random sample of data to describe and make inferences about the data set.
Descriptive statistics are based only on the data available and not based on any theory of probability.
The most common measurements are:
- Measures of Central Tendency
- Measures of Dispersion (or Variability)
Measures of Central Tendency:
Each of these measures (mean, median, and mode) finds the central location of a data set using different methods. These measures are really useful because by finding the central value of a data set, you can see where the data values typically fall and understand a data set much more quickly compared to looking at all of the individual values in the data set.
Commonly known as Average is the ratio of the sum of all observations (data points) to the total number of observations. A mean is a number around which the entire data set is spread. Considered the most reliable measure of central tendency for making assumptions about a population from a single sample.
The ‘middle’ value or midpoint in your data set. It is also known as the 50th percentile, this measure is really useful because it is much less affected by outliers and skewed data than the mean, therefore the median is a much more suited statistic, to report about your data in this type of cases.
It measures the value with the highest frequency in the data set. The distribution could be:
- Unimodal: only one value with the highest frequency of the data set.
- Bimodal: two values with the highest frequency of the data set.
- Multimodal: multiple values with the highest frequency of the data set.
Note: In a normal distribution, the measures of central tendency are equal.
Measures of Dispersion (or Variability):
These measures describe the spread of the observations around the Measures of Central Tendency. I talk about variability in the context of a distribution of values. A low dispersion indicates that the data points tend to be clustered tightly around the center. Hight dispersion signifies that they tend to fall further away.
Note: In statistics variability, dispersion and spread are synonyms that denote the width of the distribution.
It is the difference between the maximum value and the minimum value f the data set. For example in two data sets: data set 1 has a range of 20 - 38 = 18 while data set 2 has a range of 11- 52 = 41. Data set 2 has a broader range and hence, more variability than data set 1.
2. Interquartile Range (IQR) and other percentiles:
The IQR is the middle half of the data distribution that is between the upper and lower quartiles. The quartiles divide the data into quarters and denote them from low to high as Q1, Q2, and Q3. In other words, the IQR include the 50% of data points that fall between Q1 and Q3.
The interquartile range is a robust statistic of variability similarly that the median is a robust statistic of central tendency. Neither measure is influenced dramatically by outliers because they don´t depend on every value.
When you have a skewed distribution, I find that reporting the median with the IQE is a particularly good combination.
Note: You can also use other percentiles to determine the spread of different proportions just take into consideration that the broader these ranges, the higher the variability in your data set.
3. Mean Absolute Deviation (MAD):
It is the mean distance of each observation from the mean of the data set.
Let´s see an easy example:
85, 75, 80
Mean = (85+75+80)/3 = 80
MAD = (5+5+0)/3 = 3.333
Measures how far are the data points (observations) spread out from the mean. It is calculated by finding the difference between every data point and the mean, squaring them, summing them up, and then taking the average of the sum.
The squares are used during the calc because they outweigh outliers more heavily than points near the mean, preventing that differences above the mean neutralize those below the mean.
Note: because of the squaring the variance is not in the same units of measurement as the original data, it can be difficult to interpret intuitively the results of the variance, the standard deviation resolves this problem.
Another note: if your data set refers to a sample of a mayor population the formula is the same just when dividing between N replace it with N-1, to contrarest the tendency of samples to underestimate the popoulation variance.
5. Standard Deviation:
The square root of the variance. When you have a low standard deviation, your data points tend to be close to the mean. A high standard deviation means that your data points are spread out over a wide range.
Conveniently, the standard deviation uses the original units of the data, which makes interpretation easier. Consequently, the standard deviation is the most widely used measure of dispersion.
When you have normally distributed data, or approximately so, the standard deviation becomes more valuable. You can use it to determine the proportion of the values that fall within a specified number of standard deviations from the mean.
Which is the best — Range, IQR, or standard deviation?
I didn´t include the variance because the squared unit doesn´t provide an intuitive interpretation.
When you are comparing samples that are the same size, consider using the range as a measure of variability. It´s a reasonably intuitive statistic, only be aware that a single outlier can throw the range off. The range is particularly suitable for small samples when you don´t have enough data to calculate the other measures reliably, and the likelihood of obtaining an outlier is also lower.
When you have a skewed distribution, the median is a better measure of central tendency, and it makes sense to pair it with either the interquartile range or other percentile-based ranges because all of these statistics divide the dataset into groups with specific proportions.
For normally distributed data or even data that aren’t skewed, using the tried and true combination reporting the mean and the standard deviation is the way to go. This combination is by far the most common. You can still supplement this approach with percentile-based ranges as you need.
Key Concepts for distributions:
It is determined by the number fo peaks it contains.
Unimodal means that the distribution has only one peak, which means it has only one frequently occurring score, clustered at the top. A bimodal distribution has two values that occur frequently and a multimodal has several frequently occurring values.
It is the measurement of the symmetry of a distribution. It describes how much a distribution differs from a normal distribution. It is important to know that a normal distribution would have a skewness of cero.
- Positive skew is when the tail on the right side of the curve is bigger than the tail on the left side. Here mean is greater than the mode.
- Negative skew is when the tail on the left side of the curve is bigger than the tail on the right side. Here mean is smaller than the mode.
Describes whether the data is light-tailed (lack of outliers) or heavy-tailed (a bunch of outliers) when compared to a normal distribution.
- Mesokurtic: when the kurtosis is cero, like in a normal distribution.
- Leptokurtic: when the tail of the distribution is heavy (a bunch of outliers), kurtosis >0.
- Platykurtic: when the tail of the distribution is light, kurtosis <0.
Note: a histogram is an effective way to show both the skewness and kurtosis of a data set allowing you to spot easily if something is wrong with your data.
You learned about three different kinds of measurements of Central Tendency. Afterward, you learned about Range, Interquartile Range, Mean Absolute Deviation, Variance, and Standard Deviation. Then you learned about the three types of modality and to describe in terms of skewness how much a distribution differs from a normal one.
This post is the first one of a series of blocks in which I will cover a variety of topics of Statistics, Probability, Programming, Data Visualization, Analysis, and more so please if you found it useful, please consider subscribing and sharing with people who might find it interesting or useful. Give me some claps and connect with me via LinkedIn or Github.
Disclaimer: all the images displayed in the story were made by author.