Statistics for Machine learning - Part 1

This post is the first of a series of several informational statistics posts, providing essential knowledge to succeed in the field of AI/ML.

What is statistics?

towardsdatascience.com

Statistics is a field of study that helps us collect data and use that data meaningfully.

In the previous post, we discussed the importance and relevance of data in machine learning. However, data does not mean anything if not used properly.

As a person interested in become an expert at AI/ML, your goal should NOT be to memorize the many formulas covered in these next series of posts dedicated to statistics (although memorization can increase convenience). Instead, your goal should be to understand intuitively the logic and the applicational value behind each and every statistics concept covered.

Central Tendency

corporatefinanceinstitute.com

Central tendency is a representation of a typical value or a "central" value of a set of data. Common measures of central tendency are the mean, the median, and the mode of a set of data (If you do not know what these terms mean, don't worry! we'll cover everything you need to know in these posts from the ground up :)).

Mean

The mean of a set of data refers to the arithmetic average of the data set, typically denoted by the Greek symbol μ (pronounced "mu" ).

Mean is calculated using the above equation, where x is a set of data values/terms/observations, and n is the total number of observations in x. In other words, the mean is the sum of the terms in data set x divided by the total number of terms in x.

Let's demonstrate finding the mean of a data set using an example. Suppose x = [2, 4, 6, 8, 10]. To find the mean of this set of data, simply compute the sum of the observations in data set x divided by the number of terms in x, which is equivalent to (2+4+6+8+10)/5, simplifying to 6 as your final answer.

Median

The median of a set of data is the midpoint of a data set when all of the values are sorted in ascending order. For example, suppose x = [2, 1, 5, 3, 10]. If sorted in ascending order, then x = [1, 2, 3, 5, 10]. In this case, the median is 3. However, if a dataset contains an even number of values, then the median is simply the average of the two midmost values.

Mode

The mode represents the most frequent value in a data set. For example, if x = [1, 2, 3, 3, 3, 4, 10], then the mode would be 3. However, the mode is typically the least method of measuring central tendency of a dataset.

Central Tendency: Mean or Median?

medium.com

You may be wondering the following: "When do we use median or mean for finding the central tendency of a dataset?" To answer this question, it depends on the dataset provided.

For example, suppose dataset x = [4, 6, 9, 13, 80]. The mean and median of x are 16.6 and 9, respectively. Why is there a significant difference between the mean and median of x? It's because of the inclusion of the value 80 in x, known as an extreme/abnormal value (covered later in this post). Suppose instead that x = [4, 6, 9, 13, 16]. Then, the mean and median would be 9.6 and 9, respectively. This all means that while the mean is easily affected by extreme/abnormal values, the median is not.

Therefore, to measure central tendency of a dataset, we use median more often and mean less often.

Interquartile Range (IQR)

A set of numerical data can be divided into multiple quartiles with each quartile having a unique significant value. The middle value of the first half of a dataset is known as the first quartile (Q1), the median of the dataset is known as the second quartile (Q2), and the middle value of the second half of a dataset is known as the third quartile (Q3). These concepts are illustrated in the following example:

mathbits.com

In the above example, because the dataset includes an even number of values, Q2 (the dataset median) is the average of the dataset middle values (32 & 40), which is 36. Q1 and Q3 are calculated in the same manner.

The value for interquartile range (IQR) is calculated by the following equation:

Outliers and Extreme Values

Outliers are values that do NOT fit the distribution of data followed by the majority of observations (data points) in a dataset. Mathematically, outliers are calculated by the following formula:

If a datapoint is less than or equal to the value indicated by the Lower Outlier equation, or if a datapoint is greater than or equal to a value indicated by the Higher Outlier equation, then that datapoint is be considered an outlier.

For the example provided above, the IQR = Q3 - Q1 = 51 - 26.5 = 24.5. Using the Lower Outlier equation, Q1 - (1.5 * IQR) = 26.5 - (1.5 * 24.5) = -10.25. Likewise, the Higher Outlier equation yields Q3 + (1.5 * IQR) = 51+ (1.5 * 24.5) = 87.75. Therefore, if a datapoint is less than or equal to -10.25 OR greater than or equal to 87.75, then that datapoint would be considered an outlier.

Extreme Values are essentially more extreme versions of outliers. The Lower Extreme Value equation is given by Q1 - (3 * IQR), and the Higher Extreme Value equation is given by Q3 + (3 * IQR); essentially, the extreme value equations are the same as the outlier equations except that the IQR is now being multiplied by 3 instead of 1.5.

Statistical and Contextual Outliers | Leverage Points

In the dataset above, what player, if any, would you consider an outlier? Undoubtedly, the player who scored 32,000 points would be considered an outlier, given just this set of data. This is known as a statistical outlier, where a datapoint can be considered an outlier if you are considering just one variable of data collection (one column).

Now, if we are provided with more variables, what player(s), if any, would you consider as an outlier? The player who played 40 games and scored 3000 points would be considered an outlier because this player averages 75 points/game, which is far from the distribution of the other players. This is an example of a contextual outlier, where a datapoint can be considered an outlier if you are considering multiple variables of data collection (multiple columns).

Now, the player who played 1280 games and scored 32000 points would be considered a leverage point, a point slightly far from the other points in its respective data distribution. Unlike outliers, a leverage point follows the trend of the data distribution.

The image on the left depicts a leverage point with x value of approximately 5, and the right image depicts an outlier with an x value of approximately 25.

Normal Distribution

A normal distribution of data is a symmetric distribution of data resembling a bell curve, where the heaviest concentration of observations of a data set occur at a central peak (mean) and where the concentration of values becomes less and less as values are evaluated away from the central peak. Normal distribution can be used to describe most of the common phenomena/experiences in our day-to-day lives, such as height, weight, and income.

For example, suppose we model the heights of a class of students, and the distribution of heights forms a normal distribution, as seen in the following image:

statology.org

As you can see, there is a central peak at 70 inches, where the heaviest concentration of students are observed at. Concentration decreases symmetrically as you evaluate heights away from the mean (μ=70).

Standard Deviation & Variance

The standard deviation, denoted by σ (sigma), for a dataset specifies how the data is distributed/concentrated and can tell us if the data is homogeneous or has lots of variety. The standard deviation is essentially a measure of the distance of any observed value from the mean and is discussed in relation to μ. The higher the value for σ, the more varied/distributed the dataset is. For our example above, σ=2 because each observed value is represented in increments of 2 inches away from the mean.

All datasets fitting a normal distribution will have the following characteristics in common (the values in inches only refer to the example provided above):

μ + 1σ (from 70-72 inches) contains 34.1% of the observations in the dataset, and the same amount is contained by μ - 1σ (68-70 inches).
μ + 1σ to μ + 2σ (72-74 inches) and μ - 1σ to μ - 2σ (66-68 inches) each contains 13.6% of the data.
μ + 2σ to μ + 3σ (74-76 inches) and μ - 2σ to μ - 3σ (64-66 inches) each contains 2.1% of the data.

These bullet points are illustrated in the following image:

nextviewventures.com

Knowing these different percentages can also help answer a multitude of questions concerning normally distributed data. For example, using the distribution of student heights in the previous section, you can say with 95.4% confidence that a student, if randomly chosen from the dataset, will have a height between 66-74 inches (μ - 2σ to μ + 2σ), simply by adding the percentages of data points relevant to this range.

To represent variation in a dataset quantitatively, we calculate the variance, denoted by σ^2, of the dataset. Variance is equal to the square of standard deviation, which also means that standard deviation is equal to the square root of variance. Variance is calculated using the following formula:

onlinemathlearning.com

Other Data Distributions

The distribution of data in a dataset can be in any form, not just normal distribution. Other distributions include left-skewed, where the frequency of values is less on the left side of the distribution, and right-skewed, where the frequency of values is less on the right side of the distribution, as illustrated below:

mathwave.com

However, it is much easier to work with data that is normally distributed, and there are many techniques used to turn distributions of data into normal distributions, such as applying a logarithm (see below).

khanacademy.org

For example, if dataset x = [1, 10, 100], then a logarithm function can be used to turn the dataset into x = [0, 1, 2].

Z-Score

The z-score tells us how many standard deviations a value is away from the mean. The z-score is calculated with the following formula ("score" in the following image refers to the value that is being evaluated away from the mean):

simplypsychology.org

If a value does NOT fall in the range μ - 3σ to μ + 3σ, then the value is an outlier. Therefore, any z-score that does NOT fall in the range -3 to 3 is considered an outlier.

Application of Z-Score

Let's take a look at the example distribution of heights of a class of students again:

statology.org

We will use the concept of z-score to answer the following question: What is the probability of a student having a height greater than 73?

The first step is to apply the z-score formula to find Z73, which results in (73-70)/2 = 1.5. Then, we must use a z-table to find a percentage value for a probability, which actually represents the shaded region in the image on the right:

z-table.com

At z = 1.5, the number returned (given as a percentage) is 93.32%. However, this percentage represents the probability of a student having a height less than 73. To find the probability of a student having a height greater than 73, we compute 100%-93.32%= 6.68% as our final answer.

Note: whenever you are using the z-table, remember that the probability is always referring to the shaded region of the bell curve.

Summary

In this post, we covered a vast amount of statistics concepts that are important to AI/ML, including central tendency, interquartile range, outliers & leverage points, different types of data distributions, standard deviation & variance, and the applicational value of the z-score. There will be a lot more statistics concepts covered in later posts, however, and it is important to review these fundamental concepts occasionally; this is the key to succeeding as a data scientist and an expert in AI/ML.

If you enjoyed this post, be sure to subscribe to receive notifications on more content delivered directly to your inbox :D.