Statistics for Machine learning - Part 2

This post is the second of a series of several informational statistics posts, providing essential knowledge to succeed in the field of AI/ML. Topics in this post include:

Types of data variables
Hypothesis Testing
Degrees of Freedom
The ANOVA Test
Correlation
The Chi-Squared Test

Data Variable Types

In a general sense, all data variables can be categorized into categorical variables and numerical variables.

Categorical Variables

Categorical variables are variables that only have a limited number of possible values. These variables are not numeric in nature, but rather descriptive in nature. These variables tend to be similar to "labels."

However, within the overarching umbrella of "categorical variables", there are ordinal and nominal variables. Ordinal variables are categorical variables whose possible values exist in a specific order. For example, if "Movie Rating" was an ordinal variable, then possible values for this variable would be "Very Bad," "Bad," "Okay," "Good," and "Very Good." Nominal variables are simply categorical variables whose values don't have a specific order to them, i.e. city names.

Numerical Variables

Numerical variables, like the name implies, are numeric in nature. Unlike categorical variables, they are not descriptive in nature (no "labels").

It's very important to distinguish between the two main types of numerical variables: continuous and discrete variables. Continuous variables are numerical variables that can hold an infinite number of possible values (i.e., income), whereas discrete variables are numerical variables that can only hold a limited number of possible values (i.e., number of shoes owned). Oftentimes, discrete variables are treated like categorical variables.

Hypothesis Testing

A hypothesis is essentially a belief or assumption that is proven through further study and investigation of different pieces of evidence. A hypothesis test is a statistical test to validate assumptions we make on datasets. Hypothesis testing includes the following key steps:

Make observations.
Create an alternative hypothesis (denoted by H1 or HA), which is the hypothesis that is directly formed from the observations.
Create the null hypothesis (denoted by H0), which serves as the “anti hypothesis."
Test the chances of H0 being true.
Obtain a result/verdict.

When we are using hypothesis testing, it is important to note that we are testing the likelihood of the null hypothesis occurring, instead of directly testing the alternative hypothesis. Therefore, to accept H1, we must be able to reject H0.

If the likelihood of H0 occurring (denoted as P(H0)) is less than or equal to 5%, then H0 is considered invalid, and therefore, we can accept H1 (P(H1) ≥ 95%).
If P(H0) is between 5% and 40%, then we do NOT have enough evidence to reject H0/accept H1.
If P(H0) is greater than 40%, then H0 is accepted and H1 is rejected.

Application of Hypothesis Testing

To apply this concept, suppose a residents of a small town has an average mortality of 75 years in 2019. In 2020, a survey was completed by 50 residents of the small town, where an average mortality of 80 years was calculated. Our task is to determine if the average mortality of the small town has increased or decreased.

Because the average mortality calculated from 2020 survey is greater than the town's average mortality in 2019, our alternative hypothesis is that the average mortality is greater than or equal to 75 years. Therefore, our null hypothesis is that the average mortality is less than 75 years.

Suppose we calculated the probability of H0 is 2%. Therefore, the probability of HA occurring is 98%, which allows us to reject H0 and accept HA. Therefore, the average mortality of the town has increased.

Degrees of Freedom

Degrees of freedom (df) refer to how many values are required to derive the unknown values wanting to be solved for. Typically, you will find that the df = n-1, where n is the total number of observations (including the unknown value).

ANOVA

ANOVA (analysis of variance) tests if there is a significant amount of variance between 3 or more groups of data. The null and alternative hypothesis for ANOVA testing are provided below.

ANOVA is best understood through an example.

For example, suppose you want to find out if there is a significant amount of variance between 3 different movies, based on the ratings of 4 users who viewed and rated all the movies.

The group mean is the mean of values within one group, so it would be the mean of ratings for each group individually. The grand mean is the average of all three group means. All of this info is used to calculate the F statistic (denoted as Fcalc below), where "m" is the number of groups and "n" is the number of elements in each group.

The next step is calculating the F critical value (denoted as Fcrit) by using an f-table. The df in the numerator is m-1, where df in the denominator is n-1; in our example, the numerator df = 2, and the denominator df = 3. At α = 0.05, Fcrit = 9.55 (see below).

Fcrit and Fcalc are tested using the statements below:

In our example, because Fcrit > Fcalc (9.55 > 1.31), we cannot reject H0/accept HA, meaning that the movies are not very similar in nature.

Correlation

(Note: whatever is discussed under this section of correlation CANNOT be applied to categorical variables.)

In statistics, the term "correlation" refers to the statistical relationship between two continuous variables.

In order to calculate correlation, we must first calculate the covariance between two continuous variables. Covariance is easily calculated if both variables have the same number of observations. The covariance between two continuous variables is given by the following formula (X and Y are lists for 2 different data distributions for two different continuous variables, where Xi represents a particular observation of X and Yj represents a particular observation of Y, and n is the number of observations in just one of the distributions):

However, covariance as a statistic is not very useful by itself. The reason is because the relationship between continuous variables with different units of measurement (i.e., inches and pounds) might be skewed from reality through using the covariance statistic. This is why we calculate the correlation statistic (denoted by R), which shows a more realistic relationship between the two variables through dividing the covariance by the product of the standard deviations of distribution X and Y separately:

Below shows an example of calculating correlation for continuous variables X and Y. The calculated correlation was approximately 0.609. Typically, values for correlation will be between -1 < c <1. As c approaches 0, the correlation between two variables becomes weaker and weaker. Likewise, as c approaches either -1 or 1, the correlation between two variables becomes stronger and stronger.

Chi-Square

A chi-square test is used to test if two categorical variables are related to each other or not. In other words, we use the chi-square test to find if categorical variables are dependent upon each other. The null and alternative hypothesis for the chi-square test are provided below.

This test is also best understood through an example.

Let's say you are trying to find out if the personality type of a person affects their color preference. Let's also assume that you surveyed 200 people, who were either extroverts or introverts, and asked them to choose their favorite color from 4 color options. Let's denote this resulting table as the "actual values" table.

In order to actually calculate the values we need to use for the testing process, we must construct a table with all the numerical values cleared except for the grand, column, and row totals. We do this because we want to calculate the expected values for each personality-color combination if the two categorical variables were not to have any association.

To calculate a particular personality-color combination in this "expected values" table, we multiply the row total for the specific personality type (denoted as "Row Total") with the column total for the specific color type (denoted as "Column Total"). Then, we divide this calculated value by the total number of students in the survey (denoted as "Grand Total"). This calculation is illustrated below:

For example,"expected values" table above, the table entry signifying how many introverts would prefer the color red if no association between personality and color existed would be the "introvert" row total (150) * the "red" column total (40), all divided by the "grand total" (200). This results in the value 30. Repeat this process for every possible combination between the two variables, and the resulting values are filled in the table below.

The next step would be to create a "final results" table. Here, we will use the formula below to calculate our final result for each personality-color combination, using those corresponding values from the "expected values" and "actual values" tables.

For example, for the introvert-red combination, you would find the corresponding value from the "expected value" table (30) and from the "actual value" table (28) to calculate the particular final combination value. Repeat for all of the possible combinations (results displayed below), and sum all of the results to obtain the chi-square statistic (denoted as "X2_Calc" below).

With our degrees of freedom value (for chi-square tests, this value is the (number of rows-1) * (number of columns-1), which in this case would be (4-1)*(2-1) = 3), we can find our chi-squared test critical value from a chi-square table at α = 0.05 (see below).

For df=3 and α = 0.05, the chi-squared test critical value (denoted in this blog post as "X2_Crit") is 7.815. As a reminder, the X2_Calc value we calculated earlier was approximately 4.468. Use the X2_Crit and X2_Calc values in the following test:

In our case, because X2_Crit (7.815) > X2_Calc (4.468), meaning that we are not able to reject H0 or accept HA. This means that our conclusion for the test is that there is most likely no relationship between the personality type and color preference (based on the data that was collected, however).

Summary

In this post, we covered another vast amount of statistics concepts important to excelling in AI/ML, including types of data variables, hypothesis testing, degrees of freedom, the ANOVA test, correlation, and the chi-square test. Remember, it is important to review these fundamental concepts occasionally; this is the key to succeeding as a data scientist and an expert in AI/ML. Your goal shouldn't be to necessarily memorize these concepts, though; instead, you should know what these new learned tools can accomplish, applying them later when you deal with data analysis.

If you enjoyed this post, be sure to subscribe to receive notifications on more content delivered directly to your inbox :D.