Introduction to AI/ML and Data Science

This post serves as a brief introduction to the world of artificial intelligence and machine learning, as well as the high relevance of data science in this field.

What is Artificial Intelligence?

Artificial intelligence (AI) is the simulation of human intelligence by computers to carry out different tasks. Common uses of AI include artificial neural networks, natural language processing (interpreting info related to human language), and computer vision (interpreting info related to videos/images).

What is Machine Learning?

Computer scientist Tom Mitchell provides a modern definition of machine learning:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

In essence, if a computer program is improving its performance at a certain task as it gains more and more experience, then the computer program is using machine learning. Machine learning is a subset of the larger superset of AI.

What is Deep Learning?

Deep learning is a subset of machine learning that accomplishes tasks through mimicking the structure of the human brain. Deep learning systems are centered around artificial neural networks (ANNs). ANNs are computing systems inspired by how biological neural networks in the human brain process information. Because of deep learning, we are able to have self-driving cars, virtual assistants (i.e., Apple's Siri), facial recognition, and many more different applications.

The Importance of Data

The terms "artificial intelligence" and "machine learning" have existed for over half a century, but what caused the field of AI/ML to rapidly evolve in recent years (hint: look at the subtitle of this section)?

Yep, you guessed it. This field has evolved so rapidly in recent years due to massive amounts of data available. With the advent of big data, storage of data is no longer a problem. In addition, the computational power of modern computer systems has been rapidly improving year by year, allowing computations to be done in a quicker manner.

In fact, machine learning relies on data so heavily that methods associated with machine learning requires expertise in data science (a field of study that uses different scientific approaches and algorithms to extract knowledge and insights from data) as well.

In order to be an expert in AI/ML, one must be an expert in analyzing data; essentially, one must become a good data scientist.

Data Data Data!

Now that we have established the importance of data in AI/ML, we must explore what all is possible with data. Data can be used in descriptive, predictive, and prescriptive analytics, as summarized in the following graphic:

Descriptive Analytics

The table below shows different data associated with four different employees working for the same company.

An example of a descriptive analytics question would be "what are the names of the employees with a grade of at least 6?" According to the data in the table, the answer would be Rob, Joe, and Todd.

The importance of descriptive analytics is to analyze and extract meaningful information out of the data given.

However, what would your answer be if the question were to be "who is the best employee?" Would it be Todd? Or maybe Joe? However, to effectively answer this question, it is very important to note that data scientists can NOT have bias; to prove something, there has to be numbers and evidence.

Predictive Analytics

The table below shows the score percentages on the same mathematics test from students studying for varied amounts of time.

A question requiring predictive analytics may be "Based on the data given, what would be the resulting score of a student studying for 5 hours?" You may answer with "100%", but how exactly did you come to this conclusion? First, you looked at the past data given. Then, you found patterns within that data and used those patterns to arrive at a solution.

This the core of predictive analytics: analyzing past data, extract patterns from the data, and then use those patterns to predict the future.

"The future is only the past again, entered through another gate."

- Arthur Wing Pinero

Of course in many cases in the real world, data won't be this simple to analyze. Data would be much more complex, and we wouldn't be able to compute a final answer by ourselves. Therefore, we take the help of an algorithm (simply put a series of mathematical formulas and calculations), using the same concept.

First, past data is taken and fed to an algorithm. Then, the algorithm will generate different patterns, which in turn can be used to make predictions.

Although this seems all rather complex, a basic implementation of an algorithm in Python takes not 100, not 50, but only 4 lines of code!

Prescriptive Analytics

The table above displays data on movie ratings of different shows by different users. A prescriptive analytics question may be "based on the ratings User 1 has given to the shows he/she watched, what is the next most likely show to recommend to User 1?" Internally, a prescriptive analytics problem carries layers of predictive analytics.

To form a recommendation based on past data points is the core of prescriptive analytics.

Problems in Extracting Patterns

You may be wondering: "if it's this simple of a job, then why do people struggle so much in this field?" The truth is that the coding itself is typically not the most difficult part of solving problems requiring AI/ML. Difficulty often arises from the data itself, causing algorithms to not understand patterns correctly, which is caused by a multitude of scenarios:

1. Data Predominantly Supporting One Subject

The following table, which shows the number of cars owned by a household in a city in Texas, illustrates this problem:

As you can see, a very large majority of the data points are of households in Dallas, which limits the accuracy of predicting the number of cars owned by another household residing in a city other than Dallas, i. e. Houston or Austin.

2. Duplicate Data

The table above shows names of countries and names of the nation associated with the country. These two columns reference the same exact data points. With the existence of the first column titled "Country", the second column is redundant to this data set.

3. Missing Data

The table above shows demographics of a sample population from a city. Due to missing data points, the algorithm cannot extract patterns accurately and effectively because of many inaccuracies present in the data set.

Summary

In this post, we briefly discussed what exactly is artificial intelligence and machine learning, and the importance of data in this fast growing field. We described the different types of data analytics (descriptive, predictive, and prescriptive analytics) that are fundamental to understand in order to be an expert in the field of AI/ML.

If you enjoyed this post, be sure to subscribe to receive notifications on more content delivered directly to your inbox :D.