top of page

Fundamental Math for Data Science

Artificial Intelligence is transforming industries, automating tasks, and unlocking insights that were previously out of reach. At its core, AI aims to simulate human intelligence — enabling machines to learn, reason, and make decisions. However, behind these intelligent systems lies a fundamental building block: mathematics. From developing machine learning algorithms to optimizing decision-making models, math serves as the foundation for all AI technologies. Whether it’s understanding patterns in data or creating models that predict future outcomes, mathematics powers the entire AI process.

Without a solid grasp of key mathematical concepts, it’s nearly impossible to navigate the complexity of AI. In this post, we will provide an overview of the most crucial mathematical principles necessary for mastering AI and Data Science. By focusing on these core concepts, you will build a strong foundation that makes learning advanced techniques significantly more intuitive and manageable.


Basic Algebra

Basic Algebra forms the foundation of many mathematical concepts used in Data Science and AI. It involves understanding basic operations as well as working with exponents, radicals, summations, and factorials. These skills are essential because they underpin more complex operations like matrix manipulations, linear equations, and optimization problems—all of which are crucial for AI algorithms.


Addition, Subtraction, Multiplication, Division

 These are fundamental arithmetic operations necessary for data handling and basic calculations.

Example: Calculating the average (mean) of a dataset involves summing the values and dividing by the total number of entries. For a dataset [2, 4, 6], the mean is calculated as:

Mean = (2 + 4 + 6) / 3 = 12 / 3 = 4


Exponents and Radicals

These operations are commonly used in algorithms that involve distance calculations or growth models.

Example: When working with natural language processing (NLP) in AI, one common task is representing words as vectors, a concept known as word embeddings. These vectors allow us to measure the semantic similarity or distance between words, which is crucial for tasks like text classification, sentiment analysis, or word clustering.


Distance = √((x2 - x1)² + (y2 - y1)²), where:


x1, y1 are the coordinates of the first word in vector space.

x2, y2 are the coordinates of the second word in vector space.

Let’s take the word vectors for “king” and “queen” as simplified 2D vectors:

king = (0.5, 1.2) 

queen = (0.7, 1.0) 

To calculate the Euclidean distance between “king” and “queen”:

Distance = √((0.7 - 0.5)² + (1.0 - 1.2 )²) = 0.28

This distance provides a measure of how “close” or “similar” the words king and queen are in the vector space. Smaller distances indicate higher semantic similarity between the words.

Summations (∑)

Summation notation is essential for representing the sum of a series of terms, particularly in optimization problems and statistical measures.

Example: When computing the cost function for a machine learning model (such as linear regression), summation is used to aggregate the errors across all data points. For example, the sum of squared errors (SSE) is written as:

​​

SSE = Σ (yi - ŷi)²


Factorials (!)

Factorials play an important role in probability, particularly when dealing with permutations and combinations. In the context of AI, factorials are used in probabilistic models like the Naive Bayes classifier, which is widely used in text classification tasks such as spam filtering, sentiment analysis, and topic categorization.

The key equation for the Naive Bayes classifier is:

P(A|B) = (P(B|A) * P(A)) / P(B)

Where:

- P(A|B) refers to the probability that A will be true given that B is true.

- P(B|A) refers to the probability that B will be true given that A is true.

- P(A) and P(B) refer to the probabilities that A and B will be true, respectively.

For example, suppose we have two classes: walks and drives. We want to predict whether a person will walk or drive based on their age and salary.

For example, a new data point (age and salary) is provided. 

# walkers = 20, # drivers = 10. # total = 30


P(walkers)  = “prior probability” = probability of new data point being walker =  20/30

P(drivers) = “prior probability” = probability of new data point being driver = 10/30

P(X) = “marginal likelihood” = a probability of new data point entering the circle = 4/30

P(drivers|X)  = 3/10 * 10/30 / 4/30 = 0.75

P(walkers|X)  = 1/20 * 20/30 / 4/30 = 0.25


This means that, given the data point is inside the circle, there is a 75% chance that the person is a driver.


Scientific Notation

This allows for the representation of very large or very small numbers, which are common in AI when dealing with large datasets or tiny probabilities.

Example: A probability of 1.5 x 10^-10 might appear in a machine learning model when calculating the likelihood of rare events.


One real-world example where algebra is heavily used in AI is in linear regression models, which predict an outcome based on input data. Linear regression uses an algebraic equation to describe the relationship between the independent variable(s) and the dependent variable:

y = mx + b

Where:

• y is the predicted output,

• m is the slope (weight),

• x is the input data (feature),

• b is the y-intercept (bias).

Let’s say we have a linear model that predicts salary based on experience for a certain industry. We know that for every additional year of experience, the salary increases by $5,000. Also, the base salary (with 0 years of experience) starts at $40,000. The equation for predicting salary would then look like this:

Salary = 5000 * Years of Experience + 40000

So if someone has 5 years of experience, their predicted salary would be:

Salary = 5000 * 5 + 40000 = 65000

The slope  m = 5000 , representing the salary increase per year of experience, and the intercept  b = 40000 , which is the base salary.


In this case, algebra helps you calculate the best-fit line by minimizing the sum of squared errors, as shown in the SSE formula above.


Calculus

Calculus is a fundamental area of mathematics that plays a pivotal role in Data Science and AI. It provides the tools to understand and model changes, rates of change, and accumulation of quantities—concepts that are essential for analyzing dynamic systems. Calculus covers topics such as differentiation, integration, and series, all of which are used in various AI algorithms. From optimizing models using gradient descent to interpreting complex data patterns with integrals, calculus is the mathematical backbone of many advanced operations in machine learning, neural networks, and data modeling.


Series

A series is the sum of the terms of a sequence of numbers. In mathematics, series often appear as infinite sums, where terms are added indefinitely. One of the most common types of series is a geometric series, defined as:

S = a + ar + ar^2 + ar^3 + ... + ar^n = Σ ar^n

Where:

- a is the first term,

- r is the common ratio,

- n is the index of summation.


The Fourier series is also widely used for analyzing signals, particularly in speech recognition and image processing. It helps decompose complex periodic signals into simpler sinusoidal components.

A time series represents a sequence of data points collected or recorded at successive time intervals.

Example:

Derivatives

The derivative represents the rate of change of a function with respect to one of its variables. It is fundamental in calculus and is written as:

f'(x) = d/dx f(x)

Derivatives are fundamental to minimizing the loss function in machine learning models, particularly through algorithms like gradient descent. In training neural networks, the goal is to reduce the difference between predicted and actual values, a task that requires adjusting model parameters such as weights and biases. By calculating the derivative of the loss function with respect to each parameter, we can measure how small changes in these parameters affect the overall error. Gradient descent leverages this information by updating the parameters in the opposite direction of the gradient (the steepest ascent of the error), thus reducing the error iteratively. When dealing with multiple variables, partial derivatives help compute how each parameter contributes to the loss. This process, often called backpropagation in the context of neural networks, ensures that the model converges toward the optimal set of parameters, effectively minimizing the loss.

Example:

Consider a simple linear regression model: y = mx + b. The cost function for this model is the mean squared error (MSE):

MSE = (1/n) * Σ (y_i - (mx_i + b))^2

To update m and b, we calculate the partial derivatives of the MSE with respect to m and b:

∂MSE/∂m and ∂MSE/∂b.

These partial derivatives tell us how to adjust m and b to reduce the error.

Let us consider a simple dataset where we want to predict the relationship between the number of hours studied (input, x) and test scores (output, y). Suppose the dataset is as follows:

x = [1, 2, 3]

y = [2, 4, 5]

We want to fit a line y = mx + b, where m is the slope and b is the intercept. Initially, let's assume the parameters are m = 0 and b = 0.


Step 1: Calculate the Mean Squared Error (MSE)

The Mean Squared Error (MSE) is calculated as follows:

MSE = (1/n) * Σ (y_i - (mx_i + b))^2

For our dataset (n = 3), with initial guesses for m and b:

MSE = (1/3) [(2 - (0 × 1 + 0))^2 + (4 - (0 × 2 + 0))^2 + (5 - (0 × 3 + 0))^2]MSE = (1/3) [4 + 16 + 25] = (1/3) × 45 = 15


Step 2: Calculate the Partial Derivatives

Next, we calculate the partial derivatives of the MSE with respect to m and b to find how we should adjust these parameters.

Partial derivative with respect to m:

∂MSE/∂m = (2/n) Σ -x_i(y_i - (mx_i + b))

With the initial values of m = 0 and b = 0, the partial derivative becomes:

∂MSE/∂m = (2/3) [ -1(2 - 0) + -2(4 - 0) + -3(5 - 0) ] ∂MSE/∂m = (2/3) × (-25) = -16.67

Partial derivative with respect to b:

∂MSE/∂b = (2/n) Σ -(y_i - (mx_i + b))

With m = 0 and b = 0:

∂MSE/∂b = (2/3) [ -(2 - 0) + -(4 - 0) + -(5 - 0) ] ∂MSE/∂b = (2/3) × (-11) = -7.33


Step 3: Update Parameters

Now that we have the gradients, we can update m and b using gradient descent. Let’s assume a learning rate of 0.01:

New m:

m_new = m_old - α × ∂MSE/∂m m_new = 0 - 0.01 × (-16.67) = 0.1667

New b:

b_new = b_old - α × ∂MSE/∂b b_new = 0 - 0.01 × (-7.33) = 0.0733


Step 4: Recalculate MSE with Updated Parameters

With updated parameters m = 0.1667 and b = 0.0733, we can recalculate the MSE and repeat the process until the error is minimized.


Integrals

An integral calculates the area under a curve, and it is the inverse process of differentiation. The most common integral is the definite integral, which is defined as:

∫[a,b] f(x) dx

Where:- f(x) is the function to be integrated,- a and b are the limits of integration,- dx represents an infinitesimally small change in x.

For example ∫[0,3] x^2 dx equals to the area under the curve x^2 at the interval [0,3]

In computer vision, integrals are used in algorithms like convolutional neural networks (CNNs), where an integral-like operation called convolution is used to process image data, filtering key features in the process.

Convolution is a mathematical operation that combines two functions to produce a third function. In the context of AI, convolution is widely used in Convolutional Neural Networks (CNNs) for image processing. The process involves sliding a filter (kernel) over the input image, performing element-wise multiplication, and summing the results.

In terms of integrals, convolution can be written as:

(f *g)(t) = ∫ f(τ)g(t - τ) dτ

Where f and g are two functions, and denotes convolution. In image processing, this operation helps extract features like edges or textures from an image.

Example:

In a 1D convolution applied to a signal, the kernel slides over the input signal, performing the convolution operation to create a transformed signal.


 In 2D convolution (used in CNNs), a filter slides over an image, applying the convolution operation to produce a feature map.


Conclusions

While the mathematical concepts covered here form the backbone of AI and Data Science, they are by no means exhaustive. However, they represent the most critical tools that every AI practitioner should understand intuitively. Mastering these fundamentals equips you to grasp more complex topics, whether it’s developing machine learning models, analyzing data, or building neural networks. The journey of AI is deeply rooted in mathematics, and having a strong foundation allows you to explore new horizons with confidence. As you continue learning, you’ll encounter more advanced topics, but these core principles will always be essential.



Comentários


© Created by Somia Solutions

bottom of page