Join over 2 million students who advanced their careers with 365 Data Science. Learn from instructors who have worked at Meta, Spotify, Google, IKEA, Netflix, and Coca-Cola and master Python, SQL, Excel, machine learning, data analysis, AI fundamentals, and more.
Iliya Valchanov 27 Jul 2021 8 min read
Realizing how to include dummy variables into a regression is the best way to end your introduction into the world of linear regressions. Another useful concept you can learn is the Ordinary Least Squares. But now, onto dummy variables. Apart from the offensive use of the word “dummy”, there is another meaning – an imitation or a copy that stands as a substitute.
In regression analysis, a dummy is a variable that is used to include categorical data into a regression model. In previous tutorials, we have only used numerical data. We did that when we first introduced linear regressions and again when we were exploring the adjusted R-squared. However, representing numbers on a scale makes more sense than representing categories like gender or season. It’s time to find out how to include such variables into a regression we are working with.
Firstly, make sure that you check the article where we made our first steps into the world of linear regressions. We will be using the SAT-GPA example from there. If you don’t have time to read it, here is a brief explanation: Based on the SAT score of a student, we can predict his GPA. Now, we can improve our prediction by adding another regressor – attendance.
In the picture below, you can see a dataset that includes a variable that measures if a student attended more than 75% of their university lectures.
Keep in mind that this is categorical data, so we cannot simply put it in the regression.
We will start off by going through the process of using a dummy and explain it later.
The time has come to write some code. We can begin by importing the relevant libraries by writing:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns sns.set()
After that, let’s load the file ’1.03. Dummies.csv’ into the variable raw_data. You can download the file from here. If you don’t know how to load it, here’s what you need to type:
raw_data = pd.read_csv(’1.03. Dummies.csv’)
Now, let’s simply write
and see what happens.
As you can tell from the picture above, there is a third column named ‘Attendance’. It reflects if a student attended more than 75% of the lessons with two possibilities – Yes and No.
What we would usually do in such cases is to map the Yes/No values with 1s and 0s. In this way, if the student attended more than 75% of the lessons, the dummy will be equal to 1. Otherwise, it will be a 0.
So, we will have transformed our yes/no question into 0s and 1s. That’s what the dummy name stands for – we are imitating the categories with numbers.
In pandas, that’s done quite intuitively.
Let’s create a new variable data equal to raw_data. This is what we need to run:
Then, we have to overwrite the series ‘attendance’ in the data frame. So, this is how the code should look like:
This is the proper syntax to map Yes to 1 and No to 0.
and find out if we have done our job.
As you can see in the picture above, we have successfully created a dummy variable! The categorical data in the series was replaced or mapped to numerical.
Let’s take a look at the descriptive statistics of the variables. We can do that by writing:
The mean of ‘Attended more than 75%’ is 0.46, as shown below.
The fact that the mean is less than 0.5 gives us the information that there are more 0s than 1s. Since the two outcomes are 0 and 1, this implies that 46% of the students have attended more than 75% of the lessons.
In any case, now we can create a regression that explains GPA taking both SAT scores and attendance into consideration.
We can load GPA in the variable y, and SAT, and ‘Attendance’ in the variable x1. This is the code we need to run:
x1 = data[[‘SAT’, ‘Attendance’]]
We must use the statsmodels method for adding a constant. Then we can fit the regression and get the summary as before.
As you can see in the picture below, our overall model is significant,
the SAT score is significant, and the dummy variable is significant.
The adjusted R-squared of this model is 0.555, which is a great improvement from what we would get without attendance.
A model without the dummy variable would be:
GPA = 0.275 + 0.0017 * the SAT score of a student.
The model, including the dummy variable is:
GPA = 0.6439 + 0.0014 * the SAT score of a student + 0.2226 * the dummy variable.
Now, we said that the dummy is 0 or 1, so actually we can represent this equation with two others.
If the student did not attend, the dummy would be 0. So, 0.2226 * 0 is 0. The model becomes GPA = 0.6439 + 0.0014 * SAT.
If the student attended, the dummy variable would be 1, so the model becomes:
GPA = 0.6439 + 0.0014 * SAT + 0.2226.
Let’s add the intercept and the dummy together.
As you can see in the picture above, we got GPA = 0.8665 + 0.0014 * SAT.
There will be two equations, which we can call yhat_no, and yhat_yes. They will represent the two equations we just talked about. Certainly, we can parametrize these equations, but there is no need for such a simple example.
So, what we observe above are two equations that have the same slope but a different intercept. The students who attended are spread around the upper line.
On average, their GPA is 0.2226 higher than the GPA of students who did not attend.
We can even think about these as two separate regressions. We can color the points, which refer to students who attended classes, so the red line, and students who did not attend – the green line.
You can clearly see the difference now.
Finally, we will put the original regression line on the graph.
As you can see, it is steeper and goes somewhat between the two lines of the dummies.
To use this model for prediction purposes, we need two pieces of information: an SAT score and whether a person attended more than 75% of their lectures.
So, why do we even need dummy variables? We can simply use numerical data all the time and we won’t have any problems, right?
Well, not really. Whenever we have to represent categorical data, such as brands, a dummy is what we need.
As you can see, the process of creating a regression even with categorical data is not brain surgery.
Interested in learning more? You can take your skills from good to great with our statistics course!
Co-founder of 365 Data Science
Iliya is a finance graduate with a strong quantitative background who chose the exciting path of a startup entrepreneur. He demonstrated a formidable affinity for numbers during his childhood, winning more than 90 national and international awards and competitions through the years. Iliya started teaching at university, helping other students learn statistics and econometrics. Inspired by his first happy students, he co-founded 365 Data Science to continue spreading knowledge. He authored several of the program’s online courses in mathematics, statistics, machine learning, and deep learning.
We Think you'll also like
Examples of Numerical and Categorical Variables
by Iliya Valchanov 5 min read
Exploring the 5 OLS Assumptions for Linear Regression Analysis
by Iliya Valchanov 20 min read
The Difference between Correlation and Regression
by Iliya Valchanov 4 min read
Visualizing Data with Bar, Pie and Pareto Charts" width="" />
Visualizing Data with Bar, Pie and Pareto Charts
by Iliya Valchanov 7 min read