Unit 3: Self-Check Assignment 3: Diabetes Forecasting |
This assignment builds on all of our previous work and introduces you to predictive analytics through a forecasting method called a binary classifier. We will then work on how to visualize and understand a binary classifier.
In this assignment, you will:
- Receive an introduction to binary classifiers, logistic regression, and the results, including true- positive, false-positive, true-negative, and false-negative results
- Run a binary classification algorithm on our diabetes data
- Visualize the results in Tableau
For this assignment, follow these steps:
- Download the diabetes dataset if you need it
- Learn about binary classifiers
- Perform binary classification using a logistic regression in Python (this has been written for you; all you need to do is press ‘run’ in Colab)
- Download the results
- Visualize the results in Tableau
Attachments:
- ipynb
- csv dataset
Download the Diabetes Dataset
If you need to download the dataset again, click on the following link:
Pima Indians Diabetes Database
(We just used this dataset in a previous assignment, so you very well may already have it handy.)
Learn About Binary Classifiers
The word “binary” in this context means “just two options.” Some common binary outcomes could be whether a consumer will respond to direct marketing outreach (binary outcomes: they buy or they don’t buy), whether a streaming subscriber will like a certain movie (binary outcomes: they give it thumbs-up or thumbs-down), or whether an attempted financial transaction is legitimate (binary outcomes: it’s legitimate, or it’s a fraud). The important part of a binary outcome is that there are exactly two options.
A classifier is an algorithm that takes as its input one or more input variables and, as its output, makes a prediction about the value of a different variable. The prediction values are constrained to be on a pre-selected list.
A binary classifier, then, is an algorithm that takes as its input one or more variables and, as its output, classifies the results into one of two mutually exclusive categories:
Problem Domain | Possible Input Variables (can have lots) | Binary Output Variable (2 values only) |
Direct marketing | Age, income, gender of the consumer | Consumer buys or does not buy |
Streaming subscriptions | Other movies they like, age of streamer, subscription price | Thumbs-up or thumbs-down for this movie |
Financial transactions | Dollar amount of transaction, country of origin, frequency of transaction, whether or not the person has bought from this vendor before | Transaction is marked as legitimate, or transaction is flagged as fraudulent |
Question 1: Understanding the Problem
In the diabetes dataset, what is/are the possible input variable(s)? (Input variables are the things we will use to make our prediction.) Select all that apply.
- Glucose
- Insulin
- BMI
- Age
- Blood pressure
- Outcome
Question 2: Understanding the Problem
In the diabetes dataset, what is/are the possible output variable(s)? (An output variable is the thing we want to predict.) Select all that apply.
- Glucose
- Insulin
- BMI
- Age
- Blood Pressure
- Outcome
There are many algorithms which can be used in data science for classification. Exactly how to determine which algorithm should be used, and how to evaluate its results, is beyond the scope of this course. But we will give you a very basic overview of how predictive analytics models work here. In the learning resources for this unit, we have provided a video from StatQuest about logistic regression. His example in predicting obesity in mice is very close to what we are doing here.
Question 3: What We Are Trying to Do Here with Logistic Regression
Which statement most closely resembles what we are trying to do here with our logistic regression binary classifier?
- We want to predict whether or not a person will have diabetes (our binary outcome). We want to use some combination of glucose, insulin, BMI, and other data, and we realize that the relationship might not be linear. If you double the BMI, you might not double the chances of having diabetes.
- We want to predict whether or not a person will have diabetes (our binary outcome). We want to use some combination of glucose, insulin, BMI, and other data, and we expect that the relationship will be linear for all variables. In other words, if you double glucose, you will double the diabetes. If you double insulin, you will double the diabetes. And if you double glucose and insulin, you will have four times the diabetes.
- We want to predict the BMI of a person based on their diabetes status. We want to use the logistic regression S-curve to determine what the 25th, 50th, 75th, and 99th percentiles of BMI for diabetic and non-diabetic people in this sample are.
- We want to predict the S-curve-shaped interrelationships between BMI, age, glucose, pregnancies, and other data. We want to be able to see, as age goes up, what happens to BMI, glucose, and pregnancies with a valid regression with a solid P-value.
- We want to predict the log odds of having diabetes because mathematically, this will solve the problem that a straight-line linear relationship will often exceed 100%, especially when some numbers are outliers (like age of 80+ years or BMI at age 50+).
With binary classifiers, we typically build the model on our training data and then test the model (to see how good the predictions actually were) on the testing data. We then collect the results of our testing in a confusion matrix. You will find a learning resource about confusion matrices from StatQuest.
Question 4: Our Diabetes Model Confusion Matrix
Let’s say we want to predict whether a person has diabetes, and we are using the following confusion matrix:
Person actually has diabetes | Person actually does not have diabetes | |
Person is predicted to have diabetes | A | B |
Person is predicted to not have diabetes | C | D |
Match the cell with its label
(True positive, or TP)
(False positive, or FP)
(False negative, or FN)
(True negative, or TN)
Question 5: Practicing Our TP/TN/FP/FN Terminology
Let’s say we have a person with a glucose of 136, insulin of 130, and BMI of 28.3, and they are 42 years old. Our logistic regression model predicts that this person will not have diabetes. However, their medical records indicate that they do indeed have diabetes. Which phrase should be used to describe this situation?
A True positive
B False positive
C False negative
D True negative
Perform Binary Classification Using Logistic Regression in Python
Now we are going to run a binary classification predictive analytics algorithm in Python and review the results. You won’t have to write any code, but you will be running code which has been written for you.
- Go to your browser and set up a new instance of Google Colab at Welcome to Colaboratory.
- Upload two files:
- Upload the “Diabetes_Classifier.ipynb” as a notebook:
- Upload the “diabetes.csv” as a file uploaded to session storage:
Alt text: Google Colab
- Run the first cell, the classifier model. You can ask ChatGPT to explain this to you more fully, but basically what we are doing here with this code is:
- Importing a bunch of other code written by other people to help us build the model
- Reading in the diabetes.csv dataset
- Splitting the data into a training dataset (which we will use to build our logistic regression prediction model) and a testing dataset (which we will use to tell how good our model really was)
- Running the model on our training data
- Evaluating the model on our testing data
- When the code in this cell has finished running, it gives a little confusion matrix. (Note this confusion matrix has its labels switched from the way StatQuest did them. If you are keeping close track of these things, you will notice that the matrix printed from this code has the actual values on the left and the predicted values on the top. If you are not keeping close track of these things, you don’t need to keep close track of this switch either.)
Alt
Reviews
There are no reviews yet.