Name: [BUY] Unit 3: Self-Check Assignment 3: Diabetes Forecasting - Shop.GetBrainful.com
Price: 167.00 USD
Availability: InStock

Unit 3: Self-Check Assignment 3: Diabetes Forecasting

This assignment builds on all of our previous work and introduces you to predictive analytics through a forecasting method called a binary classifier. We will then work on how to visualize and understand a binary classifier.

In this assignment, you will:

Receive an introduction to binary classifiers, logistic regression, and the results, including true- positive, false-positive, true-negative, and false-negative results
Run a binary classification algorithm on our diabetes data
Visualize the results in Tableau

For this assignment, follow these steps:

Download the diabetes dataset if you need it
Learn about binary classifiers
Perform binary classification using a logistic regression in Python (this has been written for you; all you need to do is press ‘run’ in Colab)
Download the results
Visualize the results in Tableau

Attachments:

ipynb
csv dataset

Download the Diabetes Dataset

If you need to download the dataset again, click on the following link:

Pima Indians Diabetes Database

(We just used this dataset in a previous assignment, so you very well may already have it handy.)

Learn About Binary Classifiers

The word “binary” in this context means “just two options.” Some common binary outcomes could be whether a consumer will respond to direct marketing outreach (binary outcomes: they buy or they don’t buy), whether a streaming subscriber will like a certain movie (binary outcomes: they give it thumbs-up or thumbs-down), or whether an attempted financial transaction is legitimate (binary outcomes: it’s legitimate, or it’s a fraud). The important part of a binary outcome is that there are exactly two options.

A classifier is an algorithm that takes as its input one or more input variables and, as its output, makes a prediction about the value of a different variable. The prediction values are constrained to be on a pre-selected list.

A binary classifier, then, is an algorithm that takes as its input one or more variables and, as its output, classifies the results into one of two mutually exclusive categories:

Problem Domain	Possible Input Variables (can have lots)	Binary Output Variable (2 values only)
Direct marketing	Age, income, gender of the consumer	Consumer buys or does not buy
Streaming subscriptions	Other movies they like, age of streamer, subscription price	Thumbs-up or thumbs-down for this movie
Financial transactions	Dollar amount of transaction, country of origin, frequency of transaction, whether or not the person has bought from this vendor before	Transaction is marked as legitimate, or transaction is flagged as fraudulent

Question 1: Understanding the Problem

In the diabetes dataset, what is/are the possible input variable(s)? (Input variables are the things we will use to make our prediction.) Select all that apply.

Glucose
Insulin
BMI
Age
Blood pressure
Outcome

Question 2: Understanding the Problem

In the diabetes dataset, what is/are the possible output variable(s)? (An output variable is the thing we want to predict.) Select all that apply.

Glucose
Insulin
BMI
Age
Blood Pressure
Outcome

There are many algorithms which can be used in data science for classification. Exactly how to determine which algorithm should be used, and how to evaluate its results, is beyond the scope of this course. But we will give you a very basic overview of how predictive analytics models work here. In the learning resources for this unit, we have provided a video from StatQuest about logistic regression. His example in predicting obesity in mice is very close to what we are doing here.

Question 3: What We Are Trying to Do Here with Logistic Regression

Which statement most closely resembles what we are trying to do here with our logistic regression binary classifier?

We want to predict whether or not a person will have diabetes (our binary outcome). We want to use some combination of glucose, insulin, BMI, and other data, and we realize that the relationship might not be linear. If you double the BMI, you might not double the chances of having diabetes.
We want to predict whether or not a person will have diabetes (our binary outcome). We want to use some combination of glucose, insulin, BMI, and other data, and we expect that the relationship will be linear for all variables. In other words, if you double glucose, you will double the diabetes. If you double insulin, you will double the diabetes. And if you double glucose and insulin, you will have four times the diabetes.
We want to predict the BMI of a person based on their diabetes status. We want to use the logistic regression S-curve to determine what the 25^th, 50^th, 75^th, and 99^th percentiles of BMI for diabetic and non-diabetic people in this sample are.
We want to predict the S-curve-shaped interrelationships between BMI, age, glucose, pregnancies, and other data. We want to be able to see, as age goes up, what happens to BMI, glucose, and pregnancies with a valid regression with a solid P-value.
We want to predict the log odds of having diabetes because mathematically, this will solve the problem that a straight-line linear relationship will often exceed 100%, especially when some numbers are outliers (like age of 80+ years or BMI at age 50+).

With binary classifiers, we typically build the model on our training data and then test the model (to see how good the predictions actually were) on the testing data. We then collect the results of our testing in a confusion matrix. You will find a learning resource about confusion matrices from StatQuest.

Question 4: Our Diabetes Model Confusion Matrix

Let’s say we want to predict whether a person has diabetes, and we are using the following confusion matrix:

	Person actually has diabetes	Person actually does not have diabetes
Person is predicted to have diabetes	A	B
Person is predicted to not have diabetes	C	D

Match the cell with its label

(True positive, or TP)

(False positive, or FP)

(False negative, or FN)

(True negative, or TN)

Question 5: Practicing Our TP/TN/FP/FN Terminology

Let’s say we have a person with a glucose of 136, insulin of 130, and BMI of 28.3, and they are 42 years old. Our logistic regression model predicts that this person will not have diabetes. However, their medical records indicate that they do indeed have diabetes. Which phrase should be used to describe this situation?

A True positive

B False positive

C False negative

D True negative

Perform Binary Classification Using Logistic Regression in Python

Now we are going to run a binary classification predictive analytics algorithm in Python and review the results. You won’t have to write any code, but you will be running code which has been written for you.

Go to your browser and set up a new instance of Google Colab at Welcome to Colaboratory.
Upload two files:
1. Upload the “Diabetes_Classifier.ipynb” as a notebook:
2. Upload the “diabetes.csv” as a file uploaded to session storage:

Alt text: Google Colab

Run the first cell, the classifier model. You can ask ChatGPT to explain this to you more fully, but basically what we are doing here with this code is:
1. Importing a bunch of other code written by other people to help us build the model
2. Reading in the diabetes.csv dataset
3. Splitting the data into a training dataset (which we will use to build our logistic regression prediction model) and a testing dataset (which we will use to tell how good our model really was)
4. Running the model on our training data
5. Evaluating the model on our testing data
When the code in this cell has finished running, it gives a little confusion matrix. (Note this confusion matrix has its labels switched from the way StatQuest did them. If you are keeping close track of these things, you will notice that the matrix printed from this code has the actual values on the left and the predicted values on the top. If you are not keeping close track of these things, you don’t need to keep close track of this switch either.)

Alt

Note: Full answer to this question is available after purchase.

text: StatQuest

Run the next cell to generate the output file we will use to visualize the results in Tableau. Your output should look something like this, and you should have a “diabetes_predicted.csv” file available for download. It may take a minute or two to run and another minute or two to refresh, and you can click the “refresh” icon if you want to see the output file the very minute it is available:

Alt text: Classifier

Let’s just look at the “diabetes_predicted.csv” file before we download it:

Alt text: csv file

Here, let’s look at the first row, Patient_ID 767. This person has a glucose of 126, BMI of 30.1, and an age of 47. This person also had an actual outcome of Diabetes (fourth column) but was predicted to have Not Diabetes (fifth column). The Model Results column classified this as a False Negative for this person (sixth column).

Question 6: Interpreting the Output File

Look further through the diabetes_predicted.csv file. For Patient_ID 526, what was their outcome?

A True positive

B False positive

C False negative

D True negative

Download the diabetes_predicted.csv file to your computer. We are now ready to visualize it using Tableau.

Visualize the Results in Tableau

We can see that these sorts of output files can be difficult to interpret. Let’s use Tableau to help visualize them.

Fire up Tableau and import your diabetes_predicted.csv data file to Tableau. Be sure the file you import has both Actual Outcome Text and Predicted Outcome Text fields in it.
Check: You should have 231 total rows in this data source.
First, let’s make a basic bar graph: How many model results were true positives? False positives? Other values?
1. Drag the Model Results to the Columns bar and the diabetes_predicted.csv (Count) to the Rows. It should look a little bit like the skeleton below—but you should have bar charts here.

Alt text: csv file

Question 7: Interpreting the Output File

How did the model do? Of the 231 people in this dataset, what was the most frequent model result?

A True positive: 49% of the results were true positive

B False positive: 18 people had a false-positive result

C False negative: 32% of the results were a false negative

D True negative: 132 people had a true-negative result

Let’s take another look at these results, which are more akin to the confusion matrix we saw earlier.
1. Go to another worksheet
2. Put the Actual Outcome Text in the Rows area, and the Predicted Outcome Text in the Columns area:

Alt text: outcome

Then drag the diabetes_predicted.csv (Count) to the area with the “Abc” in it:

Alt text: csv file

You will now have the numbers of the actual and predicted outcomes summed up for you:

Alt text: predicted outcomes

Let’s get the Marks a bit fancier: Take the diabetes_predicted.csv (Count), also, to the Size, and once again drag diabetes_predicted.csv (Count) to the Label. Take the Model Results to the Label and expand your graphics so you can see the whole thing. You will get something that should look like this:

Alt text: predicted csv

Question 8: Interpreting the Visual Confusion Matrix

Look at your visual matrix. Which statements would you agree with? Select all that apply.

A If a person actually has diabetes, their results would be found on the top row.

B If a person actually does not have diabetes, their results would be found on the bottom row.

C If the model predicts diabetes, the majority of the people in this category will turn out to have diabetes

D If the model predicts not diabetes, the majority of the people in this category will not turn out to have diabetes

E If a person has diabetes, the model is not great at predicting this; there will be a lot of incorrect predictions given

F If a person does not have diabetes, the model is not great at predicting this; there will be a lot of incorrect predictions given

Sometimes we want to see how a model’s predictions vary as certain variables change. Does this model predict differently for people of different ages?
1. Go to a new worksheet and make a histogram of the age. Set the bin size to 10. It should look like this:

Alt text: bar graph

Add the Predicted Outcome text in front of the Age (bin). You will now see histograms, but they are split by predictions:

Alt text: bar graph

Question 9: Interpreting the Split Histograms

Look at these two histograms. Which statements would you agree with? Select all that apply.

A Among those who are predicted not to have diabetes, the age distribution has a lot of younger people in it.

B In the age group 40–49, the model is predicting approximately the same number of people with and without diabetes.

C In the age group 40–49, the model is predicting approximately the same percentage of people with and without diabetes.

C In the group which is predicted to have diabetes, the ages are relatively evenly distributed between people in their 20s, 30s, 40s, and 50s, with a sharp drop-off at age 60 and older.

Sometimes the total head count does not give the whole picture, and a percentage is a better way to go. Let’s try to get our histograms to show us percentages of total.
1. Duplicate your paired Age histograms to a new sheet.
2. Under the Rows, CNT(Age), pull down the right arrow and Add Table Calculation.

Alt text: histogram

For your Table Calculation, choose Percent of Total, and have it compute using Table(down):

Alt text: table

Then put the Model Results on the Color so you can see what percentage of each age group has what sorts of model results:

Alt text: graph

The final touch: Often, culturally, we see green as “good/correct” and red as “bad/error.” Let’s go through and set the colors so the “true” outcomes are in the green family and the “false” outcomes are in the red family.

Alt text: graph

Now we can look at – for example – a person in their 20s who is predicted not to have diabetes. Do they need to worry?
1. The prediction is not diabetes, so we want the graph on the right (blue and red).
2. Find the bar which represents people in their 20s who are not predicted to have diabetes

Alt text: graph

Let’s look at this bar a little more closely. We can drag the diabetes_predicted.csv (Count) onto the labels to have it show us the total number of people here. We can see that it does pretty well (lots of true model outcomes) for people in their 20s who are predicted not to have diabetes.

Alt text: graph

Question 10: Interpreting the Stacked Percentage Bar Charts

Look at these charts. Which statements are accurate? Select all that apply.

A For people in their 40s (age 40–49), a model prediction of “no diabetes” is very good news because the model is nearly always correct, and they probably don’t have diabetes.

B For very elderly people (age 80–89), there is only one person in the dataset of this age. Because the model predicts “diabetes” for this person, it will always predict “diabetes” for all people in this age group, regardless of their BMI, glucose, or other variables.

C Say you have 10 people in their 20s who receive a model prediction of “diabetes.” Approximately 7 of those people will actually have diabetes, but 3 will be incorrectly predicted to have diabetes.

D Say you have 10 people in their 20s who receive a model prediction of “diabetes.” Approximately 4 of those people will actually have diabetes, and these are the false positives.

E There are relatively few people in either category (predicted diabetes, predicted no diabetes) who are age 60–69, so we should be cautious about interpreting these percentages for a broader population.

Read; Unit 3: Self-Check Assignment 2: Milligan, Chapter 9: Clusters and Distributions

Order This Paper

Reviews

There are no reviews yet.

Be the first to review “[BUY] Unit 3: Self-Check Assignment 3: Diabetes Forecasting”

[BUY] Unit 3: Self-Check Assignment 3: Diabetes Forecasting

Unit 3: Self-Check Assignment 3: Diabetes Forecasting

Unit 3: Self-Check Assignment 3: Diabetes Forecasting

Reviews

Related products

[BUY] CJA/376 Wk 4 Summative Assessment: Emergency Management Presentation

[ORDER] COUC 521 BENCHMARK INTAKE PART TWO: MENTAL STATUS EXAM (MSE) ASSIGNMENT INSTRUCTIONS

[Solved] Review the following patient scenario and develop a teaching plan for the patient. This plan can be created in any format

[SOLVED] Unit 3: Self-Check Assignment 2: Milligan, Chapter 9: Clusters and Distributions