Unit 3: Self-Check Assignment 2: Milligan, Chapter 9: Clusters and Distributions |
This assignment builds on our previous work through Milligan, Chapter 9, and takes a deeper dive into two key data visualization techniques: clustering and distribution analysis. Clustering can be a key technique for visualizing the relationships between two or more variables, moving from single dimensions into multidimensional analysis. Distribution analysis can highlight central tendencies in a dataset (such as mean or median) and allow visualization to show outliers clearly.
In this assignment, you will apply clustering and distribution analysis to analyze the relationships between several factors in a healthcare dataset by following these steps:
- Go through this document and use Tableau to answer all the questions listed below. Where applicable, paste screenshots into the template below.
- When you are ready, complete the online quiz, which verifies your homework. Use the answers you found in this document to answer the questions.
- When you have completed the online quiz, submit the Word document.
- Remember, you can always ask your instructor for help if needed.
- If you need to adjust the size of your visualizations to match the options in the questions, use the “Format”-> “Cell size” options. For example, “Ctrl+Shift+B” on a Windows computer will make the visualization bigger, and “Ctrl+Up” will make it taller.
Attachments:
- csv (dataset downloaded from Kaggle):
- Pima Indians Diabetes Database
For this assignment, follow these steps:
- Download the Pima Indians Diabetes dataset
- Perform exploratory data analysis (EDA) of the data in Tableau
- Perform clustering analysis of the data in Tableau
- Perform distribution analysis of the data in Tableau
Download the diabetes dataset
- Click to access the Pima Indians Diabetes Database. You may need to sign in or register if you don’t already have an account on Kaggle. Click the “Download” button on the linked page above to download the dataset.
Alt text: dataset
- You can scroll down on this web page to learn more about this dataset, including its original publication date and what the fields mean.
- When you download the data, it will save it as a .zip file, likely named something like “archive.zip” (a standard download protocol for Kaggle).
Alt text: standard download
- Double-click on the “archive” file. It should open to a screen like the one below (this is from a Windows machine; if you are on a Mac, it may vary slightly). You will see that there is one file there (named diabetes.csv, highlighted with a green arrow below). If you click on the “Extract all” option, it will extract all the files (in this case, just the one diabetes.csv file). Save the file in a location of your choice, where it will be easy to find in order to connect with Tableau.
Alt text: Tableau
- Open the dataset on your computer. It’s a .csv file, so it should open in Microsoft Excel, and it should look like the table below. Verify that you have these exact numbers showing up in your downloaded file:
Alt text: Microsoft Excel
Perform Exploratory Data Analysis (EDA) with Tableau
- Open the dataset in Tableau. Since this is a .csv file, and not an Excel file, you need to connect to a “more” kind of data (see image below):
Alt text: Connect
Alt text: Data file
- When you open the file in Tableau, you should see something like this:
Alt text: Tableau file
- Before we do any multidimensional clustering, it’s always a good idea to get an overview of each of the data fields:
- There are 768 data rows here, each corresponding to a different member of the Pima Indian Tribe. The table contains each member’s number of pregnancies, blood glucose, blood pressure, and other health measurements. All the way to the right, the table also contains an outcome variable:
- 0 if the tribe member was not diagnosed with diabetes
- 1 if the tribe member was diagnosed with diabetes
- For example, let’s look at the Outcome for the first member. This person has had six pregnancies with a glucose reading of 148. If we scroll all the way to the right, we see the age listed as 50 and the Outcome listed as 1. This means this tribe member was diagnosed with diabetes.
- There are 768 data rows here, each corresponding to a different member of the Pima Indian Tribe. The table contains each member’s number of pregnancies, blood glucose, blood pressure, and other health measurements. All the way to the right, the table also contains an outcome variable:
Alt text: outcome
Question 1: Understanding the Data
In the diabetes dataset, Row 8 contains a tribe member who reported ten pregnancies. Which other data fields correspond with this person?
- Glucose of 115, BMI 35.3, Outcome: No Diabetes
- Glucose of 115, BMI 35.3, Outcome: Diabetes
- Glucose of 168, BMI 38.0, Outcome: Diabetes
- Glucose of 168, BMI 38.0, Outcome: No Diabetes
- Glucose of 139, BMI 27.1, Outcome: No Diabetes
- None of these
Question 1 Answer: |
- Let’s explore the Pima Indians Diabetes Database a little bit more. Anytime you get a new dataset, it’s a good idea to run some general descriptives to see what the data look like. One of the major tools we use is a histogram.
- Make a histogram of the age. On a new worksheet (below, Sheet 1), we first pull the Age to the Rows, and then, in the “Show Me” tab, we select the histogram icon.
Alt text: histogram
Alt text: histogram
Alt text: histogram
Note that when it makes the histogram, under the Data tab, you can see a new variable under Tables called “Age (bin).” This is the histogram bin for your Age variable.
If we look at Sheet 1, we can see that most of our data points report an Age of between 20 and 30. We also have a few people in their 40s and 50s, and many fewer older people.
- Tableau does the best it can to guess at “good” bin sizes for your dataset, but it doesn’t always guess exactly the way a human would like. Go to the right of the green oval under the Data -> Tables -> Age (bin), pull that menu down, and select Edit to see the exact bin sizes Tableau is suggesting here.
Alt text: Data tables
We can see here that Tableau is suggesting we group people in age groups of 4.69 years, and that our first bin is suggested to start at age 18.76. So the first bin will contain ages 21, 22, and 23. The second bin, due to rounding issues, contains ages 24, 25, 26, 27, and 28! Let’s update the bins so that they are of bin size 5, so that our yearly increments go in nice, round age buckets.
Alt text: Edit bins
If we update the bin size to 5, we can now see that the histogram has adjusted slightly. The first bin contains ages 20, 21, 22, 23 and 24. The second bin contains ages 25, 26, 27, 28, and 29. The histogram is “smoother” in the jump between the 35–40 and 40–45 age bins.
Alt text: Bar graph
Sometimes, the axis labels aren’t obvious. To be sure of which bin you are viewing, you can click on a bin, and it will display the information.
Below, we have clicked on one of the bars and learned it is the bin for ages 35 (and contains ages 35, 36, 37, 38, and 39 but not age 40). If you are curious, you can right-click on the bin and choose “View Data” and then “Full Data” to see the exact data points which make up that bar.
Alt text: bar graph
Alt text: histogram
- On a new sheet in Tableau, make a histogram of the BMI variable. Adjust the bin sizes so that they are size 5.
Question 2: Understanding the Bin Sizes
In your BMI histogram, what values are in the most frequent bin?
- 30–34.9
- 30–35
- 40, 135
- 40–33.30
- none of these
Question 2 Answer: |
Question 3: Understanding Questionable Values
In your BMI histogram, are there any values that make you question the data and wonder if those should be filtered out?
- Yes; the data looks normally distributed, like a bell curve, and this is not expected for these sorts of measurements
- Yes; there are too many counts of a BMI of 40 and higher, which is much larger than expected
- Yes; there are 11 counts of a BMI of 0, which is probably not an accurate measure
- No; the data looks normally distributed, like a bell curve, and this is expected for these sorts of measurements
Question 3 Answer: |
- On a new sheet in Tableau, make a histogram of the Glucose variable. Adjust the bin sizes so they are size 10. (Don’t filter out any Glucose measurements at this point.)
Question 4: Understanding the most frequent bins
In your Glucose histogram, what values are in the most frequent bin?
- 100 through 107, including 100 and 107
- 100 through 109, including 100 and 109
- 100 through 110, including 100 and 110
- 110 through 120, including 110 and 120
- 120 through 129, including 120 and 129
Question 4 Answer: |
Question 5 – Understanding the bin values and counts
The most frequent Glucose measurement is about 100. What is the second most frequent Glucose measurement?
- bin 100, count 117
- bin 110, count 94
- bin 110, count 105
- bin 120, count 102
Question 5 Answer: |
Question 6: Understanding the shape of the data
In your Glucose histogram, how would you describe the overall shape of this data?
- The average/median is about 100, and symmetric. It’s not skewed at all.
- The average/median is about 100, and it’s skewed to the right.
- The average/median is about 100, and it’s skewed to the left.
- It is bimodal: there are two distinct centers of glucose measurements, probably one for diabetes diagnoses and one for non-diabetes diagnoses.
- None of these
Question 6 Answer: |
- On a new Sheet in Tableau, make a histogram of the Insulin variable. Adjust the bin sizes so that they are size 50.
Question 7: Understanding the Descriptives of the Data
In your Insulin histogram, how would you describe the overall descriptives of this data?
- Zero insulin is meaningless, so there must be some mistakes.
- This data is normally distributed and follows a bell curve symmetric shape.
- This data is bimodal. We can see two distinct populations: those with diabetes and those without.
- The most frequent amount of insulin is between 0 and 49, but a few people report large levels of insulin, at 400 units or above.
- None of these
Question 7 Answer: |
- In your Insulin histogram, drag Outcome to the Color under Marks. Remember that Outcome=0 means No Diabetes, and Outcome=1 means Diabetes. Now go to the leftmost bin, right-click on it to View Data, and then look at the Full Data. How would you describe the data points in this bin? Check all that apply.
Question 8: Two-Dimensional Descriptives of the Data
How would you describe the data points in the leftmost bin in terms of insulin and outcome?
- The most frequent amount of insulin is 0, but there are some values here between 0 and 49.
- All the Insulin values here are 0.
- All the Outcome values here are 0 (no Diabetes).
- The values here average to 25.
- This contains only values from people with No Diabetes as their Outcome. If you have diabetes, you need insulin.
- None of these
Question 8 Answer: |
- Now, let’s look at the Outcome variable. We could make a histogram of this binary variable, but that’s not very satisfying. Let’s recode it to be a text variable. Instead of having to remember “0 means No Diabetes,” wouldn’t it be easier to just have the words “No Diabetes” showing on the screen?
First, make a new worksheet. Then, under Outcome, choose Create -> Calculated Field.
Alt text: outcome text
Fill it out as follows: You want to create a new field called “Outcome_Text,” and we want it to be “Diabetes” if the Outcome variable was a 1, 0 otherwise, and “NA” if, for some weird reason, the Outcome variable was neither a 1 nor a 0.
If you want to copy and paste the formula with all the glorious brackets and parentheses, here it is:
IF ([Outcome]=1)
THEN “Diabetes”
ELSEIF ([Outcome]=0)
THEN “No Diabetes”
ELSE “NA”
END
Alt text: copy and paste formula
- Let’s check that it coded correctly:
- How many Outcome=0 do we have? Let’s look at a histogram of the original Outcome variable. Looks like we have 500 in the Outcome=0 bin and 268 in the Outcome=1 bin. (Those of you who are following along at home on your calculators will notice this is about 65% No Diabetes, 35% Diabetes.)
Alt text: outcome
- How does our Outcome_Text = “No Diabetes” or “Diabetes” stack up against our original binary variable? Let’s drag the Outcome_Text to the Color and
Reviews
There are no reviews yet.