Draw the picture

by Billy Caughey

Introduction

At the University of Utah, I had the chance to take Stochastic Processes. It was in this class I ran into the classic problems about queuing. These problems usually involved a grocery store clerk, a few customers, and a grand question like: 

What is the probability the store clerk completes the current customers order before the next customer enters the line, given there are two customers already in line? 

A simple distribution is stated and you have all the pieces you need. Well, except for one piece. This piece was a picture, or diagram, of the event. When you draw the picture, you realize this isn't a question of going from two customers to one customer.



The question becomes three parts:
  • Moving from two customers to one customer: event where the clerk completes the order BEFORE the next customer arrives
  • Moving from two customers to three customers: event where the clerk completes the order AFTER the next customer arrives
  • Moving from two customers to two customers: event where the clerk completes the order WHEN the next customer arrives
One of the many tools we have a data folks is the ability to visualize our data to describe what is going on.

Example 

There are several times I have read reports where the author reports three measured variables in a table. In the table, averages and standard deviations are presented. At this point, statistical words like "t test", "anova", or "linear regression" are used. For the data beginner, these words may not make a lot of sense (but they will! We will get to these methods!). For the data veteran, these words come with set assumptions about normality, what the standard deviations are like, and how variables interact. In these cases, I find myself wanting to see a visual showing what is going on (even though I recognize the statistical words.)  Let me show you what I mean.

For this example, let's use the "Iris" data set in R. This data set contains measurements of the sepal and petal of three flower types. For more information on this data set, click here. Let's first start with a table show casing the flower type along with the averages and standard deviations of the measurements. The format of the table will be average ± standard deviation.




I see tables like this and begin to smile. There is a lot of information presented in a very efficient way. What is not shown is the actual distribution of the data. A large amount of analytics is focused in the tails of distributions. Seeing the distribution also reveals an interesting points of research as well. The distributions of these fields are presented below.


Looking at the distributions, the presentation of averages and standard deviations may not be appropriate. There is enough skew in some of the distributions that medians and quantiles. Additionally, there are a lot of interesting things going on. One that sticks out to me is the petal width and length. I'm not familiar enough with the functionality of petals, but I may want to look into it now.

Conclusion

Am I saying don't use tables? Absolutely not. Tables have their value. What I am saying is take the time to draw a picture or use a visual. Often times, the visual stands up to the phrase "a picture is worth a thousand words". If you take the time to visualize your data, you will see questions, insight, and solutions you may not have seen otherwise.

Code for this blog can be found here.






















Comments

Popular posts from this blog

Introduction in under 5 minutes

Where did I put that tool...

A Modeling Process