3 Information Distributions for Counts in Layman’s Phrases

0

image
Nikola O. Hacker Noon profile picture

Nikola O.

Combines concepts from information science, humanities and social sciences. Enjoys pondering, science fiction and design.

What number of telephones shall be offered within the subsequent quarter? How many individuals will go down with the flu within the subsequent month? What number of occasions will the CPU crush? Depend information regression may also help us reply these questions. Counts are in all places, so regardless of your background, these information distributions will turn out to be useful.

What are Information Distributions?

Information distribution tells us what the doable values of a variable are and the way usually these values happen. We are able to take something from an individual’s peak to IQ scores and see how frequent all of the doable values are. Peak and IQ scores are an amazing instance of a so-called regular distribution that describes many different phenomena on this planet. 

Nevertheless, once we work with counts, issues get tough.

Peak is a steady variable. That means each inch or centimeter enhance or lower represents the precise change in peak. For example, if somebody is 1 inch (2.5 cm) taller than you, it’s the identical distinction whether or not you might be 5 ft or 6 ft tall. 

Within the case of counts, that is totally different. Think about you go for a stroll in your favourite park and determine to rely what number of squirrels you see. You go to this park each day after lunch and stroll round for an hour. After a couple of weeks of amassing this information, you’d see that on some days, there have been none, and on different days you’d see 2 or 3, even 5. The each day rely of squirrels is a discrete variable that may very well be described or predicted with rely information regression.

Depend Information Regression

Linear regression is a means of describing a relationship between variables with a straight line. Think about that you’re a nature photographer and need to take photos of the squirrel at your park or a distinct animal some place else. You need to optimize the period of time you await squirrels to point out up. In different phrases, you need to predict what number of squirrels will present up within the subsequent few days. You would begin amassing details about daylight, the busyness of the park and so forth. These could be your predictive variables. 

Up to now, so good, you could have the information, and you’ll feed it right into a regression mannequin, i.e. equation. Now, the information distributions are available in; what are your choices?

I’ll cowl 3* rely information distributions:

  1. Poisson
  2. Negative Binomial
  3. Zero-inflated (Poisson or Negative Binomial)

(*there are other options though)

1. Poisson Distribution

Poisson distribution is usually the starting point when you work with count data. It assumes that the mean value equals the variance. Variance tells you how much the possible values differ from each other.

Let’s say you own two grocery stores, and you start counting how many customers come in every hour. You collected information for 10,000 hours from both of them, and you decided to assume the data follows the Poisson distribution. The average number of customers in the first store is 20 while 40 customers usually shop in the second one.

This is your model:

image

Overall, you can see that the Poisson distribution is a bit longer on the right side. This means that values in the first half (left side from the blue line) are less frequent than those in the second half. So if you fold the distribution in the middle, the right side will be longer.

Comparing what happens when the mean is different, notice the x-axis on both plots. As the mean increased, the maximum increased from around 40 to almost 80. This is because the Poisson distribution assumes that you only need one parameter to represent the variability in your data as well as the typical value.

However, this data distribution is often too simplistic. You might require more parameters to describe what is happening with your customers.

2. Negative Binomial Distribution

Count data regression with negative binomial distribution is an excellent option if the variance in your data is higher than the mean. It often happens with medical or public health data. Statisticians call this overdispersion. In this case, you need another parameter to capture the dispersion of the data.

Another way of thinking about this second parameter is the number of successes. Going back to the store owner example, you could include the number of customers that buy something. We were talking about grocery shopping. To make this simulation more realistic, let’s count all the customers who spend over $100. How will the data distribution change if we see 1 vs 10 such customers?

Considering the first shop that serviced 20 customers in an hour on average, these are our models:

image

Both plots show the example with the mean of 20, but the dispersion (size) parameter is different. When we assume only 1 customer will spend over $100 in an hour, we see huge differences in how much customers spend. This model suggests that most customers spend less than $100, and fewer spend more than that. The second model assuming 10 customers spend over $100, shows more balanced customer spending.

Can we focus on specific items?

3. Zero-inflated (Poisson or Negative Binomial)

Zero-inflated distributions are combinations of either Poisson or negative binomial with a peak on the value zero. The zero is inflated – hence the name.

This is a handy option in many industries. Think about insurance claims. You pay a monthly insurance fee, but you won’t claim it most of the time.  For the insurance company, every month you don’t claim anything will be recorded as zero. In fact, most people won’t claim their insurance. Those who claim their insurance get various amounts that are covered by the Poisson or negative binomial distributions.

Similarly, this data distribution would be great for understanding spending on expensive items. Where I live, you can buy a TV or a Playstation in huge grocery stores. Most people will just buy food, and that’s it. To sell something bigger is rare, so most people spend zero on technology there.

The zero-inflated model captures this as the probability that the zero value will occur. The parameter is illustrated below with Poisson Distribution.

image

The probability of zero value can also be understood as “how much of the original distribution we want to keep”. We can see that the mean highlighted with the blue colour is not 20 anymore. Additionally, the final mean is different between the two plots. In the second plot, we assume more zero values, so the mean is closer to zero than the first plot.

Conclusion

There are many other options for count data regression. It’s always best to visualize your data with a histogram and compare the shape with these essential distributions. If it doesn’t seem to fit, try to search for other options specific to your context, e.g. epidemiology, insurance, engineering, etc.

If you feel like you need a bit more background on statistics in general, read:

Quick Intro to Statistics

For advice on how to clean your data, see:

Brace Yourself: Data Cleanup Is Coming

Tags

Leave A Reply

Your email address will not be published.