Olha o trem

Era outono na Holanda. A paisagem parecia realmente de filme. Folhas no chão, frio, céu azul, uma mistura de tons amarelos e marrons. O cenário perfeito para um casal que passeava pela primeira vez…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Small Data Still Matters

Small data matters. It will always matter. No matter how awesome it is to implement machine learning models that rely on (dare I say?) “big” data, distributed processing, and hours of training, small data still matters. It always will.

Why will it always matter? Simply put, important business questions sometimes arise quickly, don’t come with a ton of data, and must be answered yesterday. We often must assemble data quickly to provide at least a partial answer to the question at hand. This process typically results in ‘small’ data as measured by the ability to easily do the analysis on a standard laptop.

No matter how many fancy models you learn about, to answer the types of business questions I describe above, you need to understand the problem you are trying to solve and how the data you have relates to it. Simply put, you need to think about the data generating process and how it relates to your question. This will often mean you must think hierarchically.

What do I mean by “think hierarchically”? First, thinking hierarchically means you recognize that data often come in groups with similarities among the groups. For example, geographical regions often provide natural groupings. Here, we often expect some group level differences but also some similarities due to the fact the same phenomena is being studied. Additionally, there can be nested groupings like cities within states. In this example, cities within the same state are expected to be more similar with each other than with cities in different states. The grouping for your particular analysis will often be obvious if you understand the data generating mechanism and the different variables involved.

Second, thinking hierarchically means you recognize the varying levels of information you have on the various groups. If our groupings are by cities and states, we’d expect to have many data points from large cities (i.e. lots of information) and fewer for small cities (i.e. little information).

Once you recognize a hierarchical structure in your data, it often makes sense to account for it in your analysis. For example, instead of city-level means and standard deviations, you could take state-level means and standard deviations. While increasing sample sizes for smaller groups, this has the unfortunate property of coarsening your analysis. You treat all cities within the same state the same, which is not quite correct. We’d like our analysis to be flexible enough to detect meaningful differences between the groups.

Given this, the biggest level-up to your modeling skills is to know how to specify and fit hierarchical models.

The main benefit of hierarchical modeling is often stated as allowing individual group estimates to borrow (or pool) information from similar groups. This is most beneficial in groups with small sample sizes because of their highly variable sample means. Instead of using this (‘unpooled’) sample mean as the group estimate, the hierarchical model uses a ‘pooled’ mean. Without going into too much statistical detail, you can think of this as using a weighted mean across similar groups. In small groups, other similar groups are weighted more. In large groups, they are weighted less, converging to the sample mean as the sample size grows.

Let’s try to demonstrate this with some simulated data. I generated data from a bunch of different groups with sizes ranging from 3 to 200. The data were generated so that certain subgroups would have similar values. The figure displays the difference between the hierarchical estimates and sample means versus group size. For small groups the two estimates can differ a lot, but for large sample sizes they get closer together.

The two estimates are clearly different, but different doesn’t mean better. The advantage of the hierarchical estimates is their smaller variance. And this makes sense because they use more information from similar groups. We show this in the following graphic which plots the standard errors of the two estimates against each other. The size of the points represent the group size and the line is where the standard errors are equal. The cluster of points in the upper right side are all from small groups. The standard error is much larger in these cases. The cluster of points in the lower left are bunched around the line. These are larger groups and the standard errors are roughly the same between the two estimates.

It’s not all good right? Well, no. It’s not. I can think of two main trade-offs. One, you have to specify a good model. This takes some practice, but if you understand the data-generating process, it is often doable. The second is that the pooled estimates may lose some information at the level of each group. We partially combine the groups to reduce variance of the estimates at the cost of not getting completely independent estimates. Reducing variance is often worth it when there are similarities between the groups and countless examples show better predictive performance of hierarchical modeling when this is true.

Hierarchical modeling is not new, there are many good texts and articles that discuss the subject. I will undoubtedly leave some good ones out, but some of my favorites are below. I tried to rank them roughly in order of accessibility.

Add a comment

Related posts:

Healthcare Hiring Showing Signs of Recovery Following Turbulent Period

As written previously within this blog, the world of hiring has been sent into a tumultuous state since the onset of the COVID pandemic. While the valiant efforts of healthcare workers have become…

How I built the courage to ask for more

As a freelancing emcee I have struggled with the fear of rejection and when I understood how this fear was holding me back from realizing my most ambitious dreams, I started practicing ways to…

Why is There a Need to Hire Attorneys for Child Custody?

Going through an emotional rollercoaster with divorce and custody proceedings? Your lawyer will evaluate your case carefully and will be able to advise you on the best course of action. Instead of…