datarich(ard)

a personal journey into data science

View My GitHub Profile

10 December 2020

Bin sizes

by datarich(ard)

Depicting large data

This is a brief examination of the effect of the bin design on data visualization (e.g., scatter plots).

HILDA provides income and household wealth data on approximately 10,000 individuals each year. When visualizing large datasets of n > 10,000 there are often too many points to distinguish patterns among the individuals or even the overall trend - a problem sometimes known as “over-plotting”. Models, or some kind of summary function, are needed to provide an accurate or fair depiction of trends and patterns. Measures of central tendency, or the expected value can also be used to reduce the number of indiivdual points, however then the data must be grouped or binned. Binning can occur by equal length or equal size.

Below we plot the relationship between income and happiness over five years in HILDA. In each case, a smooth estimate with 95% confidence intervals based on the entire dataset is shown in blue (a GAM - generalized additive model with integrated smoothness: formula = y ~ s(x, bs = “cs”) with method = “REML”).


Unsummarized plots

These plots show each individual point (e.g., n = 10,791 in the first panel).

unsum


It is difficult to distinguish any underlying trend over the x-axis (Household wealth), apart from the fact that observations tend to be bunched up at the lower end of the wealth scale. In the figures below, we summarise these 10,000 odd points by each of two methods: equal sized bins or equal interval (length) bins.

In each case, the smoothed trend line based on the underlying individual data is also overlaid, in order to compare the summary points against the (“true”) underlying trend.


Equal interval bins

These plots summarise the 10,000 individual points in each panel into ten equal sized intervals. Because the number of points in each interval can vary, the size of each point reflects the number of observations which contribute to each interval.

To construct the equal interval bins, the cut_interval() function defined unique boundaries for each year to cover the distribution of income. So each year had ten intervals but unique interval boundaries.

cut_interval


It is easy to see that as the income scale increases, the size of the points indicates fewer and fewer observations contribute to each interval. Some intervals do not appear because no observations are within the relevant range of income (see table below). The smaller points on the right side of the income scale are also more likely to fall outside the confidence interval of the smoothed trend line in blue. Only the points with substantial observations on the left side of each panel accurately represent the smoothed trend line (blue).

The ten bins have the following tallys in each year:

Equal interval bin sizes
dollar\_bin 2002 2006 2010 2014 2018
1 9256 9654 11151 21173 19161
2 7466 6510 5795 1788 3749
3 1246 1009 685 105 199
4 206 180 150 7 49
5 58 38 35 28 58
6 18 13 10 3 2
7 10 16 6 0 0
8 8 12 1 8 18
9 13 9 10 0 0
10 14 12 12 1 1


The tally in each bin dropped off quickly after the first few bins (sometimes as early as the 4th bin). In more recent years (2014, 2018) there are empty bins. This is a consequence of using equal length intervals over a long-tailed distribution such as income (which is especially long-tailed in recent years).


Decile bins

These plots summarise the 10,000 individual points in each panel into ten equal sized bins (i.e., decile bins, each containing ~1000 observations).

decile_plot


The ten equal sized decile bins produce points that tend to be bunched up on the left side of the income axis. However this appropriately reflects the portion of the income scale about which we can be most confident.

tags: