Population density bubbles
Recently my family and me were visiting Toledo, an impressive historical city in the center of Spain when I started browsing a local magazine. I stopped on a short article about some facts of the main cities in the region; local uses, gastronomy, industry, services and so. The issue was followed by an infography-like graph of the region (Castilla-la Mancha) main cities, on a map there was a bubble chart, each element was centered on a city location and two series of data where displayed in bubbles; population and area. For each city these two demographic values were shown as circles whose area was proportional to the value, both centered in the same position.
Well, nothing new under the sun… people have been using bubble charts for a long time, analysts work with them daily and even scholars know how to use them. The idea I found useful is to display two series values for each item in the same bubble chart and the way that the associated information is transmited. In the example, we analyze demographics starting from area (surface) and population as raw measures and then introduce its relation, population density, as a meaningful and working metric.
This metric is easy to work with, very familiar, easy to plot, and it’s also easy to understand, but even all of these facts, I had a better comprehension of the relationship between the raw metrics when I saw them plotted on a bubble chart than on a bar or line chart.
To illustrate this point, the next charts show the same information (population, area and population density) but in different ways, I hope you agree with me that the second plot is better than the first to depict the measures.
In fact, in chart 2 we are plotting only two metrics, the density is an inferred third metric that we can extract from the bubbles apparent size, this fact is the principal idea and the reason for this post.
Method and considerations
I’ve used Excel 2010 to plot both charts, I chose it because is familiar to me and available for most of the people around (in the case you would like to try yourself). The first one is just a column chart easy to build with little skills, the second one is more complex and there are some topics to consider.
I have set the x and y values for every item position inside the interval (-4, 4) just to ensure that I can plot all of them in a bounded scenario. In a real case, x and y coordinates should be derived from latitude and longitude conveniently scaled and normalized. One of my next objectives is to repeat this exercise with real coordinates of towns in Spain (or any country I could access to the basic demographic measures) and plot than information in this way.
To plot population and surface, we must normalize both sets of data. The reason for that is because the absolute magnitude of these metrics can be very different; normalization is needed to ensure that we can visualize most the measures in the same chart.
In this example I’ve considered a model of small towns whose urban area spans between 100 and 300 square km and its population between 5 and 52 hundred thousand people (complete fictional). Bellow you’ll find a table containing the data to play with.
The normalized values are marked in colored background, this method allows us to plot the two sets of data in a bubble chart with similar sizes for each point therefore it will be easier to compare them. From the maximum values for population (52 Kpeople) and surface (300 Km2) I get the “1s” in the normalized sets (remember that normalization will collapse or expand the range of data to the interval [0 ,1], in this case we avoid the 0 and always work with a minimum value).
When we try to plot them in an excel sheet the size of the bubbles for both measures will be scaled to the same maximum value for both sets, the final result in an excel chart is:
We can conclude easily from this chart that City6 is the least inhabited; City1 the biggest, City4 the smallest and City3 is both the most inhabited and densely populated. So far, so good, we have a kind of graphical representation that allows us to infer a relation between two direct metrics without plotting it. But we have also an issue.
Issues and enhancements
As we noticed before, City3 is the most populated and also the one with the highest density, but what about its real size? We cannot compare it with the rest of cities because the population bubble hides the area bubble, at most we can assume that it’s littler than City5, City6, City1 and City8 but there are no clues to compare it with the rest.
I’ve found two ways to solve this issue, each one with their pros and cons
A possible solution is to normalize the data sets in an alternative way which could ensure that no overlapping of population over areas will happen. We can accomplish this point normalizing the population set to the highest value (maximum will be 1) and area set to the lowest one (minimum will be 1).
The next table shows how the normalized values appear after applying this trick, the highest population; 52Kpeople is normalized to 1 while the lowest area; 60Km2 is normalized also to 1. Only in the case where both measures stand for the same city we’ll have the overlapping bubbles.
With this new data, the bubble chart will display as follows:
The relative size for areas and populations are kept in the same way, so they can be compared at first sight (cities 1, 8 and 5 are the biggest, cities 3 and 1 are the most crowded), also the population density is easy to infer from the picture (city 3 is the most dense).
With this method arises also a new issue; the population bubbles are normalized to a smaller maximum value, so the range to identify every size is shorter and less accurate. If we work with excel is possible to enhance the display setting the bubble parameters conveniently;
- Scaling bubble sizes to a 120% will grow the visual range of all the series.
- Setting the size value of the bubble to the width instead of the area will increase the difference between both series of data
The easiest way to enhance the visualization and to avoid the overlapping issue is simply applying transparency to the population series of bubbles. In excel is easy to do, and setting a 30% value for this fill property will be enough satisfying, as you can see in the next picture:
If you are an excel power user you can also use macros to edit any property of series according to whatever parameter or value of data, so for example, is easy to set transparency only when population bubble overlaps area bubble or set an specific color in this situation. Attached you’ll find a piece of code that you can use as a template or hint if you want to explore those possibilities.
This post is not intended to figure among the most useful or deep analysis, only to stress the importance of how do we show our data. In my opinion, in this fast changing environment we are in, is very useful and also funny to think about alternative ways of working, sharing and showing information.
I may say that I’m an excel user but I know there are many other tools in the market that probably are better, more usable and more powerful. The fact I use excel is just its availability and that I’ve got some (not much) skills on its usage