Last Updated on September 22, 2021 by dgvause
In 2020, as the pandemic’s momentum gathered, my employer reduced our workforce to 50% by having us work alternating weeks. The goal of the reduction was to slow spread by keeping us more physically separated. When this had limited effect and as the pandemic spread, we went to a skeleton crew and only mission-essential staff went to work. In the months off, I had the time to investigate the latest technologies that were of interest to me.
My initial data science interested arose when tracking the international Covid-related data. The infection rates were not per capita. Absolute infection numbers when comparing Italy and, say, the United States were absolutely meaningless. I remember coming across a chart showing the infections for “selected countries”. The U.S., of course, was present and ranking at or near the top. But what about the other “selected countries”. What was the selection heuristic? How would the U.S. fare in a ranking of all countries? The inability to find graphs showing this information hugely frustrated me. I began a search for a way to represent the data myself.
I discovered Python, the first scripting language that I actually liked. Perl can be amazingly arcane. Any language whose design motto is “There’s more than one way to do it.” is going to present a neophyte with a bewildering array of choices to do even simple tasks. The code is full of side effects that, while nifty for those who call themselves Perl Monks, is confusing to the rest of us. This makes for a steep learning curve. Ruby is an elegant fully object-oriented language, but no one uses it. Then there’s Python. It took the world by storm, beginners and experienced programmers alike. Since then, scientists and researchers have adopted it, implementing a huge amount of programming packages to support their needs.
Learning Python brought me to Matplotlib. It’s a wonderful, hugely powerful visualization library. It can possibly display any kind of two-dimensional visualization in existence. I am recognizing its use in many research papers as it implements many statistical tools such as error bars. Armed with it, I wondered, after still another of America’s mass shootings, just what is the relationship between per capita gun ownership and per capita gun deaths. I downloaded the data from worldpopulation.com, a nice source of data in CSV format.
With Matplotlib, I produced the chart below. I used another software package, the scikit-learn machine learning library, to do a least-squares fit on the data. It produced a coefficient of correlation of 0.345, which denotes a weak positive association between gun ownership and deaths. I left out Wyoming as its gun ownership is extremely high on the basis that it is legitimate to exclude outliers.
I liked my chart but found myself wondering about the individual states. Which are they?
I was tantalized by the wonderful visualizations on websites like ourworldindata.org, another great site for open-source data. To my delight, I found bokeh. Bokeh enables interactive mouse over events. It is another visualization package, perhaps not as able to produce the plethora of graph types as Matplotlib, but it does one thing that Matplotlib does not: user interaction. Now I could mouse over my states and produce popups with information called “tooltips”.
I believe that a good visualization encourages further questions. Of course, mine did. I moused over some of the states with the lowest death rates. Hmmm. They were mostly Democratic: Hawaii, Connecticut, New Jersey, even New York. Then I moused over some of the states with the highest death rates: Louisiana, Alaska, Montana. This lead to my next thought. Why don’t I color code by party affiliation? Thus, I produced this: