Visualization 101: Practicing Visualization

Ethango
8 min readJul 21, 2021

--

Most people who study Data Science comes from a mathematical and statistical background, and it’s often the case that they dismiss the importance of Data Visualization.

Trying to improve or add on other visualization is a good practice to improve your visualization skill. Here is a step-by-step guide on improving your data visualization.

I came across an amazing visualization on Bloomberg (https://www.bloomberg.com/graphics/2015-whats-warming-the-world/) that reveals the effect of a different array of factors (natural and manmade) on global warming.

The Visualization, as seen on Bloomberg.

The visualization shows an animated timeline of an increase in temperature compared to the 1880 -1910 average. The dots on the right side acts as a filter for the effect of factors on global warming.

In truth, I don’t think this visualization is one of the best visualizations you can make! I don’t believe a beginner can make these visualizations. The question for us is: How can we find flaws in amazing visualization like the one above?

In this article, I will show how good visualizations can also be bad visualizations. In short, presentation isn’t as important as information in Data Science. I’ll show my step-by-step process of debunking good visualization and create my own visualization that is “better” than the visualization above.

Step 0: Download Data

The first step is to download the dataset!

The original dataset for this visualization can be found on Makeover Monday: https://data.world/makeovermonday/2021w3.

To download the file, view the dataset and click download.

The download button after previewing the dataset on Makeover Monday

Step 1: Data Exploration

The first important step is to explore the data and uncover any limitations and biases in the data. Here, the data dictionary is a good step, as it allows users to see what each column represents.

The data dictionary allows users to see data distribution and common statistics such as mean, median, and skewness. It should also allow the data definition and description of each column. However, sometimes this is not the case, and further research is needed. An example of this can be seen in my dataset, as it fails to explain what each column represents clearly.

Pic 1: Data Dictionary of the chosen dataset. Here, we can see some columns: The months, but other columns afterward are tough to identify.

The data dictionary fails to provide the description easily on the data—world’s website. There is almost no description for what each of the column’s names means, and you would need to explore the original data source to know what each of these columns represents.

Pic 2: The Description column is missing, and I can’t deduce the column names

Looking for the data definition can be time-consuming. Thankfully, the users of these datasets are active and kind enough to provide some of the answers to my question in the discussion section, though I still couldn’t find the answers to the other column.

Pic 3: Looking at the discussion for answers

After looking at the data source, I uncovered what most of the columns in our dataset represent.

The values in each month represent the Global, Land, and Ocean Temperature Index measured at 0.01 Celsius. These values represent the difference in temperature of that particular month with the 1880–1910 average.

  1. hemisphere — Takes in three values (Southern, Northern, Global). Here each represents the global temperature for the Southern, Northern, or Overall Global Hemisphere.
  2. year — The year this data belongs to
  3. jan to dec — Month of January to December Temperature anomaly to 1880–1910 average temperature in Celsius
  4. j_d — J-D annual mean (January to December)
  5. d_n — D-N annual mean (Unsure but December to November?)
  6. dfj (Winter) djf (might be misspelled here) — Average for Winter
  7. mam (Spring) — Average for Spring
  8. jja (Summer) — Average for Summer
  9. son (Autumn) — Average for Autumn

If you know the field of study and how these data are collected, you can provide better insights into the data dictionary, but these were what I could collect.

Limitation and Biases

Do you notice a problem?

The columns of the dataset and the visualization on Bloomberg leave a gap in the different factors on the dataset!

Here’s an example:

Land Use dataset is not on our Makeover Monday

Where are the columns of the dataset that represents Land Use? These are only a few of the missing factors that do not exist in our dataset. The reason for this is that Bloomberg probably downloaded a much more detailed version of the dataset from the source website, and these additional details are not shown on Makeover Monday’s website. For this reason, we might have to explore the more detailed data. For now, let’s work on what we have on Makeover Monday’s website!

The first limitation is that the dataset is limited in scope. We only have information up to the months. We are unable to see the change of temperature in the scope of weeks or days. This is important as the change of temperature represents the mean change in temperature, and it could provide more information if we have more data on weeks/days scope.

The second limitation is that the information is recorded as a change in temperature from previous years instead of the actual temperature itself. People may fail to realize the norm temperature or implications of temperature at certain levels. (i.e., At x degree Celsius/Fahrenheit, then sea levels will rise by y cm).

A third limitation is the potential of biases in measurement. How are these data being measured? Where are they being measured? If the data are taken from different points, how do we get the final value? As seen from the data page below, the data points are fairly simple; this raises questions on how those values are calculated.

Pic 4: The column has 1 value each year; obtaining this value is hard to understand.

The fourth limitations are outliers and finding out how to handle them in our dataset. Here, for the j_d columns, we see several outliers on the right as our data becomes left-skewed.

Pic 5: Outliers in the Data

A fifth limitation, is the definition of some of the variable in our dataset are not clear. These variables may or may not be used depending on the types of visualization you aim to make, but it limits our understanding of the data. Of course, if you find out what those variables mean, feel free to comment down below!

Another interesting limitation is that despite having excess data, the visualization doesn’t answer all the questions. It does not allow filtering of data based on a hemisphere. This will be one of the ways we can improve the visualization, by answering questions it wasn’t able to.

While looking for the limitations of the dataset, I found another Medium Article that explores the bias and limitations of the dataset. Here is a link to that article:

This is an additional resource on how you may explore the bias and limitations of the data that are not covered on this article.

Step 2: Defining Problem Statement + Pre — Processing

The second step is to define a problem statement given the limitation and biases available on the data. Note, however, that you cannot prevent these biases when using the data for visualization because sometimes you are using someone else's data. This is not a problem since most datasets are assumed to be imperfect.

For this dataset, I have chosen to create a Data Dashboard. The Data Presentation will answer the following question: “How are the world’s temperatures changing in each of the Hemisphere?”

Defining the problem statements allows me to manage and plan how my data should be pre-processed. After downloading the data I decided to do some data processing to separate the data into different seasonals (Northern, Southern, and Global) and work from these segmented data.

Using Jupyter Notebook, I was able to segment the data by Hemisphere and change the format of the data, where instead of having columns for the month, now I have that information stored in a new ‘date’ column.

Original Dataset
Changed Dataset for Northern Hemisphere

The transformation isn’t perfect but you can see now that each row now corresponds to months from 1880 to 2020. I also have several DataFrame corresponding to each of the hemispheres. This makes it easier to create a visualization in Tableau or online tools such as Flourish.

Step 3: Creating Visualization

The third step is to create the visualization through your choice of tools! in my case, I have decided to work using Tableau as it allows me to easily combine different visualization to make a dashboard.

The visualization created through Tableau Public

So how does this visualization add on to the other visualization from Bloomberg's website? For one, we can see what year, the average global temperature increased in comparison to the 1880 to 1910 average year.

The annotation and color choices help distinguish years in which the average global has increased significantly. This is clearly seen from the choice of color, where blue indicates a decrease in global temperature and orange/red indicates an increase in average global temperature.

One difference in our visualization is the story it tells. Makeover Monday has a lot of stories, but this visualization tells a different story by segmenting the data by hemisphere. While I do feel, it might have been overall better if I only compared the North and South Hemisphere only, this can be seen from the side by side line chart.

--

--