Real World Data Visualization With Python and matplotlib Part 1
This the first part in a multipart series on real world data visualization with python and matplotlib. I will add links to other posts as they become available.
Python and matplotlib are powerful tools for parsing and visualizing data. Python is an easy to use scripting language with awesome tools for working with data. Matplotlib is the most comprehensive and easiest to use data visualization tool in Python. Together they can help extract meaning from data.
Finding Data
Petabytes of data exist in the world today, so finding some data is not a problem. But finding the right data for you is the challenge.
When you are looking for data to visualize the data set you choose needs:
-
Data you are familiar with and understand
- This helps immensely when sanitizing the data. If you don't know what a bogus value looks like your plot can convey the wrong information.
-
A parseable data format like
json
orcsv
- Anything in html, xml, or other formats needs processing, which can be difficult and time consuming (for python xml parsing look into beautiful soup)
-
Customizable data intervals
- This allows you to iterate on a small data set before working with a larger data set, which can be slow.
Given these suggestions we are going to start with a data set that I am familiar with, weather data from the National Renewable Energy Lab National Wind Technology Center in Boulder Colorado.
Downloading NWTC Weather Data
From the NWTC website we are going to download temperature data (Temperature (2m)) from January 2019.
Select the Selected 1-Min Data (ZIP Compressed)
option and hit submit. Unzip the downloaded file, move it to the desired directory and rename it to 2019-01-nwtc-temp-2m.csv
.
For convenience you can download this data here.
Inspecting
Lets go ahead and see what this data looks like. We can use the head
command to see the first part of the file:
$ head 2019-01-nwtc-temp-2m.csv
DATE (MM/DD/YYYY),MST,Temperature @ 2m [deg C]
1/1/2019,00:00,-16.59
1/1/2019,00:01,-16.6
1/1/2019,00:02,-16.61
1/1/2019,00:03,-16.62
1/1/2019,00:04,-16.63
1/1/2019,00:05,-16.64
1/1/2019,00:06,-16.65
1/1/2019,00:07,-16.66
1/1/2019,00:08,-16.66
From this we can see the shape of the data which has 3 rows, the date, the time in MST and the temperature is Celsius degrees.
We can count the number of rows with the wc (word count) utility, passing the -l
argument to get the number of lines:
$ wc -l 2019-01-nwtc-temp-2m.csv
44641 2019-01-nwtc-temp-2m.csv
Lets verify this number. Given the first couple of rows it seems like they record the temperature every minute. Lets add it up, 60 minutes in an hour, 24 hours a day, and 31 days in January works out to (60 * 24 * 31
) 44640 minutes. Add in an extra row for the titles and 44641 proves that they recorded the temperature every minute.
Parsing Data
To get a feel for parsing lets find the min and max temperatures in degrees Fahrenheit.
First lets read the csv rows into an array using the built-in python csv package
import csv
data = []
with open("2019-01-nwtc-temp-2m.csv") as datafile:
reader = csv.reader(datafile)
for row in reader:
data.append(row)
print(data[1])
That prints:
['1/1/2019', '00:00', '-16.59']
Notice that the type for the temperature is a string.
print(type(data[1][2]))
<class 'str'>
Lets convert it to a float so we can do some math...
for row in data:
row[2] = float(row[2])
Traceback (most recent call last):
File "parse_data.py", line 12, in <module>
row[2] = float(row[2])
ValueError: could not convert string to float: 'Temperature @ 2m [deg C]'
Oops, we need to remove the labels in the first row.
data = data[1:-1]
for row in data:
row[2] = float(row[2])
print(type(data[1][2]))
<class 'float'>
Now that we have some numbers lets convert the temperature to Fahrenheit:
def convert_celsius_to_farenheit(celsius_deg):
return (celsius_deg * 9.0 / 5.0) + 32.0
temperature_data = []
for row in data:
temperature_data.append(convert_celsius_to_farenheit(float(row[2])))
print(temperature_data[0])
2.137999999999998
Now lets find the min and max temperatures:
min_temp = temperature_data[0]
max_temp = min_temp
for temp in temperature_data:
if temp < min_temp:
min_temp = temp
if temp > max_temp:
max_temp = temp
print("In January 2019 the temperature ranged from {} to {}".format(min_temp, max_temp))
In January 2019 the temperature ranged from -0.5799999999999983 to 61.232
It works! But lets fix the temperature format:
def format_temp(input):
return "{:.0f}°F".format(input)
print(
"In January 2019 the temperature ranged from {} to {}".format(
format_temp(min_temp), format_temp(max_temp)
)
)
In January 2019 the temperature ranged from -1°F to 61°F
Now lets plot the data...
First steps with matplotlib
Go ahead and install matplotlib and numpy.
$ pip install matplotlib
$ pip install numpy
And test your install. Your install should look something like the output below.
$ python3
Python 3.7.2 (default, Feb 12 2019, 08:15:36)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib
>>> import numpy
>>> matplotlib.__version__
'3.0.2'
>>> numpy.__version__
'1.16.0'
If you see any errors fix them before proceeding.
Not lets plot some dummy data:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use("Agg")
plt.plot([1, 2, 3, 4])
plt.savefig("first_plot.png")
And run it with:
$ python3 test_mpl.py
And you should have your first plot.
Plotting temperature data
From here we need to plug our temperature data array into matplotlib.
plt.plot(temperature_data)
plt.savefig("January-2019-NWTC-Temp-2m-Plot.png")
Whoa, the x axis looks all wrong. Thats because we haven't added our dates to the graph. Lets do that next.
Parsing Dates
To add the correct data to the x axis we need to create a new array with the dates parsed into datetime
objects. To do this we use the datetime package and datetime.strptime
to parse the date from a string, and plug this data into matplotlib as the x axis. Matplotlib understands datetime
objects and renders them accordingly:
from datetime import datetime
dates = []
for row in data:
date = datetime.strptime("{} {}".format(row[0], row[1]), "%m/%d/%Y %H:%M")
dates.append(date)
plt.plot(dates, temperature_data)
plt.savefig("January-2019-NWTC-Temp-2m-Plot-With-Date.png")
Which yields:
Next lets make it pretty.
Adding Labels
Here we are using matplotlib's built in methods to add labels to the data and make the x axis readable. We are also changing the font to Roboto.
plt.plot(dates, temperature_data)
plt.rcParams["font.family"] = "Roboto"
plt.title("NWTC - Temperature @ 2m - January 2019")
plt.ylabel("Temperature (°F)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("January-2019-NWTC-Temp-2m-Plot-Pretty.png")
Below are the contents of plot_nrel_data.py
:
import csv
from datetime import datetime
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use("Agg")
filename = "2019-01-nwtc-temp-2m.csv"
data = []
with open(filename) as datafile:
reader = csv.reader(datafile)
for row in reader:
data.append(row)
# Remove the first row of data
data = data[1:-1]
def convert_celsius_to_farenheit(celsius_deg):
return (celsius_deg * 9.0 / 5.0) + 32.0
temperature_data = []
for row in data:
temperature_data.append(convert_celsius_to_farenheit(float(row[2])))
min_temp = temperature_data[0]
max_temp = min_temp
for temp in temperature_data:
if temp < min_temp:
min_temp = temp
if temp > max_temp:
max_temp = temp
def format_temp(input):
return "{:.0f}°F".format(input)
print(
"In January 2019 the temperature ranged from {} to {}".format(
format_temp(min_temp), format_temp(max_temp)
)
)
dates = []
for row in data:
# 1/1/2019 00:00
date = datetime.strptime("{} {}".format(row[0], row[1]), "%m/%d/%Y %H:%M")
dates.append(date)
plt.rcParams["font.family"] = "Roboto"
plt.plot(dates, temperature_data)
plt.title("NWTC - Temperature @ 2m - January 2019")
plt.ylabel("Temperature (°F)")
plt.xticks(rotation=45)
# Force matplotlib to recalculate boundaries to avoid text clipping after rotating xticks
plt.tight_layout()
plt.savefig("January-2019-NWTC-Temp-2m-Plot-Pretty.png", dpi=500)
In part 2 we are going to streamline the data acquisition process and add wind data.