Bodensee Level Analysis with Python
Analyzing the level of Lake Bodensee with Python
In this post I want to discuss a data analysis experiment I recently worked on. My aim was to use AWS SageMaker and Python to visualize public data. This post is structured as follows: First, I am going to explain the goal. Then, I will briefly describe the setup. After that, I will walk through the individual steps of my analysis.
A short summary of the output can also be found in this Twitter thread:
Gaol
In short: My goal is to recreate a graph showing the level of Lake Contanze (Bodensee).
I like to dig into data. I enjoy questioning data. And I like data visualization.
A simple but interesting data visualization I get back to again and again can be found on the website of Hochwasservorhersagezentrale Baden-Württemberg (HVZ BaWü). This is what is looks like:
The graph shows the current level of Bodensee (Lake Constanze) as a blue line. Furthermore, it shows the long-term mean for each day (green line), the maximum value (red line), and minimum value (black line). Long-term in this case means that the values since year 1850 are considered (Update: They seem to have shortened the lookback window to time frame 1980 until 2021). What I want to do is to re-create this graph in a first step. And then dive deeper into the data by looking into certain years or aspects.
Setup
I build this analysis in AWS. I simply enjoy working in this cloud environment. And I want to learn more about it. SageMaker is the service of choice in this case. This was recommended to me. I have to admit that I always thought that SageMaker is only for Machine Learning (ML). But of course, ML is all about data analysis. And with that, SageMaker contains all the tools you might want to use for data analysis. So that was my choice.
This was my first project with Jupyter Notebooks. But that experience was excellent. Data was analyzed with Pandas (a Python Data Analysis Library). matplotlib was used for data visualization. But more on that later.
Data Analysis
Please note that this was my first data analysis in Python of that kind. What I am doing here might not be efficient - and it is certainly not elegant. For this pet project, I simply aimed to get the results I wanted to get.
Get source data
The mentioned HVZ is a public institution and with that, the data is also publicly available. You can download the data from https://udo.lubw.baden-wuerttemberg.de/public/pages/home/welcome.xhtml
Sorry, all in German.
Select Wasser –> Oberflächengewässer –> Hydrologische Landespegel –> Hydrologische Landespegel (2x)
On the next page, select
- Gewässer = Bodensee
- Station = Konstanz Bodensee
- Komponente = Wasserstand, W-Stand, cm
- Produkt = Tagesmaximalwerte
- Zeitraum = 01.01.1850 - 31.12.2022
You can define a future date for ‘Zeitraum’. The source will get you the data which is available.
Export the data. This is about 6 MB.
Some data cleanup
The output file needs some tweaks.
This is what I did
- rows 1 - 6 contain the metadata - I removed this
- rows 7 - 8 are empty - deleted as well
- row 9 contains the column headers - I replaced the word “Gewässer” with “Gewasser”, just to get rid of the special character
We will rename the columns to English as one of the first steps of the data processing in Python.
Data Analysis
I stored the data in an S3 bucket.
Open SageMaker Studio and launch a Python 3 Notebook.
1/ Import Pandas
#import pandas
import pandas as pd
2/ Read csv file
# read csv file from S3 bucket
bucket ='s3davibec/Bodensee'
data_key = 'Max_KN_1850-bis-heute_clean.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
df = pd.read_csv(data_location)
# print
print(df.head())
Resulting in
3/ Rename to English
#Rename all columns to English
df = df.rename(columns={
'Messstellennummer': 'measuring_point_number',
'Stationsname': 'station_name',
'Gewasser': 'waters',
'Parameter': 'parameter',
'Datum / Uhrzeit': 'date_time',
'Zeitbezug': 'time_zone',
'Wert': 'value',
'Einheit': 'unit',
'Produkt': 'value_type'
})
# print
print(df.head())
Resulting in
4/ Add Data Conversions
We need new different data points, like the month-day combination [“mm_dd”].
#add all date conversions
df['date'] = pd.to_datetime(df['date_time']).dt.date
df['year'] = pd.to_datetime(df['date_time']).dt.year
df['month'] = pd.to_datetime(df['date_time']).dt.month
df['day'] = pd.to_datetime(df['date_time']).dt.day
#convert Month and Day to two digit format
import io
df["month"] = df.month.map("{:02}".format)
df["day"] = df.day.map("{:02}".format)
#new column
df["mm_dd"] = df["month"].astype(str) + "_" + df["day"].astype(str)
# print
print(df.head())
Resulting in
5/ Reduce Data to the Minimum
Create a new dataset with a minimum of data to make things easier in the next couple of steps.
# creating a new dataset dfs with the minimum data I will need in the next couple of steps
dfs = df[['value','date','year','mm_dd']]
# print
dfs.head()
Resulting in
6/ Check Data Types
This might be a good time to check data types.
# check the data types of the various fields
# apply the dtype attribute
result = dfs.dtypes
print("Output:")
print(result)
Resulting in
Output:
value int64
date object
year int64
mm_dd object
dtype: object
7/ Get Min, Max, Mean
Now I am creating a new data set with the three long-term values. For every day of the year, we calculate the Max, the Min, and the Mean.
# create new dataframe with min max mean values for each month date combination
df_mmm = dfs.groupby('mm_dd').agg(value_min=('value', 'min'),
value_mean=('value', 'mean'),
value_max=('value', 'max')).reset_index()
# print
df_mmm.head()
Resulting in
8/ Missed the month info on my way, so I need to bring this back in
I reduced the data in step 5/. But I missed to keep to month info in. For the data visualization, this info is handy. Therefore I re-create a month column.
# extract the first to characters from field mm_dd in order to get the month information back
df_mmm['month'] = df_mmm['mm_dd'].str[:2]
# print
df_mmm.head()
Resulting in
9/ Visualize: Max, Min, Avg
As a first visualization, I am simply setting up the Max, Min, and Mean value. Comparing this to the original graph from the data source, I can check whether my processing is okay. I use matplotlib for this purpose.
# print min, max, and mean for long-term data
# ---
# import matplotlib
import matplotlib.pyplot as plt
# plot
df_mmm.plot(x='month', kind='line')
plt.ylim(bottom=0, top=800)
# description
plt.title("Level of Lake Constance: 1850 to 2021")
plt.xlabel("Month")
plt.ylabel("Level [cm]")
# add grid
plt.grid()
plt.show()
Resulting in
This looks all-right.
10/ Add 2022 data
The intention of this step is to create a data frame with Max, Min, Mean, and 2022 data.
# create dataframe with 2022 data based on dfs
dfs2022 = dfs.loc[dfs['year']==2022]
# drop not needed columns
dfs2022 = dfs2022.drop(columns=['date','year'], axis=1)
#rename value column
dfs2022 = dfs2022.rename(columns={'value': 'value_2022'})
# print
dfs2022.head()
Resulting in
df_mmm_2022 = pd.merge(df_mmm, dfs2022, how="left", on="mm_dd")
# print
df_mmm_2022.head()
Resulting in
Visualize the data.
# print min, max, and mean for long-term data
# ---
# import matplotlib
import matplotlib.pyplot as plt
# plot
df_mmm_2022.plot(x='month', kind='line')
plt.ylim(bottom=0, top=800)
# description
plt.title("Level of Lake Constance: 1850 to 2021")
plt.xlabel("Month")
plt.ylabel("Level [cm]")
# add grid
plt.grid()
plt.show()
Resulting in
11/ Check for years with low levels at that time of the year
In August, people thought that Lake Constance might see a record low for this time of the year. All Germany saw a record droud. With that, I wanted to understand which years saw lower sea levels.
Get lowest value for the fifth of August:
# get the lowest values on a certain day of the year
# ---
# filter data frame by value
filter_value = {'08_05'}
dfs_min = dfs.loc[dfs['mm_dd'].isin(filter_value)]
# print
dfs_min.head()
Resulting in
Now this data needs to be sorted.
dfs_min = dfs_min.sort_values(by='value', ascending=True)
# print
dfs_min.head(10)
Resulting in
Now we know that the water level was lower than in 2022 in four occasions before:
- 1949
- 2003
- 2006
- 1964 This is in ascending order. 1949 saw the lowest value ever recorded on 08/05.
It is important to note that there are five years after 2000 in this (negative) Top10 list.
12a/ Dive deeper
Let us have a closer look at the 1949 data.
# create dataframe with 1949 data based on dfs
dfs1949 = dfs.loc[df['year'] == 1949]
# drop not needed columns
dfs1949 = dfs1949.drop(columns=['date','year'], axis=1)
#rename value column
dfs1949 = dfs1949.rename(columns={'value': 'value_1949'})
# print
dfs1949.head()
Resulting in
# merge the 1949 data to the dataframe created for 2022
df_mmm_1949 = pd.merge(df_mmm_2022, dfs1949, how="left", on="mm_dd")
# print
df_mmm_1949.head()
Resulting in
Create a visualization
# print min, max, and mean for long-term data
# ---
# import matplotlib
import matplotlib.pyplot as plt
# plot
df_mmm_1949.plot(x='month', kind='line')
plt.ylim(bottom=0, top=800)
# description
plt.title("Level of Lake Constance: 1850 to 2021 + 2022 and 1949")
plt.xlabel("Month")
plt.ylabel("Level [cm]")
# add grid
plt.grid()
plt.show()
Resulting in
12b/ Zoom in summer 1949
We need to have a closer look here.
# let us zoom in a little bit
# ---
# filter data frame by value
filter_value = {'05','06','07','08'}
df_mmm_1949_zoom = df_mmm_1949.loc[df_mmm_1949['month'].isin(filter_value)]
# drop not needed columns
df_mmm_1949_zoom = df_mmm_1949_zoom.drop(columns=['value_max'], axis=1)
# print
df_mmm_1949_zoom.head()
Resulting in
Probably good to check the number of entries.
# check number of rows in df
len(df_mmm_1949_zoom)
Resulting in
124
That sounds right.
Now we visualize this.
# print zoom
# ---
# import matplotlib
import matplotlib.pyplot as plt
# plot
df_mmm_1949_zoom.plot(x='mm_dd', kind='line', rot=90)
# plt.ylim(bottom=0, top=800)
# description
plt.title("Level of Lake Constance: 1850 to 2021 + 2022 and 1949")
plt.xlabel("MM_DD")
plt.ylabel("Level [cm]")
# add grid
plt.grid()
plt.show()
Resulting in
13/ All time Max
Next, we have a look into the all-time max value.
# now let us check out the max value
# ---
# get the max values for all columns
print(dfs.max())
Resulting in
value 576
date 2022-08-06
year 2022
mm_dd 12_31
dtype: object
But in which year?
# now we know that the highest value is 576, but in which year did that occur?
dfs_max = dfs.loc[df['value']==576]
# print
dfs_max
Resulting in
The answer is 1890. Now we want to see what that year looked like.
# create dataframe with 1890 data based on dfs
dfs1890 = dfs.loc[df['year'] == 1890]
# drop not needed columns
dfs1890 = dfs1890.drop(columns=['date','year'], axis=1)
#rename value column
dfs1890 = dfs1890.rename(columns={'value': 'value_1890'})
# print
dfs1890.head()
Resulting in
Merge the data.
dfs_mmm_1890 = pd.merge(dfs1890, df_mmm, how="left", on="mm_dd")
# print
dfs_mmm_1890.head()
Resulting in
And now we visualize
# print min, max, and mean for long-term data
# ---
# import matplotlib
import matplotlib.pyplot as plt
# plot
dfs_mmm_1890.plot(x='month', kind='line')
plt.ylim(bottom=0, top=800)
# description
plt.title("Level of Lake Constance: 1850 to 2021 + 1890")
plt.xlabel("Month")
plt.ylabel("Level [cm]")
# add grid
plt.grid()
plt.show()
Resulting in
14/ Check year 1999
I still remember the flood from summer 1999. At that time, the military took us to school. Let us have a look at this year.
# and what was 1999
# ---
# create dataframe with 1999 data based on dfs
dfs1999 = dfs.loc[df['year'] == 1999]
# drop not needed columns
dfs1999 = dfs1999.drop(columns=['date','year'], axis=1)
#rename value column
dfs1999 = dfs1999.rename(columns={'value': 'value_1999'})
# print
dfs1999.head()
Resulting in
dfs_1890_1999 = pd.merge(dfs1890, dfs1999, how="left", on="mm_dd")
# print
dfs_1890_1999.head()
Resulting in
dfs_mmm_1890_1999 = pd.merge(dfs_1890_1999, df_mmm, how="left", on="mm_dd")
# print
dfs_mmm_1890_1999.head()
Resulting in
And visualization again
# print min, max, and mean for long-term data
# ---
# import matplotlib
import matplotlib.pyplot as plt
# plot
dfs_mmm_1890_1999.plot(x='month', kind='line')
plt.ylim(bottom=0, top=800)
# description
plt.title("Level of Lake Constance: 1850 to 2021 + 1949 and 1999")
plt.xlabel("Month")
plt.ylabel("Level [cm]")
# add grid
plt.grid()
plt.show()#
Resulting in