Forecasting Atmospheric CO2 with Python

How to Create Time Series Forecasting Models with Darts

Giannis Tolios
Towards Data Science

--

Photo by Andreas Felske on Unsplash

Climate change is undoubtedly one of the greatest challenges that humanity is facing, with experts considering it an existential threat for our species. In 2021, unprecedented heatwaves and great wildfires were recorded worldwide, while floods brought destruction in Europe and Asia. According to the Intergovernmental Panel for Climate Change (IPCC) Sixth Assessment Report, the increasing frequency of extreme weather events is linked to climate change, and drastic measures need to be imposed globally to deal with this issue¹. Otherwise, millions of people will be affected, with their quality of life being significantly decreased in the following years.

Greenhouse gases (GHG) such as carbon dioxide (CO2) and methane (CH4) trap heat in the atmosphere, hence keeping our planet warm and friendly to biological species. Regardless, human activities such as burning fossil fuels, lead to enormous quantities of GHG being emitted, thus excessively increasing the mean global temperature of Earth². Therefore, transitioning towards a sustainable global economy is imperative, so we can mitigate climate change and secure the prosperity of our species. In this article, we are going to apply time series forecasting on atmospheric CO2 concentration data, hence getting the opportunity to explore the intersection of machine learning and climate change.

The Darts Library

Darts Logo — Image by Unit8

The darts library developers aim to simplify time series analysis and forecasting with Python. Darts supports a variety of forecasting approaches, ranging from classical statistical models like ARIMA and exponential smoothing, as well as novel methods based on machine learning and deep learning. Furthermore, darts includes various functions that let us understand the statistical properties of time series, as well as evaluate the accuracy of forecasting models. If you want to learn more about darts, you can read this detailed introduction to the library, or refer to the official API documentation. The darts library will let us create various forecasting models based on the Mauna Loa CO2 dataset. Furthermore, we are going to compare the accuracy of those models and use the best one to forecast the atmospheric CO2 concentration values for 2022.

The Mauna Loa CO2 Dataset

The Mauna Loa Observatory — Photo by NOAA

The Mauna Lua volcano is home to the eponymous Mauna Loa Observatory (MLO), a research facility that has been monitoring the atmosphere since 1950, with its remote location providing ideal conditions to record climate data. In 1958, Charles David Keeling established the CO2 monitoring program of the MLO, and started recording scientific evidence for the rapidly increasing CO2 concentration of the atmosphere³. The Mauna Loa CO2 dataset was downloaded from the Scripps Institution of Oceanography, and includes monthly atmospheric CO2 concentration values in parts per million (ppm), from 1958 to 2021⁴. Furthermore, the dataset also contains seasonally adjusted and smoothed versions of the data, but our analysis will exclusively focus on the standard time series.

Time Series Analysis

In this section, we are going to extract some insights about the dataset and its statistical properties. This will be accomplished by using various types of plots and other time series analysis techniques.

We begin by importing the Python libraries that are necessary for our project, including pandas, Matplotlib, as well as various functions from the statsmodels and darts libraries. After doing that, we set the Matplotlib figure DPI to 300, so we get high-resolution plots for this article, but it is optional.

After importing the libraries, we continue by loading the dataset to a pandas dataframe. We then apply some data preprocessing techniques to it, such as removing unnecessary columns, setting a monthly datetime index and dropping the null values.

Image by Author

After cleaning the dataset, we use the plot() pandas function to create a simple line plot of the time series. Evidently, there’s a clear upward trend, highlighting the fact that atmospheric CO2 concentration has been rapidly increasing in the past decades. Furthermore, there’s also seasonality due to the planet’s natural carbon cycle, as plants capture and release CO2 in different seasons of the year⁵. Specifically, when plants start growing in Spring, they remove CO2 from the atmosphere by photosynthesizing. In contrast, when deciduous trees lose their leaves in Autumn, CO2 is released due to respiration.

Image by Author

We use the seasonal_decompose() statsmodels function to apply decomposition on the time series, i.e. extract the trend, seasonal and residuals components. After doing that, we plot those components and get a better understanding of the trend and seasonality that was mentioned before. In this particular case, we could already identify the components by visually inspecting the line plot, but seasonal decomposition is significantly helpful in more complex cases.

Image by Author

We use the plot_acf() statsmodels function to plot the autocorrelation function (ACF) of the time series, i.e. the linear relationship between its lagged values. Because of the time series trend, we can see that autocorrelation is high for small lags and starts to gradually decrease after lag 5. The ACF should also highlight the time series seasonal component, but it is indiscernible in this case.

Time Series Forecasting

In this section, we are going to train various forecasting models on the CO2 dataset and compare their performance. After doing that, we’ll pick the most accurate model and create a forecast for 2022 based on it.

We begin by loading the pandas dataframe to a TimeSeries object, as it is required by the Darts library. After doing that, we create the plot_backtest() and print_metrics() utility functions that will let us plot the forecasts and display various model metrics, including MAE, RMSE, MAPE, SMAPE and R².

Creating a Naive Forecasting Model

Setting a baseline accuracy is standard practice, so we are going to do that by creating a naive model. This will help us evaluate the performance of more complex models, that should theoretically have higher accuracy compared to the baseline. In this case, we are going to create a naive seasonal model that always predicts the value of K steps ago, with K being equal to the seasonal period.

Image by Author

To test the naive seasonal model, we use the historical_forecasts() function with a forecast horizon of 12 months. This function consecutively trains the model on an expanding window, and keeps the last value of each forecast by default. This technique is known as backtesting or time series cross-validation, and it is more sophisticated compared to the typical train/test method. After generating the forecast, we display the plot and metrics, by using the utility functions. As we can see, the naive seasonal model performed reasonably well, with a SMAPE value of 0.59%.

Creating an Exponential Smoothing Forecasting Model

Now that we have set a baseline accuracy, we are going to create a forecasting model based on Holt-Winter’s exponential smoothing, a classical approach that has been successfully used since the 1960s⁶.

Image by Author

As previously, we use the historical_forecasts() function, as well as the utility functions to evaluate model performance. Evidently, the exponential smoothing model significantly outperforms the naive model, with a SMAPE value of 0.1%. We can also visually confirm this, by observing that the forecasted values are almost identical to the plotted dataset values.

Creating a Linear Regression Forecasting Model

In the past years, machine learning models have been widely used in time series forecasting, as an alternative to classical approaches. By simply adding the lagged values as features to the dataset, we can transform time series forecasting into a regression task⁷. Therefore, we are able to use any scikit-learn regression model or other libraries that have a compatible API, including XGBoost and LightGBM. In this case, we are going to create a linear regression model based on the scikit-learn library.

Image by Author

By using historical_forecasts() and the utility functions, we test the linear regression model and subsequently display the results. As we can see, the model has excellent performance, with a SMAPE value of 0.11%.

Creating a Temporal Convolutional Network Forecasting Model

In recent years, deep learning has become popular in time series forecasting, with recurrent neural networks being the standard choice in research, as well as practical applications. Regardless, the temporal convolutional network is an alternative architecture that offers promising results⁸, so we are going to test its performance.

Image by Author

As previously, we use historical_forecasts() and the utility functions to test the temporal convolutional network model and display the results. Furthermore, we also normalize the time series, by using the Scaler() class. Evidently, the TCN model has significantly better performance compared to the baseline, with a SMAPE value of 0.11%.

Creating a Forecast

As expected, all models outperformed the naive seasonal model, by offering excellent accuracy. Regardless, we can see that exponential smoothing provided the best performance, hence our forecast for the atmospheric CO2 concentration of 2022 will be based on it.

Image by Author

We use the fit() function to fit the exponential smoothing model on the entire dataset, and afterwards display the results. As we can visually observe, the time series components have been successfully identified by the model. Furthermore, we can compare our result to the official CO2 forecast of the United Kingdom’s Meteorological Office (Met Office) for 2022. As expected, the two forecasts have nearly identical values, hence validating the accuracy of the exponential smoothing model. It may seem surprising that exponential smoothing provided the best results, instead of the machine learning and deep learning models. Regardless, we should keep in mind that newer models aren’t necessarily better in every case, compared to the classical methods. Furthermore, machine learning models can be optimized by applying hyperparameter tuning, a technique that is time-consuming and complicated, but can offer significant improvements.

Conclusion

In this article, we have explored one of the ways that data science can be utilized to tackle climate change, a thriving and diverse field of research that constantly grows⁹, so I encourage you to learn more about it. Sadly, we also discovered that atmospheric CO2 concentration is expected to increase this year, highlighting the fact that emissions must be reduced, to ensure the prosperity of our species. Finally, the code and data of this article are available at this Github repository, so feel free to clone it. I also encourage you to share your thoughts in the comments, or follow me on LinkedIn where I regularly post content about data science, climate change and other topics. You can also visit my personal website or check my latest book, titled Simplifying Machine Learning with PyCaret.

References

[1] Zhai, P., et al. “IPCC 2021: Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change.” (2021).

[2] Houghton, John. “Global warming.” Reports on progress in physics 68.6 (2005): 1343.

[3] Harris, Daniel C. “Charles David Keeling and the story of atmospheric CO2 measurements.” Analytical chemistry 82.19 (2010): 7865–7870.

[4] Keeling, Charles D., et al. “Exchanges of atmospheric CO2 and 13CO2 with the terrestrial biosphere and oceans from 1978 to 2000. I. Global aspects.” (2001).

[5] Keeling, Charles D. “The concentration and isotopic abundances of atmospheric carbon dioxide in rural areas.” Geochimica et cosmochimica acta 13.4 (1958): 322–334.

[6] Winters, Peter R. “Forecasting sales by exponentially weighted moving averages.” Management science 6.3 (1960): 324–342.

[7] Dietterich, Thomas G. “Machine learning for sequential data: A review.” Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR). Springer, Berlin, Heidelberg, 2002.

[8] Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.” arXiv preprint arXiv:1803.01271 (2018).

[9] Rolnick, David, et al. “Tackling climate change with machine learning.” ACM Computing Surveys (CSUR) 55.2 (2022): 1–96.

--

--