Air Quality Sensor Calibration

The dataset comes from a device that monitored air quality [1] in a polluted area at road level in an Italian city. It includes hourly measurements from five sensors that detect different pollutants, like carbon monoxide (CO), nitrogen oxides (NOx), and benzene. These sensors collected data during a year, from March 2004 to February 2005. The data also includes ground-truth measurements from a certified analyzer, so we have a reliable reference for comparison.

Exploratory Data Analisys

The dataset consists of concentrations of CO, Non-Methane Hydrocarbons, Benzene, NO, and NO2. Each pollutant has been recorded by both the sensor that has to be calibrated and a certified analyzer. We'll explore the daily average time series for each pollutant measured by both the sensor and the analyzer.

Daily time series of Carbon Monoxide (CO)

Daily time series of Non-Methane Hydrocarbons (NMHC)

Daily time series of Nitrogen Dioxide (NO2)

Daily time series of Nitric oxide (NO)

Daily time series of Benzene (C6H6)

Daily time series of Ozone (O3)

Hourly time series of Carbon Monoxide (CO)

Hourly time series of Non-Methane Hydrocarbons (NMHC)

Hourly time series of Nitrogen Dioxide (NO2)

Hourly time series of Nitric oxide (NO)

Hourly time series of Benzene (C6H6)

Hourly time series of Ozone (O3)

The time series reveal that the pollutants exhibit similar behavior, suggesting they may be correlated.

Correlation heatmap.

As expected, the time series are correlated with each other.

Model (Calibration)

As shown in the correlation heatmap, many variables are interrelated. To account for this, we applied multivariable regression, ensuring all relevant variables contribute to estimating the pollutant levels. We used a Linear Regression model as a baseline and a Random Forest Regression as our primary model.

Linear Regression Summary

Pollutant Root Mean Squared Error (RMSE) R-squared (R²)
CO(GT) 0.287315 0.960344
NMHC(GT) 87.064515 0.827655
C6H6(GT) 0.753162 0.990470
NOx(GT) 23.115998 0.921530
NO2(GT) 12.136127 0.863646

Random Forest Regression

Pollutant Root Mean Squared Error (RMSE) R-squared (R²)
CO(GT) 0.276984 0.963145
NMHC(GT) 56.209143 0.928166
C6H6(GT) 0.655126 0.992790
NOx(GT) 22.569703 0.925195
NO2(GT) 11.456405 0.878492

The analysis of pollutant levels using both Linear Regression and Random Forest Regression models provides valuable insights into the relationships between variables and the effectiveness of different modeling approaches.

As expected, the pollutants show significant correlation, which was evident from the correlation heatmap. The presence of interrelated variables indicates that a multivariable regression approach is appropriate, allowing for a more comprehensive understanding of how each factor contributes to the overall levels of air pollutants.

The Linear Regression model served as a reliable baseline, yielding strong R² values across pollutants, particularly for Benzene (C6H6) with an R² of 0.990, indicating that it explained nearly 99% of the variance in Benzene concentrations. Similarly, the model performed well for Carbon Monoxide (CO), with an R² of 0.960, showing a solid fit. However, the model was less accurate for Non-Methane Hydrocarbons (NMHC), where the R² dropped to 0.827, indicating that the linear model struggled to fully capture the variability in NMHC levels.

By contrast, the Random Forest Regression model demonstrated superior performance across all pollutants. With improved R² values, the model showed that it captured more of the complex interactions between the variables. Notably, NMHC saw a significant improvement, with an R² of 0.928, highlighting the Random Forest's ability to handle nonlinear relationships more effectively. Benzene (C6H6) again performed exceptionally well with an R² of 0.993, further reinforcing the model's robustness. The slight reduction in Root Mean Squared Error (RMSE) across all pollutants also suggests improved accuracy, especially for pollutants like NMHC and NO2, where the differences were more pronounced.

In conclusion, while Linear Regression provided a solid foundation for estimating pollutant levels, the Random Forest Regression model was more effective, especially for pollutants with more complex patterns like NMHC. The improved performance of the Random Forest model underscores the importance of using more sophisticated approaches when dealing with multivariate and potentially nonlinear relationships in air quality data. This enhanced accuracy can lead to more precise predictions and better decision-making when monitoring and managing air pollution levels.

Thanks for reading! 🧑‍💻💕