PVOD Regression - Photovoltaic power output prediction based on weather data

PVOD Regression - Photovoltaic power output prediction based on weather data

PVOD is a public dataset that contains real photovoltaic power output and weather condition data. The regression of this dataset can be used to forecast PV power.

The code is available on my Github repo SmallSquare/PVOD_Regression. Feel free to correct me😊.

Introduction

Why do this? I planned to use this dataset for my research project, but then my supervisor provided me with a better private dataset for research. So some very simple work done before on this dataset can be shown here as a simple example of data processing and modelling.

Why does the photovoltaic power output need to be predicted? Unlike conventional power generation, photovoltaic power is very unstable. But it is heavily influenced by day and night and the weather conditions. So, using weather data is a good idea to forecast the PV power.

Dataset

This dataset includes data from ten power stations in relatively close geographical proximity. I will use station03 as an example.

We can learn some basic information from the provided metadata. Here are columns of the metadata:

Metadata:

  • Station_ID
  • Capacity
  • PV_Technology
  • Panel_size
  • Module
  • Inverters
  • Layout
  • Panel_Number
  • Array_Tilt
  • Pyranometer
  • Longitude
  • Latitude

Then, we just delve directly into the specific data. station03 has 14688 entries. Here are categorized columns of the dataset of station03:

Time:

  • datetime

Weather:

  • nwp_globalirrad
  • nwp_directirrad
  • nwp_temperature
  • nwp_humidity
  • nwp_windspeed
  • nwp_winddirection
  • nwp_pressure
  • lmd_totalirrad
  • lmd_diffuseirrad
  • lmd_temperature
  • lmd_pressure
  • lmd_winddirection
  • lmd_windspeed

Photovoltaics:

  • power

We now have a general view of what the data set looks like. Further, to set the stage for modelling, a correlation analysis is required.

Correlation Analysis

It can be found in the heat map that features of radiation take the highest correlation to the power output. But we will drop columns of radiation later because sometimes the radiation is not that easy to measure without a pyranometer and this is almost linearly related to the power output. This may not reflect value in the actual forecast.

Also, we can observe the distribution of the power output.

Distribution of PV output

Feature engineering and feature selection

The datetime is pretty important and we need to do some engineering on it. For a cyclic feature like date or time, to keep the distance among values in a cyclic feature reasonable, we should encode it. The exact method and rationale can be found in Pierre-Louis Bescond’s blog → Cyclical features encoding, it’s about time. His blog is easy to understand and has inspired me well.

Then, we only select the following locally measured data:
(As you can see, datetime now becomes time_sin, time_cos, date_sin and date_cos)

1
2
3
4
5
6
7
8
[lmd_temperature',
'lmd_pressure',
'lmd_winddirection',
'lmd_windspeed',
'time_sin',
'time_cos',
'date_sin',
'date_cos']

We can do a new correlation analysis on processed features and target variable power.

Correlation Analysis of precessed features

Modelling

Well, now we are finally ready to model. Since this is just a simple try on PVOD, I will use models in scikit-learn package.

Here is the performance comparison of models. The following performance is for reference only, as we have not looked carefully for hyperparameters.

Model MAE MSE RMSE R-square
RF 0.5162 1.1332 1.0646 0.9524
GBM 0.5841 1.1142 1.0556 0.9532
MLP 1.1181 3.7166 1.9278 0.8440

Although we didn’t work carefully on hyperparameters, Random Forest still give us a good prediction.

The blue line is the prediction, and the black line is the ground truth, we can see how close the prediction of GBM and RF made is to the ground truth.

Prediction by RF
Prediction by GBM

Feature importance

The RF and GBM can tell the importance of features:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- RF Feature Importance -
lmd_temperature
> 9.6774 %
lmd_pressure
> 3.6530 %
lmd_winddirection
> 1.0908 %
lmd_windspeed
> 1.1318 %
time_sin
> 53.0341 %
time_cos
> 24.2303 %
date_sin
> 2.8041 %
date_cos
> 4.3784 %
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- GBM Feature Importance -
lmd_temperature
> 9.3780 %
lmd_pressure
> 3.5278 %
lmd_winddirection
> 0.7219 %
lmd_windspeed
> 0.6464 %
time_sin
> 53.9076 %
time_cos
> 23.7690 %
date_sin
> 2.8473 %
date_cos
> 5.2021 %

It can be noticed that features of time are the most important features, as irradiance varies at different times of the day. Apart from this, the temperature is the next most important feature.

Pressure and date also have an impact on the power output, but the influence of windspeed and winddirection are minimal.

Tail

As a reminder, this is only a small attempt on PVOD and as such is deficient in many aspects. Also, the code can be found in my repository SmallSquare/PVOD_Regression.

Thanks for reading.😉

References

PVOD on Science Data Bank: PVOD v1.0: A photovoltaic power output dataset - Science Data Bank

PVOD on Github: PVODataset

PVOD on ScienceDirect: A photovoltaic power output dataset: Multi-source photovoltaic power output dataset with Python toolkit

Machine Learning Modeling of Horizontal Photovoltaics Using Weather and Location Data

Predicting solar power output using machine learning techniques

Cyclical features encoding, it’s about time!

PVOD Regression - Photovoltaic power output prediction based on weather data

https://smallsquare.github.io/PVOD-regression/

Author

SmallSquare

Posted on

2022-07-01

Updated on

2023-02-28

Licensed under

Comments