Pandas integration¶
This page describes the pandas integration of Energy Quantified’s Python client. We chose to integrate with pandas because it is quite popular and it works great with time series data. For full documentation on pandas, refer to their documentation.
Pandas is not required to use the energyquantified
package. You must
therefore install pandas separately to use the following operations.
Convert data series to data frames¶
You can convert any of these types to a pandas.DataFrame
:
They all have a method called to_dataframe()
. Periodseries
differs from
the two others in which you must supply a frequency parameter.
There also exists an alias of to_dataframe()
called to_df()
. The term
df
is a commonly used shorthand and variable name for DataFrame
’s.
Convert time series¶
Converting a time series is simple:
>>> df = timeseries.to_dataframe()
>>> df
DE Wind Power Production MWh/h 15min Actual
date
2020-01-01 00:00:00+01:00 8928.95
2020-01-02 00:00:00+01:00 16302.95
2020-01-03 00:00:00+01:00 32063.55
2020-01-04 00:00:00+01:00 33299.36
2020-01-05 00:00:00+01:00 13151.01
You can also set a custom name if you think the default one is a little too verbose:
>>> df = timeseries.to_dataframe(name='de wind')
>>> df
de wind
date
2020-01-01 00:00:00+01:00 8928.95
2020-01-02 00:00:00+01:00 16302.95
2020-01-03 00:00:00+01:00 32063.55
2020-01-04 00:00:00+01:00 33299.36
2020-01-05 00:00:00+01:00 13151.01
If you have a scenario-based time series (such as a time series loaded for an ensemble forecast), you will get one column per ensemble.
Notice that there are three column headers here. The first is the curve, the second is the instance, and the third is the scenarios. You can read more on this in the column headers section below. It is also possible to merge these three header levels onto one, see the section on single-level column headers for details.
>>> forecast.instance
<Instance: issued="2020-10-26 00:00:00+00:00", tag="ec-ens", scenarios=51>
>>> df = forecast.to_dataframe(name='wind forecast')
>>> df
wind forecast ...
2020-10-26 00:00 ec-ens ...
e00 e01 e02 ... e47 e48 e49 e50
date ...
2020-10-27 00:00:00+01:00 26575.48 26733.56 27500.18 26672.14 ... 24269.32 24301.24 30265.62 24280.31
2020-10-28 00:00:00+01:00 30657.37 30446.78 28420.88 37041.53 ... 28426.01 27353.77 32797.71 28044.18
2020-10-29 00:00:00+01:00 27776.44 27720.31 30748.11 28341.64 ... 30731.12 25900.96 29088.77 28441.85
2020-10-30 00:00:00+01:00 26984.86 23955.59 32940.16 27493.37 ... 38920.07 34470.99 26831.95 30003.82
2020-10-31 00:00:00+01:00 15179.69 14326.49 16155.63 16337.56 ... 16874.91 10602.34 8203.10 27192.68
...
There are 52 columns with data here. The first one, the one with a blank third column header, is the mean of all the other scenarios (also known as ensembles).
You can extract a single ensemble like so (here we extract scenario e48
from the 2020-10-26 00:00 ec-ens
instance:
>>> df['wind forecast']['2020-10-26 00:00 ec-ens']['e48']
date
2020-10-27 00:00:00+01:00 24301.24
2020-10-28 00:00:00+01:00 27353.77
2020-10-29 00:00:00+01:00 25900.96
2020-10-30 00:00:00+01:00 34470.99
...
Convert period-based series¶
Period-based series are converted almost the same as time series. The only
difference is that you must supply a frequency parameter to the
to_dataframe(frequency)
method. You should read the above section before
continuing.
Here we convert a REMIT series for German nuclear available capacity to a daily
average capacity pandas.DataFrame
:
>>> from energyquantified.time import Frequency
>>> periodseries.instance
<Instance: issued="2020-10-24 14:10:40+00:00", tag="a-PvMRn_EpOJtngkh4D06Q">
>>> df = periodseries.to_dataframe(
>>> frequency=Frequency.P1D,
>>> name='de nuclear remit'
>>> )
>>> df
de nuclear remit
2020-10-24 14:10 a-PvMRn_EpOJtngkh4D06Q
date
2020-10-20 00:00:00+02:00 6714.000000
2020-10-21 00:00:00+02:00 6709.812500
2020-10-22 00:00:00+02:00 6714.000000
2020-10-23 00:00:00+02:00 7145.572917
2020-10-24 00:00:00+02:00 7958.677083
2020-10-25 00:00:00+02:00 8124.000000
...
Notice that the second column header (the instance) is a little bit weird. That
is because it’s a combination of the publication date (as instance.issued
)
of the REMIT outage message for nuclear powerplants in Germany and the
ID of said message (as instance.tag
).
Convert OHLC data¶
When you have an OHLCList
, which is the
response type from eq.ohlc.load()
, you can do this:
>>> df = ohlc_list.to_dataframe()
>>> df
traded period front delivery open high low close settlement volume open_interest
0 2020-10-15 day 1 2020-10-16 NaN NaN NaN NaN 23.24 0.0 0.0
1 2020-10-15 day 2 2020-10-17 NaN NaN NaN NaN 19.00 0.0 0.0
2 2020-10-15 day 3 2020-10-18 NaN NaN NaN NaN 16.00 0.0 0.0
3 2020-10-15 month 1 2020-11-01 23.50 23.50 22.30 22.30 22.35 343.0 10104.0
4 2020-10-15 month 2 2020-12-01 25.65 25.65 24.40 24.40 24.40 68.0 9772.0
5 2020-10-15 month 3 2021-01-01 NaN NaN NaN NaN 28.65 0.0 192.0
6 2020-10-15 month 4 2021-02-01 NaN NaN NaN NaN 29.28 0.0 159.0
7 2020-10-15 month 5 2021-03-01 24.25 24.25 24.00 24.10 24.10 40.0 105.0
8 2020-10-15 month 6 2021-04-01 22.90 22.90 22.25 22.25 22.35 36.0 10.0
9 2020-10-15 quarter 1 2021-01-01 28.10 28.10 27.10 27.15 27.10 251.0 5731.0
10 2020-10-15 quarter 2 2021-04-01 20.25 20.25 19.75 19.75 19.75 86.0 1762.0
...
You can filter down further the contracts you want. Say that you only wish to work on front contracts, then do this:
>>> df[ df['front'] == 1 ]
traded period front delivery open high low close settlement volume open_interest
0 2020-10-15 day 1 2020-10-16 NaN NaN NaN NaN 23.24 0.0 0.0
3 2020-10-15 month 1 2020-11-01 23.5 23.5 22.3 22.30 22.35 343.0 10104.0
9 2020-10-15 quarter 1 2021-01-01 28.1 28.1 27.1 27.15 27.10 251.0 5731.0
15 2020-10-15 week 1 2020-10-19 21.5 21.5 20.0 20.00 20.00 310.0 200.0
21 2020-10-15 year 1 2021-01-01 23.5 23.5 22.9 23.00 22.95 89.0 9790.0
For more details on filtering, see the pandas documentation.
Convert a list of series to a data frame¶
Responses from eq.instances.load()
and eq.period_instances.load()
respectively return a TimeseriesList
and a
PeriodseriesList
.
Both list implementations subclasses Python’s built-in list, so you can call
append()
, extend()
, pop()
, remove()
and more on them. They
also have utility methods for converting all series contained in them to a
single pandas.DataFrame
.
Convert a time series list¶
Say that you have loaded three wind power forecasts in daily resolution
using eq.instances.load()
, then you can convert them to a
single pandas.DataFrame
like this:
>>> df = timeseries_list.to_dataframe()
>>> df
DE Wind Power Production MWh/h 15min Forecast
2020-10-25 00:00 gfs 2020-10-25 00:00 ec 2020-10-24 18:00 gfs
date
2020-10-25 00:00:00+02:00 NaN NaN 25723.21
2020-10-26 00:00:00+01:00 14148.87 15312.22 13579.25
2020-10-27 00:00:00+01:00 22220.05 22581.10 22010.06
2020-10-28 00:00:00+01:00 27906.20 29214.30 26829.98
2020-10-29 00:00:00+01:00 28905.48 26575.11 28152.93
...
You can also add more time series to timeseries_list
using the built-in
list methods. There is only one requirement: They must have the same frequency.
>>> timeseries_list.insert(0, wind_actual) # Add actual first
>>> timeseries_list.insert(1, wind_normal) # Add normal second
>>> df = timeseries_list.to_dataframe()
>>> df
DE Wind Power Production MWh/h 15min Actual DE Wind Power Production MWh/h 15min Normal ... DE Wind Power Production MWh/h 15min Forecast
... 2020-10-25 00:00 ec 2020-10-24 18:00 gfs
...
date ...
2020-10-23 00:00:00+02:00 13193.50 16133.94 ... NaN NaN
2020-10-24 00:00:00+02:00 22438.26 16291.00 ... NaN NaN
2020-10-25 00:00:00+02:00 24872.55 16465.75 ... NaN 25723.21
2020-10-26 00:00:00+01:00 NaN 16588.33 ... 15312.22 13579.25
2020-10-27 00:00:00+01:00 NaN 16721.59 ... 22581.10 22010.06
2020-10-28 00:00:00+01:00 NaN 16845.30 ... 29214.30 26829.98
2020-10-29 00:00:00+01:00 NaN 16958.63 ... 26575.11 28152.93
...
To get all instances for the forecast curve from the pandas.DataFrame
, use pandas’
built-in filtering capabilities:
>>> df['DE Wind Power Production MWh/h 15min Forecast']
2020-10-25 00:00 gfs 2020-10-25 00:00 ec 2020-10-24 18:00 gfs
date
2020-10-23 00:00:00+02:00 NaN NaN NaN
2020-10-24 00:00:00+02:00 NaN NaN NaN
2020-10-25 00:00:00+02:00 NaN NaN 25723.21
2020-10-26 00:00:00+01:00 14148.87 15312.22 13579.25
2020-10-27 00:00:00+01:00 22220.05 22581.10 22010.06
2020-10-28 00:00:00+01:00 27906.20 29214.30 26829.98
2020-10-29 00:00:00+01:00 28905.48 26575.11 28152.93
...
Notice that the first column header with the curve name disappeared. That is because pandas stores the data hierarchically. All columns with the same name are grouped together. So, in this case, we get the three instances for the wind power forecast curve.
Convert a period-based series list¶
Just like with Periodseries
, specify a
frequency to first convert to a fixed-interval time series in your
preferred resolution in a pandas.DataFrame
. Using the German nuclear REMIT
capacity example as before, we can see how the available nuclear capacity was
at different times:
>>> from energyquantified.time import Frequency
>>> df = periodseries_list.to_dataframe(frequency=Frequency.P1D)
>>> df
DE Nuclear Capacity Available MW REMIT
2020-10-24 14:10 a-PvMRn_EpOJtngkh4D06Q 2020-10-23 22:53 a-PvMRn_EpOJtngkh4D06Q 2020-10-23 22:32 foawy0rsE5VaMvg-JLbVbQ 2020-10-23 07:45 5mkc_POSQXzDGnTVSzsQiQ
date
2020-10-20 00:00:00+02:00 6714.000000 6714.000000 6714.000000 6714.000000
2020-10-21 00:00:00+02:00 6709.812500 6709.812500 6709.812500 6709.812500
2020-10-22 00:00:00+02:00 6714.000000 6714.000000 6714.000000 6714.000000
2020-10-23 00:00:00+02:00 7145.572917 7542.625000 6714.000000 7540.770833
2020-10-24 00:00:00+02:00 7958.677083 8124.000000 7147.750000 8124.000000
...
Column headers for time series data¶
The data frames created from time series data has three column header levels:
Curve name
Instance or contract
Scenario
Curve name is set the timeseries.curve.name
by default. If there is
no curve attribute on the Timeseries
object,
it defaults to be blank. The user can override this name by setting a custom
name (see below).
Instance or contract is set (defaults to blank) when the time series has is an instance (forecast) or when the response is an OHLC series converted to a time series:
For instances, this column header is set to
<issued> <tag>
, like so:2020-10-16 00:00 ec
.For contracts, it is set to
<period> <front|delivery> <field>
. Examples:month front-1 close
oryear 2024-01-01 settlement
.
Scenario is the scenario or ensemble ID. This header is blank unless you
load ensemble data or scenario time series. For ensembles, it is normally
named eNN
where NN is the zero-padded ensemble ID. ECMWF ensemble
forecasts, for example, have 51 scenarios, named from e00
, e01
, …,
e49
, e50
. Climate series uses underlying weather years. These column
headers are named after the weather year they are based on: y1980
,
y1981
, …, y2018
, y2019
.
Force single-level column headers¶
While the default behaviour is to create three levels of column headers, as seen above, you can tell the client to merge all the levels into one.
Do this by setting the parameter single_level_header=True
when
you invoke to_dataframe()
.
Using the wind forecast example from earlier on this page:
>>> forecast.instance
<Instance: issued="2020-10-26 00:00:00+00:00", tag="ec-ens", scenarios=51>
>>> df = forecast.to_dataframe(name='wind forecast')
>>> df
wind forecast ...
2020-10-26 00:00 ec-ens ...
e00 e01 e02 ... e47 e48 e49 e50
date ...
2020-10-27 00:00:00+01:00 26575.48 26733.56 27500.18 26672.14 ... 24269.32 24301.24 30265.62 24280.31
2020-10-28 00:00:00+01:00 30657.37 30446.78 28420.88 37041.53 ... 28426.01 27353.77 32797.71 28044.18
2020-10-29 00:00:00+01:00 27776.44 27720.31 30748.11 28341.64 ... 30731.12 25900.96 29088.77 28441.85
2020-10-30 00:00:00+01:00 26984.86 23955.59 32940.16 27493.37 ... 38920.07 34470.99 26831.95 30003.82
2020-10-31 00:00:00+01:00 15179.69 14326.49 16155.63 16337.56 ... 16874.91 10602.34 8203.10 27192.68
...
We can add the single_level_header
parameter. Notice that the headers,
which previously were three levels (curve name, instance and scenario), are
now merged into one row:
>>> df = forecast.to_dataframe(
>>> name='wind forecast',
>>> single_level_header=True # Merge column headers
>>> )
wind forecast 2020-10-26 00:00 ec-ens wind forecast 2020-10-26 00:00 ec-ens e00 ... wind forecast 2020-10-26 00:00 ec-ens e49 wind forecast 2020-10-26 00:00 ec-ens e50
date ...
2020-10-27 00:00:00+01:00 26575.48 26733.56 ... 30265.62 24280.31
2020-10-28 00:00:00+01:00 30657.37 30446.78 ... 32797.71 28044.18
2020-10-29 00:00:00+01:00 27776.44 27720.31 ... 29088.77 28441.85
2020-10-30 00:00:00+01:00 26984.86 23955.59 ... 26831.95 30003.82
2020-10-31 00:00:00+01:00 15179.69 14326.49 ... 8203.10 27192.68
...
Some functions and utilities in pandas work best when the DataFrame
has
a single level header. Setting single_level_header=True
makes it easier
than if you would have to merge the headers manually.
Set custom time series name¶
Energy Quantified’s curve names are made to be easy to understand but can be
quite long. So we made a set_name
method for Timeseries
: and
Periodseries
.
Use it to set your own custom name before converting to a pandas.DataFrame
:
>>> timeseries.name
'DE Wind Power Production MWh/h 15min Actual'
>>> timeseries.set_name('de wind actual')
>>> timeseries.name
'de wind actual'
The custom name is reflected in the pandas.DataFrame
column header:
>>> timeseries.to_dataframe()
de wind actual
date
2020-01-01 00:00:00+01:00 8928.95
2020-01-02 00:00:00+01:00 16302.95
2020-01-03 00:00:00+01:00 32063.55
2020-01-04 00:00:00+01:00 33299.36
2020-01-05 00:00:00+01:00 13151.01
You can also specify a name when invoking the to_dataframe()
method on
time series objects:
>>> timeseries.to_dataframe(name='my awesome name')
my awesome name
date
2020-01-01 00:00:00+01:00 8928.95
2020-01-02 00:00:00+01:00 16302.95
2020-01-03 00:00:00+01:00 32063.55
2020-01-04 00:00:00+01:00 33299.36
2020-01-05 00:00:00+01:00 13151.01