Pandas integration ================== This page describes the `pandas `_ integration of Energy Quantified's Python client. We chose to integrate with pandas because it is quite popular and it works great with time series data. For full documentation on **pandas**, refer to their `documentation `_. Pandas is **not required** to use the ``energyquantified`` package. You must therefore **install pandas** separately to use the following operations. Convert data series to data frames ---------------------------------- You can convert any of these types to a ``pandas.DataFrame``: * :py:class:`~energyquantified.data.Timeseries` * :py:class:`~energyquantified.data.Periodseries` * :py:class:`~energyquantified.data.OHLCList` They all have a method called ``to_dataframe()``. ``Periodseries`` differs from the two others in which you must supply a **frequency** parameter. There also exists an alias of ``to_dataframe()`` called ``to_df()``. The term ``df`` is a commonly used shorthand and variable name for ``DataFrame``'s. Convert time series ^^^^^^^^^^^^^^^^^^^ Converting a time series is simple: >>> df = timeseries.to_dataframe() >>> df DE Wind Power Production MWh/h 15min Actual date 2020-01-01 00:00:00+01:00 8928.95 2020-01-02 00:00:00+01:00 16302.95 2020-01-03 00:00:00+01:00 32063.55 2020-01-04 00:00:00+01:00 33299.36 2020-01-05 00:00:00+01:00 13151.01 You can also set a custom name if you think the default one is a little too verbose: >>> df = timeseries.to_dataframe(name='de wind') >>> df de wind date 2020-01-01 00:00:00+01:00 8928.95 2020-01-02 00:00:00+01:00 16302.95 2020-01-03 00:00:00+01:00 32063.55 2020-01-04 00:00:00+01:00 33299.36 2020-01-05 00:00:00+01:00 13151.01 If you have a scenario-based time series (such as a time series loaded for an ensemble forecast), you will get one column per ensemble. Notice that there are **three column headers** here. The first is the **curve**, the second is the **instance**, and the third is the **scenarios**. You can read more on this in the :ref:`column headers section` below. It is also possible to merge these three header levels onto one, see the section on :ref:`single-level column headers` for details. >>> forecast.instance >>> df = forecast.to_dataframe(name='wind forecast') >>> df wind forecast ... 2020-10-26 00:00 ec-ens ... e00 e01 e02 ... e47 e48 e49 e50 date ... 2020-10-27 00:00:00+01:00 26575.48 26733.56 27500.18 26672.14 ... 24269.32 24301.24 30265.62 24280.31 2020-10-28 00:00:00+01:00 30657.37 30446.78 28420.88 37041.53 ... 28426.01 27353.77 32797.71 28044.18 2020-10-29 00:00:00+01:00 27776.44 27720.31 30748.11 28341.64 ... 30731.12 25900.96 29088.77 28441.85 2020-10-30 00:00:00+01:00 26984.86 23955.59 32940.16 27493.37 ... 38920.07 34470.99 26831.95 30003.82 2020-10-31 00:00:00+01:00 15179.69 14326.49 16155.63 16337.56 ... 16874.91 10602.34 8203.10 27192.68 ... There are 52 columns with data here. The first one, the one with a blank third column header, is the mean of all the other scenarios (also known as ensembles). You can extract a single ensemble like so (here we extract scenario ``e48`` from the ``2020-10-26 00:00 ec-ens`` instance: >>> df['wind forecast']['2020-10-26 00:00 ec-ens']['e48'] date 2020-10-27 00:00:00+01:00 24301.24 2020-10-28 00:00:00+01:00 27353.77 2020-10-29 00:00:00+01:00 25900.96 2020-10-30 00:00:00+01:00 34470.99 ... Convert period-based series ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Period-based series are converted almost the same as time series. The only difference is that you must supply a **frequency** parameter to the ``to_dataframe(frequency)`` method. You should read the above section before continuing. Here we convert a REMIT series for German nuclear available capacity to a daily average capacity ``pandas.DataFrame``: >>> from energyquantified.time import Frequency >>> periodseries.instance >>> df = periodseries.to_dataframe( >>> frequency=Frequency.P1D, >>> name='de nuclear remit' >>> ) >>> df de nuclear remit 2020-10-24 14:10 a-PvMRn_EpOJtngkh4D06Q date 2020-10-20 00:00:00+02:00 6714.000000 2020-10-21 00:00:00+02:00 6709.812500 2020-10-22 00:00:00+02:00 6714.000000 2020-10-23 00:00:00+02:00 7145.572917 2020-10-24 00:00:00+02:00 7958.677083 2020-10-25 00:00:00+02:00 8124.000000 ... Notice that the second column header (the instance) is a little bit weird. That is because it's a combination of the publication date (as ``instance.issued``) of the REMIT outage message for nuclear powerplants in Germany and the ID of said message (as ``instance.tag``). Convert OHLC data ^^^^^^^^^^^^^^^^^ When you have an :py:class:`~energyquantified.data.OHLCList`, which is the response type from ``eq.ohlc.load()``, you can do this: >>> df = ohlc_list.to_dataframe() >>> df traded period front delivery open high low close settlement volume open_interest 0 2020-10-15 day 1 2020-10-16 NaN NaN NaN NaN 23.24 0.0 0.0 1 2020-10-15 day 2 2020-10-17 NaN NaN NaN NaN 19.00 0.0 0.0 2 2020-10-15 day 3 2020-10-18 NaN NaN NaN NaN 16.00 0.0 0.0 3 2020-10-15 month 1 2020-11-01 23.50 23.50 22.30 22.30 22.35 343.0 10104.0 4 2020-10-15 month 2 2020-12-01 25.65 25.65 24.40 24.40 24.40 68.0 9772.0 5 2020-10-15 month 3 2021-01-01 NaN NaN NaN NaN 28.65 0.0 192.0 6 2020-10-15 month 4 2021-02-01 NaN NaN NaN NaN 29.28 0.0 159.0 7 2020-10-15 month 5 2021-03-01 24.25 24.25 24.00 24.10 24.10 40.0 105.0 8 2020-10-15 month 6 2021-04-01 22.90 22.90 22.25 22.25 22.35 36.0 10.0 9 2020-10-15 quarter 1 2021-01-01 28.10 28.10 27.10 27.15 27.10 251.0 5731.0 10 2020-10-15 quarter 2 2021-04-01 20.25 20.25 19.75 19.75 19.75 86.0 1762.0 ... You can filter down further the contracts you want. Say that you only wish to work on **front contracts**, then do this: >>> df[ df['front'] == 1 ] traded period front delivery open high low close settlement volume open_interest 0 2020-10-15 day 1 2020-10-16 NaN NaN NaN NaN 23.24 0.0 0.0 3 2020-10-15 month 1 2020-11-01 23.5 23.5 22.3 22.30 22.35 343.0 10104.0 9 2020-10-15 quarter 1 2021-01-01 28.1 28.1 27.1 27.15 27.10 251.0 5731.0 15 2020-10-15 week 1 2020-10-19 21.5 21.5 20.0 20.00 20.00 310.0 200.0 21 2020-10-15 year 1 2021-01-01 23.5 23.5 22.9 23.00 22.95 89.0 9790.0 For more details on filtering, see the pandas documentation. Convert a list of series to a data frame ---------------------------------------- Responses from ``eq.instances.load()`` and ``eq.period_instances.load()`` respectively return a :py:class:`~energyquantified.data.TimeseriesList` and a :py:class:`~energyquantified.data.PeriodseriesList`. Both list implementations subclasses Python's built-in list, so you can call ``append()``, ``extend()``, ``pop()``, ``remove()`` and more on them. They also have utility methods for converting all series contained in them to a single ``pandas.DataFrame``. Convert a time series list ^^^^^^^^^^^^^^^^^^^^^^^^^^ Say that you have loaded three wind power forecasts in daily resolution using ``eq.instances.load()``, then you can convert them to a single ``pandas.DataFrame`` like this: >>> df = timeseries_list.to_dataframe() >>> df DE Wind Power Production MWh/h 15min Forecast 2020-10-25 00:00 gfs 2020-10-25 00:00 ec 2020-10-24 18:00 gfs date 2020-10-25 00:00:00+02:00 NaN NaN 25723.21 2020-10-26 00:00:00+01:00 14148.87 15312.22 13579.25 2020-10-27 00:00:00+01:00 22220.05 22581.10 22010.06 2020-10-28 00:00:00+01:00 27906.20 29214.30 26829.98 2020-10-29 00:00:00+01:00 28905.48 26575.11 28152.93 ... You can also add more time series to ``timeseries_list`` using the built-in list methods. There is only one requirement: They **must** have the **same frequency**. >>> timeseries_list.insert(0, wind_actual) # Add actual first >>> timeseries_list.insert(1, wind_normal) # Add normal second >>> df = timeseries_list.to_dataframe() >>> df DE Wind Power Production MWh/h 15min Actual DE Wind Power Production MWh/h 15min Normal ... DE Wind Power Production MWh/h 15min Forecast ... 2020-10-25 00:00 ec 2020-10-24 18:00 gfs ... date ... 2020-10-23 00:00:00+02:00 13193.50 16133.94 ... NaN NaN 2020-10-24 00:00:00+02:00 22438.26 16291.00 ... NaN NaN 2020-10-25 00:00:00+02:00 24872.55 16465.75 ... NaN 25723.21 2020-10-26 00:00:00+01:00 NaN 16588.33 ... 15312.22 13579.25 2020-10-27 00:00:00+01:00 NaN 16721.59 ... 22581.10 22010.06 2020-10-28 00:00:00+01:00 NaN 16845.30 ... 29214.30 26829.98 2020-10-29 00:00:00+01:00 NaN 16958.63 ... 26575.11 28152.93 ... To get all instances for the forecast curve from the ``pandas.DataFrame``, use pandas' built-in filtering capabilities: >>> df['DE Wind Power Production MWh/h 15min Forecast'] 2020-10-25 00:00 gfs 2020-10-25 00:00 ec 2020-10-24 18:00 gfs date 2020-10-23 00:00:00+02:00 NaN NaN NaN 2020-10-24 00:00:00+02:00 NaN NaN NaN 2020-10-25 00:00:00+02:00 NaN NaN 25723.21 2020-10-26 00:00:00+01:00 14148.87 15312.22 13579.25 2020-10-27 00:00:00+01:00 22220.05 22581.10 22010.06 2020-10-28 00:00:00+01:00 27906.20 29214.30 26829.98 2020-10-29 00:00:00+01:00 28905.48 26575.11 28152.93 ... Notice that the first column header with the curve name disappeared. That is because pandas stores the data hierarchically. All columns with the same name are grouped together. So, in this case, we get the three instances for the wind power forecast curve. Convert a period-based series list ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Just like with :py:class:`~energyquantified.data.Periodseries`, specify a **frequency** to first convert to a fixed-interval time series in your preferred resolution in a ``pandas.DataFrame``. Using the German nuclear REMIT capacity example as before, we can see how the available nuclear capacity was at different times: >>> from energyquantified.time import Frequency >>> df = periodseries_list.to_dataframe(frequency=Frequency.P1D) >>> df DE Nuclear Capacity Available MW REMIT 2020-10-24 14:10 a-PvMRn_EpOJtngkh4D06Q 2020-10-23 22:53 a-PvMRn_EpOJtngkh4D06Q 2020-10-23 22:32 foawy0rsE5VaMvg-JLbVbQ 2020-10-23 07:45 5mkc_POSQXzDGnTVSzsQiQ date 2020-10-20 00:00:00+02:00 6714.000000 6714.000000 6714.000000 6714.000000 2020-10-21 00:00:00+02:00 6709.812500 6709.812500 6709.812500 6709.812500 2020-10-22 00:00:00+02:00 6714.000000 6714.000000 6714.000000 6714.000000 2020-10-23 00:00:00+02:00 7145.572917 7542.625000 6714.000000 7540.770833 2020-10-24 00:00:00+02:00 7958.677083 8124.000000 7147.750000 8124.000000 ... .. _column-headers: Column headers for time series data ----------------------------------- The data frames created from time series data has three column header levels: 1. **Curve name** 2. **Instance or contract** 3. **Scenario** **Curve name** is set the ``timeseries.curve.name`` by default. If there is no curve attribute on the :py:class:`~energyquantified.data.Timeseries` object, it defaults to be blank. The user can override this name by setting a custom name (see below). **Instance or contract** is set (defaults to blank) when the time series has is an instance (forecast) or when the response is an OHLC series converted to a time series: * For *instances*, this column header is set to `` ``, like so: ``2020-10-16 00:00 ec``. * For *contracts*, it is set to `` ``. Examples: ``month front-1 close`` or ``year 2024-01-01 settlement``. **Scenario** is the scenario or ensemble ID. This header is blank unless you load ensemble data or scenario time series. For ensembles, it is normally named ``eNN`` where `NN` is the zero-padded ensemble ID. ECMWF ensemble forecasts, for example, have 51 scenarios, named from ``e00``, ``e01``, ..., ``e49``, ``e50``. Climate series uses underlying weather years. These column headers are named after the weather year they are based on: ``y1980``, ``y1981``, ..., ``y2018``, ``y2019``. .. _single-column-headers: Force single-level column headers --------------------------------- While the default behaviour is to create three levels of column headers, as seen above, you can tell the client to merge all the levels into one. Do this by setting the parameter ``single_level_header=True`` when you invoke ``to_dataframe()``. Using the wind forecast example from earlier on this page: >>> forecast.instance >>> df = forecast.to_dataframe(name='wind forecast') >>> df wind forecast ... 2020-10-26 00:00 ec-ens ... e00 e01 e02 ... e47 e48 e49 e50 date ... 2020-10-27 00:00:00+01:00 26575.48 26733.56 27500.18 26672.14 ... 24269.32 24301.24 30265.62 24280.31 2020-10-28 00:00:00+01:00 30657.37 30446.78 28420.88 37041.53 ... 28426.01 27353.77 32797.71 28044.18 2020-10-29 00:00:00+01:00 27776.44 27720.31 30748.11 28341.64 ... 30731.12 25900.96 29088.77 28441.85 2020-10-30 00:00:00+01:00 26984.86 23955.59 32940.16 27493.37 ... 38920.07 34470.99 26831.95 30003.82 2020-10-31 00:00:00+01:00 15179.69 14326.49 16155.63 16337.56 ... 16874.91 10602.34 8203.10 27192.68 ... We can add the ``single_level_header`` parameter. Notice that the headers, which previously were three levels (curve name, instance and scenario), are now merged into one row: >>> df = forecast.to_dataframe( >>> name='wind forecast', >>> single_level_header=True # Merge column headers >>> ) wind forecast 2020-10-26 00:00 ec-ens wind forecast 2020-10-26 00:00 ec-ens e00 ... wind forecast 2020-10-26 00:00 ec-ens e49 wind forecast 2020-10-26 00:00 ec-ens e50 date ... 2020-10-27 00:00:00+01:00 26575.48 26733.56 ... 30265.62 24280.31 2020-10-28 00:00:00+01:00 30657.37 30446.78 ... 32797.71 28044.18 2020-10-29 00:00:00+01:00 27776.44 27720.31 ... 29088.77 28441.85 2020-10-30 00:00:00+01:00 26984.86 23955.59 ... 26831.95 30003.82 2020-10-31 00:00:00+01:00 15179.69 14326.49 ... 8203.10 27192.68 ... Some functions and utilities in pandas work best when the ``DataFrame`` has a single level header. Setting ``single_level_header=True`` makes it easier than if you would have to merge the headers manually. Set custom time series name --------------------------- Energy Quantified's curve names are made to be easy to understand but can be quite long. So we made a :py:meth:`~energyquantified.data.base.Series.set_name` method for :py:class:`~energyquantified.data.Timeseries`: and :py:class:`~energyquantified.data.Periodseries`. Use it to set your own custom name before converting to a ``pandas.DataFrame``: >>> timeseries.name 'DE Wind Power Production MWh/h 15min Actual' >>> timeseries.set_name('de wind actual') >>> timeseries.name 'de wind actual' The custom name is reflected in the ``pandas.DataFrame`` column header: >>> timeseries.to_dataframe() de wind actual date 2020-01-01 00:00:00+01:00 8928.95 2020-01-02 00:00:00+01:00 16302.95 2020-01-03 00:00:00+01:00 32063.55 2020-01-04 00:00:00+01:00 33299.36 2020-01-05 00:00:00+01:00 13151.01 You can also specify a name when invoking the ``to_dataframe()`` method on time series objects: >>> timeseries.to_dataframe(name='my awesome name') my awesome name date 2020-01-01 00:00:00+01:00 8928.95 2020-01-02 00:00:00+01:00 16302.95 2020-01-03 00:00:00+01:00 32063.55 2020-01-04 00:00:00+01:00 33299.36 2020-01-05 00:00:00+01:00 13151.01