climetlab
climetlab copied to clipboard
Feature preparation for weather data to use in AI/ML application
I am following an invitation from @floriankrb and Peter Dueben to share some insights of how I tackle some issues working with Weather data and AI.
Actually we do not know if this is helpful, but i think we need well structured and unified preprocessing for all users and climetlab could be the place to be.
The two points where I think you can treat weather data in the wrong way are:
- aligning forecast data with measurement/observation/target data
- structuring data for algorithms that require sequences
For both issues I have built solutions and I hope you can validate the way I do .
Alignment of features
from typing import Tuple
import pandas as pd
COLUMN_DT_FORE = 'dt_fore'
def align_features(forecast_data: pd.DataFrame, target_data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""
takes both predictors and target values and derives intersection
of both to create two matching dataframes by using dt_fore
forecast_data contains MultiIndex with dt_calc, dt_fore, positional_index
dt_calc: INIT/calculation run timestaml
dt_fore: leading forecast timestamp
positional_index: location based indexer
"""
_target_data = []
_target_index = []
_rows_to_take = []
for dt_fore in forecast_data.index.get_level_values(COLUMN_DT_FORE):
try:
_target_data.append(target_data.loc[dt_fore, :].values)
_target_index.append(dt_fore)
_rows_to_take.append(True)
except KeyError:
_rows_to_take.append(False)
forecast_features = forecast_data.loc[_rows_to_take, :]
target = pd.DataFrame(_target_data, index=_target_index)
return forecast_features, target
Preprocess data according to sequences
This topic is relevant in case you would like to use recurrent neural networks like LSTM or Convolutional layers.
import pandas as pd
COLUMN_POSITIONAL_INDEX = 'positional_index'
COLUMN_DT_CALC = 'dt_calc'
def pre_process_lstm_dataframe_with_forecast_data(
data: pd.DataFrame,
lstm_sequence_length: int,
) -> pd.DataFrame:
"""
This pre processing step builds sequence according to the lstm_sequence_length for data that contains forecast.
A forecast dataset is characterized by a number of dt_calc with several dt_fores for each dt_calc.
Note: This function requires equal weighted intervals.
Args:
data: pd.DataFrame with MultiIndex
lstm_sequence_length: historical length of sequence in the dimension of time_frequency
date_offset: granularity of time steps as DateOffset object
Returns:
dataframe with list objects as entries
"""
def seq(n):
"""generator object to pre process data for use in lstm"""
df = data.reset_index()
for g in df.groupby(
[COLUMN_POSITIONAL_INDEX, COLUMN_DT_CALC], sort=False
).rolling(n):
yield g[data.columns].to_numpy().T if len(g) == n else []
return pd.DataFrame(
seq(lstm_sequence_length), index=data.index, columns=data.columns
).dropna()
As you can see, I am working with historical point forecasts. But I think this should work for arrays as well. In the end every 2D data can be transformed in such a DataFrame, but I think for array data it is not the best to do this with pandas. I am pretty sure that there are smarter solutions than these I am presenting here.
From my point of view these are the most important steps and differences to ordinary ML applications. Please let me know what you think about the topic.
Please note, that I am working hard to establish our company alitiq, so my time to contribute operational code for climetlab is limited. I will give my best to share knowledge and best practice.
I am really looking forward to discuss with you.
Hi Daniel,
I'm a colleague of @floriankrb. I can't comment on the scope of ClimetLab to provide a unified pre-processing interface, but I had a quick look at the code. I think preprocessing, data cleaning, and data assimilation are always important steps in any machine learning application, so it's great to have examples of this code for others available.
I have one comment on a code smell I noticed. I'm always careful when I see for
-loops and .append()
statements in pandas, as they tend to be very slow compared to vectorized operations like .apply()
.
Is there a specific reason that you don't use the index.intersection()
method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.intersection.html
I can imagine it takes a few steps to figure out with a MultiIndex but overall the vectorization should speed up the calculation significantly. I assume that target_data.index.intersection(forecast_data.index)
might already go in a good direction otherwise.
Dear @JesperDramsch , thanks for your reply.
At first, you are totally right that preprocessing etc. are always important, what I was about to say is that these are specifc issues working with meteorological data.
The for
loop is not over pandas. The .append()
is applied to lists.
intersection
will not work properly because it will cause a lag of data. My goal is to assign a target value to each forecast so I have to expand my target timeseries to the dimension of the forecast.
E.g.: Using a daily run with 48 hours timestamps ahead will increase the size of the target data by factor 2.
This is how a forecast DataFrame can look like:
pd.DataFrame(
[[1., 2., 3.], [4, 5, np.nan], [1., 2., 3.], [4, 5, np.nan], [1., 2., 3.], [4, 5, np.nan], [1., 2., 3.], [4, 5, np.nan], [1., 2., 3.], [4, 5, np.nan], [1., 2., 3.], [4, 5, np.nan]],
index=pd.MultiIndex.from_tuples([(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 00:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 01:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 02:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 03:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 04:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 05:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 01:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 02:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 03:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 04:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 05:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 06:00:00'), 0),
],
names=['dt_calc', 'dt_fore', 'positional_index']),
columns=[1, 2, 'temp'])
In this case the target_data would look like:
pd.DataFrame({"target": [1, 2, 6, 7, 3, 35, 36]},
index=[pd.Timestamp('2021-04-01 00:00:00'),
pd.Timestamp('2021-04-01 01:00:00'),
pd.Timestamp('2021-04-01 02:00:00'),
pd.Timestamp('2021-04-01 03:00:00'),
pd.Timestamp('2021-04-01 04:00:00'),
pd.Timestamp('2021-04-01 05:00:00'),
pd.Timestamp('2021-04-01 06:00:00'),
])
After processing the data with the supposed align_features
method, the forecast data would be the same, but the target_data would look like this:
pd.DataFrame({"target": [1, 2, 6, 7, 3, 35, 2, 6, 7, 3, 35, 36]},
index=[pd.Timestamp('2021-04-01 00:00:00'),
pd.Timestamp('2021-04-01 01:00:00'),
pd.Timestamp('2021-04-01 02:00:00'),
pd.Timestamp('2021-04-01 03:00:00'),
pd.Timestamp('2021-04-01 04:00:00'),
pd.Timestamp('2021-04-01 05:00:00'),
pd.Timestamp('2021-04-01 01:00:00'),
pd.Timestamp('2021-04-01 02:00:00'),
pd.Timestamp('2021-04-01 03:00:00'),
pd.Timestamp('2021-04-01 04:00:00'),
pd.Timestamp('2021-04-01 05:00:00'),
pd.Timestamp('2021-04-01 06:00:00'),
])
Dear @meteoDaniel,
thank you for the example inputs and outputs. It seems like there is a bug in the align_features
method that it implicitly assumes the existence of a forecast_features
variable that is not passed or declared in the function. Regardless, I cannot reproduce the example target output, as the function produces a different result than the one you provided.
Dear @JesperDramsch,
I am sorry for the bug, the function is part of a class and the bug origins from the extraction into a method for this example. And the reason for the wrong result is a typo in the target data frame . There was two times the index pd.Timestamp('2021-04-01 02:00:00')
Actually this only works if dimension of positional_index is 1. In all other cases you have to add additional loop through the positional indeces (equivalent to a location of a weather station or whatever)
So here we are:
from typing import Tuple
import pandas as pd
import numpy as np
COLUMN_DT_FORE = 'dt_fore'
forecast = pd.DataFrame(
[[1., 2., 3.], [4, 5, np.nan], [1., 2., 3.], [4, 5, np.nan], [1., 2., 3.], [4, 5, np.nan], [1., 2., 3.],
[4, 5, np.nan], [1., 2., 3.], [4, 5, np.nan], [1., 2., 3.], [4, 5, np.nan]],
index=pd.MultiIndex.from_tuples([(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 00:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 01:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 02:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 03:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 04:00:00'), 0),
(pd.Timestamp('2021-04-01 00:00:00'), pd.Timestamp('2021-04-01 05:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 01:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 02:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 03:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 04:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 05:00:00'), 0),
(pd.Timestamp('2021-04-01 01:00:00'), pd.Timestamp('2021-04-01 06:00:00'), 0),
],
names=['dt_calc', 'dt_fore', 'positional_index']),
columns=[1, 2, 'temp'])
tar = pd.DataFrame({"target": [1, 2, 6, 7, 3, 35, 36]},
index=[pd.Timestamp('2021-04-01 00:00:00'),
pd.Timestamp('2021-04-01 01:00:00'),
pd.Timestamp('2021-04-01 02:00:00'),
pd.Timestamp('2021-04-01 03:00:00'),
pd.Timestamp('2021-04-01 04:00:00'),
pd.Timestamp('2021-04-01 05:00:00'),
pd.Timestamp('2021-04-01 06:00:00'),
])
def align_features(forecast_data: pd.DataFrame, target_data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""
takes both predictors and target values and derives intersection
of both to create two matching dataframes by using dt_fore
forecast_data contains MultiIndex with dt_calc, dt_fore, positional_index
dt_calc: INIT/calculation run timestaml
dt_fore: leading forecast timestamp
positional_index: location based indexer
"""
_target_data = []
_target_index = []
_rows_to_take = []
for dt_fore in forecast_data.index.get_level_values(COLUMN_DT_FORE):
try:
_target_data.append(target_data.loc[dt_fore, :].values)
_target_index.append(dt_fore)
_rows_to_take.append(True)
except KeyError:
_rows_to_take.append(False)
forecast_features = forecast_data.loc[_rows_to_take, :]
target = pd.DataFrame(_target_data, index=_target_index, columns=['target'])
return forecast_features, target
pd.testing.assert_frame_equal(
align_features(forecast, tar)[1],
pd.DataFrame({"target": [1, 2, 6, 7, 3, 35, 2, 6, 7, 3, 35, 36]},
index=[pd.Timestamp('2021-04-01 00:00:00'),
pd.Timestamp('2021-04-01 01:00:00'),
pd.Timestamp('2021-04-01 02:00:00'),
pd.Timestamp('2021-04-01 03:00:00'),
pd.Timestamp('2021-04-01 04:00:00'),
pd.Timestamp('2021-04-01 05:00:00'),
pd.Timestamp('2021-04-01 01:00:00'),
pd.Timestamp('2021-04-01 02:00:00'),
pd.Timestamp('2021-04-01 03:00:00'),
pd.Timestamp('2021-04-01 04:00:00'),
pd.Timestamp('2021-04-01 05:00:00'),
pd.Timestamp('2021-04-01 06:00:00'),
])
)
My intention to contact Peter was to talk about such stuff and look how others can benefit from it. Maybe we start thinking of a climetlab function that can do this for dataframes and xarrays.
Actually this way increases the amount of data to train on which has both good and bad sides. Normally I add a feature to capture the increasing error into the future.
In general what do you think about the way I am working with forecast data ?
Best regards and have a nice weekend.
It still seems to me that a solution without multiple list
, append
, and nested for
-loops (when the positional index comes into play) would be more efficient.
We can get the common dataframe with a simple inner join:
x = forecast_data.join(target_data, on=[COLUMN_DT_FORE])
x = x[x['target'].notna()]
For the forecast_features
you then simple drop the target
column
forecast_features = x.drop(['target'], axis='columns')
and the target
dataframe needs to drop the multiindex, but should be:
out = x.loc[:, ["target"]].reset_index(level=[0,2], drop=True)
out.index.name = None
I think @floriankrb is more suited to talk about the actual scope of Climetlab and whether processing functions like this should be included, as right now, I understand it more as a data retrieval tool, but I'm just one of the users of it.
Dear @JesperDramsch I was able to test both versions of alignment.
- I tested it with a tiny test function, my implementation was faster, but this test script is not representativ for my use case.
- So I tested on a representativ dataset with 14000 rows My solution 5:50 and your solution 6:10 Minutes, and the pipeline I used contains some other stuff, so I think the relative difference is a bit higher.
In the past I had similar problems and I tried using inbuild functions like .apply()
but I think for heavy load tasks these functions are not always the best.
Thank you for going deeper in explaining your specific use case.
I agree with @JesperDramsch for avoiding the loop. Assuming you actually know the structure of your times axes, perhaps you would like to ignore (drop it) the multi-index (adding some security checks), then filter the underlying numpy array appropriately and recreate a single dimension index pandas.dataframe. Also, I would intuitively see xarray to be a better fit than pandas for this task, even if 14000 is not that much.
It looks like we could have this feature included in climetlab, especially if more people raise their voices for this.