tsai
tsai copied to clipboard
Time series data with a combination of continuous and categorical features
Hi,
I have a dataset with a sequence of continuous and categorical variables. The aim is to classify these sequences. I searched the tutorials notebooks but I can't find an example where this scenario is discussed.
In my opinion, a strategy similar to what has been done in fastai tabular_learner can be considered. The the categorical tesnors are first passed through an embedding and then concatenated with the continuous features. Next, the resulting tensor can be passed through the rest of the network.
If this scenario has been considered already, I would be grateful if you could point me to the associated notebook. If not, any guidance on how it could be implemented would be very useful. I would be keen to submit a PR dealing with this as I am sure some users may have similar datasets.
Thanks
Yes one also have such problem. Although one's trying to do regression instead. Currently one (stupidly) reshape it into tabular shape, take part of the TabularModel section, pass the data through it, and reshape it again so it could go into say, LSTM.
Hi, @tkharrat and @Wabinab,
This is a very timely request. I 'm currently working on it. I created a separate branch a few days ago called static data to address this same issue. The coding is almost finished, although I still have a few questions on the input. And it also needs to be tested. It'd be great if you could help.
What I'm planning to do is to expand the get_ts_dls arguments to also accept cat and cont features. The output would be a MixedDataLoaders which would create batches containing the time series, cata and cont features and the target. This could be used to handle any type of task (classification, regression, etc. It all depends on the target you pass to the dataloaders.
As there are many different input data formats, I'd rather keep it simple. I'd prefer to have X, y, cat and cont as arrays (in memory or on disk). It'd be up to the user to create them (although I have / may add some helper functions in tsai to facilitate data preparation). This means that each user would be responsible to ensure the integrity of the data. That is, to ensure that X, y, cat and cont are of the same length, and belong to the same instances. This would provide flexibility to handle different scenarios. Would this be acceptable to you? If not, please, explain why not.
If this is acceptable, I'd need to make a few final updates, and it'd be ready for testing. This won't take too long. I'd like to test it on the 2nd branch before merging with the main one. Do you have data to test the updated functionality? Would you be willing to help with testing?
Hi @oguiza,
Thanks very much for your reply.
Just to be on the safe side, are we saying that the batch will be Tuple of size 4 with (X, cat, cont, y)? I think this set-up is great if cat and cont are static features but what happends when one of the columns in the time series data is actually categorical. For example, assume we have weather data in X (variables p observed across different time steps t, hence X is p x t matrix) and some columns in p are categorical (giving the pressure level for example as low, medium, high). How do we deal with this problem in the setup you are suggesting?
I am happy to get invloved as much as needed. I can write some tests, provide a test-data or anything else. Just let me know what works best for you.
Hi @oguiza yes sure one's willing to do some testing as well and one will open up Issues if there're any issues encountered. Do tell when would you need help. The tabular chapter from fastbook got a nice small dataset that we could use to do some basic testing as well. (One's current working dataset is really large and it doesn't work really well with a subset with baseline model, and training on whole dataset is time-consuming).
@tkharrat No, X would be split into cat and cont, so we only have three (cat, cont, y) (or other order). Then, the model must be able to take in two input. So for example:
class Model(nn.Module):
def __init__: ...
def forward(self, cont, cat): ...
Then, you could deal with embeddings with cat cont separately first before concatenate them in the forward (just like what you suggested in the first post).
Or do you have some other methods to deal with them?
Update: The post one posted before saying try to use TabularModel's embedding before reshaping doesn't work. The training loss does not decrease not explode.
Ok, I think I may have misinterpreted both of you. The solution I'm working on is applicable to time series of continuous data that, in addition to the time series data, has some categorical or continuous data for each time series. Let me share with you a fully made-up example:
- The dataset contains data for 1 yr from 10 cities (with different start and end dates). Each sample contains data for 1 day.
- X contains hourly temperature and atmospheric pressure. shape: (3650 days*cities, 2 variables, 24 hours).
- cat contains the name of the city where data is collected. We have data for 10 cities: shape: (3650 days*cities, 1 variable).
- cont contains the altitude of the city and the date when data was collected: shape: (3650 days*cities, 2 variables).
- y (target) are hourly temperatures for the next 6h.
In this scenario, X is 3d but cat and cont are 2d. We'd like to train a single model using this dataset.
Could each of you @tkharrat and @Wabinab please provide an example like this to make sure we understand out needs?
@oguiza I think the problem you are describing is different from what I had in mind. If I modify your example a bit, that data would be:
- The dataset contains data for 1 yr of observation from say the same city. Each sample contains data for 1 day. X contains hourly temperature, wind speed and atmospheric pressure measured continuously and humidity given as categorical feature with 5 levels say: very low, low, medium, high, very high . shape: (365 days, 4 variables: 3 cont + 1 cat, 24 hours).
- y (target) are hourly temperatures for the next 6h or maybe the an independent categorical variable if we want to solve a classification problem.
@oguiza I think solving this problem is independent from what you are doing. Maybe it can be done by having a pre processing head in the network that deals with the cat columns, pass them to the embedding and then apply the rest of the network. Of course the dataloaders will have to be ajusted to pass 2 (cat, cont) tensors instead of 1.
@Wabinab can you confirm that you had the same requirement?
Oh right @oguiza the problem you describe is different from what one encountered. Sorry for the misinterpretation. My problem/requirement is similar to @tkharrat. The time series data is of shape (seq_len, num_features) so that at each time steps, there are num_features describing that particular time step. So one would have X of size (batch_size, seq_len, num_features) and y (target) of shape (batch_size, seq_len) (at least for one's problem, but it might well be other shape depending on what regression/classification problem one is trying to deal with, so y's shape isn't too rigid).
So the shape of X would be as below (or its transpose)
| time step 1 | time step 2 | time step 3 | ... | |
|---|---|---|---|---|
| feature 1 | ... | ... | ... | ... |
| feature 2 | ... | ... | ... | ... |
| feature 3 | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |
One haven't thought of any good idea of how to solve the categorical problems. Though NLP uses embeddings one haven't encountered any problems NLP with multiple features except for a sentence/paragraph passed in as input, so that's really is one single feature.
Hi @tkharrat have you managed to solve it? One managed to get a working model that works, but it is a hybrid between tsai, fastai, and plain PyTorch. If you would like one could make a dummy notebook to demonstrate how one did it?
Hi @tkharrat and @Wabinab, I'm working on a solution that would require just a single dataloader. This means a preprocessing step is necessary to transform the categories so that a batch contains both the X_cat and X_cont data (as a single tensor). The model would then split the tensor into X_cat and X_cont. It'd be good to have a dataset where we can test the approach. Are you aware of any public datasets that reflect your needs? Alternatively, we could build a dummy dataset.
Hi @oguiza thanks for the update. As one promises one made the notebook here: https://colab.research.google.com/drive/1rwA8lzCoz_PpIb9qceADSM8OUuwxIwBX?usp=sharing you could check it out how one does it for the moment. And @oguiza no one doesn't find any dataset that kinda resembles this if we cannot use competition dataset then perhaps we might need to find deeper or more easier, just make one ourselves as you mentioned.
Thanks again for your work @oguiza
I came across an interesting data set in this public repo. It is not strickly speaking a time series but rather a sequence of events. Nevertheless, it has the properties we need. A sequence of categorical and continuous features and hence most of the models in the library can be applied.
They show how they obtain and prepare the data in this notebook. I used the examples therein to create this colab notebook that could be used as a start of a tutorial we can add in the tutorials_nbs. Given that the data-preparation may take some time, I saved the multiple objects here so people can use. @oguiza let me know if there is a better place to save the data. In particular, the object we care about is this one which is a list of tuples:
- The first element in the tuple is a
pandas.DataFramegiving the sequence - The second element is the binary target (a string) Obviously this data will have to be processed further to be on a format
This data set is interesting for several reasons:
- The data has a mixture of categorical and continuous features as mentioned above
- Similar to an NLP problem, the sequences (sentence in NLP) have different length. This will allow us to show user how we can prepare data in this case and how he could pad (short) or truncate (long) sequences. We could also take advantage of fastai's SortedDL
- The authors of the repo have written a paper where they describe some learners they derived for the same data/problem. This offer us a benchmark to test our learners against some published work. I think they even share their implementation of these learners but I am not sure (they don't use sequence model but rather standard tabular ML learners with hand crafted features).
@oguiza @Wabinab Let me know what you think. If you are happy with it, I am keen to help @oguiza to maybe update the notebook I started in more detailed tutorial.
Hi @tkharrat, Thanks for sharing this repo. It's very interesting, and I'm very keen on adding this type of functionality to tsai. (I love football as well, and I'be always been been interested in this type of data :) ) I'm quite busy at the moment though and won't be able to start working on this until the week of Nov 8th.
It is not strictly speaking a time series but rather a sequence of events.
This is fine. tsai is focused on time series and sequential data. The index doesn't even need to be time. It could be something else.
I have a question about the data. Will we be able to create a 3d array from it? [samples x features x steps]
There's already a function in tsai you may use to preprocess a df (preprocess_df). You can pass processors (like Categorify which will convert any categorical variable to an integer that can be then converted into an encoding).
Once you have that, you should be able to use df2Xy to create your input and target data.
When you build the data, you chould be able to use a normal TSDataLoaders to create batches. This process would use a single dataloader. Batches would contain both cat and cont features.
Please, take a look at it and let me know if that helps build the data. We may need to consider other options (like handling data with a tabular dataloaders). I'd like to analyze pros and cons when I have the time.
This process could certainly be streamlined to make it easier to use. What also missing are models that can take both categorical and continuous variables and can create embeddings and concatenate the data. There are many options in this case. We'll need to see what works best, altough it may depend on the use case.
Hi @oguiza
The data provided is a list of tuples as described above. One thing to note though is that the sequence of play (the first element in the tuple) are of different lengths (i.e number of row in the dataFrame). That's why I mentioned padding in my previous post (which will happen in the before_batch step). Therefore, I am not sure we can rely on the combination of preprocess_df and df2Xy whcih makes the assumption of the same number of time step in every example. Besides, preprocess_df does not handle the splits and process the full dataFrame which may create future information leakage.
I think for this task, it may be easier to write a custom tfmdLists that has a proper show method and create a batch with 3 tensors (cat, cont, target). I think if we store cat and cont in the same tensor, we will have to convert cat to float (from integer), and then re-split into 2 tensors because the cat will have to go through an embedding first.
However, if we present the data in a 3 tensors format, I think we could use any tsai learner with a small amendment.
- Inherit from the desired Learner class
- Slightly amend the
forward()method: It will have to take thecattensor, pass it to embedding, concatenate it withcontand then call the mother classforward()method.
The main drawback of this approach is that it may not be generalizable to any sequence data with cat and cont features (user will have to prepare his own tfmdLists) but it could be used as an example of how it can be done.
I am happy to make an attempt and share the notebook. Maybe @oguiza it can help you generalise it even further.
Hi @tkharrat, I've been looking at the dataset and notebook you have created. I find it too complicated for a notebook tutorial. The goal of tutorials is to demonstrate how you can easily apply a model or method to your own dataset. So I'd like to propose something that might still be useful:
- Create a dataframe using one of the UCR multivariate datasets.
- Modify the scale. This will allow us to show demonstrate how to scale the data using different techniques (standardization, normalization, etc) while preventing leakage.
- Add categorical data. This will allow us to demonstrate the use of categorify. And later demonstrate the use of embeddings.
- We can also modify the time series length so that they are different. This will allow the use of padding and truncating.
- We can also add delete some data to simulate nan values. This will require df2Xy to be modified to allow the padding/ truncating.
Once the data is ready, we can create the dataloaders. I believe minor changes will be necessary. In my opinion, we'll be able to use a single dataloader for both categorical and continuous data.
This approach will require a change in the models (which I'm willing to make anyways) to adapt them to take and separate categorical from continuous data (I've tested it with an attribute cat_pos for categorical positions and cont_pos for continuous positions and it works well). I've also built a MultiEmbedding layer that needs to be added to the models (this will allow the creation of multiple, concatenate embeddings).
I believe all this is something that can be done fairly quickly. We can create a simple tutorial with this approach. You can then test this approach with the large dataset you proposed. And if you find any issues, we may need to make adjustments.
As I said, I am willing to make this available in tsai as I think it is a common scenario. Please let me know what you think and if you are still willing to collaborate on this.
Edit: I've created a gist to create the dataset I mentioned above containing categorical and continuous variables. It just takes a few seconds to create it.
Hi @oguiza,
The plan above sounds good. I just want to add few comments:
-
if you want to load your data in a unique tensor, and use an attribute cat_pos for categorical positions and cont_pos for continuous, that's fine but you will have to convert int to float in the cat_pos locations. No big deal obviously but it does not seem to be the fastai philosophy.
-
on the other hand, if we follow the fastai tabular approach and return our data in the form of tuples of length 3 (cats, conts, target), then I think modifying the models becomes super easy. We just need to create a head that applies the embedding to cats and concatenates the results with conts. The obtained tensor can be processed by the forward() method of any model already defined in tsai.
I made an attempt of implementing 2) here. It does not seem too bad.
I am happy with whichever path you want to take. Just let me know how to proceed and how I can help and I am more than keen to contribute.
Hi @tkharrat,
Thanks for sharing your code. I think it looks very good.
Personally, I think both options are kind of similar. To make a model split a tensor between categorical and continuous variables, or take a tuple is easy. I can adapt the tsai models for either option (or both). The most difficult part is to create a pipeline that transforms a dataframe with categorical and continuous data into batches. I can see 2 options:
- Transform the df into a numpy array using any type of sklearn preprocessors. Once we have the array we pass it to a tsai dataloader.
- Create a dataloader using fastai's tabular data approach.
One of the reasons I created tsai's TSDataLoaders is that fastai's way of handling arrays was very slow. It processes each sample individually and applies the transforms on the fly. tsai preprocesses all samples during the dataset creation and then manages all samples in the batch at the same time. That's why batch creation is about 100 times faster using tsai compared to the vanilla, DataBlock process. I don't know, however, how fast the tabular data process is. But it's something to bear in mind. Here's what I propose:
- Let's use the dataset I shared in the gist. I think it has all the required features.
- Let's find o several methods to create a dataloader. And let's measure the dataloader performance. One of the methods could be the one you have used in your notebook. But ideally, we should avoid customizing the solution as much as we can. Although there will be some transforms that will be required (as the padding transform).
- When we have that, I'll make the required updates in tsai's model to allow them to take batches of categorical & continuous data (as a single tensor or a tuple of tensors).
Once we get this to work well, you'll need to create a custom Tensor subclass with a show method to display your data.
Please, let me know if you are ok with this approach.
I'll start working on #2 now.
Hi @oguiza,
I think this plan makes sense. Actually, you already solved point n3 with MultiEmbedding. With minimum coding, I think we can extend any tsai model to fit the current set up.
I will carry on experienting on my side and report back.
Hi @tkharrat,
I've made some progress on this task. Here's the approach I'm working on. I think it'd be good to split data preparation into separate components:
- initial data preparation: this encompasses all tasks at a dataframe level. These transforms are usually implemented once and don't need to be reversible. In fastai, these types of tasks are handled with separate functions (like make_date, add_datepart, add_elapsed_times, df_shrink, etc). But there are many others like feature engineering, missing data encoder, etc. All this will happen before a tsai dataset is built. This will give the user maximum flexibility.
- transform the dataframe into X and y arrays (using SlidingWindowPanel). The output of this could be a numpy array in memory or on disk (useful for large datasets that don't fit in memory). And sequences may have different lengths.
- once the arrays are converted into tensors using a TSDataset, apply batch transforms. There are several functions already available in tsai that can be used to scale the data like Standardize, Normalize, etc that can be used.
TODO:
-
initial data preparation: there are many functions in sklearn that can be used to build a pipeline. But other functions may be required. I've already prepared a few that are not available in sklearn.
-
df to Xy: SlidingWindowPanel may be used to create the X and y arrays from the data. However, I've identified a couple of areas of improvement that I'll implement soon. I'd like to know your opinion on this. Let's analyze different scenarios with a window length of 30 in all of them:
- a sequence is 100 steps long. How should we apply the sliding window? I'd say create 3 windows with positions: 11-40, 41-70, and 71-100, and discard 0-10. This is not the way it works now (1-30, 31-60, and 61-90, and discard 91-100).
- s sequence is 20 steps long. Option 1: discard the sample since it's too short. Option 2: use it padding at the end. Option 3: use it padding at the beginning. I think option 3 makes more sense.
- a sequence has 2 steps. Should a minimum number of steps be set? I'd say it makes sense to set a minimum percentage.
-
The models will be updated to accept both categorical and continuous data.
I'd love to have your view on all this.
I think option 3 makes more sense
I wonder, why does it make more sense to pad at the beginning than at the end?
a sequence has 2 steps. Should a minimum number of steps be set? I'd say it makes sense to set a minimum percentage.
Absolutely!
Thanks for sharing your thoughts @vrodriguezf! My rationale is that is we are preparing a dataset with a window length of 30 but we only have 20 timesteps for a given instance, I would think we are missing some history (at the beginning), and only have the 20 last steps (you may want to predict the next one for example in forecasting). It's true that it may not matter much in a classification or regression task. What do think?
Mmm ok I see the point, when you have to choose what to keep, you give priority to the future, because at the end, it's what matters the most generally in the time series world.
Anyway, for the first case you propose (100 time steps), there is no need to discard, just padding, right?
Anyway, for the first case you propose (100 time steps), there is no need to discard, just padding, right?
It depends on whether a minimum unpadded length is established or not. Bear in mind that the "remainder" might just be 1 time step. And you don't want to use t 1 step padded up to 30. I proposed to set a limit (which you seemed to agree to). So in the end the solution might be to maintain the pad_remainder arg but add a min_remainder_size (or similar). And change the logic to start from the end so that the remainder is always the oldest part of the sequence.
oh yes sure, it's the same situation in both cases :) looks coherent to me!!
cc: @tkharrat, @vrodriguezf
I'd like to update the status of this enhancement request. I believe all the different components required to perform this type of task are in place now. I've started to use it on some proprietary datasets and it seems to work well.
I've made the following additions to tsai:
- Updated SlidingWindow and SlidingWindowPanel so they can now pad sequences starting from either end.
- Added TSCategoricalEncoder that allows the creation of categories for 1 or more columns in a pandas dataframe. It has a state so you can fit on the train set, and then transform train, validation, and test data.
- Added MultiEmbedding layer that applies an embedding layer to each variable passed as categorical, concatenating the output to continuous features.
- Added cat_pos (categorical variable position), n_embeds (number of embeddings) and emb_szs (embedding sizes) to LSTMPlus and TSiTPlus. This may be extended to other models if necessary.
The main task that is remaining IMO is a tutorial to demonstrate how this functionality can be used. I will try to do this over the coming weeks.
Wow, amazing work Ignacio!!! This is a huge addition to the library.
It has a state so you can fit on the train set, and then transform train, validation, and test data
what does this mean? Does this have to do with the situation where some category's of a variable may not appear in training, but in test time?
Wow, amazing work Ignacio!!! This is a huge addition to the library.
I think so. I've started to use categorical embeddings on some datasets, and it works pretty well.
It has a state so you can fit on the train set, and then transform train, validation, and test data
what does this mean? Does this have to do with the situation where some category's of a variable may not appear in training, but in test time?
Exactly. You need to ensure you calculate categorical embeddings in the same way you calculated them for training. It's not just that they may be missing. They might also be in a different order.
cc: @tkharrat, @vrodriguezf
I'd like to update the status of this enhancement request. I believe all the different components required to perform this type of task are in place now. I've started to use it on some proprietary datasets and it seems to work well. I've made the following additions to
tsai:
- Updated SlidingWindow and SlidingWindowPanel so they can now pad sequences starting from either end.
- Added TSCategoricalEncoder that allows the creation of categories for 1 or more columns in a pandas dataframe. It has a state so you can fit on the train set, and then transform train, validation, and test data.
- Added MultiEmbedding layer that applies an embedding layer to each variable passed as categorical, concatenating the output to continuous features.
- Added cat_pos (categorical variable position), n_embeds (number of embeddings) and emb_szs (embedding sizes) to LSTMPlus and TSiTPlus. This may be extended to other models if necessary.
The main task that is remaining IMO is a tutorial to demonstrate how this functionality can be used. I will try to do this over the coming weeks.
Hi @oguiza, Is there a chance you can post here a short chunk of code showing the TSCategorical functionality, side by side with a TSRegression, for instance? I understand a fully fleshed-out tutorial would be better, but a few guiding steps could be very helpful.
Closing this issue due to lack of activity and progress. If necessary please, create a new one.