darts [FEAT] Add tsmixer-basic

tsmixer was original reported as two different models, tsmixer-basic (which allows for past covariates and is called simply tsmixer in the paper) and tsmixer-ext, which allows for past, future, and static covariates. All results in the paper except for the m5 dataset used tsmixer-basic. The darts implementation is based on tsmixer-ext.

However, tsmixer-ext isn't identical to tsmixer-basic when there are no static or future covariates. The key difference is that while tsmixer-basic projects to output_chunk_length in the final layer, effectively encoding the historical data while maintaining it's time dimension, tsmixer-ext projects the historical and static data to output_chunk_length in the first layer. I don't think this is optimal as this will limit the usefulness of the residual connections. My testing with the original google-research source code shows a performance regression of about 10% higher MAE and MSE with the weather dataset when moving the temporal project step to the top of the model.

If the maintainers think this would be valuable I can implement this. I think the most sensible way to do so would be to add a project_first=True keyword.

Aug 25 '24 02:08 eschibli

Hi @eschibli,

If you think that you can elegantly make the tsmixer-basic architecture easily available through the TSMixerModel API/constructor, it would be for sure valuable to have a variation of this model that performs better when no future covariates are available, which can occur in many situations. I would maybe just call the argument first_layer_projection instead of just project_first, but we can discuss it in your PR.

You will also need to add checks in the fit() method so that an error is raised if first_layer_projection=False and future/static covariates are provided.

Aug 26 '24 08:08 madtoinou

Hi @madtoinou - in my work I have seen significantly better performance in problems with both past and future covariates using an encoder-decoder model, projecting in the middle of the backbone so that both the past and future values are mixed in their original time space. This was not reported in the original paper but seems like a logical extension, which reduces to tsmixer-basic and tsmixer with project_after=0 and project_after=num_blocks.

ie,

class TSMixerModel(MixedCovariatesTorchModel):
    def __init__(
        self,
        ...,
+       project_after=0
        )

class TSMixeModule(...)
...

-self.conditional_mixer = ...

+self.encoder_mixer = self._build_mixer(
+    sequence_length=self.input_chunk_length,
+    num_blocks=project_after,
+    future_cov_dim=0
+    hidden_size=hidden_size,
+    static_cov_dim=static_cov_dim,
+    **mixer_params
+    )


+self.decoder_mixer = self._build_mixer(
+    sequence_length=self.output_chunk_length,
+    num_blocks=(num_blocks - project_after),
+    future_cov_dim=future_cov_dim,
+    hidden_size=hidden_size,
+    static_cov_dim=static_cov_dim,
+    **mixer_params,
+    )

def forward(X):
    ...
+   for mixing_layer in self.encoder_mixer:
+       x = mixing_layer (x, x_static_hist)

+   # Project to output_chunk_length space
+   x = x.transpose(1, 2)
+   x = self.encoder_to_decoder(x)
+   x = x.transpose(1, 2)

    ...

    if self.future_cov_dim:
        x_future = self.feature_mixing_future(x_future)
        x = torch.cat([x, x_future], dim=-1)

-    for mixing_layer in self.conditional_mixer
+    for mixing_layer in self.decoder_mixer:
+        x = mixing_layer (x, x_static_future)
...

Is this something we would consider merging?

Mar 05 '25 17:03 eschibli