STAM-pytorch
STAM-pytorch copied to clipboard
regression
Beautiful work as usual, thanks for this implementation.
I'm curious if you tried using this for a regression task? I have tried using TimeSFormer without success yet, I know the signal is there because I can learn it with a small 3dcnn trained from scratch so I suspect my understanding of how and where to modify the transformer is the culprit. The output is a 1D vector with len == num_frames. Any suggestions very appreciated!
This is a pure code implementation, no experiments or training code or test.
I am currently using this and TimeSformer
for regression, you don't need to modify anything, just set n_classes
to the number of regressors, and use MSELoss
.
The output of these type of models comes from the clst_oken
attending to other inputs. You can see that the head is super simple:
self.mlp_head = nn.Linear(dim, num_classes)
@tcapelle
What do you mean by "number of regressors"?
I initially had a classification based transformer code and then convert it to a regressor.
I am not sure if the following is correct? Is 1
correct here? What should I set 1 to?
self.mlp_head = nn.Sequential(
nn.LayerNorm(emb_dim),
nn.Linear(emb_dim, 1) # is this 1 correct for regression?
)
Previously, it was:
nn.Linear(emb_dim, num_classes)
@tcapelle
What do you mean by "number of regressors"?
I initially had a classification based transformer code and then convert it to a regressor.
I am not sure if the following is correct? Is
1
correct here? What should I set 1 to?self.mlp_head = nn.Sequential( nn.LayerNorm(emb_dim), nn.Linear(emb_dim, 1) # is this 1 correct for regression? )
Previously, it was:
nn.Linear(emb_dim, num_classes)
Hi did you figure out how to use timesformer for regression tasks as i am trying to do the same but have found no luck
Yeah, that's it!
You will put as many outputs as variables to regress. If you have only one-dimensional regression, then 1
is it.
My only take away, is that most regression problems can be converted to classification problems by binning the outputs.
Instead of predicting the price of a good in, let's say, a range of[0,100]
, you will predict the probability of the value to be in bins:
-
[0,10], [10,20], ..., [90,100]
- This way you get a probabilistic model that can be trained with standard cross entropy loss. It's a very useful trick. The tricky part is creating a data pipeline to train this model; good luck 👍 .
Yeah, that's it! You will put as many outputs as variables to regress. If you have only one-dimensional regression, then
1
is it. My only take away, is that most regression problems can be converted to classification problems by binning the outputs. Instead of predicting the price of a good in, let's say, a range of[0,100]
, you will predict the probability of the value to be in bins:
[0,10], [10,20], ..., [90,100]
- This way you get a probabilistic model that can be trained with standard cross entropy loss. It's a very useful trick. The tricky part is creating a data pipeline to train this model; good luck 👍 .
Thank you for the quick response, so lets say that I am hoping to use the pretrained timesformer model for regression instead of classification, for example using negative pearson loss, and each frame of the video having a unique numeric label/ground truth. So essentially the training data would be a 60 sec video broken into frames with corrsponding values/ labels for each frame. So in this case the we will only have a 1 dimensional regression am I right?
Thank you for the quick response, so let's say that I am hoping to use the pre-trained timesformer model for regression instead of classification, for example, using negative Pearson loss, and each frame of the video has a unique numeric label/ground truth. So essentially, the training data would be a 60-sec video br
I think that TimeSformer
expects a fat tensor of the type:
frames = torch.randn(2, 5, 3, 256, 256) # (batch x frames x channels x height x width)
So you have to construct a dataloader that generates this. When I used these models I trained from scratch. So I was not carefully checking what input the model expects, I used the model as an architecture.
For training, construct a dataloader that, for each batch of videos, gives you a batch of values. How you label this snippets of video (you will have to subsample or reduce the input size, as the model cannot ingest inputs that are too long). I was training using 10 frames of video that came from a camera with one image per minute, so a 10-minute sequence and estimating the average movement speed. So I predicted one value for this 10-second tensor (bs, 10, 128, 128).
I hope that clarifies the strategy to follow.
Another quick tip, you can create a super simple dataloader by stacking the full video together and then just slicing randomly on it; here you have an example
Thank you so much for the quick and detailed responce, I am sorry for asking so many questions I am new to the whole video transformer domain. I just have a follow up question so my dataloader looks something like this
Containing video frames and corresponding to them pulse signal. Frames are put in 4D tensor with size [c x d x w x h]
train_loader = torch.utils.data.DataLoader( pulse(with pulse containing (frames, labels)), batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=True, sampler=sampler)
Hope this clarifies my idea:
@tcapelle hi thanks for all the help regarding the data loader, I am sorry to bother yet again. I was having some trouble understadning where this issue arises from and why it arises as the only thing I changed is the dataloaders.
I have pin-pointed where the issue is it seems like my traindataloader doesnt have the values in bold for cur_iter, (inputs, labels,### _, meta) in enumerate(train_loader). I dont understand how to reslove this though as i am not using their dataloaders. The dataloader i am using works in the following way where the pulse_3d returns: sample = (frames, labels)
Sorry, I can't help you with this. Maybe ask on the PyTorch forums?
I will try asking there but i dont think its a pytorch issue is it? I beleive it comes from the dataloader apperently the dataloader should contain inputs, labels, _,meta as seen in the following snippet from the train_net.py(TimeSformer)
sorry, don't know.
sorry, don't know.
Thank you for all the help, just a tiny follow up for the TimeSformer did you use the code provided by facebook or did you manage to find some other script
I used @lucidrains implementation
But @lucidrains implementation doesnt have a trainer code does it?
hi @tcapelle using TimeSformer(orange line) for regression commapred to 3D CNN(pink line) my results are quite weird. I am adding a screen shot of the loss(MSE)-epoch graph for training and validation. Note: Each video is broken into chuncks of 32 conseutive frames each with their corresponding gt values. The model predicts 1 value per frame fed so for 32 frames it outputs 32 values.