Transformers-Tutorials
Transformers-Tutorials copied to clipboard
Need Help in understanding hidden_states of Computer vision models
I am having trouble in interpreting the hidden_state and last_hidden_state indexing with respect to transformer models for computer vision
which layer output is the last_hidden state. Example in a swin transformer tiny the hidden_state returns a tuple of 5 with sizes 3136x96, 784x192, 196x38, 49x768 and 49x768 respectively. I tried to view them but I was not able to get the last_hidden_state from the tuples of hidden_state. Similar problem I faced in VIT models too Please help in understanding these embeddings from Model output class specially for transformers of computer vision as I am trying to find some interpretibility from the model outputs. Also the index of the optional initial embedding outputs is confusing.
Thanks in advanced