Image captioning clarification and pretrained model
Hello, thank you for your work! I have few questions about your work.
-
The BLIP-2 model is used to create captions of images to be used as prompts for the LMTraj-SUP model. As far as I understand the image captioning is performed on one single reference image for each scenario. Is this correct? If yes, what's the role of the
*_oracle.pngand*_bg.pngfiles? I don't see the caption sentence in the test preprocessed data I downloaded. Why? -
In the paper (Section 4.3 - Model Size) it is claimed you tried 3 different model sizes and that the smallest one has real-time performances. Which size are the released pretrained models?
-
As far as I understand the homography matrix stored at
datasets/homography/are used to map pixel coordinates to world coordinates. However, by checking the original homography matrix (downloaded here) for example for biwi eth, I noticed the first two columns are swapped. Why?
# original matrix
2.8128700e-02 2.0091900e-03 -4.6693600e+00
8.0625700e-04 2.5195500e-02 -5.0608800e+00
3.4555400e-04 9.2512200e-05 4.6255300e-01
# your matrix
2.0091900e-03 2.8128700e-02 -4.6693600e+00
2.5195500e-02 8.0625700e-04 -5.0608800e+00
9.2512200e-05 3.4555400e-04 4.6255300e-01
- The released pretrained models are :
exp3-ct-eth-pixel-multimodal-best
exp3-ct-univ-pixel-multimodal-best
exp3-ct-zara2-pixel-multimodal-best
exp3-ct-hotel-pixel-multimodal-best
exp3-ct-zara1-pixel-multimodal-best
Are these intended to predict with observed trajectory expressed in pixel only? Do we need a different model to predict directly in meter coordinate system? If yes are these models available?
- I want to test the released pretrained models onto a small custom test set of a new scenario. Do I need something more than the homography and the data in the format
<frame_id> <ped_id> <x_meter> <y_meter>?
Hi @vittoriacav, Thank you for your interest in my paper!
- (1) You are correct. We used a single reference image, similar to Y-Net. Since the camera remained fixed in the dataset, BLIP-2 consistently produced almost the same result across sequential video frames, with only occasional minor word changes. Note that, unlike BLIP-2, ChatGPT tends to generate more varying details from run to run. (2) The
*_bg.pngfile represented the temporal median value across all video frames and was used for visualization during development. It is deprecated, so you can delete it. The*_oracle.pngfile was taken from Y-Net and is used to post-process the predicted paths. (3) That’s strange. I double-checked by downloading the dataset from the dataset zoo, and the caption sentences were included at the end of the observation prompt. Could you please check it again?
# datasets/preprocessed/eth-test-8-12-pixel.json
{
"id": 0,
...
"observation": "question: What trajectory does pedestrian 0 follow for the next 12 frames? context: Pedestrian 0 moved along the trajectory [(84, 92), (85, 87), (85, 82), (86, 77), (87, 73), (87, 69), (87, 65), (88, 62)] for 8 frames. Pedestrian 1 moved along the trajectory [(88, 106), (89, 102), (89, 96), (89, 91), (89, 86), (88, 81), (88, 76), (88, 71)] for 8 frames. a view of people walking on the snow covered ground. answer:",
"forecast": "Pedestrian 0 will move along the trajectory [(89, 60), (91, 58), (89, 56), (92, 55), (92, 52), (93, 48), (93, 44), (91, 41), (89, 37), (87, 34), (83, 31), (81, 28)] for the next 12 frames."
}
...
-
The released pretrained models and the metrics reported in the main table of the paper are all based on the smallest model (170MB).
-
The first two columns are swapped because of differences in how the ETH and UCY datasets handle image coordinate systems. For ETH, the homography matrix is applied to [W, H] coordinates, while for UCY, it’s applied to [H, W]. While Y-Net handled this with exceptions in the code, I used a simple approach by swapping the first two columns of the homography matrix.
-
The released models and config files are optimized for predictions in pixel coordinates. Training directly in meter coordinates can slightly degrade performance, likely due to the transposed issue in ETH and the challenge of maintaining a one-to-one match with positions between meter coordinates and image descriptions. Due to their large size, I don’t currently have pre-trained models for the meter coordinate system. Note that you might normalize the trajectory by subtracting the last observed coordinate to improve generality in the meter coordinate system.
-
Strictly speaking, scene captions are needed for optimal performance, but in simple cases, there isn't a significant difference without them. If you don't have the
oracle.pngwalkable area segmentation map, you can fill all pixels with 1 (0: blocked, 1: walkable).
Dear @InhwanBae, thank you for your detailed answers. I confirm the downloaded dataset includes the caption in the test preprocessed data as well, it was my mistake.
I tried testing the pretrained models on a small custom sequence, however the performance is quite bad:
# eth model
ADE: 9.347091674804688
FDE: 8.883607864379883
# hotel model
ADE: 8.766213417053223
FDE: 8.24624252319336
# univ model
ADE: 8.766213417053223
FDE: 8.24624252319336
# zara1 model
ADE: 8.59603214263916
FDE: 8.084837913513184
# zara2 model
ADE: 9.095739364624023
FDE: 8.382474899291992
I also tried plotting the past, gt, and predicted trajectories and I noticed the predicted trajectory often starts far away from the last position of the observed trajectory. Here is an example with eth model:
At this folder you can find everything you need to replicate this custom dataset test (I'll give you access upon request). You'll find:
- the raw dataset under
/custom/test/custom.txt - the preprocessed data under
preprocessedfolder - the caption obtained with
Salesforce/blip2-opt-2.7bmodel (6.7b was too big to handle inference with my laptop) incustom_caption.txtfile - the homography matrix and oracle map at
custom_H.txtandcustom_oracle.png - finally in
plotsfolder you can visually see the behavior of the eth model for each step of the sequence
Do you have any explanation for this behavior or any suggestions to improve the performance?
Hi @vittoriacav,
This issue commonly occurs when the scale-down factor used in the data preprocessor and the evaluator differ. Sometimes, this can happen if the homography is incorrect, so it's good to check if all datasets have been transformed into the pixel coordinate system and whether they fit within the image size (720*576).
I'm closing this issue for now. Feel free to open another issue if you have any further questions!