LMTrajectory Image captioning clarification and pretrained model

Hello, thank you for your work! I have few questions about your work.

The BLIP-2 model is used to create captions of images to be used as prompts for the LMTraj-SUP model. As far as I understand the image captioning is performed on one single reference image for each scenario. Is this correct? If yes, what's the role of the *_oracle.png and *_bg.png files? I don't see the caption sentence in the test preprocessed data I downloaded. Why?
In the paper (Section 4.3 - Model Size) it is claimed you tried 3 different model sizes and that the smallest one has real-time performances. Which size are the released pretrained models?
As far as I understand the homography matrix stored at datasets/homography/ are used to map pixel coordinates to world coordinates. However, by checking the original homography matrix (downloaded here) for example for biwi eth, I noticed the first two columns are swapped. Why?

# original matrix
   2.8128700e-02   2.0091900e-03  -4.6693600e+00
   8.0625700e-04   2.5195500e-02  -5.0608800e+00
   3.4555400e-04   9.2512200e-05   4.6255300e-01

# your matrix
   2.0091900e-03   2.8128700e-02  -4.6693600e+00
   2.5195500e-02   8.0625700e-04  -5.0608800e+00
   9.2512200e-05   3.4555400e-04   4.6255300e-01

The released pretrained models are :

exp3-ct-eth-pixel-multimodal-best    
exp3-ct-univ-pixel-multimodal-best   
exp3-ct-zara2-pixel-multimodal-best
exp3-ct-hotel-pixel-multimodal-best  
exp3-ct-zara1-pixel-multimodal-best

Are these intended to predict with observed trajectory expressed in pixel only? Do we need a different model to predict directly in meter coordinate system? If yes are these models available?

I want to test the released pretrained models onto a small custom test set of a new scenario. Do I need something more than the homography and the data in the format <frame_id> <ped_id> <x_meter> <y_meter>?

Sep 26 '24 08:09 vittoriacav

Hi @vittoriacav, Thank you for your interest in my paper!

(1) You are correct. We used a single reference image, similar to Y-Net. Since the camera remained fixed in the dataset, BLIP-2 consistently produced almost the same result across sequential video frames, with only occasional minor word changes. Note that, unlike BLIP-2, ChatGPT tends to generate more varying details from run to run. (2) The *_bg.png file represented the temporal median value across all video frames and was used for visualization during development. It is deprecated, so you can delete it. The *_oracle.png file was taken from Y-Net and is used to post-process the predicted paths. (3) That’s strange. I double-checked by downloading the dataset from the dataset zoo, and the caption sentences were included at the end of the observation prompt. Could you please check it again?

# datasets/preprocessed/eth-test-8-12-pixel.json
{
  "id": 0, 
  ...
  "observation": "question: What trajectory does pedestrian 0 follow for the next 12 frames? context: Pedestrian 0 moved along the trajectory [(84, 92), (85, 87), (85, 82), (86, 77), (87, 73), (87, 69), (87, 65), (88, 62)] for 8 frames. Pedestrian 1 moved along the trajectory [(88, 106), (89, 102), (89, 96), (89, 91), (89, 86), (88, 81), (88, 76), (88, 71)] for 8 frames. a view of people walking on the snow covered ground. answer:", 
  "forecast": "Pedestrian 0 will move along the trajectory [(89, 60), (91, 58), (89, 56), (92, 55), (92, 52), (93, 48), (93, 44), (91, 41), (89, 37), (87, 34), (83, 31), (81, 28)] for the next 12 frames."
}
...

The released pretrained models and the metrics reported in the main table of the paper are all based on the smallest model (170MB).
The first two columns are swapped because of differences in how the ETH and UCY datasets handle image coordinate systems. For ETH, the homography matrix is applied to [W, H] coordinates, while for UCY, it’s applied to [H, W]. While Y-Net handled this with exceptions in the code, I used a simple approach by swapping the first two columns of the homography matrix.
The released models and config files are optimized for predictions in pixel coordinates. Training directly in meter coordinates can slightly degrade performance, likely due to the transposed issue in ETH and the challenge of maintaining a one-to-one match with positions between meter coordinates and image descriptions. Due to their large size, I don’t currently have pre-trained models for the meter coordinate system. Note that you might normalize the trajectory by subtracting the last observed coordinate to improve generality in the meter coordinate system.
Strictly speaking, scene captions are needed for optimal performance, but in simple cases, there isn't a significant difference without them. If you don't have the oracle.png walkable area segmentation map, you can fill all pixels with 1 (0: blocked, 1: walkable).

Sep 29 '24 12:09 InhwanBae

Dear @InhwanBae, thank you for your detailed answers. I confirm the downloaded dataset includes the caption in the test preprocessed data as well, it was my mistake.

I tried testing the pretrained models on a small custom sequence, however the performance is quite bad:

# eth model
ADE: 9.347091674804688
FDE: 8.883607864379883
# hotel model
ADE: 8.766213417053223
FDE: 8.24624252319336
# univ model
ADE: 8.766213417053223
FDE: 8.24624252319336
# zara1 model
ADE: 8.59603214263916
FDE: 8.084837913513184
# zara2 model
ADE: 9.095739364624023
FDE: 8.382474899291992

I also tried plotting the past, gt, and predicted trajectories and I noticed the predicted trajectory often starts far away from the last position of the observed trajectory. Here is an example with eth model: plot0

At this folder you can find everything you need to replicate this custom dataset test (I'll give you access upon request). You'll find:

the raw dataset under /custom/test/custom.txt
the preprocessed data under preprocessed folder
the caption obtained with Salesforce/blip2-opt-2.7b model (6.7b was too big to handle inference with my laptop) in custom_caption.txt file
the homography matrix and oracle map at custom_H.txt and custom_oracle.png
finally in plots folder you can visually see the behavior of the eth model for each step of the sequence

Do you have any explanation for this behavior or any suggestions to improve the performance?

Sep 30 '24 09:09 vittoriacav

Hi @vittoriacav,

This issue commonly occurs when the scale-down factor used in the data preprocessor and the evaluator differ. Sometimes, this can happen if the homography is incorrect, so it's good to check if all datasets have been transformed into the pixel coordinate system and whether they fit within the image size (720*576).

Sep 30 '24 13:09 InhwanBae

I'm closing this issue for now. Feel free to open another issue if you have any further questions!

Jan 20 '25 15:01 InhwanBae