ijepa
ijepa copied to clipboard
Some questions about context and target
Hi authors, it's an amazing job, the idea is new and results are impressive.
When I read the papers, i'm confused by how you get context and target.
(1) In the paper you mentioned that image first go throught a vit to get a sequence of patch-level features. and you randomly sample M patch features. Until now i'm following, but then you just mentioned how you sample the blocks with a random aspect ratio in the range (0.75, 1.5) and random scale in the range (0.15, 0.2). , in my understanding, it refers to you first use this ratio to pick a mask, then you use this mask to get features of patches in side this mask, is that true?
(2) in Context section, you refer block, in my understanding the block should be a rectangular, but in figure 4 it seems not. you also mention Since the target blocks are sampled independently from the context block, there may be significant overlap, why target block sampled from context block? aren't they sampled from original image patch-level representations?
Thanks for answer questions !
Hi @lezhang7
Targets:
- Image first goes through a vit to get a sequence of patch-level features.
- Next, we sample M=4 blocks (with aspect ratio and scale that you mentioned), and so you end up with a M=4 sets of patch-level features (one set for each block).
Context:
- Sample a large block (rectangular).
- Remove patches that overlap with target blocks (no longer rectangular), and process only the remaining patches by the context encoder.
Does this clarify things?