OutfitTransformer About the CIR task

Hi, In the paper, when handling the CIR task, an empty image and target description are used to generate a target item embedding, and then it is spliced with part of the outfit. However, in your code, I see that the design idea is to encode the target and outfit together. I think your design is very clever, but I want to know why is it task_emb = torch.cat([self.task_emb, self.embed_emb], dim=-1). Is it to keep the shape consistent with task_emb = torch.cat([self.task_emb, self.predict_emb], dim=-1)? Look forward to your reply.

Apr 11 '25 06:04 Krual-T

In addition to what is mentioned above, I also want to know why the outfit token in the paper is designed as the splicing of two embeddings. Is it to share some global features for CIR processing?

Apr 11 '25 08:04 Krual-T

In previous studies, the embedding of the image representing “?” was concatenated with the query text embedding. Therefore, the reason you pointed out—such as maintaining shape consistency with task_emb = torch.cat([self.task_emb, self.predict_emb], dim=-1)—is indeed correct.

However, since the embedding of the “?” image is a fixed value, we believed it might not be optimal. To address this, we replaced it with a learnable parameter that can be trained during the learning process.

This approach is actually aligned with common practices in other Transformer-based models, such as BERT, where learnable tokens are often used in a similar manner.

Apr 11 '25 08:04 owj0421

All the issues you sent me earlier, I checked and thank you for your hard work. I will revise it after the school exam period is over.

Apr 11 '25 08:04 owj0421

Thanks for your reply. I have just come into contact with machine learning and don't know much about these common practices. My confusion is that for the CIR problem, the target item token mentioned in the paper is composed of a blank image and the encoded target description. You mentioned that a learnable variable can be used to replace the blank image. Then my understanding is that there are three variables in the entire framework. One is the outfit token in CP, one is the learnable variable that replaces the blank image and is used for splicing with the encoded text description, that is, the target item token, and there is also a learnable pad. But your implementation is:

self.task_emb = nn.Parameter(
    torch.randn(self.item_enc.d_embed // 2) * 0.02, requires_grad=True
)
self.predict_emb = nn.Parameter(
    torch.randn(self.item_enc.d_embed // 2) * 0.02, requires_grad=True
)
self.embed_emb = nn.Parameter(
    torch.randn(self.item_enc.d_embed // 2) * 0.02, requires_grad=True
)
self.pad_emb = nn.Parameter(
    torch.randn(self.item_enc.d_embed) * 0.02, requires_grad=True
)

This makes me a little puzzled as to why this is done. My understanding is that through the shared variable task_emb, the model can capture some global features so that CIR can draw on the experience left by the CP task. The predict_emb is to let the model distinguish between CP and CIR tasks. The embed_emb is used to maintain the shape. In addition, in PolyvoreTripletDataset, the description (category) of the target item is placed in FashionComplementaryQuery, but it is not used in the model (outfit_transformer.py - fn: embed_query).

Apr 11 '25 09:04 Krual-T

Hi, sorry for the late reply. I’ve been busy with midterms, so my response is delayed.

Since my code was written quite a while ago, I had some confusion, but here are my answers:

| Q1: My understanding is that through the shared variable task_emb, the model can capture some global features so that CIR can draw on the experience left by the CP task.

Answer: Yes, that’s correct. However, there are a few differences from the original paper.

The task embedding helps the model understand which part of the input is related to which task.
The predict_emb and embed_emb variables are used to allow the model to recognize each specific task: CP (Compatibility Prediction) and CIR (Complementary Item Retrieval).

| Q2: PolyvoreTripletDataset, the description (category) of the target item is placed in FashionComplementaryQuery, but it is not used in the model (outfit_transformer.py - fn: embed_query).

Answer:

Initially, I planned to implement the model exactly as described in the original paper, so I included query for future compatibility and to make it easier for other researchers to build on this work.
In summary, it's currently not used in my code.

Apr 12 '25 17:04 owj0421

Thanks for your reply，the following question does not need to be answered in a hurry： I plan to use the description (category). Do you have any good suggestions? Wish you success in the exam.

Apr 13 '25 03:04 Krual-T