Architectural Question Regarding Autoregressive Prediction Starting Position in Image-to-SVG Training

Open Yukinonooo opened this issue 7 months ago • 0 comments

Thank you for your excellent work! StarVector has achieved impressive results in the image-to-SVG generation task.

While delving into the code implementation, I identified a potential architectural question and would like to understand the authors' design considerations:

🤔 Problem Description

In the embed_im_to_svg function, the current training flow is:

Python

inputs_embeds = torch.cat([conditioning_embeds, svg_tokens_embeds], dim=1) targets = torch.cat([empty_targets, svg_targets], dim=1)

where empty_targets are all filled with -100

Input: [Image_embed1, Image_embed2, ..., Image_embedN, SVG_token1, SVG_token2, ...] Target: [ -100, -100, ..., -100, SVG_token1, SVG_token2, ...]

🚨 Potential Issue

Although the loss for the image part is masked out, during autoregressive training, the model still performs the following:

Position 0: Predicts the first image embedding from an empty sequence. Position 1: Predicts the second image embedding from [Image_embed1]. Position N: Predicts the Nth image embedding from [Image_embed1...N-1]. Position N+1: Predicts the first SVG token from [Image_embed1...N] ← This is where meaningful prediction for SVG begins. This way, the model is effectively learning a task of "how to predict the complete image representation from an empty/partial image representation," which doesn't seem to be the intended goal for SVG generation.

💭 Questions/Thoughts

Could this design impact training efficiency and model performance? Have you considered using a prefix-style training approach? For example: Treating image embeddings as a fixed context prefix. Performing autoregressive training only on the SVG part.

Jun 01 '25 14:06 Yukinonooo