Architectural Question Regarding Autoregressive Prediction Starting Position in Image-to-SVG Training
Thank you for your excellent work! StarVector has achieved impressive results in the image-to-SVG generation task.
While delving into the code implementation, I identified a potential architectural question and would like to understand the authors' design considerations:
🤔 Problem Description
In the embed_im_to_svg function, the current training flow is:
Python
inputs_embeds = torch.cat([conditioning_embeds, svg_tokens_embeds], dim=1) targets = torch.cat([empty_targets, svg_targets], dim=1)
where empty_targets are all filled with -100
Input: [Image_embed1, Image_embed2, ..., Image_embedN, SVG_token1, SVG_token2, ...] Target: [ -100, -100, ..., -100, SVG_token1, SVG_token2, ...]
🚨 Potential Issue
Although the loss for the image part is masked out, during autoregressive training, the model still performs the following:
Position 0: Predicts the first image embedding from an empty sequence. Position 1: Predicts the second image embedding from [Image_embed1]. Position N: Predicts the Nth image embedding from [Image_embed1...N-1]. Position N+1: Predicts the first SVG token from [Image_embed1...N] ← This is where meaningful prediction for SVG begins. This way, the model is effectively learning a task of "how to predict the complete image representation from an empty/partial image representation," which doesn't seem to be the intended goal for SVG generation.
💭 Questions/Thoughts
Could this design impact training efficiency and model performance? Have you considered using a prefix-style training approach? For example: Treating image embeddings as a fixed context prefix. Performing autoregressive training only on the SVG part.