StoryGen environment issue

I'm using a 3090 GPU, and the rest of it was installed well except for the "Detectron2" package. Is "Detectron2" a must-have package?

Dec 03 '24 13:12 ParkSungHin

Sorry for the confusion. When exporting the environment, I included all the libraries I commonly use. However, for the StoryGen project, Detectron2 is not a necessary dependency and can be excluded. In fact, you only need to focus on the key libraries and their versions: torch, diffusers, accelerate, xformers, and transformers.

Dec 04 '24 05:12 haoningwu3639

Thank you for your kind response. Thanks to your support, I was able to successfully experiment with the task. However, I have some questions regarding the experimental results that I'd like to ask. Could you assist me with a few inquiries?

To execute your GitHub task, I trained the model using only the StorySalon dataset (ebook) available on Hugging Face, with eight NVIDIA 3090 GPUs, following the exact code provided on GitHub. The inference.py script was also executed with the same code.

From the results, it seems that the model somewhat recognizes "cat," but it struggles to recognize "The black-haired man."

infer_result

I am curious whether the root cause of this issue is:

a lack of sufficient training data
The following phenomena occurring during training
Modifications made to the stage2_config file to enable multi-GPU training.

Could you clarify or help identify the most likely cause?

The phenomena

024-12-07 09:46:46,212 model.pipeline [WARNING] - The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['dere banunii " suggests they are in a remote or isolated location, possibly in a desert or mountainous area. the phrase " aamtya a lee kaapi j 1 1 nana joree " could refer to a specific event or situation that led to their current situation.']

stage2_config.yml

validation_sample_logger: num_inference_steps: 20 guidance_scale: 5 gradient_accumulation_steps: 24 train_steps: 50000 train_batch_size: 4 validation_steps: 500 checkpointing_steps: 1000 seed: 6666 mixed_precision: 'fp16' learning_rate: 1e-5 scale_lr: false lr_scheduler: cosine lr_warmup_steps: 500 use_8bit_adam: true adam_beta1: 0.9 adam_beta2: 0.999 adam_weight_decay: 0.01 adam_epsilon: 1.0e-08 max_grad_norm: 0.5

Dec 08 '24 03:12 ParkSungHin

Sorry for the late reply. For the questions you raised, I have the following suggestions:

First of all, introducing more and higher quality data will help the generation effect; because this project is relatively old, the data quality of our StorySalon is not very advantageous compared with the current related work, but our proposed data processing pipeline can help to scale up high-quality data;
This warning means that your text prompt is too long, because CLIP's text encoder can only support inputs of up to 77 tokens, and longer parts will be truncated, which may also cause semantic problems in text embedding;
Changes to config will not impact the quality.
The model we proposed is better at single object-driven generation, and has demonstrated a certain ability to generate multiple object combinations, but due to the lack of specific training, the quality of multi-object compositional generation will be worse than single-object generation.

Jan 06 '25 06:01 haoningwu3639