Hello, I'm trying to build an environment with environment.yaml on Windows, and there are a lot of things that aren't running. GPU is also a 4070ti super, so I think the Pytorch and Cuda versions will be different, so how should I approach it??

Jul 19 '24 03:07 ParkSungHin

If you encounter problems when using environment.yaml, I suggest you install the key dependencies I recommend, including torch, accelerate, xformers, diffusers, and transformers. Since the diffusers library is updated and iterated quickly, and there may be obvious incompatibilities between previous and later versions, it is recommended to install it according to the version I provide. If you can provide more error information, I will be glad to give more guidance.

Jul 19 '24 05:07 haoningwu3639

Thanks to this, I finished setting up the environment on Linux. Do I have to download all the files in metadata.json? I have downloaded the data, but it seems difficult to download all of them due to the errors below. ERROR: [youtube] QCJyJup0qcc: Private video. Sign in if you've been granted access to this video ERROR: [youtube] n5p24NNdycc: Video unavailable. This video has been removed by the uploader ERROR: [youtube] eo1TV_1KZsE: Video unavailable

download_videos

Is it correct that the down-style proceeds like the above?

Aug 01 '24 08:08 ParkSungHin

Sorry for the late reply, I was on vacation last week. You don't necessarily need to download all the videos in metadata.json, because they may be removed due to YouTube's restrictions. The data in YouTube videos does not account for a large proportion of our StorySalon dataset, so you can focus on the data in the open-source library. You can also search for suitable YouTube videos and use the data processing pipeline we provide to expand the data further.

Aug 05 '24 06:08 haoningwu3639

thank you! So I'm in the process of data processing now. Can I ask you a question because I'm curious during the process?

Did you make sure that only two pairs of the many storybook videos and vtts were extracted from extract.py ?
Can I use the yolov7.pt file used in human_ocr_mask.py as a model that recognizes real people provided by github of yolov7? Or should I fine-tune it to match the picture image of the storybook?

Aug 10 '24 13:08 ParkSungHin

Sorry, this code was added for the purpose of testing script accuracy. We will update it with a correct version that removes this code.
Yes, you can directly use the pre-trained model.

Aug 12 '24 02:08 Verg-Avesta

Thank you, and thanks to you, we've solved that problem! However, the next step, inpaint.py , shows an error like the picture below, so can you tell me how to solve it? sg_problem

Aug 18 '24 12:08 ParkSungHin

Since the inpainting pipeline is totally borrowed from the implementations of Stable Diffusion, we did not include this part code in our repository, you can follow our README.md to download the related code and dependencies from https://github.com/CompVis/stable-diffusion

Aug 19 '24 05:08 haoningwu3639

Thank you, I was able to complete the data preprocessing. Now I'm trying to train_StorySalon_stage1,2, but I don't know how to set up the dataset. There are folders made up of a series of numbers and images of the episode frame in those folders. And there is a text file with the same name as that number. In my opinion, I know that the data set path in the file "train_StorySalon_stage1,2" consists of one data set path per train and validation.

Is it a structure to put the Inpaint image & mask & caption folder, which are folders made of a series of numbers, in the upper folder?
how much ratio of the entire data set should I prepare for the training and validation data set?
What is the "PDF_test_set.txt" and "video_test_set.txt" ??
Is it okay to use the image by manually inpainting the image because the SDM inpaint performance is not good and most of them are not inpaint depending on the mask? Do I not have to overlap the mask part with the part I Inpainted myself?

Sep 24 '24 06:09 ParkSungHin

The data preprocessing script we provide can correctly construct the structure of the dataset, please refer to the code we provide.
We use 95% of the data for training and 5% for testing.
These are two documents used to record the index of the test set samples, which we have provided.
Of course, we just want to remove the noise parts that are not conducive to generation, but you should stack the masks because we do not calculate the loss on the masked parts during training.

Oct 08 '24 04:10 haoningwu3639

Thank you for your detailed answer, but I have some more questions that I don't understand yet.

According to dataset.py 's StorySalonDataset class, pdf_testset.txt is a pair of data (image, text) extracted from an e-book, and video_testset.txt is a pair of data extracted from a video, so can I use only the video_testset.txt part according to the suggested metadata.json script?
I've collected up to 1176 stories extracted from the current video, and video_testset was shot with the 1458 range in mind. Can I use it as it is?
If the question in question 1 is correct, how should I extract e-book data to utilize pdf_testset.txt?
In the case of impainting that I work on myself, it proceeds to a web application, and the mask for impainting is not extracted separately, should I make the mask myself and prepare it? Or is it okay to use the mask that I created with human_ocr_mask.py as it is?

Oct 10 '24 05:10 ParkSungHin

Of course, that works.
Since YouTube videos may be taken down, some videos might no longer be available. However, if you collected them based on the metadata we provided, you should theoretically be able to use them.
The PDF-related data is open-source and available. We have already provided a curated dataset, which you can download from the following link: https://huggingface.co/datasets/haoningwu/StorySalon
I believe the mask extracted using the human_ocr_mask.py script should work fine.

Oct 19 '24 08:10 haoningwu3639

How can I set enviroment??

Is it correct that the down-style proceeds like the above?