stable-diffusion icon indicating copy to clipboard operation
stable-diffusion copied to clipboard

Does it make sense to train a model from scratch on a specific product category (such as computer keyboards)?

Open dustyny opened this issue 2 years ago • 4 comments

I am running some experiments with product photography to test if SD can solve a specific business use case.

Does it make sense to train a model from scratch on a specific product category? So let's say I want a model that is trained only on computer keyboards. Does the model need other/random image data to get good results or could I reduce the data set and create a very narrowly focused model?

If so what would you guess would be a minimum number of images?

I have 2700 products, 3800 images (can be flipped to give 7600), the products are on a white background, no other objects there. I have multiple versions of the product from 1-4 different angles. They are all positioned in the same place in the middle of the image.

The text descriptions have been generated in natural language from the specifications and they all use the exact same terminology. I generated the text based on anything that should be visible in the image, the different parts of the keyboard using the correct industry terminology, what color each part is, what type of keyboard it is, the number of keys it has, the keyboard layout, language, material type, height, width, depth, etc. If needed I can get higher resolution versions of the same image and slice them up to produce higher detail sections of the product, that could increase the dataset up to 13,800 images..

Would appreciate anyone's thoughts on if this make sense to try and if it does, what type of settings should i be considering with such a small dataset.

dustyny avatar Dec 20 '22 22:12 dustyny

Same problems. I want to train a diffusion model on a set with 1000 images containing different products in the supermarket. The problem for me is that the condition I give is not natural language, just some attributes or features I suppose, so it does not match the dimension of embeddings given by the pre-trained model (d = 768). I am trying to train from the scratch but I think it's computationally costing and may need more images.

But I have tried fine tuning using the pre-trained models and I think thousands of images you specify are totally enough for fine tuning. At least for me, pokemon and products could produce a reasonable result. I suppose if you do not expect to learn some new representations from the model, then you can just fine tune to get some high quality key boards. For me, I want the model to learn some renderer knowledge like the light intensity, camera pitch angle or something like that which is not learned in the pre-trained model.

Liam-Tian avatar Dec 26 '22 14:12 Liam-Tian

What I've done is created a translator from machine generated data to natural language. You'll need to create some binning of the data to categorize it but once you work out the mapping you can come up with some good phrasing that the NLU should be able to understand.

So in the instance of keyboards, I might say that a width of 43.18 cm is a "full sized keyboard" in some cases I need to create some logic that said if the keyboard has a width of a 30.0 CM AND it has 87 keys it is a "Ten Keyless". With this approach, I've been able to go from a bunch of product specifications to phrases that read like.

"A Razer mechanical keyboard, gaming, low profile, ten keyless, black body, black keycaps with blue letters, Windows OS, QWERTY layout".

One thing that I considered is does the specifications have information that would not be apparent in a photo. So the model wouldn't be able to tell that the keyboard has a metal weighted base plate or that the keyboard is made of a specific type of plastic.. My other premise is that the most prominent features should be earlier in the description.

Not sure at the moment what features will be meaningful and which ones will not.. but I remember seeing that there is a way to get weights of the keywords to see how much the impact the diffusion.. So something to test after the model is built. I just need to know the best way forward..

dustyny avatar Dec 31 '22 16:12 dustyny

Great intuition to catogorize the products. However in my case I need to make it numerical for some purposes. I wonder how can we specify a continuous property like in your case, we want such keyboards of which the width is in the range of (42.18, 43.18). For example, if we specify the width is 42.43, then the model at least will output such a keyboard which is wider than 42 and shorter than 43. The task is to let the model understand the numerical input in the condition. This will help me a lot if I can make it.

Liam-Tian avatar Jan 03 '23 10:01 Liam-Tian

I'd say take a look at BERT to better understand what the limits of the NLU model is. It's a statistical model of how words combined and ordered; it doesn't have an understanding of what they mean as an abstraction. You are more likely to get what is the most common occurrence of a number when in the context of the sentence, which is most likely going to be incorrect.

So it might know that the height range of a US male is between 5'8" & 6', as that fact is probably in the training data (many times).. but it might not know that 5'10" is in that range, since it doesn't understand that this is a range of numbers is and what is between 5'8" & 6'.

That of course is an oversimplification, BERT was trained on a massive amount of data and it has so many word patterns in it that just about anything you think up is going to be covered.. but I think it gets the point across, you can't think about it in terms of logic & deterministic responses, it's statistical.

dustyny avatar Jan 03 '23 17:01 dustyny