point-e icon indicating copy to clipboard operation
point-e copied to clipboard

The information about training data

Open LeeDoYup opened this issue 2 years ago • 5 comments

Hello, thanks for the awesome project ! I love the results and respect for the authors' efforts.

By the way, I cannot find the information of training data. I wonder the details of training data such as how the training data is collected.

I think that many researchers are also curious about the details of training data & data gathering process.

Thank you :)

LeeDoYup avatar Dec 21 '22 06:12 LeeDoYup

Here is an excerpt from the Point-E paper(link in the README) regarding the dataset used:

4.1. Dataset We train our models on several million 3D models. We found that data formats and quality varied wildly across our dataset, prompting us to develop various post-processing steps to ensure higher data quality.

To convert all of our data into one generic format, we rendered every 3D model from 20 random camera angles as RGBAD images using Blender (Community, 2018), which supports a variety of 3D formats and comes with an optimized rendering engine. For each model, our Blender script normalizes the model to a bounding cube, configures a standard lighting setup, and finally exports RGBAD images using Blender’s built-in realtime rendering engine.

We then converted each object into a colored point cloud using its renderings. In particular, we first constructed a dense point cloud for each object by computing points for each pixel in each RGBAD image. These point clouds typically contain hundreds of thousands of unevenly spaced points, so we additionally used farthest point sampling to create uniform clouds of 4K points. By constructing point clouds directly from renders, we were able to sidestep various issues that might arise from attempting to sample points directly from 3D meshes, such as sampling points which are contained within the model or dealing with 3D models that are stored in unusual file formats.

This is what I've gathered from an initial scan of the paper. I would love to know more about exactly how the dataset was assembled, especially where the "several million 3D models" came from. I'll look into it further when I get a chance to really delve into the paper and related works.

AdamSlay avatar Dec 21 '22 14:12 AdamSlay

@AdamSlay Thank you. I have also checked the description, but post this issue to raise a question about the "several million 3D models."

LeeDoYup avatar Dec 22 '22 00:12 LeeDoYup

Is there any plan to release the training data, e.g., "several million 3D models"? If not, could you suggest how these training samples are collected? Thanks~

YBZh avatar Jan 15 '23 01:01 YBZh

The supplementary material says: "We train our SDF regression model on a subset of 2.4 million manifold meshes from our dataset,". However, I cannot find any information regarding where those 2.4 million manifold meshes are coming from. Maybe there are some licencing issues regarding scraping them from somewhere?

matd3m0n avatar Jun 03 '23 15:06 matd3m0n

TL;DR like SD, they almost certainly won't release the training data, they'll release links to it if they do that. That part is up to them and they might just not feel like doing it.

A quick search finds "Cults 3D" which is a database of free models for 3D printers. They supposedly have 1.1M models. It appears most of them are CC-attribution but there are lots of CC-non-commercial as well. All of the ML models are "research only" so non-commercial use doesn't affect their (the ML network's) release as it will almost certainly have a non-commercial-use clause which everyone will immediately ignore. CMU.edu has 6 hours of point cloud data from 10 kinects of 54 different scenes of people and objects. If there's a fast & clean way of autogenerating meshes from this now, tons of human poses could be pulled from there. The full database also includes 31 synced HD video streams from multiple angles and ~480 VGA streams. One of the australian universities has a big point cloud database of objects from around Sydney. Both of those are publically accessible.

As far as distribution,

  1. Regardless of license if something is on the public internet giving out a link to it isn't something that any license on the 3rd party server can really say anything about. The internet can't exist without links. Their only recourse is to set up their server to require logins or interaction for each download if they're really opposed to it.
  2. If the training data is released it's highly likely it'll be in the same form as diffusers, a big database of links to the images that were pulled. This keeps stabilityai away from the enormous legal issues that they don't bother mentioning. Tons of the images probably have copyright exif that's ignored, or was stripped by some shadier photo-sharing service (facebook and probably instagram started removing all exif years ago and changing the embedded ICC profile to the non-standard Display P3 upon upload, which was when I quit uploading anything I took to facebook before I quit using it altogether)... yes, the person who downloads an image can strip this out trivially but I also had my contact email in there in case someone wanted to negotiate commercial use. Lots more have copyright watermarks. They have to assume they can't legally release those. At this point in time some people might even be releasing models free for non-commercial use with a clause that they can't be used to train generative AI. Since the people training it never read the pages files are on or check the file EXIFs stuff like that could require retraining the entire model to remove things that prevented them from using them entirely. US law says EULAs are valid even if they weren't presented or you never read them so they have no case for not knowing those terms were there.
  3. The worst item, mentioned even less often (but that is thankfully incredibly unlikely with 3D models, and not a legal issue in most places with those, although you might have issues with 3D simpsons porn in Australia ;P ) is that there's a high probability of a set of images as huge as they're using containing at least a few instances of child pornography, incest, bestiality, or other images that are illegal depending on the country the downloader is in. Given varying laws on practically everything in different regions of the world there's almost a guarantee that there are images illegal where you are in the mess, although (in the US at least) if you luck out and none of it is child porn it probably hasn't been turned into a honeypot since they compiled the list and nobody is going to care anyway. Just hope some 17 year old girl's shirtless selfie from some social media / icloud leak didn't get pulled in from somewhere. In any case, the model originators are probably fully aware that this is a very real risk and it would have factored in to not directly releasing training data as much as or more than the copyright .
  4. Last of all there's the bandwidth issue for the people making the model. Huge amounts of storage + outgoing data per month are expensive whatever web host you're using unless you're a university research group hosting on their site. If all data is public which it pretty much has to be giving people the links probably saves them enough monthly host fees over the course of a year to buy 1/10-1/30th of another H100 module. Meshes compress really easily compared to images though, so unless the training data includes normal / displace maps (you'd hope they figured out the issue of some models being designed to require an applied displace map to be full detail) it's likely to be much smaller than what SD 1.5 was trained on.

NeedsMoar avatar Aug 26 '23 04:08 NeedsMoar