dvc.org icon indicating copy to clipboard operation
dvc.org copied to clipboard

dev: do not store images in Git

Open shcheklein opened this issue 4 years ago β€’ 22 comments

Clone takes forever, at some point we will hit the GH limits.


As an experiment we can do see what would it take to integrate DVC to store images, use GraphQL to fetch them. POC here - https://github.com/iterative/blog/pull/93

After that we can amend Git repo to remove all image objects.

Any other ideas?

shcheklein avatar Jan 28 '20 00:01 shcheklein

@shcheklein where should we store the images? Should we use an S3 bucket with public access and serve them from there?

fabiosantoscode avatar Jan 28 '20 14:01 fabiosantoscode

@fabiosantoscode for the blog.dvc.org specifically? pick any remote storage to experiment with. Probably will be storing on S3 when it's done.

shcheklein avatar Jan 28 '20 14:01 shcheklein

@shcheklein I've looked deeply into gatsby hooks, the source-filesystem plugin (which can generate files from Buffer objects as well as real files) and decided to simplify to the max and work up from there.

So I used dvc import-url to place an example image in a subdirectory of static. I can query it with GraphQL just fine. I place the DVC files in a single subdirectory "dvc-img".

Plus, you don't need to restart gatsby develop when I import a new image because since the images are in the real filesystem, they are taken in by gatsby just like any other images would, and therefore they are available from GraphQL.

All that's missing, I think, is running dvc pull using some git hook (using husky), or whenever gatsby starts a build or a dev server.

Am I on the right track? I don't see too many problems with this approach, and it's pretty simple so it shouldn't be too hard to maintain.

Edit: I'm not tackling image optimization itself since gatsby has plugins for that. I think it's OK to store large images on S3 and have them be optimized on build time. Edit 2: It might be better to place the DVC files next to the images, but it might be my dislike of complex directory structures talking.

fabiosantoscode avatar Jan 28 '20 18:01 fabiosantoscode

An interesting problem we have now, is third party contributions. If a regular user were to create a post they would have to give us the image off-band.

<aside> Sending data over a PR could be a cool problem for iterative and DVC to solve. Storing images in some secure iterative-controlled content-addressible storage and PR a "data PR DVC file" which contains its sha512.

Once the PR is merged that "data PR DVC file" can be turned into a regular old import-url DVC file by an insider with s3/google drive permissions (or a github action!), who simply runs a command, which downloads it from iterative and places it in the desired remote. It might also help coworkers "branch off" their large data files, and then have them imported back into master with ease.

It's still possible for a team to do this by hand by making the files available on some personal HTTP server or public S3 bucket and adding a import-url DVC file with the s3 public URL to the file. It does add some additional steps on the merge side though. </aside>

fabiosantoscode avatar Jan 28 '20 18:01 fabiosantoscode

So I used dvc import-url to place an example image in a subdirectory of static. I can query it with GraphQL just fine. I place the DVC files in a single subdirectory "dvc-img".

not sure I understand why would you need dvc import-url, why not just dvc add directory-with-iamges?

All that's missing, I think, is running dvc pull using some git hook (using husky), or whenever gatsby starts a build or a dev server.

yep, that's the most interesting part - how to automate and hook it nicely into the workflow. Write a plugin that detects that images are stored in DVC and resolves/pulls them programmatically? Also, how do we add more images ... can we make it part of the build (detect changes and save dvc push them)?

Sending data over a PR could be a cool problem for iterative and DVC to solve

Agreed! Let's keep this in mind and think. But probably for now can solve this for the internal team first.

shcheklein avatar Jan 29 '20 01:01 shcheklein

@shcheklein dvc add is probably better, yes.

I can create this plugin. If the image DVC files are stored next to the images it's a lot easier (or if the DVC file represents the whole image subdirectory). I can detect changes in the DVC file(s) and run dvc pull automatically. This can detect branch changes and whatnot.

Automatic pull has a drawback though. If you change a DVC file after changing an image, you lose the changes you made. After reading through the docs a bit, I don't think there's a way to pull a DVC file that avoids overwriting novel data that might be in the working directory. It's probably safe to assume people will have an extra copy of the images they're adding, so this is probably fine to ignore for now.

As per writing images, I don't think it would be very good to track images automatically on build time (if you crop, resize or color correct an image 10 times before committing and your dev server is running, we get 9 unused versions of that image sitting on S3).

Instead, simply asking users to run dvc install and making sure it has been installed as a pre-commit check, might be enough. If I correctly understand what the hooks do, I think they can effectively make sure that if you're pushing any git changes, changes to data files are pushed as well. If not, we can probably augment them with our own hooks.

fabiosantoscode avatar Jan 29 '20 13:01 fabiosantoscode

or if the DVC file represents the whole image subdirectory

probably this should scale better, DVC is not very well suited for thousands and thousands of DVC-files - it might become a bottleneck. Also, dvc pull can do "granular" partial pulls for a file in the directory.

Automatic pull has a drawback though. If you change a DVC file after changing an image, you lose the changes you made.

not sure why. You don't change DVC-file. It's being updated by running dvc add or dvc run.

we get 9 unused versions of that image sitting on S3

this is fine, we can do dvc gc from time to time

if you crop, resize or color correct an image 10 times before committing

it's a pretty regular thing that takes a lot of time, btw in Gatsby. Can we use DVC pipelines there? to save prepared processed images into cache?

Instead, simply asking users to run dvc install and making sure it has been installed as a pre-commit check, might be enough

we can do any simple option first for the update experience.

shcheklein avatar Jan 30 '20 00:01 shcheklein

The case where you lose your image changes is mostly when you change branches, which changes the DVC file. It shouldn't be a big deal.

Using DVC pipelines sounds like a good idea, but I think there are some gatsby plugins for compressing images which use the gatsby cache to keep things up to date. Either way I don't want to tackle this before the rest of the experience is solid.

Thanks for clearing up my questions, I'll get right on this.

fabiosantoscode avatar Jan 30 '20 16:01 fabiosantoscode

@fabiosantoscode it won't allow you to do dvc checkout if it means losing uncached files.

use the gatsby cache to keep things up to date.

yep, the question is - can Netlify and/or Gatsby CI/CD setup in a way to use this cache easily? in case of DVC cache is saved to S3 and even other developers/editors can get it and benefit potentially.

Either way I don't want to tackle this before the rest of the experience is solid.

yep, that's right.

shcheklein avatar Jan 30 '20 18:01 shcheklein

To fix this I think we should cache the .cache and public folders across builds. It's trivial in most CIs but it doesn't seem to be documented on netlify. They do seem to cache node_modules so we might just piggy-back on that and copy the cache to and from node_modules/.tmp before and after each build. That might just work. Some netlify forum discussions mention this repository

fabiosantoscode avatar Jan 30 '20 19:01 fabiosantoscode

Hi @iterative/websites ! I just realized this p1 issue has been waiting for a while. Should we plan it up?

Option A - Use DVC (probably with an S3 bucket) B - Consider Git-LFS since it's integrated to Github and we're aiming to use the specific, narrow use case of large file storage (not really a data management need). C - ?

Either way it's be nice to have it setup automatically with yarn for dev envs.

Cc @casperdcl didn't you propose a related solution for another repo? Thanks

jorgeorpinel avatar Mar 15 '22 01:03 jorgeorpinel

Thanks, @jorgeorpinel for bringing this up, we need more discussion on this. We have already discussed it a little on slack and have also mentioned this on #docs meeting but haven't reached a conclusion overall.

Option A - Use DVC (probably with an S3 bucket)

  • As @casperdcl mentioned we would need to address this issue:

one problem is making sure the S3 etc. storage that DVC uses is publicly readable but only writable by us. Then the next problem is how do we handle external contributions? How do they write to our storage?

B - Consider Git-LFS since it's integrated to Github and we're aiming to use the specific, narrow use case of large file storage (not really a data management need).

For this we need to be okay with : why-you-shouldn't-use-git-lfs

  • Still the blog may not be 100% right
    • https://github.com/git-lfs/git-lfs#uninstalling
  • another list of limitations by git-lfs : https://github.com/git-lfs/git-lfs/wiki/Limitations

C - ?

Possible option C was I mentioned using some content management tools(cms for content as well as media management): we have pros and cons and little discussion on slack

If we choose to go with one of the above: Option A:

  • I can collaborate with someone who knows DVC well and submit a pr

Option B:

  • I can submit a pr on this

Option C:

  • If we do consider cms, we would need to find a suitable one. popular headless cms with Gatsby
  • This one will have other advantages as well but would take some significant development time.

yathomasi avatar Mar 15 '22 11:03 yathomasi

propose a related solution for another repo

As I recall I'd also suggested an option of using a submodule (e.g. https://github.com/iterative/static) for images (so that the base repo doesn't bloat in size).

Cons: Would require people to shallow-clone/checkout/init submodules but maybe easiest? From discussion with @shcheklein I also recall the gatsby build will also have to checkout the submodule in order to perform image optimisations.

casperdcl avatar Mar 15 '22 17:03 casperdcl

one problem is making sure the S3 etc. storage that DVC uses is publicly readable but only writable by us

Quite doable: you config up 2 remotes to the same storage, the read-only one via HTTP in .dvc/config and the one for writing in .dvc/congif.local (needs to be setup manually as part of the dev env setup, which should be documented in the README and/or here.)

Indeed the LFS integration with GH makes this easier for devs though.

how do we handle external contributions? How do they write to our storage?

Good question. It's not a common case though so maybe not to worry about? In the serare cases they'd have to upload it to the PR during review and we'd process images appropriately once it's approved, before squash & merge.

Maybe GH/LFS also makes this easier? If so I'd have to incline for LFS at this point.

we need to be okay with : why-you-shouldn't-use-git-lfs

We'd want to rewrite the whole repo history to remove image files with either option I think. I don't think it's a problem for a web app -- we never deploy much older versions?

Possible option C was I mentioned using some content management tools(cms

Interesting but how does it integrate with the existing engine? Do we want to manage 2 web apps? Sorry, I can't open the Slack links (feel free to @ me directly on Slack though). BTW we'd still want to rewrite history and remove all images from the repo, I believe.

option of using a submodule

Ugh submodules. A reason not to use Git is that it's "not supposed" to manage binaries anyway, right? The submodule may still take a long time to clone, etc. and complicates things for devs/contributors anyway.

jorgeorpinel avatar Mar 15 '22 20:03 jorgeorpinel

Possible option C was I mentioned using some content management tools(cms

Interesting but how does it integrate with the existing engine? Do we want to manage 2 web apps?

πŸ’‘ idea to explore: (bit hacky but) we could try using our existing Discuss site to upload images and load them from there into docs. We do something like that for all blog posts: to create a forum for comments. I believe the images would be served up by our CDN anyway (we should check).

jorgeorpinel avatar Mar 15 '22 20:03 jorgeorpinel

Maybe GH/LFS also makes this easier? If so I'd have to incline for LFS at this point.

Since we deploy our website on Heroku(I couldn't find mention of support for git-lfs), there might be a problem during build time so we might need some workaround on Heroku to support git-lfs. But, that does come with a limitation.

Shall we try with LFS and submit a pr to see how it goes?

Interesting but how does it integrate with the existing engine?

Currently, the contents are kind of separate too. Gatsby has the flexibility to use it from different sources.

Do we want to manage 2 web apps?

Depends. We can rely on the cloud services most of them provide.

Possible option C ...

πŸ’‘ idea to explore: (bit hacky but) we could try using our existing Discuss site to upload images and load them from there into docs...

Or, we can use image/media management services like Cloudinary which has easy integration with Gatsby too. There are alternatives as well:

yathomasi avatar Mar 21 '22 10:03 yathomasi

Even if there is no other option, I'd not use Git-LFS for any solution. It's like using Mercedes trucks in BMW factory. If we're going to use competitors' solutions, git-annex may be better, as it doesn't require server side changes πŸ˜ƒ

DVC can be used with multiple remote configurations like we do in example repositories. Something like reading from https://dvc-public.s3.amazonaws.com and writing to s3://dvc-public/ might be doable. PRs will still contain images and the final versions will be uploaded by the team to S3 with their credentials.

iesahin avatar Mar 21 '22 15:03 iesahin

It’s is just an idea. Need to discuss its feasibility.

  1. Setup Github action on pull request
  2. If there are any images on some specified folder, upload them to some image storage implementation
  3. The image storage implementation can be one of
    1. DVC to s3 storage
    2. Some images as service(with API)
    3. Directly to some s3 bucket
      1. Github Action https://github.com/marketplace/actions/s3-sync\
      2. and serve through CloudFront
  4. Once the images are uploaded, create a pull request that replaces the path of the image with a new image path and also completely removes those images (I assumed it’s possible to use git-filter-branch or git-filter-repo)
  5. The idea of a new pull request is similar to the restyled-pull requests we are using.

cc: @iterative/websites @casperdcl

yathomasi avatar Mar 28 '22 11:03 yathomasi

If we convert dvc.org into a DVC repository, it will be possible to add image files with dvc add and we'll dvc push them to S3 with our credentials.

In publish phase, a dvc pull in the repository is required to get the files before image transformations. This can use an https remote not to allow leaking S3 credentials to Heroku or GitHub.

In our PRs, we can upload the files ourselves. Outside contributors won't be able to push though. They can include the images to PRs, and we can upload them.

Once we have a workflow to use DVC in the repository, we can automate as much as possible.

iesahin avatar Mar 28 '22 15:03 iesahin

If there are any images on some specified folder, upload them to some image storage implementation

before this step I would have:

  1. cml send-github-check --status in_progress
  2. manual approval (see https://github.com/iterative/cml/blob/f3406ac28cb98aeef5cc47f4754b4c8cac428b14/.github/workflows/test-deploy.yml#L10-L19)
  3. cml send-github-check --conclusion success

Also the storage could be similar to https://assets.cml.dev :)

/CC @iterative/cml

casperdcl avatar Mar 28 '22 19:03 casperdcl

we can use image/media management services like Cloudinary which has easy integration with Gatsby... Imagekit.io Imgix

Sounds interesting. I'm also of the idea to try this manually first, and automate via hooks or actions once the workflow is clear.

  1. Once the images are uploaded, create a pull request that replaces the path of the image with a new image path and also completely removes those images

I'd be ideal that the transformation can happen locally though, so that there's no need to cascade PRs some of which rewrite history of what was just merged in (sounds fragile). The question about outside contributors is good but it's not very common that they're include images in PRs so not a pressing one to answer.

jorgeorpinel avatar Mar 29 '22 01:03 jorgeorpinel

Lowering to p2 due to lack of resources for now.

dberenbaum avatar Feb 27 '23 21:02 dberenbaum