cog icon indicating copy to clipboard operation
cog copied to clipboard

Fail while pushing with --separate-weights

Open wnakano opened this issue 1 year ago • 17 comments

Today I started to face the following issue, while using cog push --separate-weights Although I was able to push the model without the flag --separate-weights

On the error below, I just replaced the project and model name by and <model-name, respectively.

$ cog push --separate-weights
⚠ Cog doesn't know if CUDA 11.2.2 is compatible with PyTorch 1.13.1. This might cause CUDA problems.
Building Docker image from environment in cog.yaml as r8.im/<project-name>/<model-name> ...
Weights unchanged, skip rebuilding and use cached image...
[+] Building 4.0s (7/7) FINISHED                                                      docker:default
 => [internal] load .dockerignore                                                               0.0s
 => => transferring context: 22.25kB                                                            0.0s
 => [internal] load build definition from Dockerfile                                            0.0s
 => => transferring dockerfile: 4.41kB                                                          0.0s
 => resolve image config for docker.io/docker/dockerfile:1.4                                    1.6s
 => CACHED docker-image://docker.io/docker/dockerfile:1.4@sha256:9ba7531bd80fb0a858632727cf7a1  0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04          1.2s
 => ERROR [internal] load metadata for r8.im/<project-name>/<model-name>  2.2s
 => [auth] <project-name>/<model-name> -weights:pull token for r8.im    0.0s
------
 > [internal] load metadata for r8.im/<project-name>/<model-name>-weights:latest:
------
Dockerfile:2
--------------------
   1 |     #syntax=docker/dockerfile:1.4
   2 | >>> FROM r8.im/<project-name>/<model-name>-weights AS weights
   3 |     FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04
   4 |     ENV DEBIAN_FRONTEND=noninteractive
--------------------
ERROR: failed to solve: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://r8.im/_token?scope=repository%3A<project-name>%2F<model-name>-weights%3Apull&service=us-docker.pkg.dev: 404 Not Found
ⅹ Failed to build runner Docker image: Failed to build Docker image: exit status 1

wnakano avatar Sep 29 '23 18:09 wnakano

same here!

andreemic avatar Oct 16 '23 15:10 andreemic

facing same issue

usamaehsan avatar Nov 11 '23 16:11 usamaehsan

facing same issue today, but week ago it works well with --separate-weights

a-sane avatar Dec 11 '23 17:12 a-sane

can we get some help on this?

ynie avatar Dec 14 '23 16:12 ynie

@hongchaodeng I saw you implemented this feature. Do you know what's going on? Thank you so much!!

ynie avatar Dec 14 '23 16:12 ynie

I got a similar error, but deleting "path/to/your/cog_project/.dockerignore" and "path/to/your/cog_project/.dockerignore/.cog" files solved it for me.

masahiro-koga-jai avatar Jan 31 '24 09:01 masahiro-koga-jai

I faced a similar issue too. The docker build was failing to find the copied data.

=> ERROR [1/4] COPY checkpoints/canny /src/checkpoints/canny                                                                                                                                                                                                                                                                                                        0.0s
 => ERROR [2/4] COPY checkpoints/ip_adapter /src/checkpoints/ip_adapter                                                                                                                                                                                                                                                                                              0.0s
 => ERROR [3/4] COPY checkpoints/tile /src/checkpoints/tile                                                                                                                                                                                                                                                                                                          0.0s
 => ERROR [4/4] COPY checkpoints/vae /src/checkpoints/vae
...
Dockerfile:11
--------------------
   9 |     COPY checkpoints/canny /src/checkpoints/canny
  10 |     COPY checkpoints/ip_adapter /src/checkpoints/ip_adapter
  11 | >>> COPY checkpoints/vae /src/checkpoints/vae
--------------------
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 46e45d4e-74bc-4316-b8d3-ef813683c1c8::umpry926pu2og534hz3uqwpxt: "checkpoints/vae": not found

while the file was actually here.

hervenivon avatar May 13 '24 16:05 hervenivon

I stopped using replicate due to the poor tech support and framework.

ynie avatar May 13 '24 16:05 ynie

What are you using as a replacement?

hervenivon avatar May 13 '24 18:05 hervenivon

Runpod is way better with better support.

ynie avatar May 13 '24 18:05 ynie

PS: like @masahiro-koga-jai, deleting the .dockerignore solved it for me. The .dockerignore is updated during cog build, and it obviously conflicts.

I got a similar error, but deleting "path/to/your/cog_project/.dockerignore" and "path/to/your/cog_project/.dockerignore/.cog" files solved it for me.

hervenivon avatar May 13 '24 18:05 hervenivon

@ynie @hervenivon This and some other issues lead to a frustrating DX on Replicate, but YMMV building on Runpod. Personally my experience matches the reports here https://www.reddit.com/r/LocalLLaMA/comments/17il9n3/experience_on_runpod/

(I would definitely prefer Runpod's 4090's over A40's for image gen – they're half the price and twice as fast.)

emcmanus avatar May 14 '24 00:05 emcmanus

You may also need to rm -r .cog/. I believe I got this error after a bad cog push --separate-weights.

My guess is r8.im/<project-name>/<model-name>-weights gets created on the first invocation, only.

Deleting Cog's build folder seems to have forced it to create the missing image.

emcmanus avatar May 14 '24 03:05 emcmanus

I'm still shocked that this is still an issue after so many months. I remember wasting so many hours trying to fix this. Does anyone working at Replicate care?

ynie avatar May 14 '24 04:05 ynie

Based on their Discord, my sense is they're absolutely swamped by end-users who mostly want to use the web frontends for various tools. Ideally Replicate knows this is not their core business, but I'm not so sure. I suspect they're feeling stronger PMF on the front-end than on the infra side of things.

emcmanus avatar May 14 '24 05:05 emcmanus

Actually, I find cog super convenient for some of the projects I'm working on, but I do agree that the UX has some flaws.

Glad to find support in the community. Thanks! 🙏

hervenivon avatar May 14 '24 12:05 hervenivon

PS: like @masahiro-koga-jai, deleting the .dockerignore solved it for me. The .dockerignore is updated during cog build, and it obviously conflicts.

I got a similar error, but deleting "path/to/your/cog_project/.dockerignore" and "path/to/your/cog_project/.dockerignore/.cog" files solved it for me.

yes, I had added .cog/ in .dockerignore file, removing it solved the problem for me

narendraadloid avatar Jun 04 '24 13:06 narendraadloid