syskn

Results 8 comments of syskn

_RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory_ When this message shows up, it usually implies that one of the checkpoint files is incomplete (e.g. broken during transfer)....

I too can confirm that this issue persists with the default settings of 4GB swap space, in the first release version and the most recent versions.

Might be related: https://github.com/vllm-project/vllm/issues/667

I'm having the same issue - checkpointing simply hangs with multi GPUs exactly after >100 steps with ZeRO 1. Tried with various batch sizes and allgather bucket sizes.

I noted this with a NeoX model I quantized (Pythia 128g, desc_act = False, CUDA). Inferencing at least 4x slower than usual on A100-80GB with both CPU (single core) and...

@TheBloke Interesting, so severe performance dip **should not** happen unless desc_act is True. It's strange because I had it explicitly set to False and experienced severe slowdown. H100 means you...

@Ph0rk0z I had to quantize it with a very large number of examples (3072) to see final avg loss on attn_out/attention.dense below 40 and qkv loss below 100. cache_examples_on_gpu must...

Probably this: https://github.com/vllm-project/vllm/issues/546 For the record, I wasn't able to fix this particular issue.