Randy Gobbel
Randy Gobbel
I changed the default value of max_steps from 10 x size x size (in doorkey.py) to 100 x size x size, and it works fine. I also increased the number...
> Seems like this should be reported to the Jupyter repo, not IJulia? I thought this was specifically an IJulia problem, but as far as I can tell everything works...
I'm not sure of the complete set, but `cuda-pytorch` and `mlc` were definitely failing. There are only a few packages in the entire collection that have these: `cuda-python`, `mlc`, `onnxruntime`,...
I'm running JP 6 (36.2.0)
> Can you elaborate on why you want this? Unlike TF and PyTorch, Flux runs just fine on machines without CUDA-enabled GPUs and even functionality like moving arrays to GPU...
> > ... the dependency on CUDA also makes it impossible to install up-to-date versions of other packages that have a (legacy, IMO) dependence on CUDA, e.g. `TensorOperations.jl`. > >...
I ran it through the stages that had been handled by `mlc_llm.build` by hand, and with a little extra massaging (specifying `chat-template` on the command line, for example), it's working!...
> Oh that's great @rgobbel! What kind of tokens/sec do you get out of it? On Llama-2-70B I get a max of ~5 tokens/sec on AGX Orin 64GB I haven't...
Ok, here's what I got: model | quantization | input tokens | output tokens | prefill time | prefill rate | decode time | decode rate | memory -- |...
Ok, exactly which local_llm image is that (with mixtral support working correctly)? The default image (`dustynv/local_llm:r36.2.0`) tries to use `mlc_llm.build`, which then errors out partly because mixtral is not in...