### ``` from airllm import AirLLMLlamaMlx import mlx.core as mx MAX_LENGTH = 128 # could use hugging face model repo id: model = AirLLMLlamaMlx("Qwen/Qwen-7B-Chat",layer_shards_saving_path='.cache') input_text = [ 'I like',...

future work



我用以下的代码来加载microsoft-phi2, `from airllm import AutoModel` 报错: ``` 270, in split_and_save_layers if max(shards) > shard: ^^^^^^^^^^^ ValueError: max() arg is an empty sequence ```

future work

Does this support Flan-T5 model? Thanks


Hello, I can't help to ask if you have ever tried to implement any parallelism strategies to this program to help the inference in general as far as being able...


Is there a way to quantize on macos ? bitsandbytes is not supported on Apple sillicon. Can we you GGUF Models ?


``` from sys import platform from airllm import AutoModel import mlx.core as mx assert platform == "darwin", "this example is supposed to be run on mac os" # model =...


Mac M1 Max 32GB user here without ability to bitsandbites quantize Is there a way configure the chunk size for the inference to be quicker ? I think the 32GB...


I am attempting to run Llama13b using an NVIDIA GeForce RTX 3090, but the model never completes loading. ![image](

