llama-cpp-python
llama-cpp-python copied to clipboard
Workflow update
Add CPU with AVX, AVX2, AVX512 with OpenBlas & Remove unnecesary 32 bits wheels
- Without AVX Utuntu, Windows => 32 bits, mac => 64 bits
- AVX : Ubuntu, Windows, Mac => 32/64 bits
- AVX2: Ubuntu, Windows, Mac => 64 bits
- AVX512 Ubuntu, Windows, Mac => 64 bits
CUDA compiled with AVX Remove Python 3.8 Remove macos-11 deprecated Add python 3.9 when missing Upgrade macos-13 to macos-latest in tests Upgrade ubuntu-20.04 to ubuntu-latest Upgrade windows-2019 to windows-latest refactoring of metal building
Tests 11 may 24 CPU Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044907773 CUDA Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044908928 Metal Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044910394
https://github.com/abetlen/llama-cpp-python/issues/1342#issuecomment-2054099460
I'll paste my comment here, and maybe we can open a new discussion, basically I'm concerned about the size of releases ballooning with the number of prebuilt wheel variants. I had some suggestions for long term solutions there but not sure what the right approach is.
Anecdotally @oobabooga claims to have run into issues with GitHub throttling his prebuilt wheel repo because of this.
If you generate too many wheels, there is a 100% chance you will reach a storage quota and GitHub will ask you to start paying for storage else your wheels will fail to upload. It's not too expensive (a few $ a month at most), but it's worth keeping in mind.
I avoided the API quota limit problems by adding a timer in my yaml
- name: ⌛ rate 1
shell: pwsh
run: |
# add random sleep since we run on fixed schedule
sleep (get-random -max 1200)
# get currently authenticated user rate limit info
$rate = gh api rate_limit | convertfrom-json | select -expandproperty rate
# if we don't have at least 100 requests left, wait until reset
if ($rate.remaining -lt 400) {
$wait = ($rate.reset - (Get-Date (Get-Date).ToUniversalTime() -UFormat %s))
echo "Rate limit remaining is $($rate.remaining), waiting for $($wait) seconds to reset"
sleep $wait
$rate = gh api rate_limit | convertfrom-json | select -expandproperty rate
echo "Rate limit has reset to $($rate.remaining) requests"
}
I'll paste my comment here, and maybe we can open a new discussion, basically I'm concerned about the size of releases ballooning with the number of prebuilt wheel variants. I had some suggestions for long term solutions there but not sure what the right approach is.
Anecdotally @oobabooga claims to have run into issues with GitHub throttling his prebuilt wheel repo because of this.
https://github.com/Smartappli/serge-wheels/actions
Not enabling AVX penalizes LLaMa cpp python performance in both cpu and cuda.
Maybe the list can be shrink down a bit. For example:
- Not many people have AVX512, remove until there's enough demand.
- Making AVX support a minimum?
- Remove python3.8, it's EOL in a few months.
@Smartappli Your hanges are adding AVX for CUDA wheels, is that needed? At that point the user is using the GPU.
It makes sense for the basic wheels to have AVX, and AVX2 wheels, not so much for the CUDA ones.
I copy that thx @gaby
in summary: AVX and AVX2 on CPU is enough
@abetlen workflow update done
@gaby @oobabooga @abetlen What do you think?
Up to @abetlen I was going to mention the current CI (outside of this PR) is building i386 and win32 wheels, is that even necessary?
Up to @abetlen I was going to mention the current CI (outside of this PR) is building i386 and win32 wheels, is that even necessary?
@gaby Make sense
@gaby Before avx -> 32 bits avx -> 32+64 bits avx2 -> 64 bits avx512 -> 64bits
@gaby @oobabooga @abetlen What do you think?
@gaby thx for the code review
Ping @gaby
Ping @gaby
It's up to @abetlen :-)
Hey @Smartappli will review soon.
@abetlen I found another improvement, if you look at the release for Metal wheels https://github.com/abetlen/llama-cpp-python/releases/tag/v0.2.71-metal
Its publishing x86_64 wheels, but that's not a platform with Metal. It should only be aarch64/arm64
@gaby can you not run metal on intel macs I assumed that was possible? Additionally, the metal wheels are actually fairly small / fast to build.
@abetlen It only does for intel uhd and some AMD cpu's. But apple devices are arm64
@gaby can you not run metal on intel macs I assumed that was possible? Additionally, the metal wheels are actually fairly small / fast to build.
@gaby @abetlen x86_64 architecture removed
Tests 11 may 24 CPU Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044907773 CUDA Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044908928 Metal Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044910394
ping @gaby
@abetlen can you review plz?
Hey @Smartappli thanks for your patience and the PR, busy month so just catching up on open PRs right now, do you mind splitting this one up into 2 with one that includes the following
CUDA compiled with AVX
Remove Python 3.8
Remove macos-11 deprecated
Add python 3.9 when missing
Upgrade macos-13 to macos-latest in tests
Upgrade ubuntu-20.04 to ubuntu-latest
Upgrade windows-2019 to windows-latest
refactoring of metal building
and another just for the cpu wheels changes?
Hey @Smartappli thanks for your patience and the PR, busy month so just catching up on open PRs right now, do you mind splitting this one up into 2 with one that includes the following
CUDA compiled with AVX Remove Python 3.8 Remove macos-11 deprecated Add python 3.9 when missing Upgrade macos-13 to macos-latest in tests Upgrade ubuntu-20.04 to ubuntu-latest Upgrade windows-2019 to windows-latest refactoring of metal buildingand another just for the cpu wheels changes?
@abetlen Done: https://github.com/abetlen/llama-cpp-python/pull/1515
Has anyone managed to fix the CUDA workflows? Mine keep failing with error
C:\Miniconda3\envs\build\include\crt/host_config.h(153): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk. [C:\Users\runneradmin\AppData\Local\Temp\tmpwbsbwtdg\build\CMakeFiles\CMakeScratch\TryCompile-uh6ciq\cmTC_cbbed.vcxproj]
See: https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/actions/runs/9457447475/job/26051277254.
I see that @abetlen's workflow also fail with the same error: https://github.com/abetlen/llama-cpp-python/actions/runs/9457182450/job/26051175939