lecture-jax
lecture-jax copied to clipboard
MAINT: use Github based GPU instance
This PR makes use of the Tesla T4 instance now available on GitHub Actions as a beta instance
Deploy Preview for incomparable-parfait-2417f8 ready!
| Name | Link |
|---|---|
| Latest commit | a672c9736c4dc74b8a8dec800b0abf16ebf1b02c |
| Latest deploy log | https://app.netlify.com/sites/incomparable-parfait-2417f8/deploys/666a2c387abc8e0008e222c7 |
| Deploy Preview | https://deploy-preview-181--incomparable-parfait-2417f8.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
🚀 Deployed on https://666a36597c14415a37c64edb--incomparable-parfait-2417f8.netlify.app
- [x] re-enable build cache
- [x] remove
dockerdependency and test local builds usinganaconda(simpler)
The results between EC2 (left) and GitHub Actions (right)
@jstac there is a really interesting mix of timing results here between the V100 on EC2 and the T4 on GitHub. Many times are lower with a few exceptions such as wealth_dynamics. I will try and understand the root causes.
Just triggered a new publish so we are comparing like for like. (https://github.com/QuantEcon/lecture-jax/releases/tag/publish-2024may09)
This is interesting @mmcky! Now since we are using GA's GPU for trial, shall we compare the costs that we were having through AWS and now on GA -- maybe need to figure out how are we going to compare the costs since I believe it will depend on the frequency of commit push?
This is interesting @mmcky! Now since we are using GA's GPU for trial, shall we compare the costs that we were having through AWS and now on GA -- maybe need to figure out how are we going to compare the costs since I believe it will depend on the frequency of commit push?
that's right @kp992 -- the pricing is:
| Service | Cost | Units |
|---|---|---|
EC2 (p2.xlarge) |
$0.90 | per instance Hour |
GA (Ubuntu GPU 4-core) |
$0.07 | per minute |
So if we have a 10 minute job then
| Service | Cost |
|---|---|
EC2 (p2.xlarge) |
$0.90 |
GA (Ubuntu GPU 4-core) |
$0.70 |
so the pricing really depends on the frequency of long runs vs short runs. Honestly (while the per hour price on GA is a LOT higher, I think it will work out to be pretty similar).
@kp992 this is the like-for-like time comparisons now with the current live site.
still an interesting mix of performance differences.
Machine Details:
EC2:
| Name | GPUs | vCPUs | RAM (GiB) | NetworkBandwidth | Price/Hour* | RI Price / Hour** |
|---|---|---|---|---|---|---|
| p2.xlarge | 1 | 4 | 61 | High | $0.900 | $0.425 |
Github:
| CPU | GPU | GPU card | Memory (RAM) | GPU memory (VRAM) | Storage (SSD) | Operating system (OS) |
|---|---|---|---|---|---|---|
| 4 | 1 | Tesla T4 | 28 GB | 16 GB | 176 GB | Ubuntu, Windows |
So it appears we are running on a machine with less RAM which is interesting.
- [x] remove the docker container layer to see if that speeds up compute times
Currently the kernel is dying when installing directly onto the vm provided by github (rather than using our docker container). IT would be quicker and more efficient to get this route working.
Thanks @mmcky, are we moving forward with moving to Github Actions VM for all our repos using AWS?
Thanks @mmcky, are we moving forward with moving to Github Actions VM for all our repos using AWS?
I would like to if we can -- as that is less to maintain. But currently I am getting issues with kernels dying which suggest that jax install isn't working properly (without a container).
The driver versions under docker are:
NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.3
and when using the native VM
NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2
so the CUDA version is likely causing the issue?
@kp992 any ideas on why the jupyter kernel is dying when running directly on the VM but the docker container is OK?
- [x] @mmcky can we host the docker container on github to speed up the compute?
Thanks @mmcky, I will try to look into it. I will create a new PR on top of these commits so I can test and play around separately.
@mmcky working on using github containers to store the docker container here
https://github.com/QuantEcon/lecture-python.docker/pull/4
@kp992 the fetch from github containers is about 10min. That is pretty good right?
@kp992 it looks like these instances may have CUDA=12.3 installed. Our docker is configured for CUDA=12.5 so there are a lot of ptax warnings. We may need to adjust the Docker container to enable this context (or upgrade CUDA drivers). I think CUDA upgrades would require a reboot but looking into it.
@kp992 looks like the newer CUDA driver is working. Will post a speed comparison with the current live site once I get the preview.
@jstac and @kp992 here are the latest results moving our computations to the GitHub based GPU instance. LHS = current live site (built on EC2) and RHS = this PR (built on Github + using CUDA=12.5 driver). Many times are improved except for Wealth Dynamics (@kp992 would you mind reviewing this lecture to see why this might be?)
Thanks @mmcky , good to know.
Thanks @mmcky, this looks great. I can look at the wealth dynamics timings difference.
thanks @jstac and @kp992. I am doing one final round of review on this and then I will migrate to use github instances for this lecture series as well.
- [x] check this closely as the
nvidia-smiis reporting the following and the docker container is usingCUDA=12.5
NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 12.3
AH HA! That page hasn't bee re-executed as the date is from 06th of June. This will refresh in a full build.
@kp992 I think this is ready. If you can cast your eye over it one more time then I'll merge.