lecture-jax MAINT: use Github based GPU instance

trafficstars

This PR makes use of the Tesla T4 instance now available on GitHub Actions as a beta instance

May 09 '24 04:05 mmcky

Deploy Preview for incomparable-parfait-2417f8 ready!

Name	Link
Latest commit	a672c9736c4dc74b8a8dec800b0abf16ebf1b02c
Latest deploy log	https://app.netlify.com/sites/incomparable-parfait-2417f8/deploys/666a2c387abc8e0008e222c7
Deploy Preview	https://deploy-preview-181--incomparable-parfait-2417f8.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

May 09 '24 04:05 netlify[bot]

🚀 Deployed on https://666a36597c14415a37c64edb--incomparable-parfait-2417f8.netlify.app

May 09 '24 05:05 github-actions[bot]

[x] re-enable build cache
[x] remove docker dependency and test local builds using anaconda (simpler)

May 09 '24 05:05 mmcky

The results between EC2 (left) and GitHub Actions (right)

@jstac there is a really interesting mix of timing results here between the V100 on EC2 and the T4 on GitHub. Many times are lower with a few exceptions such as wealth_dynamics. I will try and understand the root causes.

May 09 '24 06:05 mmcky

Just triggered a new publish so we are comparing like for like. (https://github.com/QuantEcon/lecture-jax/releases/tag/publish-2024may09)

May 09 '24 07:05 mmcky

This is interesting @mmcky! Now since we are using GA's GPU for trial, shall we compare the costs that we were having through AWS and now on GA -- maybe need to figure out how are we going to compare the costs since I believe it will depend on the frequency of commit push?

May 09 '24 10:05 kp992

This is interesting @mmcky! Now since we are using GA's GPU for trial, shall we compare the costs that we were having through AWS and now on GA -- maybe need to figure out how are we going to compare the costs since I believe it will depend on the frequency of commit push?

that's right @kp992 -- the pricing is:

Service	Cost	Units
EC2 (`p2.xlarge`)	$0.90	per instance Hour
GA (`Ubuntu GPU 4-core`)	$0.07	per minute

So if we have a 10 minute job then

Service	Cost
EC2 (`p2.xlarge`)	$0.90
GA (`Ubuntu GPU 4-core`)	$0.70

so the pricing really depends on the frequency of long runs vs short runs. Honestly (while the per hour price on GA is a LOT higher, I think it will work out to be pretty similar).

May 09 '24 23:05 mmcky

@kp992 this is the like-for-like time comparisons now with the current live site.

still an interesting mix of performance differences.

Machine Details:

EC2:

Name	GPUs	vCPUs	RAM (GiB)	NetworkBandwidth	Price/Hour*	RI Price / Hour**
p2.xlarge	1	4	61	High	$0.900	$0.425

Github:

CPU	GPU	GPU card	Memory (RAM)	GPU memory (VRAM)	Storage (SSD)	Operating system (OS)
4	1	Tesla T4	28 GB	16 GB	176 GB	Ubuntu, Windows

So it appears we are running on a machine with less RAM which is interesting.

May 09 '24 23:05 mmcky

[x] remove the docker container layer to see if that speeds up compute times

May 10 '24 01:05 mmcky

Currently the kernel is dying when installing directly onto the vm provided by github (rather than using our docker container). IT would be quicker and more efficient to get this route working.

May 10 '24 02:05 mmcky

Thanks @mmcky, are we moving forward with moving to Github Actions VM for all our repos using AWS?

May 13 '24 05:05 kp992

Thanks @mmcky, are we moving forward with moving to Github Actions VM for all our repos using AWS?

I would like to if we can -- as that is less to maintain. But currently I am getting issues with kernels dying which suggest that jax install isn't working properly (without a container).

May 13 '24 05:05 mmcky

The driver versions under docker are:

NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.3

and when using the native VM

 NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2

so the CUDA version is likely causing the issue?

May 21 '24 00:05 mmcky

@kp992 any ideas on why the jupyter kernel is dying when running directly on the VM but the docker container is OK?

[x] @mmcky can we host the docker container on github to speed up the compute?

May 21 '24 01:05 mmcky

Thanks @mmcky, I will try to look into it. I will create a new PR on top of these commits so I can test and play around separately.

May 21 '24 12:05 kp992

@mmcky working on using github containers to store the docker container here https://github.com/QuantEcon/lecture-python.docker/pull/4

Jun 10 '24 06:06 mmcky

@kp992 the fetch from github containers is about 10min. That is pretty good right?

Jun 11 '24 04:06 mmcky

@kp992 it looks like these instances may have CUDA=12.3 installed. Our docker is configured for CUDA=12.5 so there are a lot of ptax warnings. We may need to adjust the Docker container to enable this context (or upgrade CUDA drivers). I think CUDA upgrades would require a reboot but looking into it.

Jun 11 '24 04:06 mmcky

@kp992 looks like the newer CUDA driver is working. Will post a speed comparison with the current live site once I get the preview.

Jun 11 '24 05:06 mmcky

@jstac and @kp992 here are the latest results moving our computations to the GitHub based GPU instance. LHS = current live site (built on EC2) and RHS = this PR (built on Github + using CUDA=12.5 driver). Many times are improved except for Wealth Dynamics (@kp992 would you mind reviewing this lecture to see why this might be?)

Jun 11 '24 06:06 mmcky

Thanks @mmcky , good to know.

Jun 12 '24 00:06 jstac

Thanks @mmcky, this looks great. I can look at the wealth dynamics timings difference.

Jun 12 '24 02:06 kp992

thanks @jstac and @kp992. I am doing one final round of review on this and then I will migrate to use github instances for this lecture series as well.

Jun 12 '24 04:06 mmcky

[x] check this closely as the nvidia-smi is reporting the following and the docker container is using CUDA=12.5

NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 12.3

AH HA! That page hasn't bee re-executed as the date is from 06th of June. This will refresh in a full build.

Jun 12 '24 05:06 mmcky

@kp992 I think this is ready. If you can cast your eye over it one more time then I'll merge.

Jun 12 '24 23:06 mmcky

lecture-jax lecture-jax copied to clipboard

MAINT: use Github based GPU instance

✅ Deploy Preview for incomparable-parfait-2417f8 ready!

lecture-jax
lecture-jax copied to clipboard

Deploy Preview for incomparable-parfait-2417f8 ready!