lecture-jax icon indicating copy to clipboard operation
lecture-jax copied to clipboard

MAINT: use Github based GPU instance

Open mmcky opened this issue 1 year ago • 14 comments
trafficstars

This PR makes use of the Tesla T4 instance now available on GitHub Actions as a beta instance

mmcky avatar May 09 '24 04:05 mmcky

Deploy Preview for incomparable-parfait-2417f8 ready!

Name Link
Latest commit a672c9736c4dc74b8a8dec800b0abf16ebf1b02c
Latest deploy log https://app.netlify.com/sites/incomparable-parfait-2417f8/deploys/666a2c387abc8e0008e222c7
Deploy Preview https://deploy-preview-181--incomparable-parfait-2417f8.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar May 09 '24 04:05 netlify[bot]

🚀 Deployed on https://666a36597c14415a37c64edb--incomparable-parfait-2417f8.netlify.app

github-actions[bot] avatar May 09 '24 05:05 github-actions[bot]

  • [x] re-enable build cache
  • [x] remove docker dependency and test local builds using anaconda (simpler)

mmcky avatar May 09 '24 05:05 mmcky

The results between EC2 (left) and GitHub Actions (right)

Screenshot 2024-05-09 at 4 00 54 PM

@jstac there is a really interesting mix of timing results here between the V100 on EC2 and the T4 on GitHub. Many times are lower with a few exceptions such as wealth_dynamics. I will try and understand the root causes.

mmcky avatar May 09 '24 06:05 mmcky

Just triggered a new publish so we are comparing like for like. (https://github.com/QuantEcon/lecture-jax/releases/tag/publish-2024may09)

mmcky avatar May 09 '24 07:05 mmcky

This is interesting @mmcky! Now since we are using GA's GPU for trial, shall we compare the costs that we were having through AWS and now on GA -- maybe need to figure out how are we going to compare the costs since I believe it will depend on the frequency of commit push?

kp992 avatar May 09 '24 10:05 kp992

This is interesting @mmcky! Now since we are using GA's GPU for trial, shall we compare the costs that we were having through AWS and now on GA -- maybe need to figure out how are we going to compare the costs since I believe it will depend on the frequency of commit push?

that's right @kp992 -- the pricing is:

Service Cost Units
EC2 (p2.xlarge) $0.90 per instance Hour
GA (Ubuntu GPU 4-core) $0.07 per minute

So if we have a 10 minute job then

Service Cost
EC2 (p2.xlarge) $0.90
GA (Ubuntu GPU 4-core) $0.70

so the pricing really depends on the frequency of long runs vs short runs. Honestly (while the per hour price on GA is a LOT higher, I think it will work out to be pretty similar).

mmcky avatar May 09 '24 23:05 mmcky

@kp992 this is the like-for-like time comparisons now with the current live site.

Screenshot 2024-05-10 at 9 08 40 AM

still an interesting mix of performance differences.

Machine Details:

EC2:

Name GPUs vCPUs RAM (GiB) NetworkBandwidth Price/Hour* RI Price / Hour**
p2.xlarge 1 4 61 High $0.900 $0.425

Github:

CPU GPU GPU card Memory (RAM) GPU memory (VRAM) Storage (SSD) Operating system (OS)
4 1 Tesla T4 28 GB 16 GB 176 GB Ubuntu, Windows

So it appears we are running on a machine with less RAM which is interesting.

mmcky avatar May 09 '24 23:05 mmcky

  • [x] remove the docker container layer to see if that speeds up compute times

mmcky avatar May 10 '24 01:05 mmcky

Currently the kernel is dying when installing directly onto the vm provided by github (rather than using our docker container). IT would be quicker and more efficient to get this route working.

mmcky avatar May 10 '24 02:05 mmcky

Thanks @mmcky, are we moving forward with moving to Github Actions VM for all our repos using AWS?

kp992 avatar May 13 '24 05:05 kp992

Thanks @mmcky, are we moving forward with moving to Github Actions VM for all our repos using AWS?

I would like to if we can -- as that is less to maintain. But currently I am getting issues with kernels dying which suggest that jax install isn't working properly (without a container).

mmcky avatar May 13 '24 05:05 mmcky

The driver versions under docker are:

NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.3    

and when using the native VM

 NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2  

so the CUDA version is likely causing the issue?

mmcky avatar May 21 '24 00:05 mmcky

@kp992 any ideas on why the jupyter kernel is dying when running directly on the VM but the docker container is OK?

  • [x] @mmcky can we host the docker container on github to speed up the compute?

mmcky avatar May 21 '24 01:05 mmcky

Thanks @mmcky, I will try to look into it. I will create a new PR on top of these commits so I can test and play around separately.

kp992 avatar May 21 '24 12:05 kp992

@mmcky working on using github containers to store the docker container here https://github.com/QuantEcon/lecture-python.docker/pull/4

mmcky avatar Jun 10 '24 06:06 mmcky

@kp992 the fetch from github containers is about 10min. That is pretty good right?

mmcky avatar Jun 11 '24 04:06 mmcky

@kp992 it looks like these instances may have CUDA=12.3 installed. Our docker is configured for CUDA=12.5 so there are a lot of ptax warnings. We may need to adjust the Docker container to enable this context (or upgrade CUDA drivers). I think CUDA upgrades would require a reboot but looking into it.

mmcky avatar Jun 11 '24 04:06 mmcky

@kp992 looks like the newer CUDA driver is working. Will post a speed comparison with the current live site once I get the preview.

mmcky avatar Jun 11 '24 05:06 mmcky

@jstac and @kp992 here are the latest results moving our computations to the GitHub based GPU instance. LHS = current live site (built on EC2) and RHS = this PR (built on Github + using CUDA=12.5 driver). Many times are improved except for Wealth Dynamics (@kp992 would you mind reviewing this lecture to see why this might be?)

Screenshot 2024-06-11 at 4 10 39 PM

mmcky avatar Jun 11 '24 06:06 mmcky

Thanks @mmcky , good to know.

jstac avatar Jun 12 '24 00:06 jstac

Thanks @mmcky, this looks great. I can look at the wealth dynamics timings difference.

kp992 avatar Jun 12 '24 02:06 kp992

thanks @jstac and @kp992. I am doing one final round of review on this and then I will migrate to use github instances for this lecture series as well.

mmcky avatar Jun 12 '24 04:06 mmcky

  • [x] check this closely as the nvidia-smi is reporting the following and the docker container is using CUDA=12.5

NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 12.3

AH HA! That page hasn't bee re-executed as the date is from 06th of June. This will refresh in a full build.

mmcky avatar Jun 12 '24 05:06 mmcky

@kp992 I think this is ready. If you can cast your eye over it one more time then I'll merge.

mmcky avatar Jun 12 '24 23:06 mmcky