cerebros-core-algorithm-alpha icon indicating copy to clipboard operation
cerebros-core-algorithm-alpha copied to clipboard

120 add use examples efficientnet fine tuning on cifar 100

Open sashakolpakov opened this issue 2 years ago • 10 comments

Added EfficientNet (v.2, small model) fine-tuning on Cifar-100 using Cerebros / ipynb notebook / py code

sashakolpakov avatar Oct 31 '23 21:10 sashakolpakov

I added a CICD test for this benchmark. Let's pray that this will run on the Github test server in a workable time. If not, we may need to make a miniaturized version of it for the CICD demos. https://github.com/david-thrower/cerebros-core-algorithm-alpha/pull/123/files#diff-cc8c65daed8907e6bb50ac1769d49c05f5f48bdbe8b5cfd3b24b7c5e56ceb8dc

david-thrower avatar Oct 31 '23 22:10 david-thrower

Thanks! Should we try to update / add other CNNs? By the way in one EfficientNet Cifar-10 notebook I can see an orphaned computation (interrupted, never finished and never made it to testing).

Alex

On Tue, 31 Oct 2023 at 23:10, David Thrower @.***> wrote:

I added a CICD test for this benchmark. Let's pray that this will run on the Github test server in a workable time. If not, we may need to make a miniaturized version of it for the CICD demos. https://github.com/david-thrower/cerebros-core-algorithm-alpha/pull/123/files#diff-cc8c65daed8907e6bb50ac1769d49c05f5f48bdbe8b5cfd3b24b7c5e56ceb8dc

— Reply to this email directly, view it on GitHub https://github.com/david-thrower/cerebros-core-algorithm-alpha/pull/123#issuecomment-1788105772, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIOQZLAUNYRVJJTIAJKW2TYCFZLTAVCNFSM6AAAAAA6YLAT6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYGEYDKNZXGI . You are receiving this because you authored the thread.Message ID: @.*** com>

sashakolpakov avatar Oct 31 '23 22:10 sashakolpakov

  1. Needs tensorflow_datasets be installed, otherwise the check fails. My bad! Updated requirements.txt, rerunning.
  2. Also complains no CUDA. Please instruct the course of action.

On Tue, 31 Oct 2023 at 23:10, David Thrower @.***> wrote:

I added a CICD test for this benchmark. Let's pray that this will run on the Github test server in a workable time. If not, we may need to make a miniaturized version of it for the CICD demos. https://github.com/david-thrower/cerebros-core-algorithm-alpha/pull/123/files#diff-cc8c65daed8907e6bb50ac1769d49c05f5f48bdbe8b5cfd3b24b7c5e56ceb8dc

— Reply to this email directly, view it on GitHub https://github.com/david-thrower/cerebros-core-algorithm-alpha/pull/123#issuecomment-1788105772, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIOQZLAUNYRVJJTIAJKW2TYCFZLTAVCNFSM6AAAAAA6YLAT6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYGEYDKNZXGI . You are receiving this because you authored the thread.Message ID: @.*** com>

sashakolpakov avatar Oct 31 '23 22:10 sashakolpakov

Here is what I did: I added the tensorflow datasets to a separate requirements file, which I should also do later on with tensorflow-text and other ancillary requirements ... I want to avoid bloating the core and separating the use case specific packages from the core packages.

david-thrower avatar Oct 31 '23 23:10 david-thrower

Thanks! I'm gonna follow this structure in all future pull requests.

On Wed, 1 Nov 2023 at 00:04, David Thrower @.***> wrote:

Here is what I did: I added the tensorflow datasets to a separate requirements file, which I should also do later on with tensorflow-text and other ancillary requirements ... I want to avoid bloating the core and separating the use case specific packages from the core packages.

— Reply to this email directly, view it on GitHub https://github.com/david-thrower/cerebros-core-algorithm-alpha/pull/123#issuecomment-1788153660, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIOQZO4QUBD26KKEQ7MI5LYCF7WNAVCNFSM6AAAAAA6YLAT6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYGE2TGNRWGA . You are receiving this because you authored the thread.Message ID: @.*** com>

sashakolpakov avatar Oct 31 '23 23:10 sashakolpakov

To reply to your question: "Also complains no CUDA. Please instruct the course of action." - This should be only a warning, which stf will throw whenever it is running on a CPU only machine. - By default, it will JIT compile (except on text classification). This will speed it up on CPUs almost as much as an inexpensive GPU will. This will leverage the XLA which is basically a technology in the CPU that allows tandem linear algebra operations to complete in 1 step (basically an arrangement of transistors such that a multiply-add operation is done with one pulse of current as a single register taking both add and multiply operands concurrently).

https://www.tensorflow.org/xla

Since we are poor, this approach is preferable to GPUs anyway.

https://keras.io/api/models/model_training_apis/

jit_compile: If True, compile the model training step with [XLA](https://www.tensorflow.org/xla). XLA is an optimizing compiler for machine learning. jit_compile is not enabled for by default. Note that jit_compile=True may not necessarily work for all models. For more information on supported operations please refer to the [XLA documentation](https://www.tensorflow.org/xla). Also refer to [known XLA issues](https://www.tensorflow.org/xla/known_issues) for more details.

david-thrower avatar Oct 31 '23 23:10 david-thrower

@sashakolpakov

No problem, I was loading everything into the same requirements.txt as well. This commit just happened to be the one where I caught on to the fact that I need to stop adding more and more to it. Once I package this and put it on PyPi, I think requirements will install automatically with a pip install, so for that reason, I need to separate it ... and need to package this for pypi ...

david-thrower avatar Oct 31 '23 23:10 david-thrower

It did fail, not sure if by a narrow margin or not. I think I will do the following: prepare a smaller dataset, where each class has a small given number of images (need to determine what number). I will make it is balanced, too. Taking just a random subset will lead to very few or no samples in some categories (even though subsampling with 15-20k entries out of 50k total training images with no regard to the dataset being balanced gives us already reasonably good accuracy!)

On Tue, 31 Oct 2023 at 23:10, David Thrower @.***> wrote:

I added a CICD test for this benchmark. Let's pray that this will run on the Github test server in a workable time. If not, we may need to make a miniaturized version of it for the CICD demos. https://github.com/david-thrower/cerebros-core-algorithm-alpha/pull/123/files#diff-cc8c65daed8907e6bb50ac1769d49c05f5f48bdbe8b5cfd3b24b7c5e56ceb8dc

— Reply to this email directly, view it on GitHub https://github.com/david-thrower/cerebros-core-algorithm-alpha/pull/123#issuecomment-1788105772, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIOQZLAUNYRVJJTIAJKW2TYCFZLTAVCNFSM6AAAAAA6YLAT6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYGEYDKNZXGI . You are receiving this because you authored the thread.Message ID: @.*** com>

sashakolpakov avatar Nov 01 '23 07:11 sashakolpakov

@sashakolpakov , This is what I had to do on the efficientnet cifar10 example. For showcase examples, definitely full - scale is awesome, but for the CICD tests, the test must complete in a timeframe that fits.

What I think a good solution to this problem is that I should make an environment variable like CICD_TEST, then make all the Python scripts look for this, but default to False if the variable does not exist.

If the execution environment the script runs in has the environment variable CICD_TEST set to true, then a small subset of the data is run in the training jobs. If the variable is set absent or set to false, then the full data set runs.

david-thrower avatar Nov 01 '23 16:11 david-thrower

Approved to merge this use case in, but given the scale of compute required, it may be infeasible to have as a routine CICD test for now.

david-thrower avatar Nov 27 '23 07:11 david-thrower