FluxML-Community-Call-Minutes
FluxML-Community-Call-Minutes copied to clipboard
Refresh Model Zoo
This issue will track long term progress for the model-zoo. Here are the steps to get there (in order):
- [x] Test all models: run through all the tutorials to check whether they work on the latest release and document problems
- [x] Address outstanding external PRs: take care of PRs such as
- [x] Metalhead.jl
- [x] Flux.jl
outdims
- ~~LearnBase.jl~~
- ~~yet to be filed MLDataPattern.jl~~
- [x] Issue/PR clean-up: close stale issues and PRs
- [ ] Update models: get models working with latest release (no new tutorials!)
- [ ] Add new tutorials: add new tutorials (especially ones that make use of the new ecosystem packages)
- [ ] Trim old tutorials: remove tutorials that are superseded by new tutorials
- [ ] Set up benchmarking CI: convert tutorials to Weave.jl like SciML
- [ ] Translate to benchmarks: convert as many tutorials as possible to use the benchmark model
Original Text
Collecting these issues and PRs here for now. Eventually we may want to split some out for triage.
- [ ] https://github.com/FluxML/model-zoo/pull/241 update to Flux 0.11 (note: this is already done for the tutorial)
- [ ] https://github.com/FluxML/model-zoo/issues/235 incorrect dimension order for some CNNs
- [ ] https://github.com/FluxML/model-zoo/pull/199 use the non-deprecated TagBot Action
- [ ] https://github.com/FluxML/model-zoo/issues/154 rm bitrotted diffeq examples since they appear to be covered under SciML
- [ ] https://github.com/FluxML/model-zoo/issues/244 fix model non-convergence
- [ ] https://github.com/FluxML/model-zoo/pull/203 clean up unnecessary use of global scope
- [ ] https://github.com/FluxML/model-zoo/issues/218, https://github.com/FluxML/model-zoo/pull/192 many models are missing READMEs
- [ ] [TODO issue] MNIST batch size is too aggressive and causes frequent OOMs
- [ ] [TODO issue] add CI (and possibly convergence testing?)
What is the overall vision for the model zoo? Like what attributes would make a good zoo model? I'm going to throw out some things and let me know if they are what you have in mind:
- single
.jl
file +Project.toml
: self-contained, no need to have a bunch of files, easier setup (just]instantiate
, run Julia file). Not easy to document what different pieces are doing without long comments - jupyter notebook: example like this which gives a full tutorial to the ecosystem (I'm working on this right now to learn myself, but it could be useful to know the whole pipeline and options for other newcomers). But this is harder to get up and running and hard to build a benchmark pipeline around
- optimized to be extremely fast: CuIterators and carefully thought out optimizations, no global scope. Harder for a Flux/Julia newbie to do
- Everything up to date: keep the examples relevant, identify bugs and holes before other people try them, actively prevent bitrot
- What to do about the data? Does every model need to implement its own method for downloading data? is there a best practice?
I read someone somewhere else said that they thought getting a working model zoo up to best practices would set the stage for growing the ML/DL/Flux ecosystem because then it's easier to benchmark and progressively improve.
As someone newer to the ecosystem and Julia, it would probably be good to have a list of best practices when implementing a model - a checklist with explanations. It would promote consistency for the model zoo models and be a good jumping point for new people in the community. There's a PR for something about putting things in the global scope, but someone new from python won't necessarily know not to do that for their model.
I think there is also likely an opportunity to improve these pages as well, specifically the performance one: https://fluxml.ai/Flux.jl/stable/ecosystem/ https://fluxml.ai/Flux.jl/stable/performance/
As for my first post above, I just found Literate.jl, which I think could be a solution to having to maintain 1 .jl
file but still being able to take advantage of the richness of notebooks. If they were all done in that style, there could be unit tests and CI but it could be easily transferable to notebooks. Just a thought.
These are great comments. I'll try and give some insight on some of them.
- Julia files vs. Jupyter notebooks: I believe Chris has given us some references on how SciML uses Literate.jl (or maybe it was Weave.jl) to generate Julia + notebooks as well as benchmark with CI. (link)
- Staying up to date: This is absolutely crucial, and I think that using the zoo for benchmarking will help ensure this.
- Data management: All zoo models should use MLDatasets.jl, or implement the custom dataset interface.
It's all done by https://github.com/SciML/RebuildAction
As someone newer to the ecosystem and Julia, it would probably be good to have a list of best practices when implementing a model - a checklist with explanations.
Totally agree with this. Once the Metalhead.jl PR is resolved, I can write up a section on this in the docs for Flux.jl.
RE documenting best practices, there's an old list here that might be a good jumping-off point. I think it's also worth a look at the GSOD PRs for overlap and/or opportunities for collaboration.
Thanks for all the responses. It sounds like there are a handful of lists that could be put together generally on the topics of performance and best practices, and I think putting them in the Flux Docs probably makes sense.
@darsnack , when you said "the custom dataset interface" it sounds like you were referring to something in particular?
It also sounds like I need to familiarize myself with RebuildAction, CI, and the SciML Benchmarks, likely after this semester ends.
Feel free to ask for help, and feel free to ping me to join the next ML Fast AI coordination call. I am getting Dhariya involved as well.
@darsnack , when you said "the custom dataset interface" it sounds like you were referring to something in particular?
Yes, since we've adopted MLDataPattern.jl for iterating datasets, any custom dataset needs to implement the getobs
interface. Here is a good overview of what that entails.
I wrote a wonderful Documenter plugin DemoCards.jl for JuliaImages's demo page. It uses Literate.jl so you can write in plain julia files. It maps the folder structure into page structure so making changes are very flexible.
I'm willing to put some effort in updating the model zoo. Already started testing the models, I'm keeping a list of issues here.
So we already have the scripts folder which does the conversion, which is for this exact use case, also FluxBot.jl
We need only go through the models to add literature. This is literate based as well, so it should be easy to test things out.
cc @sophb
Another thing that makes the model zoo less useful is that recently many models have this argument handling as part of every model which distracts from what the scripts are meant to be there for. Removing that and using simpler constructs that point to exactly what aspect is being talked about would make a whole lot of difference
Yes, I noticed this while testing the models, some of them have taken a bit too much of a kitchen sink approach.
IMHO the model zoo should find the right balance between three things:
- Impress with what can be done with Flux
- Guide people trying to learn Flux
- Benchmark the Flux package
My guess is that highlighting just one or maybe two features or ecosystem packages per model and using each of these features or ecosystem packages in only one or two models will result in a good balance. Overdoing it will only result in distraction from the great things that can be done with Flux, harder to grasp tutorials and longer running times for the benchmarks.
Some examples:
- Use the dcgan model to highlight a more complex custom training loop and try to use a vanilla training loop in all the other models.
- The language detection model can be used to highlight a custom dataset and let all the others use MLDatasets
- The VGG CIFAR10 model could be the only one using custom logging with TensorBoardLogger.jl
IMHO the model zoo should find the right balance between three things:
Impress with what can be done with Flux Guide people trying to learn Flux Benchmark the Flux package
I think if we use @ChrisRackauckas approach with SciML, we should be able to write Literate.jl scripts for all our model zoo examples. And I agree that some examples should throw out some of the kitchen sink and try to focus on the essence of what that example is trying to teach.
Right now, the problem is that all the examples exist to be consumed in script form. Instead, if they were written in Literate.jl, we could have a Publish.jl website for the entire zoo. The pages of that site being the tutorials. I think just that change of writing for a different audience makes a difference in the produced result.
We are working on establishing a "Flux" Publish theme for all the ecosystem packages. I can put together a sample PR where I translate a couple of the zoo examples and showcase the website.
I have been saying for a while that if we can get the literature part in the model scripts, the conversion to be pushed to the site is trivial.
We are working on establishing a "Flux" Publish theme for all the ecosystem packages. I can put together a sample PR where I translate a couple of the zoo examples and showcase the website.
A PR would be welcome, have you seen the tutorial pages on the site already?
we should be able to write Literate.jl scripts for all our model zoo examples.
They already are, check the scripts
directory on the model zoo.
Also check the dg/zygote branch for how the zoo used to look like, without the kitchen sink approach.
~~Ah I didn't realize that the model-zoo was already capable of feeding the tutorials on the website. Does this happening automatically on release?~~
I'm thinking of something like the tutorials page on the website, but it is automatically updated by CI on every "release" of the model zoo. That way every thing in the zoo is consumed as either a tutorial on the website, a runnable script you can download, or a script that the benchmarking CI can run.
In the set up I was describing, the model zoo would have its own website that hosted all these tutorials. Is there a way to tie that to the Flux website? I don't think GH actions can trigger events on other repos?
I'm thinking of something like the tutorials page on the website, but it is automatically updated by CI on every "release" of the model zoo
So the idea is exactly that, and we have a working example of that tied with FluxBot.jl, plus RebuildAction would allow for benchmarking. We are already setting up a benchmarking suite for GPU performance.
the model zoo would have its own website that hosted all these tutorials. Is there a way to tie that to the Flux website
We should have that happen as part of the flux website, for sure.
Could we add a bit in the tracker to move the site off Jekyll? Maybe publish.jl or pkgpage.jl or franklin.jl?
Yeah moving off Jekyll to a Julia-based static website generator would make this all easier for sure.
Can we replace any occurrence of train!
in the model-zoo with a custom loop? Discussion in https://github.com/FluxML/Flux.jl/issues/1461 is not converging, and maybe this is something we could all agree on
replace any occurrence of train! in the model-zoo with a custom loop?
I don't think that's a great idea, maybe for a model which specifically intends to show the loop or something that benefits from it.
Discussion in FluxML/Flux.jl#1461 is not converging, and maybe this is something we could all agree on
I'm not sure what you mean here? There is ongoing and active discussion and best to have it properly
I don't think that's a great idea, maybe for a model which specifically intends to show the loop or something that benefits from it.
I suggest the other way around, we give a single example of train!
I'm not sure what you mean here? There is ongoing and active discussion and best to have it properly
It surely good to have the discussion, I'm just saying it is not converging and it could go on for months. My suggestion is to reverse things as currently stand, and primarily point users to the pattern that is more informative and more flexible, and only in the second instance to train!
Let me rephrase this. The pattern that we have established with Flux.jl is what is represented in the examples, and we shouldn't change things only to change them back if we don't have a clear answer yet which will come with the convergence.
there is no need to change anything back once you have custom loops
If most of the examples use train!
, and we converge on removing it or downplaying it, then we taught a bunch of users to use a non-preferred API.
The for-loop will always be a first class API, so you can't go wrong by using custom loops everywhere.
Not really, the for loop is under the same considerations and open to the same changes that any other api flux has had or will have. I prefer the for loop to train personally, but it isn't cogent to push it when it's not
If the examples are better served to not need a for loop, then we did the right thing by teaching users to look for similar constructs in other packages that provide loops for more complex cases, and also educate them about how the loops work by having examples and docs that are meant to teach the for loop api.
API design shouldn't prefer one set of assumptions for every case, but offer options.
I don't want everyone copying the same for loop everywhere, the average case does not need any more complication. If there are average cases not well served, improve the api to catch that case.