Metalhead.jl
Metalhead.jl copied to clipboard
Docs restructuring, part 1
This is the PR for landing the first half of the docs restructuring, which will include, amongst other things
- more docstrings
- a proper export of the Layers API from Metalhead (or at least the layers that are stable), with documentation added for them
- Some tutorials to start working with Metalhead (including pre-trained models and using them as feature extractors, as well as a user's guide to working with the mid-level ResNet API).
This PR is a work in progress, so if some things seem off/incomplete they'll probably fill in as time goes on. Initial commit is just to check if doc previews work
Once the documentation build has completed, you can preview any updated documentation at this URL: https://fluxml.ai/Metalhead.jl/previews/PR201/
P.S. there really needs to be a way to just run docs previews if it's only a docs change - the Metalhead CI load is quite heavy
P.S. there really needs to be a way to just run docs previews if it's only a docs change - the Metalhead CI load is quite heavy
Separate it into a different .yml? Though I think the documentation job should run in parallel? You don't get a green check mark immediately, but the preview job should finish separately.
A couple quick thoughts (I know you might already be considering some of this):
- the model API reference should split the docstrings into sub-headings based on the model type (ResNet-like, etc.)
- the layer API reference should split into public (exported) and private functions
Separate it into a different
.yml? Though I think the documentation job should run in parallel? You don't get a green check mark immediately, but the preview job should finish separately.
Yep, I meant more in terms of the load of running so many tasks which are unnecessary.
I am considering the other two comments, yes 😄 Those are next in the process
P.S. there really needs to be a way to just run docs previews if it's only a docs change - the Metalhead CI load is quite heavy
Something like this should work, but IMO this will make it unnecessarily complicated ☹️
| label(s) | triggered CI | why? |
|---|---|---|
neither documentation nor test-ci |
tests and docs CI | for PRs with no labels |
documentation |
only the docs CI + DocBot | for PRs only changing the documentation |
documentation and test-ci |
tests and docs CI + DocBot | for PRs that require DocBot's comment along with running the tests |
I've seen a skip ci label used successfully before, but not sure how to set that up with GH actions.
Is it affecting queuing? As long as we can get machines...it's GH's load not ours 😄.
Some weird errors pop up in the docs CI:
https://github.com/FluxML/Metalhead.jl/runs/8228349979?check_suite_focus=true#step:6:18 doesn't seem to be the right error for the pre-trained ResNet loading - this one seems to suggest that there's no weights for ResNet-18 at all, when the error should be a loading error because of the architecture difference.
https://github.com/FluxML/Metalhead.jl/runs/8228349979?check_suite_focus=true#step:6:39 is even weirder - Flux does seem to be in the environment, so how is it not defined?
https://github.com/FluxML/Metalhead.jl/runs/8228349979?check_suite_focus=true#step:6:39 is even weirder - Flux does seem to be in the environment, so how is it not defined?
Maybe try using Flux? Also I think it would be better to use onecold from OneHotArrays.jl?
https://github.com/FluxML/Metalhead.jl/runs/8228349979?check_suite_focus=true#step:6:18 doesn't seem to be the right error for the pre-trained ResNet loading - this one seems to suggest that there's no weights for ResNet-18 at all, when the error should be a loading error because of the architecture difference.
This took me some time to debug, but the issue is originating from here - https://github.com/FluxML/Metalhead.jl/blob/cc486bf00c60874de426dece97956528ce406564/src/pretrain.jl#L9
julia> artifact = BSON.load(path, @__MODULE__)
ERROR: UndefVarError: TaskLocalRNG not defined
Stacktrace:
[1] #31
@ C:\Users\Saransh\.julia\packages\BSON\rOaki\src\extensions.jl:21 [inlined]
[2] (::Base.BottomRF{BSON.var"#31#32"})(acc::Module, x::String)
@ Base .\reduce.jl:81
[3] _foldl_impl(op::Base.BottomRF{BSON.var"#31#32"}, init::Module, itr::Vector{Any})
@ Base .\reduce.jl:62
[4] foldl_impl
@ .\reduce.jl:48 [inlined]
[5] mapfoldl_impl
@ .\reduce.jl:44 [inlined]
[6] _mapreduce_dim
@ .\reducedim.jl:315 [inlined]
[7] #mapreduce#675
@ .\reducedim.jl:310 [inlined]
[8] #reduce#677
@ .\reducedim.jl:359 [inlined]
[9] resolve(fs::Vector{Any}, init::Module)
@ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\extensions.jl:21
[10] (::BSON.var"#35#36")(d::Dict{Symbol, Any}, init::Module)
@ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\extensions.jl:64
[11] _raise_recursive(d::Dict{Symbol, Any}, cache::IdDict{Any, Any}, init::Module)
@ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:80
[12] raise_recursive(d::Dict{Symbol, Any}, cache::IdDict{Any, Any}, init::Module)
@ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:93
[13] (::BSON.var"#23#24"{IdDict{Any, Any}, Module})(x::Dict{Symbol, Any})
@ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:98
[14] applychildren!(f::BSON.var"#23#24"{IdDict{Any, Any}, Module}, x::Vector{Any})
@ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\BSON.jl:26
[15] raise_recursive
@ C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:98 [inlined]
[16] (::BSON.var"#17#20"{IdDict{Any, Any}, Module})(x::Vector{Any})
@ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:80
[17] applychildren!(f::BSON.var"#17#20"{IdDict{Any, Any}, Module}, x::Dict{Symbol, Any})
@ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\BSON.jl:19
The actual error (ERROR: UndefVarError: TaskLocalRNG not defined) is suppressed by this catch block -
https://github.com/FluxML/Metalhead.jl/blob/cc486bf00c60874de426dece97956528ce406564/src/pretrain.jl#L15-L17
Note: Julia v1.6.* does not have TaskLocalRNG(). Flux does this -
https://github.com/FluxML/Flux.jl/blob/8bc0c35932c4a871ac73b42e39146cd9bbb1d446/src/utils.jl#L46-L60
"""
rng_from_array([x])
Create an instance of the RNG most appropriate for `x`.
The current defaults are:
- `x isa CuArray`: `CUDA.default_rng()`, else:
- `x isa AbstractArray`, or no `x` provided:
- Julia version is < 1.7: `Random.GLOBAL_RNG`
- Julia version is >= 1.7: `Random.default_rng()`
"""
rng_from_array(::AbstractArray) = default_rng_value()
rng_from_array(::CuArray) = CUDA.default_rng()
if VERSION >= v"1.7"
@doc """
default_rng_value()
Create an instance of the default RNG depending on Julia's version.
- Julia version is < 1.7: `Random.GLOBAL_RNG`
- Julia version is >= 1.7: `Random.default_rng()`
"""
default_rng_value() = Random.default_rng()
else
default_rng_value() = Random.GLOBAL_RNG
end
to deal with this difference, but I don't really know how to fix this in Metalhead. Also, doctests run on Julia v1.6.*, so all of the errors will be reproducible locally only if you run the doctests on Julia v1.6.*.
The RNG error is odd. I'll try and test on 1.6 to see if I can narrow something down.
Maybe try
using Flux? Also I think it would be better to useonecoldfromOneHotArrays.jl?
Just using Flux: onecold should work in theory though, right? The reason I'm not using OneHotArrays is because this is kind of a beginner tutorial and the package segregation thing is something that confuses someone starting out with Flux/Julia very easily. Since Flux exports onecold anyways, this should work, I think.
The actual error (
ERROR: UndefVarError: TaskLocalRNG not defined) is suppressed by thiscatchblock -
That's a nasty place for an unrelated error to hide 😬 I wonder if we could somehow filter out model loading specific errors and throw more informative messages for them while keeping any other errors passing through
A couple comments:
- the RNG stored with the model is different in pre-1.7 vs. post-1.7; Metalhead.jl should probably have artifacts for both if it supports 1.6. Can you take care of porting those manually to HF?
- the Flux error was cause you had
using Flux: onecoldbut not justusing Flux. You need both in this case. I agree thatusing Flux: onecoldis preferable here vs. OneHotArrays.jl. - I'll fix up the error handling and also add support for Julia versions
Actually let's fix the RNG issue directly in Flux. I'll whip up a PR soon.
Okay now that I am at my desk and I can see the error in more detail, this isn't something that we need to fix in Flux. Flux will skip trying to load the RNGs. The issue is that BSON has an error cause TaskLocalRNG is not defined pre-1.7 like Saransh pointed out.
@theabhirath I think the appropriate fix here is to always save the pre-trained weights using the minimum Julia compat. So in this case, re-save them under Julia 1.6.
@theabhirath I think the appropriate fix here is to always save the pre-trained weights using the minimum Julia compat. So in this case, re-save them under Julia 1.6.
Gotcha, I'll whip up something soon to handle this - this should fix the nightly failures as well
it's a pity for this work to go stale
@theabhirath if you have bandwidth to rebase, that would be great. Just merging this PR will be a good starting point for me or someone else to finish up getting 0.8 ready.
Hey, sorry I've been inactive for some time, college has been a little busy! I can definitely rebase this PR. The reason I haven't actually just finished this up is because there was a blocker that I felt needed a little time to investigate (the porting of weights from torchvision issue that I brought up on Zulip) which I think is the real blocker for v0.8. I was hoping to find time over the summer, but if someone can find time before that and get that sorted I think the docs PRs should be an easy merge
I agree that the weights are the main blocker. I can make time for that, but I think merging this PR would be a strict improvement in the amount of work left before we can release. So I say we just merge after rebasing.