Metalhead.jl icon indicating copy to clipboard operation
Metalhead.jl copied to clipboard

Docs restructuring, part 1

Open theabhirath opened this issue 3 years ago • 14 comments
trafficstars

This is the PR for landing the first half of the docs restructuring, which will include, amongst other things

  1. more docstrings
  2. a proper export of the Layers API from Metalhead (or at least the layers that are stable), with documentation added for them
  3. Some tutorials to start working with Metalhead (including pre-trained models and using them as feature extractors, as well as a user's guide to working with the mid-level ResNet API).

This PR is a work in progress, so if some things seem off/incomplete they'll probably fill in as time goes on. Initial commit is just to check if doc previews work

theabhirath avatar Sep 07 '22 12:09 theabhirath

Once the documentation build has completed, you can preview any updated documentation at this URL: https://fluxml.ai/Metalhead.jl/previews/PR201/

github-actions[bot] avatar Sep 07 '22 12:09 github-actions[bot]

P.S. there really needs to be a way to just run docs previews if it's only a docs change - the Metalhead CI load is quite heavy

theabhirath avatar Sep 07 '22 12:09 theabhirath

P.S. there really needs to be a way to just run docs previews if it's only a docs change - the Metalhead CI load is quite heavy

Separate it into a different .yml? Though I think the documentation job should run in parallel? You don't get a green check mark immediately, but the preview job should finish separately.

A couple quick thoughts (I know you might already be considering some of this):

  • the model API reference should split the docstrings into sub-headings based on the model type (ResNet-like, etc.)
  • the layer API reference should split into public (exported) and private functions

darsnack avatar Sep 07 '22 13:09 darsnack

Separate it into a different .yml? Though I think the documentation job should run in parallel? You don't get a green check mark immediately, but the preview job should finish separately.

Yep, I meant more in terms of the load of running so many tasks which are unnecessary.

I am considering the other two comments, yes 😄 Those are next in the process

theabhirath avatar Sep 07 '22 13:09 theabhirath

P.S. there really needs to be a way to just run docs previews if it's only a docs change - the Metalhead CI load is quite heavy

Something like this should work, but IMO this will make it unnecessarily complicated ☹️

label(s) triggered CI why?
neither documentation nor test-ci tests and docs CI for PRs with no labels
documentation only the docs CI + DocBot for PRs only changing the documentation
documentation and test-ci tests and docs CI + DocBot for PRs that require DocBot's comment along with running the tests

Saransh-cpp avatar Sep 07 '22 14:09 Saransh-cpp

I've seen a skip ci label used successfully before, but not sure how to set that up with GH actions.

ToucheSir avatar Sep 07 '22 14:09 ToucheSir

Is it affecting queuing? As long as we can get machines...it's GH's load not ours 😄.

darsnack avatar Sep 07 '22 15:09 darsnack

Some weird errors pop up in the docs CI:

https://github.com/FluxML/Metalhead.jl/runs/8228349979?check_suite_focus=true#step:6:18 doesn't seem to be the right error for the pre-trained ResNet loading - this one seems to suggest that there's no weights for ResNet-18 at all, when the error should be a loading error because of the architecture difference.

https://github.com/FluxML/Metalhead.jl/runs/8228349979?check_suite_focus=true#step:6:39 is even weirder - Flux does seem to be in the environment, so how is it not defined?

theabhirath avatar Sep 08 '22 02:09 theabhirath

https://github.com/FluxML/Metalhead.jl/runs/8228349979?check_suite_focus=true#step:6:39 is even weirder - Flux does seem to be in the environment, so how is it not defined?

Maybe try using Flux? Also I think it would be better to use onecold from OneHotArrays.jl?

https://github.com/FluxML/Metalhead.jl/runs/8228349979?check_suite_focus=true#step:6:18 doesn't seem to be the right error for the pre-trained ResNet loading - this one seems to suggest that there's no weights for ResNet-18 at all, when the error should be a loading error because of the architecture difference.

This took me some time to debug, but the issue is originating from here - https://github.com/FluxML/Metalhead.jl/blob/cc486bf00c60874de426dece97956528ce406564/src/pretrain.jl#L9

julia> artifact = BSON.load(path, @__MODULE__)
ERROR: UndefVarError: TaskLocalRNG not defined
Stacktrace:
  [1] #31
    @ C:\Users\Saransh\.julia\packages\BSON\rOaki\src\extensions.jl:21 [inlined]
  [2] (::Base.BottomRF{BSON.var"#31#32"})(acc::Module, x::String)
    @ Base .\reduce.jl:81
  [3] _foldl_impl(op::Base.BottomRF{BSON.var"#31#32"}, init::Module, itr::Vector{Any})
    @ Base .\reduce.jl:62
  [4] foldl_impl
    @ .\reduce.jl:48 [inlined]
  [5] mapfoldl_impl
    @ .\reduce.jl:44 [inlined]
  [6] _mapreduce_dim
    @ .\reducedim.jl:315 [inlined]
  [7] #mapreduce#675
    @ .\reducedim.jl:310 [inlined]
  [8] #reduce#677
    @ .\reducedim.jl:359 [inlined]
  [9] resolve(fs::Vector{Any}, init::Module)
    @ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\extensions.jl:21
 [10] (::BSON.var"#35#36")(d::Dict{Symbol, Any}, init::Module)
    @ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\extensions.jl:64 
 [11] _raise_recursive(d::Dict{Symbol, Any}, cache::IdDict{Any, Any}, init::Module)
    @ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:80       
 [12] raise_recursive(d::Dict{Symbol, Any}, cache::IdDict{Any, Any}, init::Module)
    @ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:93       
 [13] (::BSON.var"#23#24"{IdDict{Any, Any}, Module})(x::Dict{Symbol, Any})
    @ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:98       
 [14] applychildren!(f::BSON.var"#23#24"{IdDict{Any, Any}, Module}, x::Vector{Any})
    @ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\BSON.jl:26
 [15] raise_recursive
    @ C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:98 [inlined]  
 [16] (::BSON.var"#17#20"{IdDict{Any, Any}, Module})(x::Vector{Any})        
    @ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\read.jl:80       
 [17] applychildren!(f::BSON.var"#17#20"{IdDict{Any, Any}, Module}, x::Dict{Symbol, Any})
    @ BSON C:\Users\Saransh\.julia\packages\BSON\rOaki\src\BSON.jl:19       

The actual error (ERROR: UndefVarError: TaskLocalRNG not defined) is suppressed by this catch block - https://github.com/FluxML/Metalhead.jl/blob/cc486bf00c60874de426dece97956528ce406564/src/pretrain.jl#L15-L17

Note: Julia v1.6.* does not have TaskLocalRNG(). Flux does this -

https://github.com/FluxML/Flux.jl/blob/8bc0c35932c4a871ac73b42e39146cd9bbb1d446/src/utils.jl#L46-L60

"""
    rng_from_array([x])
Create an instance of the RNG most appropriate for `x`.
The current defaults are:
- `x isa CuArray`: `CUDA.default_rng()`, else:
- `x isa AbstractArray`, or no `x` provided:
  - Julia version is < 1.7: `Random.GLOBAL_RNG`
  - Julia version is >= 1.7: `Random.default_rng()`
"""
rng_from_array(::AbstractArray) = default_rng_value()
rng_from_array(::CuArray) = CUDA.default_rng()

if VERSION >= v"1.7"
  @doc """
      default_rng_value()
  Create an instance of the default RNG depending on Julia's version.
  - Julia version is < 1.7: `Random.GLOBAL_RNG`
  - Julia version is >= 1.7: `Random.default_rng()`
  """
  default_rng_value() = Random.default_rng()
else
  default_rng_value() = Random.GLOBAL_RNG
end

to deal with this difference, but I don't really know how to fix this in Metalhead. Also, doctests run on Julia v1.6.*, so all of the errors will be reproducible locally only if you run the doctests on Julia v1.6.*.

Saransh-cpp avatar Sep 08 '22 07:09 Saransh-cpp

The RNG error is odd. I'll try and test on 1.6 to see if I can narrow something down.

Maybe try using Flux? Also I think it would be better to use onecold from OneHotArrays.jl?

Just using Flux: onecold should work in theory though, right? The reason I'm not using OneHotArrays is because this is kind of a beginner tutorial and the package segregation thing is something that confuses someone starting out with Flux/Julia very easily. Since Flux exports onecold anyways, this should work, I think.

The actual error (ERROR: UndefVarError: TaskLocalRNG not defined) is suppressed by this catch block -

That's a nasty place for an unrelated error to hide 😬 I wonder if we could somehow filter out model loading specific errors and throw more informative messages for them while keeping any other errors passing through

theabhirath avatar Sep 08 '22 08:09 theabhirath

A couple comments:

  • the RNG stored with the model is different in pre-1.7 vs. post-1.7; Metalhead.jl should probably have artifacts for both if it supports 1.6. Can you take care of porting those manually to HF?
  • the Flux error was cause you had using Flux: onecold but not just using Flux. You need both in this case. I agree that using Flux: onecold is preferable here vs. OneHotArrays.jl.
  • I'll fix up the error handling and also add support for Julia versions

darsnack avatar Sep 08 '22 11:09 darsnack

Actually let's fix the RNG issue directly in Flux. I'll whip up a PR soon.

darsnack avatar Sep 08 '22 12:09 darsnack

Okay now that I am at my desk and I can see the error in more detail, this isn't something that we need to fix in Flux. Flux will skip trying to load the RNGs. The issue is that BSON has an error cause TaskLocalRNG is not defined pre-1.7 like Saransh pointed out.

@theabhirath I think the appropriate fix here is to always save the pre-trained weights using the minimum Julia compat. So in this case, re-save them under Julia 1.6.

darsnack avatar Sep 08 '22 13:09 darsnack

@theabhirath I think the appropriate fix here is to always save the pre-trained weights using the minimum Julia compat. So in this case, re-save them under Julia 1.6.

Gotcha, I'll whip up something soon to handle this - this should fix the nightly failures as well

theabhirath avatar Sep 08 '22 14:09 theabhirath

it's a pity for this work to go stale

CarloLucibello avatar Apr 21 '23 17:04 CarloLucibello

@theabhirath if you have bandwidth to rebase, that would be great. Just merging this PR will be a good starting point for me or someone else to finish up getting 0.8 ready.

darsnack avatar Apr 21 '23 18:04 darsnack

Hey, sorry I've been inactive for some time, college has been a little busy! I can definitely rebase this PR. The reason I haven't actually just finished this up is because there was a blocker that I felt needed a little time to investigate (the porting of weights from torchvision issue that I brought up on Zulip) which I think is the real blocker for v0.8. I was hoping to find time over the summer, but if someone can find time before that and get that sorted I think the docs PRs should be an easy merge

theabhirath avatar Apr 22 '23 01:04 theabhirath

I agree that the weights are the main blocker. I can make time for that, but I think merging this PR would be a strict improvement in the amount of work left before we can release. So I say we just merge after rebasing.

darsnack avatar Apr 22 '23 02:04 darsnack