MLUtils.jl
MLUtils.jl copied to clipboard
Archiving DataLoaders.jl
Now that MLUtils.jl has feature parity with DataLoaders.DataLoader, it's time to deprecate DataLoaders.jl.
We should:
- [ ] add a guide for users transitioning that explains the differences between
DataLoaders.DataLoaderandMLUtils.DataLoader. - [ ] port and adapt relevant documentation, including:
- [ ] notice for people that are looking for
torch.utils.data.DataLoaderand the comparison to PyTorch - [ ] the collating/batching guide
- [ ] guide on larger-than-memory datasets (should be updated to use
MLDatasets.FileDatasetandmapobsinstead of custom struct)
- [ ] notice for people that are looking for
- [ ] go through DataLoaders.jl issues to search for feature requests that we can move to MLUtils.jl, i.e. https://github.com/lorenzoh/DataLoaders.jl/issues/32 -> https://github.com/JuliaML/MLUtils.jl/issues/68
I'm having second thoughts on this, since I heard some complaints about MLUtils.jl now being too heavy for someone that needs only basic utilities like train/test split. For instance, the latency of using MLUtils is halved removing the parallel stuff:
julia> @time using MLUtils # on master
2.090915 seconds (2.10 M allocations: 124.960 MiB, 3.50% gc time, 68.02% compilation time)
julia> @time using MLUtils # after removing FLoops and FoldsThreads
0.928274 seconds (890.85 k allocations: 53.444 MiB, 2.36% gc time, 73.03% compilation time)
Maybe we can keep DataLoaders.jl, move it to the JuliaML org, copy and paste the implementation from here and mark a breaking release. Then we have Flux reexport it.
For MLUtils, we keep eachobs without the parallel option.
As a further breakdown, I also suggest to move getobs and numobs and their implementations for the Base types to a MLCore.jl repo. But that can come later, with the excision of the parallel dataloader MLUtils is a lean dependence for someone that wants to implement the obs interface.
Or we just go on with the plan outlined in the OP. What do you think @lorenzoh @darsnack @ToucheSir ?
Right, I didn't think of the latency. I think what you're proposing makes sense, and makes for an easier migration as well.
My opinion is that we should really consider how many people are annoyed by this latency issue and if this is a real demand that justifies dividing work again into multiple repositories.
This premature modularization is a serious issue in Julia organizations where people simply split things into packages because 1 user commented about load time. At some point we all hope that Julia will fix the latency issue, and so let's think carefully before deciding to go separate ways.
Can you think of users of DataLoaders.jl that are not users of MLUtils.jl? If the answer is yes, then we have a more solid argument to keep these efforts separate with such a tiny community of maintainers.
How much can we reduce import times without removing core deps (FLoops can probably go though since it's mostly used for syntax sugar)? I don't think this package has any precompilation, for example.
I agree with Brian that we should look into reducing the load time contributed by Floops first. It would be nicer as a user to not have to load different packages to set up the data pipeline.
I think the main complaint re: load times is packages that want to extend the interface, not user load times. For this, let's see how far people get with the getindex and length interface. If we find lots of types are needing to define getobs and numobs specifically, then we can split the interface out. I think that makes more sense than splitting out the parallel loading.
As another data point, the import time difference is also quite big on my machine:
julia> @time using MLUtils
0.222900 seconds (468.17 k allocations: 27.956 MiB, 1.72% compilation time) (no parallel deps)
1.010500 seconds (1.67 M allocations: 95.612 MiB, 1.94% gc time, 43.90% compilation time) (master)
0.995093 seconds (1.56 M allocations: 88.791 MiB, 46.31% compilation time) (without FLoops)
Edit: added time without FLoops.jl
Can you think of users of DataLoaders.jl that are not users of MLUtils.jl? If the answer is yes, then we have a more solid argument to keep these efforts separate with such a tiny community of maintainers.
I can think of many users of MLUtils not being users of DataLoaders.
It would be nicer as a user to not have to load different packages to set up the data pipeline.
true. This is mitigated though by the fact that Flux's users just import Flux and have anything they need, without having to know anything about MLUtils or DataLoaders. Non-deep-learning users can just use an eachobs without any parallel option.
Edit: added time without FLoops.jl
so removing FLoops doesn't help much.
As for the mantainance burden of another repo, I expect DataLoaders.jl to be very low activity, we will hardly change anything until we start experimenting with other parallel loading strategies.
What about precompilation? I don't think any packages in JuliaML use it?
Precompilation would help with TTFX but not with speeding up using right?
I thought it was part of https://github.com/SciML/DifferentialEquations.jl/issues/786, but reading back through https://github.com/SciML/DifferentialEquations.jl/issues/786#issuecomment-1000786329 it appears invalidations + Requires were more important for import times. Has anyone tried running @time_imports from 1.9 on MLUtils? I don't have a nightly Julia installed locally.
This is what I get:
julia> @time_imports using MLUtils
10.4 ms ┌ MacroTools
17.8 ms ┌ ZygoteRules 34.58% compilation time
0.6 ms ┌ DefineSingletons
0.3 ms ┌ IteratorInterfaceExtensions
0.9 ms ┌ TableTraits
1.1 ms ┌ DelimitedFiles
3.3 ms ┌ Compat
1.2 ms ┌ Requires
0.4 ms ┌ DataValueInterfaces
1.0 ms ┌ DataAPI
11.2 ms ┌ Tables
1.0 ms ┌ ConstructionBase
16.2 ms ┌ Setfield 29.96% compilation time (22% recompilation)
13.6 ms ┌ InitialValues
127.6 ms ┌ BangBang 55.14% compilation time (2% recompilation)
10.0 ms ┌ FunctionWrappers
51.4 ms ┌ DataStructures
0.4 ms ┌ CompositionsBase
15.4 ms ┌ Accessors 24.46% compilation time
1.5 ms ┌ ContextVariablesX
60.8 ms ┌ ChainRulesCore
62.5 ms ┌ ChangesOfVariables
0.8 ms ┌ InverseFunctions
2.8 ms ┌ DocStringExtensions 52.48% compilation time
3.6 ms ┌ IrrationalConstants
0.7 ms ┌ StatsAPI
0.7 ms ┌ SortingAlgorithms
1.4 ms ┌ LogExpFunctions
6.1 ms ┌ Missings
38.9 ms ┌ StatsBase 3.78% compilation time
6.9 ms ┌ MicroCollections
0.7 ms ┌ Adapt
0.7 ms ┌ ArgCheck
6.4 ms ┌ SplittablesBase
34.7 ms ┌ Baselet
85.8 ms ┌ Transducers 10.24% compilation time
28.4 ms ┌ MLStyle
1.1 ms ┌ PrettyPrint
5.1 ms ┌ ShowCases
265.3 ms ┌ FoldsThreads 86.56% compilation time
1.2 ms ┌ FLoopsBase
0.7 ms ┌ NameResolution
3.4 ms ┌ JuliaVariables
6.1 ms ┌ FLoops
747.3 ms MLUtils 43.14% compilation time (<1% recompilation)
We could do the lazy import (or require import) thing we do in MLDatasets.jl also here for FoldsThreads (see https://github.com/johnnychen94/LazyModules.jl), but it is a hackish solution that we should avoid if possible.
Some finer-grained timing locally suggests that the majority of import overhead for BangBang and FoldsThreads is self time and not from dependencies (unless there is significant invalidation happening, but that does not appear to be the case). It's not clear to me why this is though, @tkf do you have any thoughts?
I'm waiting for the outcome of this discussion before tagging a new release that adds the 'parallel' API to DataLoader which we may have to break in the future if the parallel implementation is moved out. But now a few changes have accumulated, most importantly bug fixes like #97, so we should reach a decision soon.
Would it be possible to add parallel as an experimental option? Could be removed later by simply ignoring it.
Thinking through it, I think it would be better to have DataLoader in MLUtils.jl. If criticisms mount after, it can always be removed.
Re being able to implement the data container interface without taking on a large dep: I think getindex and length should suffice in most use cases; where they don't, is mostly for builtin types like Array, Tuple, and Dict that are already defined in MLUtils.jl
I share Lorenz's view here about waiting to see what people say. Also, keeping it in means we can release now, so I suggest doing that and moving this discussion to the next cycle. The long load times + DataLoader as a type are already released right?
Yes, the load time is already released.
Ok, let's release then
As a further breakdown, I also suggest to move getobs and numobs and their implementations for the Base types to a MLCore.jl repo.
Yes, please.