MLUtils.jl Archiving DataLoaders.jl

Now that MLUtils.jl has feature parity with DataLoaders.DataLoader, it's time to deprecate DataLoaders.jl.

We should:

[ ] add a guide for users transitioning that explains the differences between DataLoaders.DataLoader and MLUtils.DataLoader.
[ ] port and adapt relevant documentation, including:
- [ ] notice for people that are looking for torch.utils.data.DataLoader and the comparison to PyTorch
- [ ] the collating/batching guide
- [ ] guide on larger-than-memory datasets (should be updated to use MLDatasets.FileDataset and mapobs instead of custom struct)
[ ] go through DataLoaders.jl issues to search for feature requests that we can move to MLUtils.jl, i.e. https://github.com/lorenzoh/DataLoaders.jl/issues/32 -> https://github.com/JuliaML/MLUtils.jl/issues/68

May 28 '22 08:05 lorenzoh

I'm having second thoughts on this, since I heard some complaints about MLUtils.jl now being too heavy for someone that needs only basic utilities like train/test split. For instance, the latency of using MLUtils is halved removing the parallel stuff:

julia> @time using MLUtils # on master
  2.090915 seconds (2.10 M allocations: 124.960 MiB, 3.50% gc time, 68.02% compilation time)

julia> @time using MLUtils  # after removing FLoops and FoldsThreads
  0.928274 seconds (890.85 k allocations: 53.444 MiB, 2.36% gc time, 73.03% compilation time)

May 28 '22 10:05 CarloLucibello

Maybe we can keep DataLoaders.jl, move it to the JuliaML org, copy and paste the implementation from here and mark a breaking release. Then we have Flux reexport it.

For MLUtils, we keep eachobs without the parallel option.

As a further breakdown, I also suggest to move getobs and numobs and their implementations for the Base types to a MLCore.jl repo. But that can come later, with the excision of the parallel dataloader MLUtils is a lean dependence for someone that wants to implement the obs interface.

Or we just go on with the plan outlined in the OP. What do you think @lorenzoh @darsnack @ToucheSir ?

May 28 '22 10:05 CarloLucibello

Right, I didn't think of the latency. I think what you're proposing makes sense, and makes for an easier migration as well.

May 28 '22 12:05 lorenzoh

My opinion is that we should really consider how many people are annoyed by this latency issue and if this is a real demand that justifies dividing work again into multiple repositories.

This premature modularization is a serious issue in Julia organizations where people simply split things into packages because 1 user commented about load time. At some point we all hope that Julia will fix the latency issue, and so let's think carefully before deciding to go separate ways.

Can you think of users of DataLoaders.jl that are not users of MLUtils.jl? If the answer is yes, then we have a more solid argument to keep these efforts separate with such a tiny community of maintainers.

May 28 '22 12:05 juliohm

How much can we reduce import times without removing core deps (FLoops can probably go though since it's mostly used for syntax sugar)? I don't think this package has any precompilation, for example.

May 28 '22 14:05 ToucheSir

I agree with Brian that we should look into reducing the load time contributed by Floops first. It would be nicer as a user to not have to load different packages to set up the data pipeline.

I think the main complaint re: load times is packages that want to extend the interface, not user load times. For this, let's see how far people get with the getindex and length interface. If we find lots of types are needing to define getobs and numobs specifically, then we can split the interface out. I think that makes more sense than splitting out the parallel loading.

May 28 '22 15:05 darsnack

As another data point, the import time difference is also quite big on my machine:

julia> @time using MLUtils
  0.222900 seconds (468.17 k allocations: 27.956 MiB, 1.72% compilation time) (no parallel deps)
  1.010500 seconds (1.67 M allocations: 95.612 MiB, 1.94% gc time, 43.90% compilation time) (master)
  0.995093 seconds (1.56 M allocations: 88.791 MiB, 46.31% compilation time) (without FLoops)

Edit: added time without FLoops.jl

May 29 '22 05:05 lorenzoh

Can you think of users of DataLoaders.jl that are not users of MLUtils.jl? If the answer is yes, then we have a more solid argument to keep these efforts separate with such a tiny community of maintainers.

I can think of many users of MLUtils not being users of DataLoaders.

It would be nicer as a user to not have to load different packages to set up the data pipeline.

true. This is mitigated though by the fact that Flux's users just import Flux and have anything they need, without having to know anything about MLUtils or DataLoaders. Non-deep-learning users can just use an eachobs without any parallel option.

Edit: added time without FLoops.jl

so removing FLoops doesn't help much.

As for the mantainance burden of another repo, I expect DataLoaders.jl to be very low activity, we will hardly change anything until we start experimenting with other parallel loading strategies.

May 29 '22 09:05 CarloLucibello

What about precompilation? I don't think any packages in JuliaML use it?

May 29 '22 14:05 ToucheSir

Precompilation would help with TTFX but not with speeding up using right?

May 29 '22 15:05 CarloLucibello

I thought it was part of https://github.com/SciML/DifferentialEquations.jl/issues/786, but reading back through https://github.com/SciML/DifferentialEquations.jl/issues/786#issuecomment-1000786329 it appears invalidations + Requires were more important for import times. Has anyone tried running @time_imports from 1.9 on MLUtils? I don't have a nightly Julia installed locally.

May 29 '22 17:05 ToucheSir

This is what I get:


julia> @time_imports using MLUtils
     10.4 ms    ┌ MacroTools
     17.8 ms  ┌ ZygoteRules 34.58% compilation time
      0.6 ms  ┌ DefineSingletons
      0.3 ms    ┌ IteratorInterfaceExtensions
      0.9 ms  ┌ TableTraits
      1.1 ms      ┌ DelimitedFiles
      3.3 ms    ┌ Compat
      1.2 ms    ┌ Requires
      0.4 ms    ┌ DataValueInterfaces
      1.0 ms      ┌ DataAPI
     11.2 ms    ┌ Tables
      1.0 ms    ┌ ConstructionBase
     16.2 ms    ┌ Setfield 29.96% compilation time (22% recompilation)
     13.6 ms    ┌ InitialValues
    127.6 ms  ┌ BangBang 55.14% compilation time (2% recompilation)
     10.0 ms  ┌ FunctionWrappers
     51.4 ms  ┌ DataStructures
      0.4 ms    ┌ CompositionsBase
     15.4 ms  ┌ Accessors 24.46% compilation time
      1.5 ms  ┌ ContextVariablesX
     60.8 ms    ┌ ChainRulesCore
     62.5 ms  ┌ ChangesOfVariables
      0.8 ms    ┌ InverseFunctions
      2.8 ms    ┌ DocStringExtensions 52.48% compilation time
      3.6 ms    ┌ IrrationalConstants
      0.7 ms    ┌ StatsAPI
      0.7 ms    ┌ SortingAlgorithms
      1.4 ms    ┌ LogExpFunctions
      6.1 ms    ┌ Missings
     38.9 ms  ┌ StatsBase 3.78% compilation time
      6.9 ms    ┌ MicroCollections
      0.7 ms    ┌ Adapt
      0.7 ms    ┌ ArgCheck
      6.4 ms    ┌ SplittablesBase
     34.7 ms    ┌ Baselet
     85.8 ms  ┌ Transducers 10.24% compilation time
     28.4 ms  ┌ MLStyle
      1.1 ms  ┌ PrettyPrint
      5.1 ms  ┌ ShowCases
    265.3 ms  ┌ FoldsThreads 86.56% compilation time
      1.2 ms  ┌ FLoopsBase
      0.7 ms  ┌ NameResolution
      3.4 ms  ┌ JuliaVariables
      6.1 ms  ┌ FLoops
    747.3 ms  MLUtils 43.14% compilation time (<1% recompilation)

May 31 '22 01:05 theabhirath

We could do the lazy import (or require import) thing we do in MLDatasets.jl also here for FoldsThreads (see https://github.com/johnnychen94/LazyModules.jl), but it is a hackish solution that we should avoid if possible.

May 31 '22 06:05 CarloLucibello

Some finer-grained timing locally suggests that the majority of import overhead for BangBang and FoldsThreads is self time and not from dependencies (unless there is significant invalidation happening, but that does not appear to be the case). It's not clear to me why this is though, @tkf do you have any thoughts?

May 31 '22 16:05 ToucheSir

I'm waiting for the outcome of this discussion before tagging a new release that adds the 'parallel' API to DataLoader which we may have to break in the future if the parallel implementation is moved out. But now a few changes have accumulated, most importantly bug fixes like #97, so we should reach a decision soon.

Jun 05 '22 09:06 CarloLucibello

Would it be possible to add parallel as an experimental option? Could be removed later by simply ignoring it.

Thinking through it, I think it would be better to have DataLoader in MLUtils.jl. If criticisms mount after, it can always be removed.

Re being able to implement the data container interface without taking on a large dep: I think getindex and length should suffice in most use cases; where they don't, is mostly for builtin types like Array, Tuple, and Dict that are already defined in MLUtils.jl

Jun 10 '22 13:06 lorenzoh

I share Lorenz's view here about waiting to see what people say. Also, keeping it in means we can release now, so I suggest doing that and moving this discussion to the next cycle. The long load times + DataLoader as a type are already released right?

Jun 10 '22 14:06 darsnack

Yes, the load time is already released.

Jun 10 '22 14:06 lorenzoh

Ok, let's release then

Jun 10 '22 14:06 CarloLucibello

As a further breakdown, I also suggest to move getobs and numobs and their implementations for the Base types to a MLCore.jl repo.

Yes, please.

Nov 03 '23 02:11 ablaom

MLUtils.jl MLUtils.jl copied to clipboard

Archiving DataLoaders.jl

MLUtils.jl
MLUtils.jl copied to clipboard