Flux.jl PyTorch feature parity

A list of PyTorch 1.7 features. Items are checked if we have something more or less equivalent in Flux or in the julia ecosystem and supported by Flux. This list is not complete, it comes from a rough scan of pytorch's documentation. Please feel free to add anything I missed in the comments, and whoever has write access to modify the list. Related issue https://github.com/FluxML/ML-Coordination-Tracker/issues/16, and more generally anything in https://github.com/FluxML/ML-Coordination-Tracker/issues

Pytorch Features

Conv Layers

[x] Conv1d, Conv2d, Conv3d.
[x] ConvTranspose1d, ConvTranspose2d, ConvTranspose3d.
[x] groups in convolution layers
[ ] Fold, Unfold. In progress: https://github.com/FluxML/NNlib.jl/pull/444

Pooling Layers

[x] MaxPool1d, MaxPool2d, MaxPool3d
[ ] MaxUnPool1d, MaxUnPool2d, MaxUnPool3d
[x] AvgPool1d, AvgPool2d, AvgPool3d
[ ] FractionalMaxPool2d
[ ] LPPool1d, LPPool2d
[x] AdaptiveAvgPool1d, AdaptiveAvgPool2d, AdaptiveAvgPool3d
[x] AdaptiveMaxPool1d, AdaptiveMaxPool2d, AdaptiveMaxPool3d

Padding Layers

[x] ReflectionPad (1d,2d)
[x] ReplicationPad (1d,2d,3d) ( NNlib.pad_repeat)
[x] ZeroPad (2d)
[x] ConstantPad (1d,2d,3d)
[ ] ~Add corresponding layers for all of the aboves wrapping the NNlin functions~ keep as functions. Need to add them Flux's docs.

Activations

[x] ... . NNlib has an extensive collection of activation, plus we have any julia function.

Normalization Layers

[x] BatchNorm1d, BatchNorm2d, BatchNorm3d
[x] LayerNorm
[x] GroupNorm
[x] InstanceNorm1d,InstanceNorm2d,InstanceNorm3d
[ ] SyncBatchNorm
[x] LocalResponseNorm. Very old unfinished PR #312. It is an outdated technique, probably we can live without it.
[ ] Move the functional implementations to NNlib.jl (https://github.com/FluxML/NNlib.jl/issues/19)

Recurrent Layers

[x] RNN
[x] GRU
[x] LSTM

Attention Layers

[ ] Transformer. Well maintained implementations in Tansformers.jl.
[x] MultiHeadAttention ~Should be moved from Transformers.jl to Flux.jl~ (ensure hitting cudnn kernels). PR #2146

Linear Layers

[x] Identity
[x] Linear
[x] Bilinear

Dropout Layers

[x] Dropout
[x] Dropout2d, Dropout3d (#1490)
[x] AlphaDropout

Sparse Layers

[x] Embedding PR #1516
[x] EmbeddingBag PR #2031

Distance Functions

[ ] CosineSimilarity. We have this in Distances.jl. Also easy to handcode. TODO check if AD and gpu friendly.
[ ] PairwiseDistance. We have this in Distances.jl TODO check if AD and gpu friendly (could use Tullio.jl to achieve both)

Loss Functions

[x] .... . We should be well covered here.
[x] CTCLoss. Being Implemented in #1287 (todo: remove separate GPU case, integrate with cudnn)

Vision Layers

[x] PixelShuffle. #1468
[ ] Upsample (for 1d, 2d, and 3d). (partially done in #1468)
- [x] 'nearest'
- [ ] 'linear' (cpu version merged in NNlib, CUDA PR still to come)
- [x] 'bilinear'
- [ ] 'bicubic'
- [x] 'trilinear' (cpu versino merged in NNlib, CUDA PR still open )

Initialization

[x] xavier_uniform, xavier_normal. Called glorot here.
[x] kaiming_normal kaiming_uniform
[x] sparse
[x] orthogonal (#1496)

Parallelism and Distributed

[ ] DataParallel
[ ] DistributedDataParallel(solved by https://github.com/DhairyaLGandhi/DaggerFlux.jl
[x] set_num_threads, set_num_interop_threads. Not sure which operations are parallelized in pytorch. Here we have parallelization only in blas operations.

Distributions

[x] diff rules for logpdf offered by DistributionsAD.jl
[x] rsample. params's differentiability through sampling supported by many distr: gradient(mu -> rand(Normal(mu, 1)), 0) == (1,).

ONNX

[ ] Current best support in ONNXmutable. See this discussion
- [x] ONNX.jl's old implementation has been replaced
- [ ] Overcome the limitations reported here

FFT

[x] ... . Zygote has the adjoints for AbstractFFTs.

Quantization

[ ] ...

Pruning

[ ] WIP pruning package here

Optim

[ ] schedulers #1434 and #1506, also see ParameterSchedulers.jl
- [ ] Integrate with Flux's optimizers? (See https://github.com/FluxML/Optimisers.jl/pull/15)
- [x] Document in Flux (see #1511 and #1513) ~- [ ] Reexport in Flux (see #1506)~ (TBD)
- [x] LambdaLR (handled in ParameterSchedulers.jl)
- [x] MultiplicativeLR (handled in ParameterSchedulers.jl)
[x] optimizers
- [x] SGD (+ momentum)
- [x] Adam
- [x] AdaGrad
- [x] AdaDelta
- [x] RMSprop
- [ ] LBFGS. Integration with Optim.jl

LinAlg

[x] det
[x] norm

Tensorboard

[x] integration offered by TensorBoardLogger.jl

XLA

[ ] Some work in XLA.jl

Misc

[ ] Pytorch has both layers and their functional counterpart.
[x] einsum. AD and CUDA compatible Einstein summation given by Tullio.jl and other packages
- [ ] add Documentation to Flux.jl
[ ] LazyModuleMixin (pytorch 1.8) PR #2078
[ ] weight_norm. Attempt in #1005 , PR #2053
[x] modules iterator. #1444
[ ] spectral_norm. Old attempt in #115

Pytorch Extras

Torchvision

[ ] datasets. Some are implemented in DLDatasets.jl (unreleased), some in FastAI.jl, some in MLDatasets.jl, many are missing.
- Will consolidate in MLDatasets.jl (see https://github.com/lorenzoh/DLDatasets.jl/issues/1)
[x] models. Some are implemented in Metalhead.jl, but it is a bit stale and not comprehensive.
- [x] Metalhead's PR should add a bunch of model and generally revive the repo
- [ ] We should expose the possibility to load pretrained weights
[ ] io
[ ] transforms. Some ~~unreleased~~ work in DataAugmentation.jl

Torchaudio ...

Torchtext ...

Dec 19 '20 01:12 CarloLucibello

Do you mind if I try to implement the support in Flux corresponding to Dropout2D in pytorch?

Dec 19 '20 07:12 gxyd

yes please, essentially this is all up for grabs

Dec 19 '20 07:12 CarloLucibello

Note that we shouldn't add all these layers here, for eg pixel shuffle has an implementation, so does Transformers, up sampling and embedding are direct Julia operations etc

Dec 19 '20 09:12 DhairyaLGandhi

where is pixel shuffle?
upsampling is not trivial, we already have a few unfinished attempts here (#1180 )
I was surprised to find transformers in pytorch, they were requested here https://github.com/pytorch/pytorch/issues/10459, I guess it makes sense though to have some basic components in Flux in the same way we have RNN.
I'm not an NLP guy, but it looks like learnable embeddings https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html are not entirely trivial to implement in a way that plays nicely with other layers. I think they are very very common, so should be implemented in Flux.

Maybe @chengchingwen could provide some suggestions on the last two items

Dec 19 '20 10:12 CarloLucibello

@CarloLucibello About ONNX: This exists which I guess (hope?) is better than nothing: https://github.com/DrChainsaw/ONNXmutable.jl

I haven't registered it because 1) the name sucks and I can't think of anything better and 2) I'm thinking of splitting the import and export into two separate packages. 1 is the main blocker though :)

I'd be happy to donate it to FluxML, or parts of it (e.g. import/export primitives).

Dec 19 '20 11:12 DrChainsaw

Yeah upsampling is non-trivial to get right and be performant on the GPU as well (last time I tried it, I had to ask in #gpu on Slack to get a good implementation).

For ONNX, is it possible to hand control of ONNX.jl to @DrChainsaw? It seems like ONNXmutable.jl should really supersede that package.

For vision models, there is this Metalhead PR which I think will bring us much closer to PyTorch parity. I am planning on training some of the simpler ones this weekend, but I would appreciate the help to add pre-trained weights from anyone with a GPU.

Lastly, for hyperparameter/learning rate schedules, I just started ParameterSchedulers.jl to break the functionality out of FluxTraining.jl. This is quite a simple package, and I want to finish it this weekend for a project. I am happy to transfer ownership to FluxML.

Dec 19 '20 14:12 darsnack

I tried implementing WeightNorm before, but it's harder than I thought without doing a per-layer implementation. See #1005 Doing a per layer implementation is actually easy, but maintenance hell at the same time.

Dec 19 '20 14:12 bhvieira

@DrChainsaw what are the limitations of ONNXmutable?

Dec 19 '20 15:12 CarloLucibello

@CarloLucibello From the ML-Coordination issue it seems like there is alot of ways to look at ONNX import/export so what is a limitation appears to be a bit more subjective than I thought.

Here are some things I can think of

Only a subset of OPs supported. This is imo not a big deal as I have made an effort for it to be easy to add more and even make it easy for users to just hack in own versions locally. Most OPs are trivial to add but I have intentionally not added more than what I happen to need in hopes that it would encourage contribution.

It has capabilities which perhaps only a small subset of users have use for w.r.t model manipulation. This translates to dependencies like JuMP and Cbc (used to solve the problem of keeping all parameter shapes aligned when changing the model structure) as well as metadata used to formulate the shape constraints. This may appear as bloat to users who only want to import a model and use it. The annoying part here is that Chain can't represent any graph and even things like what is proposed in #1289 seem very hard to translate to from a more standard graph format such as the one used in ONNX. NaiveNASlib has an internal graph format which does not have the extra functionality for shape alignment which could perhaps be used, but there seem to be a desire for a 'Flux native' format.

RNNs are currently a bit limited, although this is more on Flux than on ONNXmutable since ONNX wants RNN to have 3D input while Flux wants 2D (in a loop). I have worked around this to some extent by just changing the model shape to 3D if a recurrent layer is found and then just fold the time dimension into the batch dimension if a Dense layer is encountered. This only works for a few model architecture types though.

Exporting functionality can't handle 1) non-primitive functions with type constraints (e.g. function thewholemodel(x::AbstractArray)) and 2) non-function control flow (e.g if/else/for, functions like ifelse/map/reduce or let_onnx_know_this_is_a_loop(f, n) could be solveable I think). The first can probably be hacked around with IRtools but I think that the latter would require some abstract interpreter or similar sopisticated code analysis/transformation, e.g. mjolnir.

Eco-system wise it would be better to refactor at least the export primitives to use NNlib as that would make them useable from other libraries which use NNlib (KNet, Avalorn etc). Perhaps not so much a limitation in itself though and can always be broken out later down the road. For export there is no limit on how many ways one can chose to translate a Julia function to an ONNX Node.

Btw, I think it would be better to try to remove ONNX.jl from the general registry and use a name like OnnxFlux.jl to clearly state that it translates between ONNX and Flux.

Dec 19 '20 18:12 DrChainsaw

Btw, I think it would be better to try to remove ONNX.jl from the general registry and use a name like OnnxFlux.jl to clearly state that it translates between ONNX and Flux.

Unfortunately we can't remove packages from the registry. But if ONNXFlux.jl makes more sense, then we can just archive the ONNX.jl repo.

Dec 19 '20 19:12 darsnack

I don't think it's unreasonable to expect anyone looking to use transformer layers to use Transformers.jl. One potential reason for Torch to add them is because there is no canonical library for transformers in that ecosystem (or really for any other domain...).

RE ONNX, why not give that repo name over to ONNXMutable and then consider how best to refactor/reorganize? I highly doubt anyone is using the existing functionality, given that it's broken on most recent versions of Julia that Flux supports.

RE XLA, I presume this is covered by the work Keno and Tim are doing? Not sure if there's a link to any details there.

Dec 19 '20 21:12 ToucheSir

Regarding embeddings, although I haven't dealt with the potential caveats from weight norm and such, are there challenges I'm overlooking compared to doing a fairly trivial matrix indexing? Example:

struct Embed{T}
    w::T
end

@functor Embed
Embed(in::Integer, out::Integer; initW=glorot_uniform) = Embed(initW(out, in))
(m::Embed)(x::AbstractVector) = m.w[:,x]

Dec 19 '20 23:12 jeremiedb

My understanding is that the trivial indexing triggers scalar indexing on GPU arrays. Transformers.jl has custom implementations for both CPU and CUDA, so in that sense the hard work is already done.

Dec 20 '20 01:12 ToucheSir

Something else I'd like to submit for consideration is an equivalent to the upcoming LazyModuleMixin. Not a 1-1 port, but some mechanism to avoid specifying intermediate sizes during model construction.

Dec 20 '20 03:12 ToucheSir

Are Embeddings something of general utility besides Transformers, worth moving to Flux.jl?

cc @chengchingwen @jeremiedb @ToucheSir

Dec 20 '20 10:12 CarloLucibello

My understanding is that the trivial indexing triggers scalar indexing on GPU arrays. Transformers.jl has custom implementations for both CPU and CUDA, so in that sense the hard work is already done.

That gather is similar to GeometricFlux's one? worth having it as a primitive in Flux.jl or CUDA.jl? @yuehhua

Dec 20 '20 10:12 CarloLucibello

Flux is lacking attention modules. That would be good to have (and PyTorch does have it).

Dec 20 '20 15:12 bhvieira

That gather is similar to GeometricFlux's one? worth having it as a primitive in Flux.jl or CUDA.jl?

Note that there's also very similar implementation in ScatterNNlib (gather, scatter, their gradients). It wold be great to have them in NNlib and CUDA so other packages (like Avalon of my own) could use it.

Dec 20 '20 15:12 dfdx

the trivial indexing triggers scalar indexing on GPU arrays

I recently used this approach for embedding and can confirm good performance on GPU, maybe there's been recent improvement in CUDA.jl explaining that it doesn't resort to scalar operations. Benchmark against Transformers.jl would be interesting through.

Dec 20 '20 17:12 jeremiedb

@darsnack

I would appreciate the help to add pre-trained weights from anyone with a GPU.

I would want to help on that if possible, I'm not really sure of the process of it though. I do have access to a GPU (GTX 1080), let me know if I can be of any help on that. I'll try to figure out the procedure for that.

Dec 21 '20 05:12 gxyd

@gxyd Take a look at the PR linked above. Someone already posted a training script (I haven't had the time to check if it works). I would just ping that PR thread if you manage to get something to train.

Dec 21 '20 16:12 darsnack

I think something need to be mentioned together with Embedding is the one-hot encoding implementation. The problem of Embedding/OneHotEncoding is to maintain semantics and composability without hurting the performance on GPU. Currently the implementation of OneHotVector is not that handy, so I have one custom one-hot implementation in Transformers.jl.

I do think they are worth moving to Flux/NNlib but there some questions need to be discuss. The semantics of gather/scatter in Transformers.jl and ScatterNNlib.jl are different. I follow the definition in TF and @yuehhua follows the one in pytorch_scatter package. The decision need to be made before we treat them as a basic building block in Flux/NNlib.

Dec 23 '20 00:12 chengchingwen

@CarloLucibello I would like to add Einstein summation and tensor product to the discussion list. They are quite useful in some novel model design.

Dec 23 '20 00:12 chengchingwen

@CarloLucibello I would like to add Einstein summation and tensor product to the discussion list. They are quite useful in some novel model design.

I added them as covered by Tullio.jl. Possibly we just have to add references and examples in Flux.

Dec 23 '20 05:12 CarloLucibello

@chengchingwen could you open an issue here about OneHotVector's limitations?

Dec 23 '20 05:12 CarloLucibello

I think the issue with onnx implementations in general isn't writing the package initially, but the additional ops that need to be added regularly. We need a solution to that problem which is more pressing imo.

I agree we need more attention modules.

I would want to gather the relative issues with upsampling

@CarloLucibello https://github.com/FluxML/NNlib.jl/pull/112/files

Dec 23 '20 06:12 DhairyaLGandhi

Hi -

I have used Flux recently to do some NLP tasks and wrote my own (maybe terrible?) embeddings layer.

It took the table of embeddings from the Embeddings.jl package and turned a sentence of text into a matrix of embeddings given the data I was using.

W/ some guidance and some time I might be able to turn it into something like the Pytorch version for a more general collection of inputs than just text. If anyone wants to take a look at what I did to evaluate how suitable it might be, the layer I wrote is here.

I'd be happy to contribute if someone could give me a bit of feedback to modify/improve what I've already done!

Dec 30 '20 16:12 austinbean

@austinbean did you check the embeddings in Transformers.jl? if those are general enough and already battle-tested we could port them here

Jan 19 '21 07:01 CarloLucibello

@CarloLucibello the embeddings layer there looks similar to what I did, but much better done and more thorough!

Jan 20 '21 14:01 austinbean

who deleted Fold, Unfold, Dropout2d and Dropout3d? It should be stated why at least

Jan 30 '21 17:01 CarloLucibello

Flux.jl Flux.jl copied to clipboard

PyTorch feature parity

Pytorch Features

Pytorch Extras

Flux.jl
Flux.jl copied to clipboard