DL4S You need some help?

Hey there @palle-k, I really love this project and I think it is really awesome! I'd really like to help out and contribute, but DL4S seems pretty complete to me. Is there anything that I could help with? I'm really familiar with Swift, and I'm a bit new to open source.

May 27 '21 00:05 ryendu

Hey! Thank you for your interest in contributing to this project. I really appreciate it.

The biggest missing thing in DL4S is GPU support. The experimental ArrayFire implementation that I built is far from ideal, initial performance benchmarks are not great and ArrayFire uses OpenCL, which doesn't run on iOS and isn't well maintained on macOS.
Ideally this library would have a way to compile dynamic compute graphs using a combination of MetalPerformanceShaderGraph or MLCompute operations and manual Metal code generation (so basically a neural network compiler operating on dynamic graphs).

Besides that, there are a few missing features and convolution and pooling operations would need some performance tuning.

Missing features besides GPU support that come to mind right now are:

Interfacing with other deep learning frameworks by allowing the import and export of ONNX models.
1D convolutions, dilated convolutions
A simple method for loading audio data into tensors
An easy way to automatically load images from directories to be used as training sets by using folder names as labels
Image resizing as tensor operations
A dedicated library for working with text (text tokenization; vocabulary creation, saving and loading; batched encoding, etc.)
DL4S should include a generally applicable broadcast-to-shape operator.
Support for boolean/uint8 tensors with corresponding boolean ops (and, or, not, xor) and the ability to generate them through the comparison of other (float, int, ...) tensors (==, !=, <, >, <=, >=, ...)
Better tensor indexing: Tensors should be able to be indexed through boolean masks, index tensors, ranges and integer indices and especially a combination thereof, similar to what is possible in Numpy.
Operators: reduce product, cumulative sum, cumulative product, ... (PyTorch and TensorFlow have a lot of operators that DL4S does not have right now)
Support for packed sequences for batching variable length sequences in RNNs (LSTM, GRU, Vanilla LSTM, Bidirectional RNNs)
Support for non-max suppression for object detection
A bigger model zoo (more networks for vision tasks, networks for end-to-end audio processing or preprocessing operators)
A way to quickly suppress gradient computation like with torch.no_grad():

There's really a lot of stuff that can be added to DL4S. Let me know if you want to work on any one of these or if you find anything else missing that you want to add.

May 27 '21 00:05 palle-k

I think I'll first start small with 1D convolutions, and dilated convolutions. Also, in the future, should I make an issue with a proposal of changes I plan to make, make the changes, and then make a pull request, or should I just make a pull request with changes.

May 27 '21 14:05 ryendu

Great! You can directly create a PR, no need to create an issue.

You can use the existing code that I wrote for 2D convolutions (im2col/col2im) and introduce the dilation factor. Please make sure to add a few test cases to ensure that your code behaves as expected and that performance is good.

The THNN library behind PyTorch has an implementation with dilation here.

May 27 '21 14:05 palle-k

Awesome! Thanks for the tip, I'll get started working on it!

May 27 '21 14:05 ryendu

I'm very happy to see that people still want to push Swift forward in writing deep learning algorithms (even after Google stopped S4TF project). 🙂

May 28 '21 10:05 RahulBhalley

I definitely want to help! Check out ARHeadsetKit - it makes heavy use of GPGPU and is entirely in Swift. I have an aspiration to revive Swift for TensorFlow, and I had no idea this repo existed!

I'm currently working on tutorials and publicity for ARHeadsetKit, but several months down the road, I think this would be an epic way to delve into machine learning!

Oct 25 '21 01:10 philipturner

Lack of (real) enthusiasm and inability to use S4TF for everyday tasks were the reason why it failed. I'm sure we can change its fate and get something like Tensor<Half, GPU> running on an iPhone!

Oct 25 '21 01:10 philipturner

MTLLayeredBuffer will be the key to making this possible!

Oct 25 '21 01:10 philipturner

Thanks for the interest. As my time resources right now are pretty limited at this time, I am actually quite happy to accept additional maintainers for this project. Let me know if you are interested. Metal support was always on my list, though my experimental metal implementation (feature/metal) isn't up to the task and I unfortunately had to shift my focus to other projects.

Oct 25 '21 02:10 palle-k

There is definitely room to improve on you Metal support. Just scrolling through the only shader file, I can see a lot of areas where the code can be cleaned up. I have a thousand hours of experience using Metal for GPGPU, and I may be the only person who can pull this off.

Again, this is months in the future. However, I might just need a break from AR some day and start exploring your repo.

Also, do you have Google Cardboard?

Oct 25 '21 02:10 philipturner

If you need anything, let me know.

And yes, I do own a few Cardboards.

Oct 25 '21 02:10 palle-k

You could be the first person outside my home to ever experience ARHeadsetKit on Google Cardboard! Would you mind downloading AR MultiPendulum?

App Link: https://apps.apple.com/app/ar-multipendulum/id1583322801

Oct 25 '21 02:10 philipturner

Unless I’m mistaken, this would make DL4S the first ML scripting API besides Python TensorFlow 2.5 to have hardware acceleration on an a Mac, and the first ever for iOS

Oct 28 '21 02:10 philipturner

I've been following this project too. This project has already made a good progress in terms of the constructs of DL algorithm. Yeah, for open source world, it will be first DL library with Metal support on macOS and iOS.

But Apple have few Metal-accelerated DL frameworks like Metal Performance Shaders, Metal Performance Shaders Graph, and ML Compute. And I've worked heavily with Metal & MPS Graph frameworks. Lol.

By the way, about me, I authored Deep Learning with Swift for TensorFlow book this year. I wanted S4TF to succeed (especially Swift in DL world). Although I don't have much time to contribute but I'd love to make contributions whenever possible!

One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides dtype argument). The same approach is used in MPS where the argument name is dataType (an enumeration type).

Regards Rahul Bhalley

Oct 28 '21 04:10 RahulBhalley

Unless I’m mistaken, this would make DL4S the first ML scripting API besides Python TensorFlow 2.5 to have hardware acceleration on an a Mac, and the first ever for iOS

There have been other Metal accelerated libraries like Serrano. Though none of the libraries I found have a particularly Swifty API and support the dynamic behaviors that DL4S does.

Also, there are neural network compilers that target Metal, like TVM.

Oct 28 '21 08:10 palle-k

One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides dtype argument). The same approach is used in MPS where the argument name is dataType (an enumeration type).

Having a generic type solves multiple pain points:

It eliminates some types of errors at compile time. The type system works to our advantage. Erasing the type wouldn't be Swifty.
It allows for different overloads to be chosen at compile time. For example, subscripting with a vector of booleans (UInt8) would result in a different method call than subscripting with an index vector (Int32).
Having the data type available at compile time allows for generic specialization, which reduces overhead (which is one of the key reasons why DL4S outperforms other libraries on the CPU in certain models)
The data type of a tensor conveys information about its role to the developer. A float tensor is used in other ways than a integer tensor.

Oct 28 '21 08:10 palle-k

I’ve rarely ever used MPS - just for some rare debugging cases when authoring a shader would take more time than calling an MPSKernel. Depending on the nature of how tensors are allocated, I might end up making DL4S a wrapper over MPS. This is just speculation, though, because I haven’t taken a serious look at the source code yet.

I would also add support for a Float16 and UInt64/Int64 generic types for GPU, even though it might never be used in practice. I also lack knowledge on this, but there might not be MPS support for half-precision NN operations. @RahulBhalley - could you give me some insight into this since you have worked heavily with MPS before?

Sent from my iPhone

On Oct 28, 2021, at 4:55 AM, Palle @.***> wrote: One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides dtype argument). The same approach is used in MPS where the argument name is dataType (an enumeration type).

Having a generic type solves multiple pain points:

It eliminates some types of errors at compile time. The type system works to our advantage. Erasing the type wouldn't be Swifty. It allows for different overloads to be chosen at compile time. For example, subscripting with a vector of booleans (UInt8) would result in a different method call than subscripting with an index vector (Int32). Having the data type available at compile time allows for generic specialization, which reduces overhead (which is one of the key reasons why DL4S outperforms other libraries on the CPU in certain models) The data type of a tensor conveys information about its role to the developer. A float tensor is used in other ways than a integer tensor. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Oct 28 '21 22:10 philipturner

Also, I snagged the “s4tf” organization name, if you want to join it just for kicks.

Sent from my iPhone

On Oct 28, 2021, at 6:47 PM, Philip Turner @.***> wrote:

I’ve rarely ever used MPS - just for some rare debugging cases when authoring a shader would take more time than calling an MPSKernel. Depending on the nature of how tensors are allocated, I might end up making DL4S a wrapper over MPS. This is just speculation, though, because I haven’t taken a serious look at the source code yet.

I would also add a Float16 generic type

Sent from my iPhone

On Oct 28, 2021, at 4:55 AM, Palle @.***> wrote:

One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides dtype argument). The same approach is used in MPS where the argument name is dataType (an enumeration type).

Having a generic type solves multiple pain points:

It eliminates some types of errors at compile time. The type system works to our advantage. Erasing the type wouldn't be Swifty. It allows for different overloads to be chosen at compile time. For example, subscripting with a vector of booleans (UInt8) would result in a different method call than subscripting with an index vector (Int32). Having the data type available at compile time allows for generic specialization, which reduces overhead (which is one of the key reasons why DL4S outperforms other libraries on the CPU in certain models) The data type of a tensor conveys information about its role to the developer. A float tensor is used in other ways than a integer tensor. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Oct 28 '21 22:10 philipturner

If there is a massive overhead to communicating with a discrete GPU on Intel Macs, would it be acceptable to have something like “iGPU” and “eGPU” device names (in addition to CPU and GPU)?

The reason is that integrated GPUs might be faster than either a discrete GPU or a CPU on some models, and developers would have an easy way to choose either one if available (e.g. Tensor<Float, GPU> -> Tensor<Float, eGPU>). If an eGPU is not available, it would act the same as if just saying “GPU”.

The main difference would simply be that with iGPU, certain buffers use system shared memory, while with eGPU, they are GPU-private. You might want to mix and match CPU and GPU tensors as well on certain models, which further adds the need for flexibility in workflows.

Alternatively, this might be resolved by setting a static type property on GPU (whether integrated/external is preferred) - although that isn’t thread safe and takes away from the flexibility I mentioned above. I just want to throw around ideas and see what options I have to explore a few months down the road.

Oct 29 '21 03:10 philipturner

I might end up making DL4S a wrapper over MPS.

That'll be amazing.

I would also add support for a Float16 and UInt64/Int64 generic types for GPU, even though it might never be used in practice. I also lack knowledge on this, but there might not be MPS support for half-precision NN operations. @RahulBhalley - could you give me some insight into this since you have worked heavily with MPS before?

@philipturner you should be able to add FP16 precision via MPS (see MPSDataType). I haven't used low-precision computation in MPS but you should be able to provide UInt8-based quantized computation like PyTorch.

Also, I snagged the “s4tf” organization name, if you want to join it just for kicks.

Sure, I would love to join it.

Oct 29 '21 12:10 RahulBhalley

Having a generic type solves multiple pain points:

It eliminates some types of errors at compile time. The type system works to our advantage. Erasing the type wouldn't be Swifty.

It allows for different overloads to be chosen at compile time. For example, subscripting with a vector of booleans (UInt8) would result in a different method call than subscripting with an index vector (Int32).

Having the data type available at compile time allows for generic specialization, which reduces overhead (which is one of the key reasons why DL4S outperforms other libraries on the CPU in certain models)

The data type of a tensor conveys information about its role to the developer. A float tensor is used in other ways than a integer tensor.

Sure @palle-k.

I was just concerned about that I had to write Tensor<Float> every time I initialized a variable in S4TF. So I always used to write typealias TFloat = Tensor<Float>. It just felt cleaner. I think S4TF team was also talking about improving this syntax.

But never mind, it's fine if generic type specification give advantage to compiler-level optimization.

Oct 29 '21 12:10 RahulBhalley

I sent you an invitation to the organization!

Oct 29 '21 13:10 philipturner

Thanks, joined.

Oct 29 '21 16:10 RahulBhalley

@RahulBhalley do you have enough time to try learning my framework ARHeadsetKit? It requires Xcode and I have 3 hours worth of tutorials so far. The people who have already tried the tutorials said they were really fun, and I think you might enjoy them as well.

Oct 29 '21 17:10 philipturner

Before you read this comment, note that it's monstrously large and maybe a bit overboard on detail.

I have some experience emulating double precision on GPUs, in a past attempt to accelerate MultiPendulum. The attempt failed because the dynamic range was the real problem, not precision (in the end, I couldn't even use two parallel CPU cores to accelerate the simulation because of cache incoherence). However, I gained a lot of experience with emulating higher-precision arithmetic.

Metal does not allow double-precision arithmetic, even on GPUs that support it natively. Under emulation, it would take dozens of GPU clock cycles to do a single addition or multiplication. However, devices with native double precision already have a fraction of the throughput of single-precision, so it might actually have comparable performance.

On devices with a weak GPU or workloads that underutilize the GPU, it will be faster to run double-precision arithmetic on the CPU. This would most likely be the case on the Intel Mac Mini and iPhone 8.

However, when the GPU is massive like the M1 Max or AMD 6000 series, running large-scale double-precision computations could be faster than the CPU, even under emulation. There might be some special use cases for DL4S where Tensor<Double, GPU> could benefit performance. For example, it could offload work from the CPU in a mixed CPU-GPU tensor workflow. This means there is a reason to bring Tensor<Double, GPU> to this framework.

Side Note: What are your opinions on mixing CPU and GPU tensors in DL4S? It might require redesigning some code internal to DL4S, which I haven't had the chance to study yet.

A few months down the road, I might make a Metal library for emulating double-precision or quad-precision on iOS and macOS GPUs. The setup for compiling the GPU shaders is very complicated (see the "Xcode Framework" option for ARHeadsetKit's configuration). So, I would post the open-source MSL code in this repository, but compile it locally and upload the .metallib file to DL4S. At runtime, DL4S would load shader functions from the library.

The double-precision shaders would have to be compiled separately anyway. This is because Metal's fast math (-ffast-math) breaks accumulation of rounding errors, which are crucial to emulating double precision. Furthermore, this workflow eliminates the headache of linking an ARHeadsetKit-like Metal library to DL4S.

Having support for 64-bit integers and 64-bit floating point numbers on the GPU could be a unique advantage to DL4S, since not many ML frameworks have it outside of the CPU (if any). Just having that feature might motivate people to try DL4S when they wouldn't otherwise, or make DL4S vital to some rare ML use cases.

With bfloat16, it is proven that the bottleneck for ML is dynamic range, not precision. Emulated double precision will not let you expand past 8 bits of exponent. If developers expect otherwise, they may meet some unfortunate surprises (it broke MultiPendulum).

We would need to at least briefly document this behavior if I add support for Tensor<Double, GPU>. My personal repository for double-precision arithmetic would go into greater detail with edge cases and performance limitations. @palle-k, would this scenario be acceptable for DL4S?

Oct 29 '21 23:10 philipturner

@RahulBhalley could you make your s4tf membership public? I can't change your visibility myself. Here's a picture in case you aren't familiar with the process: Screen Shot 2021-10-30 at 6 53 06 AM copy

Oct 30 '21 10:10 philipturner

Done.

Oct 30 '21 12:10 RahulBhalley

I may also add DocC documentation to enhance the developer experience in Xcode and help with the documentation hosted on Jazzy. I would also make some super awesome Xcode tutorials for this framework, like the ones for ARHeadsetKit. What do both of you think about this?

I wish that @palle-k would at least respond to say he read my comments. I have thrown several major ideas there and asked for his feedback, but all I have got is silence. I really don’t mean to be rude, but could you at least add a comment to this thread if you have the time?

Nov 03 '21 04:11 philipturner

@philipturner Sorry for not replying that often, I really wish I had more time to invest in this project.

All these ideas sound great. Though I guess the most important part would be getting the basic GPU computation pipeline running to the point where we can get performance comparable to other GPU accelerated frameworks.

I've also listed a bunch of things that need to be done further up this thread. These especially include improved subscripting.

Double precision is rarely used in ML. I don't have any specific use cases that would benefit from this amount of precision. However, if it is something that is important to a specific use case of yours, then feel free to implement it.

As you are familiar with GPU programming I would assume your time brings the most value there.

Nov 03 '21 09:11 palle-k

Thank you so much!

Sent from my iPhone

On Nov 3, 2021, at 5:23 AM, Palle @.***> wrote

@philipturner Sorry for not replying that often, I really wish I had more time to invest in this project.

All these ideas sound great. Though I guess the most important part would be getting the basic GPU computation pipeline running to the point where we can get performance comparable to other GPU accelerated frameworks.

I've also listed a bunch of things that need to be done further up this thread. These especially include improved subscripting.

Double precision is rarely used in ML. I don't have any specific use cases that would benefit from this amount of precision. However, if it is something that is important to a specific use case of yours, then feel free to implement it.

As you are familiar with GPU programming I would assume your time brings the most value there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Nov 03 '21 09:11 philipturner

DL4S DL4S copied to clipboard

You need some help?

DL4S
DL4S copied to clipboard