GPU Support
It's not an issue - I'm just here to say I love this.
I want to see this get to the point where it will run large transformers.
For acceleration, you might want to consider something like https://github.com/gpujs/gpu.js, maybe?
Best of luck.
This is not an area I know much about would love to know more about the approach, @eduardoleao052 is this something you're thinking of using a dependency for or building this from scratch specifically for matrix operations ?
@medic-code after completing the TypeScript support, I'll study the best option to add GPU support. This means that the heavy matrix multiplications will be able to be run on a GPU, speeding up the ML models.
If I may be so bold, I would suggest the following approach. I'm wondering if we can outdo Python if we build from the ground up to focus on GPU integration and performance. My understanding is Torch and Python mostly have mapped corresponding routines (kernels) in the GPU, so in theory, if we take a from the ground up performance approach, I would like to think we can outperform classical Torch.
We add decorations inside the Torch-ish objects that allow a visitor object to assemble an entire compute pipeline, kind of a compute abstract syntax tree of what is to be executed that we can map data into.
The pipeline needs to be able to be processed as a whole or chunked along I/O and compute-friendly boundaries to allow us to optimise when running with different GPU availability.
If the whole pipeline will fit in a big fat GPU then we want it all in there. GPUs are getting bigger and fatter. Let's design for that. If we need to break it up into discrete steps with a small or multiple small GPUs, then we need to be able to do that as well.
┌────────────┐ ┌────────────┐ ┌──────────┐ ┌──────────────┐ ┌─────────────┐
│ Torchish │ │ Pipline │ │ Pipleine │ │ Orchestrator │ │ GPU │
│ Objects ├──► │ Assembler ├───► │ AST ├────►│ ├───► │ Gateway │
└────────────┘ └────────────┘ └──────────┘ └──────────────┘ └─────────────┘
The executing Can take Representation Takes the Takes chunks
object the planned of what needs pipeline and of the AST
execution and to be computed determines how and data and
turn it into it will be executed executes in GPU
Pipeline AST against the GPU
Make use of ReadableStream<X> WritableStream<X> TransformerStream<X> for compatibility with browser and node and everything async. We need to stream everything to minimise memory usage. We need to be able to stream data and models from storage into the GPU and then data out without maxing out v8's limited memory.
As a kind of proof of concept and possibly even the solution, for version 0 we take the AST for the pipeline and build a function that can be passed into https://github.com/gpujs/gpu.js Let it do the heavy lifting for now. If we need to do something else in the future, we can replace this step with something else and keep the rest of the design and implementation.
@metawrap-dev I like the spirit, and you clearly have a great understanding of hardware manipulation on Pipelines. I think we could make it work, and adding at least the gpu.js is a clear first step in my opinion. If a simple GPU acceleration is put in place, it could already allow a much wider range of applications for the library.
Thanks!
To give a toy example: add(mult(w,x), mult(y,z))
We really want to build an object tree that represents the promise to execute this.
That way, instead of passing (w * x) and then (y * z) and then waiting for the result of these and then passing r1 + r2 to the GPU, we could more efficiently pass (w * x) + (y * z) in one go and get one value back.
So we go from
sequenceDiagram
participant JavaScript
participant GPU
JavaScript->>GPU: w * x
GPU->>JavaScript: r1
JavaScript->>GPU: y * z
GPU->>JavaScript: r2
JavaScript->>GPU: r1 + r2
GPU->>JavaScript: a
to
sequenceDiagram
participant JavaScript
participant GPU
JavaScript->>GPU: (w * x) + (y * z)
GPU->>JavaScript: a
This will work efficiently when we pass a torch-ish object into another's method as a parameter
Not much if we pass in just a scalar/primitive unless we make a special class, eg a Number class that would allow a whole tree to be linked. We would, of course, make it as granular as possible (no point in building the internal AST of a matrix operation).
If the library is skewed towards being able to build the largest connected Pipeline tree, then we have the ability to pass the whole intention into the GPU and optimizers and have them work their magic.
Cool! I could see that working building upon the Tensor and Module objects. The Module.forward() method is always spelled-out before the model is actually run. Then we could build this "graph" with all the operations, and send them in parallel to the GPUs!
I'm currently finishing to add TypeScript support to the library, and will push it to main soon (currently it's in the develop branch).
Would you by any chance be willing to work on a PR, so that we can try to implement the GPU.js support? I'd love to take on that challenge!
Sure! Once you complete the TypeScript support. I do a lot of prototyping in TypeScript, so it's my happy place.
In the meantime, I will look into implementing the toy example above and try out some design ideas.
These two features in gpu.js are likely going to be load-bearing for us.
https://github.com/gpujs/gpu.js/blob/develop/README.md#combining-kernels https://github.com/gpujs/gpu.js/blob/develop/README.md#create-kernel-map
Some notes here for things I want to investigate.
- How to combine async/Promise and direct reference. (P,R) or (R{P})
- Data vs Function - how to model so we maximise "streamability"
- How do we stream "big" data in the pipeline? Chunk data along with execution nodes? Transform pipeline like map/reduce repeating F with D1...DN then combining (where)?
- How to chunk the pipeline graph? Assuming not chunking into big echoey GPU is the easy case. Pinch the pipe at the "thinnest" place. Define "thinnest" with the dimensions we have.
- Future - Is it possible to extend gpu.js to support "Persistent Kernels"
- Object reference equivalence optimisation?
- Object hash equivalence optimisation? (overkill)
- It feels like being monad-like is going to play a big part of this working, so make everything "complete".
I seem to have proof of concept for item 1 on my list.
I can build a toy pipeline graph, display it unresolved and resolve it. I will put this up in a repository tomorrow.
Tomorrow, let's see if I can pass some of it into gpu.js.
That's awesome, looks like a great start! Btw I'll merge the TypeScript branch into main today, I'll let you know.
@metawrap-dev Merged the TS into the main branch, but there are still some mistakes, I'll solve them today.
Thanks. I will take a look. If you have any typescript questions I'm happy to help there as well.
I'm going to continue in a different repository until I have the major questioned I outlined above answered.
Today I'm going to tackle the streaming and batching big data question. Big as in more than can fit into memory. The goal (i) will be to have a toy versions of a model load from disk with chained execution operations and then output to disk.
I want to be able to get to the point where I execution is triggered only when we want to save the result. So, complete lazy loading and execution with the ability to inspect the pipeline graph.
Next after that the goal (ii) will be to have a combination of granularities so we have some operations that will be handled by a prepositioned kernel and then some that are weaved together with permutations in code. Those will be handled by less granular code generation.
At that point it will be possible to create a visitor class that can analyse and chop/chunk and transpile the pipeline graph.
I can probably only get some of (i) done today given the time I have as just coding up the testing and plumbing is going to be a majority of the work.
Cool! If you want any help at some point, just let me know. About the TypeScript, the only issue left to solve are a couple of type: anys and unused variables. Just a matter of keeping the code cleaner.
If you can avoid any, it would be a Good Thing™️.
If it is the case where you don't know the type yet in the flow of execution and it will be resolved later on, then using unknown rather than any is preferred.
If it is just a case of trying to get things to match and any is used to get past that, then I can definitely help with that.
Here is my work so far.
https://github.com/metawrap-dev/streaming-compute-poc
I'm currently removing all the anys, don't worry!
Most can be corrected simply, if there's some syntax on a specific part that I just don't know I can let you know, thanks for offering the help! I took an overlook of the code so far, really interesting stuff. I'll actually read it tomorrow, but from what I understand you're implementing a Number class.
Will this in the future compose a Tensor class for tensor operations?
The DataNumber class is just a start and just a minimalist toy model. Yes, higher order objects like tensors will be modelled the same way. I'm just starting very small as there is a lot of type semantics to work out to make everything composable before I move onto something more complex.
I've updated the repository. I've also implemented some more complex composition involving sources. However, I'm still only using primitive number types and the multiply operation for now. My goal is to get the streaming working before I start doing anything more complex or more granular.
In memory, Source and Destination were added. I haven't yet tackled I/O and streaming, but the functionality I have so far models a lot of the problem and solution in memory.
E.g. in one test case.
{DestinationMemory(0 stored, 1 in buffer, 5 batch size) <= [( multiply {SourceMemory(4 elements, 0 index, 1 batch size) <= [10,10,10,10]} => unresolved )]=>[]}'
Resolves to
{DestinationMemory(1 stored, 0 in buffer, 5 batch size) <= []=>[10000]}
This is, in effect, loading computing and saving the result.
We get to see the pipeline graph before it is resolved.
I'm not sure if I should strictly type these or use the "sausage machine principle" and assemble the correct type from arbitrary length and format input arrays.
resolve() is going to be invoking any transpilation. At the moment I am just executing everything in the CPU.
Optimal batch sizes can be discovered on the fly. At the moment, they are there to force the pipeline to execute when data limits are reached to emulate a shadow of some kind of real-world behaviour. I'm thinking of some back-pressure signalling up from the GPU from each kernel and the GPU in general.
When complete and if it can be demonstrated to be usefully performant, the goal will be to take the learnings from my repository and transfer them, in style, to yours.
If it all plays out well, we should have something that could, in theory, outperform Python, at least within striking distance.
The next step is creating some actual disk I/O-driven Source and Destination implementations.
After that, some matrix operations. One could be implanted as a granular kernel, and the other could be built up with primitive operations. That should allow some serious testing with code generation.
I'm super busy until the end of the week, but will see what I can do in between things
Gathering some more thoughts for (i)
I'm close to being able to serialise and deserialise the execution state. Source and Destination batch themselves now. What is missing is generalising this to the Compute elements. I'm going to tease out the internal run-state of each element into something common that can be marshalled.
What will that get us?
- Batching and chunking of execution where possible.
- Execution back-pressure signalling on the running pipeline can self-optimise.
- Able to freeze and store long-running execution to storage and later on thaw and continue execution from where it left off.
I'm also considering adding an unrolling strategy as the same kind of parameter as batch size. With this, the system that is trying to optimise the pipeline graph will also be able to experiment with unrolling.
And that implies that metrics are going to be needed.
The optimising system observing the execution needs to be able to get metrics at each point in the pipeline and maintain some previous data points to make a comparison. Enough to give it the ability to converge, but not too much that it overwhelms the task at hand.
- Common
Strategyobject? Something that can be introspected and modified even just using dumb random search. The element can take a strategy and apply it toState. - Common 'State' object? Something that reflects the concrete application of
Strategyor separate 'Config' for this? State = Runtime + Config?
Implemented Strategy, Config, and State across all the elements as described above. Strategy needs to be put into practice, so what is there is a placeholder for now.
So close to 100% :)
Tentative flow control so that operations asynchronously will block until data becomes available.
It is time to add more complex Data and Compute elements to refine the design as single number flow seems done. I've exhausted what I can do with these now.
The goal is to make sure that execution only occurs after resolve, but there are situations were populating Source or Destination will start to trigger flows.
The design looks like it is holding together and will do what I want. More complex objects should now shake any bugs or design issues with the way I want data to flow and obey batch/chunk sizes. It should allow me to test the roll/unroll strategy ideas, as I need a Compute element with at least a few layers of execution to support that.
These are really solid groundworks. Been looking through the repo, and it's impressive stuff. I think adding more complex objects is the next big step. If you need any help, let me know!
Thanks,.
Yes, now that I have the flow and semantics thought out and lazy execution working, more complex objects are now needed.
Just thinking some things out loud again here.
Code-Gen/Optimizer:
- Just take the pipeline graph (PG), break it into sub-graphs of the right size and transform each sub-graph into a GPU kernel.
- "Where" it is broken up will be affected by
Strategy. At a point that maximises throughput. Treat the GPU as I/O. - "How" we size.. Good question.
- If the same PG "shape" is being built, executed and then torn down over and over, then how do we associate it with past metrics?
- Need a hash that bubbles up to the root that reflects the underlying execution pattern. Each identical sub-graph will have the same hash. Is it worth it? Will see? Likely yes.
- Add metrics now. It's too important to leave until later.
- How do we cleanly associate optimiser with compute element without using globals?
- Some 'Compute' elements will have 1:1 correspondence with inbuilt (hardware ops) and pre-positioned (software) kernels. Noting inside. Just an Element.
- Some
Computeelements will have control over unrolling based onStrategy. Built PG will depend onStrategy - Can we model flow-based/Persistent computation in the design now? Build a pipeline that you feed data into a source and it feeds into a
Destination? Reset vs Revisit resolve()/Resolved semantics (self-clocking on batch). We may need to mock behaviour for now, but it is important that this is designed in at ground level now. Two modes. Complete reset + Cumulative. - Naive vs Easy number packing? Is it better to store the matrix in 1d or 2d array if the whole flow from coder to GPU is taken into account? Not obvious. Measure.
- How much will the optimizer handle and how much will we need to pre-empt? eg. Data types: UB,U... UV,V data types? Do we need to be aware or can we assume GL optimizer will handle? Tackle this after v1.
Mock Compute and Data elements.
- DataVectorN
- DataMatrixMxN
- ComputeDotN (Dot Product)
- ComputeDot4GPU (Dot Product GPU)
- ComputeAverage (Average Pooling)
- ComputeAverage4x4GPU (Average Pooling) (Hardware Op)
- ComputeMultiplyMatrix
- ComputeMultiplyMatrix4x4GPU
This should give enough. The plan for today is just to get the new mock data elements in. Probably all I have time for.
I suspect the design won't survive the first data element without changes, but here goes. : )
DotProduct4
A lot to unpack here.
Great success.. but..
From a flow aspect, there are several ways to computer dotN. Not all of them are obviously nonsense. Also this is going to be the same for all operations.
- Supply two vectors. Output a single number. OK.
- Supply an array of two vectors, batch compute and return a vector (Ok.. makes sense.. we want that)
- Stream in two huge vectors of known same length. Stream compute. Output a single number. (Nice)
- Same as 3 but we arbitrarily decide when to cut the sausage and output a number. (Can't really think of a use case)
- Hold my beer. Stream in a stream of huge vectors. Stream compute. Stream out vector of numbers. (Very nice)
- Same as 5 but we arbitrarily decide when to cut the sausage and output a number. (Can't really think of a use case, or how unless we single end of vector somehow? ).
This adds a lot of fragile complexity in having generalized arguments.
I'm wondering if I can create IStreamify<T> so to formally transform non-streaming to a streaming version. Otherwise, It's obvious I am going to end up in array hell. (Is it a matrix or an array of vectors to streaming output of vector etc. )
Time to dive deep into some of the latest typescript features.
The plan =>
-
ISourceable<T> - Takes a
Computeelement and adapts it for streaming inputs in from a source. May require TS type magic or specialised classes for each element. -
IDestinationable<T> - Takes a
Computeelement and adapts it for streaming outputs out to a destination.
I can probably create some generalised versions of these and, worst case, we end up with a pair per Compute element but it will move the complexity of different steaming different styles into these distinct adapter objects. (A Good Thing ™️ )
Same approach for IPersistent for creating persistent kernels. Same as above but for the whole pipeline graph.
I tried the above. It works. I ended up with the more sensibly named IStreamer as we don't seem to need to for Destination and what it effect does is stream the data in.
The new method simplifies the code because Compute elements now don't need to be aware of how to batch and don't need to deal with argument overloading with the ISource
const streamer = new Streamer<v4, number>(
new SourceMemory<v4>([1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]),
new ComputeLength4<v4>())
Creates a pipeline graph of
{Streamer{SourceMemory(4 elements, atoms 4, 0 index, 1 batch size)
<= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}
=> {ComputeLength4{SourceMemory(0 elements, atoms 0, 0 index, 1 batch size)
<= []}=>unresolved}}
resolves to
{Streamer{SourceMemory(4 elements, atoms 0, 4 index, 1 batch size)
<= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}
=> {ComputeLength4{SourceMemory(4 elements, atoms 0, 4 index, 1 batch size)
<= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}=>{DataNumber
<= 2}}}
The alternative original is
const compute= new ComputeLength4<v4>(new SourceMemory<v4>([1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]))
Where we pass in the source just like any other parameter.
And would get something like multiply.
(multiply{SourceMemory(3 elements, atoms 3, 0 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10},(multiply{SourceMemory(2 elements, atoms 2, 0 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10}]}=>unresolved)]}
=> unresolved)
...and resolves to.
(multiply{SourceMemory(3 elements, atoms 0, 3 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10},(multiply{SourceMemory(2 elements, atoms 0, 2 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10}]}=>{DataNumber <= 100})]}
=> {DataNumber <= 10000})
The advantage in either could be when it comes to code generation.
So I will sit on it for a while as I try more complex Data and Compute elements combinations and then make a decision when I move to code generation.
TypeScript's type system is doing some heavy lifting here.
type v4 = Vector<number, 4> | DataVectorN | (number | IData<number>)[]
const source = new SourceMemory([1, 1, 1, 1], new DataVectorN([1, 1, 1, 1]), [1, new DataNumber(1), 1, 1], [1, 1, 1, 1])
const compute = new ComputeLength4()
const streamer = new Streamer<v4, number>(source, compute)
Some more thoughts.
Streaming in such a way that engages elegantly with TypeScript's type system but remains performant; There are few ways to go.
The main issue is argument cardinality and the marshalling stage into an Compute element with different argument sizes.
Either the Streamer needs to know how many arguments it is dealing with, or the Compute element needs to know.
Source(2)=>Streamer(2)=>Compute(2)
ICompute<I,N,O>
This needs to be expressed in the type, the type system will flag a mismatch so N needs to propagate properly. It will be used to drive Vector<N> but we also want it as a cardinal for use at runtime.
The code is bit of a mess right now because I have a few ideas in play simultaneously and not all of them will be the winner.
Streamer vs no Streamer: Compute input becomes Vector vs Source.
Sausage Machine vs Structured.
Sausage machine (or some variant) may actually be the best way forward and the most flexible.
This is all about being able to chunk up code and data into big blobs that we can blat into the GPU without a lot of back and forth. Ultimately all the design decisions are driven by that goal.
@metawrap-dev its really fascinating to see your thought process, I've just been wondering if you had any ideas if there's a better way to document this other than an open issue ?
For example maybe an open PR with a readme.md, or a wiki page or something ? It's okay if you feel this way of documenting your thoughts is fine. I just imagine that after a while you might want to collect your thoughts in a summary and rationale so far. I realise that this is still a part of discovery as you go and you may want to postpone documenting in a more formalised way. Just some thoughts.
Yep, eventually. At the moment, I'm still experimenting. I assumed here was good because others could chip in. It just kind of happened organically as I write down most of these thoughts while mobile and then refer back to them as I work, so very convenient for me.
I will gather my thoughts elsewhere.
I'm definitely not suggesting you should gather your thoughts else where outside of github, as i said its very useful to see your thought process/allow others to contribute. I would imagine that you may find trying to scroll back through more than 29 posts a little more challenging as you keep documenting, that is all.