mesozoic-egg

Results 20 comments of mesozoic-egg

When changing to use the vm_allocate, I saw that the mac 14 (arm) runner worked as expected, but not on mac 12 (x86) runner mac 14 arm: https://github.com/mesozoic-egg/newbufferwithbytesnocopy/actions/runs/11055124079/job/30713571130 mac 12...

Access to tinybox would be great!

Not yet done but I got some prelim results: Training a GPT2 model (I arbitrarily changed the size so it causes OOM without the shard but can fit FSDP) Model...

The memory savings seem to be around 30%, probably should compare against the pytorch implementation... I added some extra profiling code but will delete them before review

Once in a while there seems to be an error regarding axis not matching depending on model implementation, I think it boils down to this example: ```python x = Tensor.empty(4,...

One solution I arrived at is checking the gradient's axis before assigning: ```python def step(self): for x in self.params: if isinstance(x.grad.lazydata, MultiLazyBuffer) and (axis:=x.grad.lazydata.axis) is not None and axis !=...

@rggs: pytorch's all_gather refers to two things: in the context of reduce operation (sum), this is handled by all_reduce inside multi.py; in the context of elementwise operation (e.g. multiplication), this...

Sharding across four devices for MNIST with AdamW: Parameter size (with optimizer): 14.4 mB Device 0 peak mem: 65 mB Device 1 - 3 peak: 17 mB Without sharding the...