tinygrad
tinygrad copied to clipboard
Dtype casting behavior differs between runtimes
When using cstyle runtimes like METAL an intermediate cast op has no effect leading to differing behaviors compared to other runtimes like torch. The kernels generated will use the initial dtype and only apply the casting once realizing.
TORCH example
>>> x=Tensor.randn(10)*10
>>> x.numpy()
array([ -1.1966271, 16.344046 , 13.719806 , -14.306839 , 0.4940065,
-0.7837186, 1.9311037, 1.0995939, 11.713716 , 3.4509966],
dtype=float32)
>>> x.cast(dtypes.int32).cast(dtypes.float32).numpy()
array([ -1., 16., 13., -14., 0., 0., 1., 1., 11., 3.],
dtype=float32)
METAL example
>>> x=Tensor.randn(10)*10
>>> x.numpy()
array([-6.686873 , 12.219945 , 2.935935 , -7.987997 , 11.411002 ,
11.297409 , 7.143289 , -4.76507 , 0.8306983, 10.898464 ],
dtype=float32)
>>> x.cast(dtypes.int32).cast(dtypes.float32).numpy()
array([-6.686873 , 12.219945 , 2.935935 , -7.987997 , 11.411002 ,
11.297409 , 7.143289 , -4.76507 , 0.8306983, 10.898464 ],
dtype=float32)
Hmm, that's interpreted vs compiled. Which do you think is correct?
Not sure how prevalent dynamic casting is, but It might lead to some confusion if a cast op is called but internally it doesn’t occur.
Example floor/ceil is currently blocked by this since we would have to realize the int cast on the input for it to apply which isn’t ideal.
I think the answer might be something like a required
flag on the cast, so it won't optimize out. contiguous
might already do this.
Maybe always enforce it if its a down cast? Having a required flag for a function can be confusing as the question then becomes when will decide to not do it
closing as stale - on master both interpreted and compiled backend honored the middle downcast