tinygrad
tinygrad copied to clipboard
MaskRCNN Inference
So far I've created base classes based on reference implementation (thanks @wozeparrot ), and I'm able to load the weights @geohot
https://github.com/mlcommons/training/tree/master/object_detection/pytorch/maskrcnn_benchmark
TODO:
-
[x] Load weights of the saved model
-
[x] Load reference model in torch and verify same parameters present in tinygrad model
-
[x] Add call to each module and verify it works by comparing it with reference torch call
- [x] Add call to ResNetFPN
- [x] Add call to RPN
- [x] Add call to RoIHeads
-
[x] Test model call works end to end
- [x] test call to ResNetFPN
- [x] test call to RPN
- [x] test call to RoIHeads
-
[x] Add inference code
-
[x] Remove torch functions, lower usage of .numpy()
-
[x] Run model on test dataset, Box AP should be similar
-
[x] Calculate inference time(s/im)
I started the same project today but you are head of me. Maybe you need to drop the last fc layer of the backbone right?
I started the same project today but you are head of me. Maybe you need to drop the last fc layer of the backbone right?
yes, they can be removed, but I won't be using them anyway in the forward call
Changes made in tinygrad/
:
------------------------------------------------------------
files insertions deletions
------------------------------------------------------------
tinygrad/tensor.py 2 1
------------------------------------------------------------
lines added in the tinygrad folder: 1
@geohot So there are still some torch functions which need to be removed, but here's an example output
Reference output for the same image
I'm aware that the results aren't exactly the same, this is because the resnet block output doesn't exactly match reference implementation, it matches with atol=1e-3
If I use resnet output from reference, and everything else from my implementation, results match exactly end to end
confidence_threshold=0.6
Bbox outputs from tinygrad
Bbox outputs from maskrcnn_benchmark
@geohot try the model python examples/mask_rcnn.py --image <path>
Is it matching the required score on mlperf?
@geohot i'm checking that next
@geohot currently its not meeting it but its close
Box Average Precision (AP) = 0.330
Mask Average Precision (AP) = 0.309
MLPerf criterion: 0.377 Box min AP and 0.339 Mask min AP
The run took too long to run as well (4.68s per image, 5000 images total, 6.5 hours total)
I think I first need to make sure the resnet in tinygrad matches reference implementation (i found backbone results to be slighly different, only matching with atol=1e-3) Also, will check the reference implementation scores
@geohot So I checked the reference implementation, strangely it also has lower scores than they claimed to have (Model is R-50 FPN Mask)
Box Average Precision (AP) = 0.331
Mask Average Precision (AP) = 0.309
the results are almost same for my implimentation, however the reference ran much faster (~2s per image)
Found this issue, which could explain the difference in performance
Reference implementation is actually 0.377 Box AP and 0.342 Mask AP. (turns out i was doing something very wrong) going to switch to a V100 for further testing of my implementation as my laptop is too slow for 5k image runs
The result should exactly match with reference now. the resnet being used in reference implementation had strides in first layer of the bottleneck, causing the difference in outputs. Now I'm doing a full run with tinygrad model for 5k images. BTW, GCP VMs are much slower than my laptop itself, (probably because the CPUs in those GPU VMs suck). so i'll have to use my laptop :(
@geohot getting Box AP 0.374 and Mask AP 0.342, so close now!
@geohot Added model in eval, MODEL=mrcnn python examples/mlperf/model_eval.py
So we are short of mlperf requrirement in bbox by only 0.001 pts. I strongly beleive such small difference is because of float approximations.
Inference on 5k images ran in 5 hours 13 mins with OPENCL and NVidia RTX 3060 Mobile
CUDA backend does work but is much slower
bbox result
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.376
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.589
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.409
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.212
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.408
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.497
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.310
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.487
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.511
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.323
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.546
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648
mask result
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.342
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.560
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.363
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.155
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.506
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.293
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.448
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.468
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.272
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.623
I started facing the issue of kernels having too many args after I removed numpy from hot paths, for now i used the fix in https://github.com/geohot/tinygrad/issues/953 which isnt fully correct but works. @geohot results match now :)
Inference on 5k images ran in 9 hours 43 mins with OPENCL and Nvidia RTX 3060 Mobile (time increased, probably because gathers aren't efficient)
bbox result
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.377
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.591
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.411
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.213
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.410
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.499
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.312
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.489
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.513
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.325
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.550
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.651
mask result
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.342
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.560
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.363
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.155
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.506
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.293
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.448
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.468
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.272
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.623
Here's the remaining work
- Implement ROI Align using tinygrad
- Implement floor, remove numpy from LevelMapper, Pooler
- Correct fix for https://github.com/geohot/tinygrad/issues/953
- Other uses of numpy, torch are in preprocessing/postprocessing, so I'm skipping them for now
Updates:
- Roi Align is now implemented, but still has to use numpy for gathers, because the gathers are on huge tensors. A single forward pass needs to do all these gathers in the masked_index step of bilinear interpolate in roi align
gathering (33868800,) from (15564800,)
gathering (33868800,) from (15564800,)
gathering (33868800,) from (15564800,)
gathering (33868800,) from (15564800,)
gathering (9683968,) from (3891200,)
gathering (9683968,) from (3891200,)
gathering (9683968,) from (3891200,)
gathering (9683968,) from (3891200,)
gathering (5318656,) from (972800,)
gathering (5318656,) from (972800,)
gathering (5318656,) from (972800,)
gathering (5318656,) from (972800,)
gathering (1304576,) from (243200,)
gathering (1304576,) from (243200,)
gathering (1304576,) from (243200,)
gathering (1304576,) from (243200,)
gathering (18866176,) from (15564800,)
gathering (18866176,) from (15564800,)
gathering (18866176,) from (15564800,)
gathering (18866176,) from (15564800,)
gathering (602112,) from (3891200,)
gathering (602112,) from (3891200,)
gathering (602112,) from (3891200,)
gathering (602112,) from (3891200,)
- Added floor, ceil with cast, realize (cast needs enforcing)
- For now, OPT=1 seems to work without limiting kernel args
master merge seems to have broken, something, the model isnt working now. looking into it EDIT: fixed now
@geohot, Latest results
GPU=1 OPT=1 MODEL=mrcnn python examples/mlperf/model_eval.py
Inference on 5k images ran in 9 hours 59 mins with OPENCL and Nvidia RTX 3060 Mobile
bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.378
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.593
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.411
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.215
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.411
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.499
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.313
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.490
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.514
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.327
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.551
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.652
mask
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.342
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.560
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.363
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.155
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.506
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.293
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.448
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.468
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.272
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.623
Is this ready for me to test? Will run on 7900XTX and confirm it meets the target
Is this ready for me to test? Will run on 7900XTX and confirm it meets the target
yes @geohot , run on 7900XTX should take around 3-4 hours
GPU=1 MODEL=mrcnn python examples/mlperf/model_eval.py
Testing now, required mkdir datasets/COCO
but looks like it's running
Made it to: 3%|████ 3%|████ | 136/5000 [1:35:07<56:42:24, 41.97s/it]
and got
pyopencl._cl.MemoryError: create_buffer failed: MEM_OBJECT_ALLOCATION_FAILURE
7900XTX with 24GB of VRAM
Made it to: 3%|████ 3%|████ | 136/5000 [1:35:07<56:42:24, 41.97s/it]
and got
pyopencl._cl.MemoryError: create_buffer failed: MEM_OBJECT_ALLOCATION_FAILURE
7900XTX with 24GB of VRAM
So I’ve been using OPT=1 because of the kernel fusion issue, i usually get 8 sec per image on rtx 3060 mobile, I’m suspecting this behaviour is because of OPT=2. Could you please try OPT=1 @geohot ?
Pulled, and rerunning with OPT=1
OPT=1 PYTHONPATH="." GPU=1 MODEL=mrcnn python examples/mlperf/model_eval.py
3%|████▋ | 142/5000 [27:45<15:49:38, 11.73s/it]
pyopencl._cl.MemoryError: create_buffer failed: MEM_OBJECT_ALLOCATION_FAILURE
I think this is actually because you are running out of kernel program space, maybe try with method cache disabled?
At 150 now with method cache disabled, but this is brutally slow. ETA is over 24 hours.