tinygrad icon indicating copy to clipboard operation
tinygrad copied to clipboard

MaskRCNN Inference

Open kunwar31 opened this issue 1 year ago • 30 comments

So far I've created base classes based on reference implementation (thanks @wozeparrot ), and I'm able to load the weights @geohot

https://github.com/mlcommons/training/tree/master/object_detection/pytorch/maskrcnn_benchmark

TODO:

  • [x] Load weights of the saved model

  • [x] Load reference model in torch and verify same parameters present in tinygrad model

  • [x] Add call to each module and verify it works by comparing it with reference torch call

    • [x] Add call to ResNetFPN
    • [x] Add call to RPN
    • [x] Add call to RoIHeads
  • [x] Test model call works end to end

    • [x] test call to ResNetFPN
    • [x] test call to RPN
    • [x] test call to RoIHeads
  • [x] Add inference code

  • [x] Remove torch functions, lower usage of .numpy()

  • [x] Run model on test dataset, Box AP should be similar

  • [x] Calculate inference time(s/im)

kunwar31 avatar May 31 '23 20:05 kunwar31

I started the same project today but you are head of me. Maybe you need to drop the last fc layer of the backbone right?

Marcelo5444 avatar May 31 '23 20:05 Marcelo5444

I started the same project today but you are head of me. Maybe you need to drop the last fc layer of the backbone right?

yes, they can be removed, but I won't be using them anyway in the forward call

kunwar31 avatar May 31 '23 21:05 kunwar31

Changes made in tinygrad/:

------------------------------------------------------------
files                             insertions       deletions
------------------------------------------------------------
tinygrad/tensor.py                         2               1
------------------------------------------------------------
lines added in the tinygrad folder: 1

tinyb0t avatar Jun 01 '23 17:06 tinyb0t

@geohot So there are still some torch functions which need to be removed, but here's an example output image

kunwar31 avatar Jun 02 '23 03:06 kunwar31

Reference output for the same image image

kunwar31 avatar Jun 02 '23 03:06 kunwar31

I'm aware that the results aren't exactly the same, this is because the resnet block output doesn't exactly match reference implementation, it matches with atol=1e-3
If I use resnet output from reference, and everything else from my implementation, results match exactly end to end

kunwar31 avatar Jun 02 '23 03:06 kunwar31

confidence_threshold=0.6

Bbox outputs from tinygrad image

Bbox outputs from maskrcnn_benchmark image

kunwar31 avatar Jun 02 '23 11:06 kunwar31

@geohot try the model python examples/mask_rcnn.py --image <path>

kunwar31 avatar Jun 02 '23 16:06 kunwar31

Is it matching the required score on mlperf?

geohot avatar Jun 02 '23 16:06 geohot

@geohot i'm checking that next

kunwar31 avatar Jun 02 '23 16:06 kunwar31

@geohot currently its not meeting it but its close

Box Average Precision  (AP) = 0.330
Mask Average Precision  (AP) = 0.309

MLPerf criterion: 0.377 Box min AP and 0.339 Mask min AP

The run took too long to run as well (4.68s per image, 5000 images total, 6.5 hours total)

I think I first need to make sure the resnet in tinygrad matches reference implementation (i found backbone results to be slighly different, only matching with atol=1e-3) Also, will check the reference implementation scores

kunwar31 avatar Jun 03 '23 12:06 kunwar31

@geohot So I checked the reference implementation, strangely it also has lower scores than they claimed to have (Model is R-50 FPN Mask)

Box Average Precision  (AP) = 0.331
Mask Average Precision  (AP) = 0.309

the results are almost same for my implimentation, however the reference ran much faster (~2s per image)

kunwar31 avatar Jun 03 '23 19:06 kunwar31

Found this issue, which could explain the difference in performance

kunwar31 avatar Jun 03 '23 19:06 kunwar31

Reference implementation is actually 0.377 Box AP and 0.342 Mask AP. (turns out i was doing something very wrong) going to switch to a V100 for further testing of my implementation as my laptop is too slow for 5k image runs

kunwar31 avatar Jun 05 '23 16:06 kunwar31

The result should exactly match with reference now. the resnet being used in reference implementation had strides in first layer of the bottleneck, causing the difference in outputs. Now I'm doing a full run with tinygrad model for 5k images. BTW, GCP VMs are much slower than my laptop itself, (probably because the CPUs in those GPU VMs suck). so i'll have to use my laptop :(

kunwar31 avatar Jun 05 '23 20:06 kunwar31

@geohot getting Box AP 0.374 and Mask AP 0.342, so close now!

kunwar31 avatar Jun 06 '23 03:06 kunwar31

@geohot Added model in eval, MODEL=mrcnn python examples/mlperf/model_eval.py So we are short of mlperf requrirement in bbox by only 0.001 pts. I strongly beleive such small difference is because of float approximations.

Inference on 5k images ran in 5 hours 13 mins with OPENCL and NVidia RTX 3060 Mobile
CUDA backend does work but is much slower

bbox result

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.376
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.589
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.409
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.212
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.408
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.497
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.310
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.487
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.511
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.323
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.546
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648

mask result

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.342
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.560
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.363
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.155
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.506
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.293
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.448
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.468
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.272
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.623

kunwar31 avatar Jun 07 '23 17:06 kunwar31

I started facing the issue of kernels having too many args after I removed numpy from hot paths, for now i used the fix in https://github.com/geohot/tinygrad/issues/953 which isnt fully correct but works. @geohot results match now :)

Inference on 5k images ran in 9 hours 43 mins with OPENCL and Nvidia RTX 3060 Mobile (time increased, probably because gathers aren't efficient)

bbox result

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.377
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.591
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.411
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.213
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.410
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.499
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.312
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.489
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.513
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.325
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.550
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.651

mask result

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.342
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.560
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.363
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.155
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.506
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.293
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.448
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.468
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.272
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.623

Here's the remaining work

  1. Implement ROI Align using tinygrad
  2. Implement floor, remove numpy from LevelMapper, Pooler
  3. Correct fix for https://github.com/geohot/tinygrad/issues/953
  4. Other uses of numpy, torch are in preprocessing/postprocessing, so I'm skipping them for now

kunwar31 avatar Jun 09 '23 06:06 kunwar31

Updates:

  1. Roi Align is now implemented, but still has to use numpy for gathers, because the gathers are on huge tensors. A single forward pass needs to do all these gathers in the masked_index step of bilinear interpolate in roi align
gathering  (33868800,)  from  (15564800,)
gathering  (33868800,)  from  (15564800,)
gathering  (33868800,)  from  (15564800,)
gathering  (33868800,)  from  (15564800,)
gathering  (9683968,)  from  (3891200,)
gathering  (9683968,)  from  (3891200,)
gathering  (9683968,)  from  (3891200,)
gathering  (9683968,)  from  (3891200,)
gathering  (5318656,)  from  (972800,)
gathering  (5318656,)  from  (972800,)
gathering  (5318656,)  from  (972800,)
gathering  (5318656,)  from  (972800,)
gathering  (1304576,)  from  (243200,)
gathering  (1304576,)  from  (243200,)
gathering  (1304576,)  from  (243200,)
gathering  (1304576,)  from  (243200,)
gathering  (18866176,)  from  (15564800,)
gathering  (18866176,)  from  (15564800,)
gathering  (18866176,)  from  (15564800,)
gathering  (18866176,)  from  (15564800,)
gathering  (602112,)  from  (3891200,)
gathering  (602112,)  from  (3891200,)
gathering  (602112,)  from  (3891200,)
gathering  (602112,)  from  (3891200,)
  1. Added floor, ceil with cast, realize (cast needs enforcing)
  2. For now, OPT=1 seems to work without limiting kernel args

kunwar31 avatar Jun 13 '23 07:06 kunwar31

master merge seems to have broken, something, the model isnt working now. looking into it EDIT: fixed now

kunwar31 avatar Jun 13 '23 07:06 kunwar31

@geohot, Latest results

GPU=1 OPT=1 MODEL=mrcnn python examples/mlperf/model_eval.py

Inference on 5k images ran in 9 hours 59 mins with OPENCL and Nvidia RTX 3060 Mobile

bbox

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.378
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.593
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.411
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.215
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.411
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.499
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.313
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.490
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.514
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.327
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.551
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.652

mask

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.342
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.560
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.363
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.155
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.506
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.293
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.448
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.468
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.272
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.623

kunwar31 avatar Jun 14 '23 04:06 kunwar31

Is this ready for me to test? Will run on 7900XTX and confirm it meets the target

geohot avatar Jun 21 '23 17:06 geohot

Is this ready for me to test? Will run on 7900XTX and confirm it meets the target

yes @geohot , run on 7900XTX should take around 3-4 hours GPU=1 MODEL=mrcnn python examples/mlperf/model_eval.py

kunwar31 avatar Jun 22 '23 17:06 kunwar31

Testing now, required mkdir datasets/COCO but looks like it's running

geohot avatar Jun 23 '23 03:06 geohot

Made it to: 3%|████ 3%|████ | 136/5000 [1:35:07<56:42:24, 41.97s/it]

and got

pyopencl._cl.MemoryError: create_buffer failed: MEM_OBJECT_ALLOCATION_FAILURE

7900XTX with 24GB of VRAM

geohot avatar Jun 23 '23 06:06 geohot

Made it to: 3%|████ 3%|████ | 136/5000 [1:35:07<56:42:24, 41.97s/it]

and got

pyopencl._cl.MemoryError: create_buffer failed: MEM_OBJECT_ALLOCATION_FAILURE

7900XTX with 24GB of VRAM

So I’ve been using OPT=1 because of the kernel fusion issue, i usually get 8 sec per image on rtx 3060 mobile, I’m suspecting this behaviour is because of OPT=2. Could you please try OPT=1 @geohot ?

kunwar31 avatar Jun 23 '23 14:06 kunwar31

Pulled, and rerunning with OPT=1

geohot avatar Jun 23 '23 17:06 geohot

OPT=1 PYTHONPATH="." GPU=1 MODEL=mrcnn python examples/mlperf/model_eval.py

3%|████▋ | 142/5000 [27:45<15:49:38, 11.73s/it]

pyopencl._cl.MemoryError: create_buffer failed: MEM_OBJECT_ALLOCATION_FAILURE

geohot avatar Jun 23 '23 23:06 geohot

I think this is actually because you are running out of kernel program space, maybe try with method cache disabled?

wozeparrot avatar Jun 24 '23 00:06 wozeparrot

At 150 now with method cache disabled, but this is brutally slow. ETA is over 24 hours.

geohot avatar Jun 24 '23 02:06 geohot