torchsparse icon indicating copy to clipboard operation
torchsparse copied to clipboard

Any plan to support bfloat16?

Open pycoco opened this issue 1 year ago • 5 comments

pycoco avatar Dec 07 '23 10:12 pycoco

@ys-2020, could you please take a look at this issue when you have time? Thanks!

zhijian-liu avatar Dec 11 '23 04:12 zhijian-liu

Hi @pycoco , thanks for your interest. bfloat16 is typically used for training jobs. However, we have launched many training jobs and found that float16 will not affect the accuracy. That's why we do not support bfloat16 now.

If you find there is any job that bfloat16 can have better training results, please let us know. And we will plan to implement it.

ys-2020 avatar Dec 11 '23 15:12 ys-2020

@ys-2020 Thanks for your quick reply and great work,i found that model with float16 will encounter loss nan problem in certain scenarios. Maybe it is caused by underflow/overflow. So this a good choice to support bfloat16 in training.

pycoco avatar Dec 12 '23 04:12 pycoco

@pycoco . Hi! Thank you for the feedback. Can you provide more details about the 'certain scenario'? Actually we have launched a lot of training jobs on segmentation/detection tasks and many different datasets, and we didn't meet the nan loss. (Also, you can change to fp32 as a backup plan for now.)

ys-2020 avatar Dec 12 '23 04:12 ys-2020

@ys-2020 In my scenario, i use voxelnext with voxel size [0.05, 0.05, 0.15], range [-100.0, -100.0, -1.5, 100.0, 100.0, 4.5] and own dataset. FP32 is normal but training time is too long. Actually i use 'spconv' now, maybe i should adapt to our library and have a try(but i think the library is not the reason of the problem).

pycoco avatar Dec 12 '23 05:12 pycoco