turicreate icon indicating copy to clipboard operation
turicreate copied to clipboard

Model checkpointing in Object Detection

Open abhishekpratapa opened this issue 6 years ago • 10 comments

  • Model Checkpointing in OD

abhishekpratapa avatar May 04 '18 03:05 abhishekpratapa

Feedback from user (from dup #1120):

Great work with Turi. I've loved the fixes and features and ease-of-use that has been added throughout version 5beta and now to the official release, well done!

This might be a feature request, as I am not sure how to implement this in my current Object Detector project

I have a large project that is running on an 18-core Intel 7980XE CPU and 5 GTX 1080Ti GPUs. While training, I made it through 20k+ iterations over the course of several weeks. The power at my house flickered and I lost everything and realized that I hadn't set up checkpoints of any kind like what is done automatically with Tensorflow.

I set up another training process using the "model_checkpoint_interval=10" and "model_checkpoint_path='/tmp/model_checkpoints'", but that doesn't seem to save any checkpoints and per the documentation -- doesn't actually seem like the object_detector supports checkpoints. (but checkpoints seem to be supported by other APIs in Turi)

Is there something that I can do to ensure my progress gets saved so that I do not waste my resources if a process crashes or the power goes out after weeks of processing time?

As a bonus question -- if I set my max_iterations to 20,000, but decide to call it off earlier than that and save the results that I have up until this point, is there a way to interrupt the training without killing the entire Python script?

Thanks everyone.

gustavla avatar Oct 04 '18 14:10 gustavla

I would ditto this feature request. Recently at Skafos we've been using the Object Detection toolkit a bunch. I've run into situations where I'd make it through hours of training and realize the model converged much earlier. So I would have to restart with updated parameters. Additionally, leaving something training overnight for a long-running job like the one mentioned above is scary without the ability to exit early (and safely).

tylerhutcherson avatar Sep 06 '19 19:09 tylerhutcherson

any updates on this? My training stopped after several hours because of a full disk (not RAM) usage. Are there any checkpoints or intermediate models being saved by turicreate ? if not, what exactly is being written on the Disk that it got consumed ? I'm using a Colab session.

ShreshthSaxena avatar Jul 29 '20 13:07 ShreshthSaxena

I don't think we should be writing any models or anything like that during training. Could you see what files are being written by the process? Perhaps it's from SFrame manipulations for the dataset. @hoytak How large is the dataset and how much free disk space did you have before you began training?

nickjong avatar Jul 29 '20 23:07 nickjong

I read more on this and yes it looks to be about SFrame manipulations. Possible way around this could be to resize the images and annotations beforehand, so I'm looking into it. The dataset is about 50k images (over 10 GBs), I get a total of 150 GB Disk space in Colab, didn't check exact free space at the time.

ShreshthSaxena avatar Jul 30 '20 08:07 ShreshthSaxena

Yes, in general, I would recommend resizing images to something not too much bigger than the network's input size, which by default is 416 by 416. (You might want to go somewhat bigger, since each image can be randomly cropped each time it is used, before being resized to the network's input size.) This is something we've long talked about doing, but we would want to investigate any accuracy impact for different ranges of resizing before just doing this automatically.

This particular issue might be addressed by #3282, which would defer the image loading until later in the pipeline. The SFrame manipulations would then be manipulating image paths, not full images.

nickjong avatar Jul 30 '20 16:07 nickjong

Any progress regarding this issue?

waheed0332 avatar Oct 01 '20 09:10 waheed0332

Any progress regarding this issue?

There has not been any progress on this issue so far.

I actually think the majority of work is determining the best user experience (ex: how do we update the API, is more than one checkpoint kept, where are the checkpoint saved, what is the default behavior, etc). If you have thoughts here, please let me know.

If you want to get something working for your own purposes, I think all you have to do is periodically call save method in the main training loop.

TobyRoseman avatar Oct 01 '20 19:10 TobyRoseman

After thinking it over a bit more, my previous comment may have overlooked a sizable amount of work. Should any checkpointing functionality include the ability to resume training from a checkpoint? Is there enough value in checkpointing, without the ability to resume, for us to release that as a standalone feature?

TobyRoseman avatar Oct 03 '20 00:10 TobyRoseman

I don't see any value of checkpoints without the ability to resume. For me, the point of checkpoint here is to be able to resume training in case of any crash or simple load that checkpoint to export model for deployment.

waheed0332 avatar Oct 04 '20 13:10 waheed0332