cvat-opencv icon indicating copy to clipboard operation
cvat-opencv copied to clipboard

TransT tracker integration

Open dschoerk opened this issue 1 year ago • 7 comments

PR to integrate the single object tracker TransT as an AI tool into CVAT.

also see here: https://github.com/cvat-ai/cvat-opencv/issues/14

dschoerk avatar Aug 31 '22 07:08 dschoerk

Resolved #4768

sizov-kirill avatar Aug 31 '22 12:08 sizov-kirill

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.

How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?

Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

tangy5 avatar Sep 17 '22 04:09 tangy5

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.

How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?

Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

dschoerk avatar Sep 17 '22 05:09 dschoerk

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference. How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done? Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there is can be functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

tangy5 avatar Sep 17 '22 06:09 tangy5

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference. How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done? Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there can be a functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

tangy5 avatar Sep 17 '22 06:09 tangy5

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference. How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done? Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there is can be functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

to extremely simplify things: AI tools are integrated as serverless functions i.e. they get called via a rest interface on the nuclio platform like some webservice. an image is sent from cvat to the service and it responds with the tracked location and state of the tracker. within this PR i have implemented such a service. from the perspective of implementation this is great, because of its simplicity - performance when tracking over multiple frames is not amazing with this approach.

to implement what you're looking for is not trivial. the simplest i can imagine is to call the tracking service repeatedly until the required amount of frames is tracked. BUT this is not very performant. each frame is sent in a separate http request to the service and it requires n requests to tracking n frames. a better solution would be to have a service that's capable of tracking multiple frames, keep its state internally and doesn't require the images to be sent in the request but rather access them from a docker mount. all of that would reduce flexibility but increase performance.

dschoerk avatar Sep 17 '22 08:09 dschoerk

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference. How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done? Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there is can be functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

to extremely simplify things: AI tools are integrated as serverless functions i.e. they get called via a rest interface on the nuclio platform like some webservice. an image is sent from cvat to the service and it responds with the tracked location and state of the tracker. within this PR i have implemented such a service. from the perspective of implementation this is great, because of its simplicity - performance when tracking over multiple frames is not amazing with this approach.

to implement what you're looking for is not trivial. the simplest i can imagine is to call the tracking service repeatedly until the required amount of frames is tracked. BUT this is not very performant. each frame is sent in a separate http request to the service and it requires n requests to tracking n frames. a better solution would be to have a service that's capable of tracking multiple frames, keep its state internally and doesn't require the images to be sent in the request but rather access them from a docker mount. all of that would reduce flexibility but increase performance.

Thank you for the reply, my AI function also predict one frame at a time (batch size = 1), I can imagine the simplest is to repeatedly sent HTTP request, one after one automatically until all MP4 frames are sent. Thank you!

tangy5 avatar Sep 17 '22 16:09 tangy5

I can't push to this thread. The requested URL returned error: 403. Opened another PR.

yasakova-anastasia avatar Nov 03 '22 09:11 yasakova-anastasia