cvat
cvat copied to clipboard
Per segment chunks
Motivation and context
- Changed chunk generation from per-task chunks to per-segment chunks
- Fixed a memory leak in video reading on the server side (only in media_extractors, so there are several more left)
- Fixed a potential hang in
importworker or the server process on process shutdown - Disabled multithreading in video reading in endpoints (not in static chunk generation)
- Refactored static chunk generation code (moved after job creation)
- Refactored various server internal APIs for frame retrieval
- Updated UI logic to access chunks
- Added a new server configuration option
CVAT_ALLOW_STATIC_CACHE(boolean) to enable and disable static cache support. The option is disabled by default (it's changed from the previous behavior) - Added tests for the changes made
- Added missing original chunk type field in job responses
- Fixed invalid kvrocks cleanup in tests for Helm deployment
When this update is applied to the server, there will be a data storage setting migration for the tasks. Existing tasks using static chunks (task.data.storage_method == FILE_SYSTEM) will be switched to the dynamic cache (i.e. to == CACHE)). The remaining files should be removed manually, there will be a list of such tasks in the migration log file.
After this update, you'll have an option to enable or disable static cache use during task creation. This allows, in particular, prohibit new tasks using the static cache. With this option, any tasks using static cache will use the dynamic cache instead on data access.
User-observable changes:
- Job chunk ids now start from 0 for each job instead of using parent task ids
- The
use_cache = falseorstorage_method = filesystemparameters in task creation can be ignored by the server - Task chunk access may be slower for some chunks (particularly, for tasks with overlap configured, for chunks on segment boundaries, and for tasks previously using static chunks)
- The last chunk in a job will contain only the frames from the current job, even if there are more frames in the task
How has this been tested?
Checklist
- [ ] I submit my changes into the
developbranch - [ ] I have created a changelog fragment
- [ ] I have updated the documentation accordingly
- [ ] I have added tests to cover my changes
- [ ] I have linked related issues (see GitHub docs)
- [ ] I have increased versions of npm packages if it is necessary (cvat-canvas, cvat-core, cvat-data and cvat-ui)
License
- [ ] I submit my code changes under the same MIT License that covers the project. Feel free to contact the maintainers if that's a concern.
Summary by CodeRabbit
Summary by CodeRabbit
-
New Features
- Introduced a new server setting to disable media chunks on the local filesystem.
- Enhanced frame prefetching with a
startFrameparameter for improved chunk calculations. - Added a new property,
data_original_chunk_type, for enhanced job differentiation in the metadata.
-
Bug Fixes
- Resolved memory management issues to prevent leaks during video processing.
- Corrected naming inconsistencies related to the
prefetchAnalyzer.
-
Documentation
- Included configuration for code formatting tools to ensure consistent code quality across the project.
-
Refactor
- Restructured classes and methods for improved clarity and maintainability, particularly in media handling and task processing.
-
Chores
- Updated formatting scripts to include additional directories for automated code formatting.
[!IMPORTANT]
Review skipped
Auto incremental reviews are disabled on this repository.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
Walkthrough
This update introduces several enhancements across various components, particularly focusing on improved media handling, frame processing, and task management. Notable features include new configuration options for disabling local media storage, refined frame prefetching logic, and updated test frameworks for better reliability. Memory management issues have been addressed, ensuring smoother operations during video processing, while modular code improvements enhance overall maintainability.
Changes
| Files | Change Summary |
|---|---|
| changelog.d/..._mzhiltso_job_chunks.md | New server setting to disable local media chunks; distinct chunk IDs for jobs; memory leak fixes. |
| cvat-core/src/frames.ts | FrameData and PrefetchAnalyzer updates for improved prefetching logic and constructor enhancements. |
| cvat-data/src/ts/cvat-data.ts | FrameDecoder updates for handling startFrame and chunkNumber adjustments. |
| cvat/apps/dataset_manager/bindings.py | Replaced FrameProvider with TaskFrameProvider, reflecting task-specific handling changes. |
| cvat/apps/dataset_manager/formats/cvat.py | Refactored dump_media_files for better modularity; frame retrieval changes. |
| cvat/apps/dataset_manager/tests/test_formats.py | Removed outdated functions and classes; enhanced test reliability through inheritance. |
| cvat/apps/engine/apps.py | Enhanced ready method for dynamic configuration management. |
| cvat/apps/engine/cache.py | Improved MediaCache class with new methods for cache management. |
| cvat/apps/engine/default_settings.py | New setting for static media caching based on environment variables. |
| cvat/apps/engine/frame_provider.py | Major refactor of frame loading capabilities with modular design changes. |
| cvat/apps/engine/log.py | Enhanced logging functionality for migration processes. |
| cvat/apps/engine/media_extractors.py | New classes/methods for media processing and improved type annotations. |
| cvat/apps/engine/models.py | Updated method signatures for chunk names/paths. |
| cvat/apps/engine/pyproject.toml | New configuration settings for code formatting tools. |
| cvat/apps/engine/serializers.py | Added a new read-only field to JobReadSerializer for additional information. |
| cvat/apps/engine/task.py | Enhanced task processing logic for segment creation and job management. |
| cvat/apps/engine/tests/... | Various tests updated for improved validation and functionality. |
| cvat/apps/lambda_manager/tests/test_lambda.py | Refactored test code structure for better maintainability. |
| cvat/requirements/base.in | Added comments for clarity on library dependencies. |
| cvat/schema.yml | New property data_original_chunk_type added to enhance schema expressiveness. |
| tests/python/shared/utils/helpers.py | Updated image/video functions and added a new utility for reading video files. |
Sequence Diagram(s)
sequenceDiagram
participant User
participant Server
participant MediaCache
participant FrameProvider
User->>Server: Request Media
Server->>MediaCache: Check Cache
MediaCache-->>Server: Return Media (if available)
Server->>FrameProvider: Retrieve Frames
FrameProvider-->>Server: Return Frames
Server-->>User: Deliver Media
🐰 In the meadow, changes bloom bright,
New features hop with pure delight!
With frames and tasks now dancing in line,
Our code, like carrots, is truly divine!
So celebrate with a joyful cheer,
For every fix brings us near! 🌿
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
--
I pushed a fix in commit <commit_id>, please review it.--Generate unit testing code for this file.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples: --@coderabbitai generate unit testing code for this file.--@coderabbitai modularize this function. - PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: --@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.--@coderabbitai read src/utils.ts and generate unit testing code.--@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.--@coderabbitai help me debug CodeRabbit configuration file.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
Codecov Report
Attention: Patch coverage is 80.79051% with 243 lines in your changes missing coverage. Please review.
Project coverage is 74.34%. Comparing base (
f227718) to head (3eea60d).
Additional details and impacted files
@@ Coverage Diff @@
## develop #8272 +/- ##
===========================================
- Coverage 83.21% 74.34% -8.87%
===========================================
Files 397 394 -3
Lines 41907 42216 +309
Branches 3882 3896 +14
===========================================
- Hits 34873 31386 -3487
- Misses 7034 10830 +3796
| Components | Coverage Δ | |
|---|---|---|
| cvat-ui | 78.73% <82.70%> (+0.03%) |
:arrow_up: |
| cvat-server | 70.47% <80.56%> (-16.76%) |
:arrow_down: |
@coderabbitai review
Actions performed
Review triggered.
Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.
[!NOTE] Currently processing new changes in this PR. This may take a few minutes, please wait...
Commits
Files that changed from the base of the PR and between ac9ff3368684bf146385e8174315b1bff736c7db and d2b138564f080371d62a96b27e401a9c8cc7fb69.
Files selected for processing (40)
- changelog.d/20240812_161617_mzhiltso_job_chunks.md (1 hunks)
- changelog.d/20240812_161734_mzhiltso_job_chunks.md (1 hunks)
- changelog.d/20240812_161912_mzhiltso_job_chunks.md (1 hunks)
- cvat-core/src/frames.ts (10 hunks)
- cvat-data/src/ts/cvat-data.ts (4 hunks)
- cvat/apps/dataset_manager/bindings.py (11 hunks)
- cvat/apps/dataset_manager/formats/cvat.py (3 hunks)
- cvat/apps/dataset_manager/tests/test_formats.py (4 hunks)
- cvat/apps/dataset_manager/tests/test_rest_api_formats.py (7 hunks)
- cvat/apps/engine/apps.py (1 hunks)
- cvat/apps/engine/cache.py (4 hunks)
- cvat/apps/engine/default_settings.py (1 hunks)
- cvat/apps/engine/frame_provider.py (1 hunks)
- cvat/apps/engine/log.py (1 hunks)
- cvat/apps/engine/media_extractors.py (18 hunks)
- cvat/apps/engine/migrations/0083_move_to_segment_chunks.py (1 hunks)
- cvat/apps/engine/models.py (2 hunks)
- cvat/apps/engine/pyproject.toml (1 hunks)
- cvat/apps/engine/serializers.py (2 hunks)
- cvat/apps/engine/task.py (13 hunks)
- cvat/apps/engine/tests/test_rest_api.py (15 hunks)
- cvat/apps/engine/tests/test_rest_api_3D.py (4 hunks)
- cvat/apps/engine/tests/utils.py (2 hunks)
- cvat/apps/engine/views.py (11 hunks)
- cvat/apps/lambda_manager/tests/test_lambda.py (4 hunks)
- cvat/apps/lambda_manager/views.py (2 hunks)
- cvat/requirements/base.in (1 hunks)
- cvat/schema.yml (1 hunks)
- dev/format_python_code.sh (1 hunks)
- tests/python/rest_api/test_jobs.py (5 hunks)
- tests/python/rest_api/test_tasks.py (5 hunks)
- tests/python/sdk/test_auto_annotation.py (1 hunks)
- tests/python/sdk/test_datasets.py (1 hunks)
- tests/python/sdk/test_jobs.py (1 hunks)
- tests/python/sdk/test_projects.py (1 hunks)
- tests/python/sdk/test_pytorch.py (1 hunks)
- tests/python/sdk/test_tasks.py (1 hunks)
- tests/python/shared/assets/jobs.json (27 hunks)
- tests/python/shared/fixtures/init.py (1 hunks)
- tests/python/shared/utils/helpers.py (4 hunks)
______________________________________________________________________________________________________________________________________________________________________ < Refactor early, refactor often. Just as you might weed and rearrange a garden, rewrite, rework, and re-architect code when it needs it. Fix the root of the problem. > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- \ \ \ \ /\ ( ) .( o ).
[!TIP]
You can disable sequence diagrams in the walkthrough.
Disable the
reviews.sequence_diagramssetting in your project's settings in CodeRabbit to remove sequence diagrams from the walkthrough.
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>.Generate unit testing code for this file.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai generate unit testing code for this file.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai generate interesting stats about this repository and render them as a table.@coderabbitai show all the console.log statements in this repository.@coderabbitai read src/utils.ts and generate unit testing code.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.@coderabbitai help me debug CodeRabbit configuration file.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (invoked as PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
/check
:heavy_check_mark: All checks completed successfully :page_facing_up: See logs here
@zhiltsov-max, I don't remember whether we decided to validate passed/calculated chunk_size based on segment_size or not. However, If we are not going to prohibit task creation (or adapt chunk_size) when chunk_size > segment_size, then GET /api/jobs/<id>/data/meta should return min(chunk_size, segment_size).
@zhiltsov-max, I don't remember whether we decided to validate passed/calculated chunk_size based on segment_size or not. However, If we are not going to prohibit task creation (or adapt chunk_size) when chunk_size > segment_size, then GET /api/jobs/
/data/meta should return min(chunk_size, segment_size).
We decided to optionally implement chunk size restrictions based on the data size, but this PR was already quite big, so I didn't add it. The existing implementation should handle the situation you're talking about in this function for tasks and here for static chunks for segments.
Upd: I don't think we need to change how chunk_size is reported. It should be used to determine chunk ids. If it's bigger than segment size, it means there's just 1 chunk, which is correct. It can be read as max_chunk_size.
/check
:heavy_check_mark: All checks completed successfully :page_facing_up: See logs here
If REST API for each job enumerates chunks from zero (Job View Set), we probably should follow the same approach for context images and frames. Hovewer I do not see any changes related to it. E.g. when we request: /api/jobs/
/data?type=context_image&number= , probably should be relative to the job start.
It might be a good idea, but it wasn't planned for the feature. Only chunks were supposed to change. I can do a separate PR later for this, once it's needed.
Have you checked the impact of this change on the dataset layer in the SDK? I would assume that some adaptation would be needed, since it relies on the chunk layout to find frames by index.
Have you checked the impact of this change on the dataset layer in the SDK? I would assume that some adaptation would be needed, since it relies on the chunk layout to find frames by index.
Yes, we only use task chunks in SDK, their API is not changed by this PR. The tests pass.
One more idea is that before releasing it, we may somehow notify users to force them to update client immediately after release. It may be especially important in this case.
Perhaps need to prepare a pull request in private repos to update code according to new changes (as I see now, some private code uses outdated imports)
I will suggest to make a global replacement, like this to go to the same naming approach
If to press Play in a GT job
I think the problem is somewhere in this code:
As for a ground truth job this condition may not to work.
@zhiltsov-max, Mostly LGTM, but I've found 2 small bugs:
-
If I create a task with 4 images and start_frame 2, I'll receive a 400 code on the task preview request.
-
I'm not sure whether it is related to the current PR or not, but I guess the progress bar can show an incorrect interval of loaded chunks/frames for a GT job (e.g. there were 2 frames 3 and 6 in the GT job):
- I see that now we will have a "3rd type" of created tasks on the server: tasks with images initially created with static chunks, migrated to dynamic chunks but without manifest file. I'm not sure right now whether there can be some problems with that or it becomes a waste of time. However, it is probably some kind of inconsistent state.
If I create a task with 4 images and start_frame 2, I'll receive a 400 code on the task preview request.
Fixed
I'm not sure whether it is related to the current PR or not, but I guess the progress bar can show an incorrect interval of loaded chunks/frames for a GT job (e.g. there were 2 frames 3 and 6 in the GT job):
Fixed
Quality Gate passed
Issues
33 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
1.9% Duplication on New Code