mcap icon indicating copy to clipboard operation
mcap copied to clipboard

Add `mcap du` command

Open defunctzombie opened this issue 2 years ago • 1 comments

Problem: When looking at mcap files I want to understand how much space particular topics take up. I might want to know if I have some heavyweight topics I need to trim or if some debug topic is taking a large amount of space relative to its worth. Having a command that shows me how much "space" is used by each topic can help me answer such questions.

This change introduces a new mcap du command. This command reads and mcap file and outputs "disk usage" statistics about the mcap file.

Below I show invocations of this command on some sample mcap files. The output shows the space taken by each kind of record and then the space per topic (relative to all topics). For me, the primary use-case for this command to show the "topic" information but wanted to offer both sets of "usage" for discussion. I would probably hide the "record" breakdown under a flag or remove it entirely.

$ mcap du ~/Downloads/demo_2023-11-18_08-59-16.mcap
RECORD KIND    | SUM BYTES | % OF TOTAL FILE BYTES  
-----------------+-----------+------------------------
header         |        21 |              0.000035  
data end       |         4 |              0.000007  
schema         |       872 |              0.001466  
channel        |        95 |              0.000160  
statistics     |        76 |              0.000128  
chunk          |  59343216 |             99.748138  
message index  |     50860 |              0.085489  
chunk index    |     97828 |              0.164436  
summary offset |        68 |              0.000114  
footer         |        20 |              0.000034  

TOPIC       | SUM BYTES | % OF TOTAL MESSAGE BYTES  
--------------+-----------+---------------------------
camera_h264 |   3723297 |                 6.288593  
mouse       |     29340 |                 0.049555  
camera_jpeg |  55454516 |                93.661850
$ mcap du ~/Downloads/NuScenes-v1.0-mini-scene-0061-f4fbf7b.mcap
RECORD KIND    | SUM BYTES | % OF TOTAL FILE BYTES  
-----------------+-----------+------------------------
statistics     |       456 |              0.000089  
unknown        |        30 |              0.000006  
summary offset |       102 |              0.000020  
header         |        21 |              0.000004  
chunk          | 511377621 |             99.861298  
message index  |    582526 |              0.113755  
schema         |     15152 |              0.002959  
channel        |      1786 |              0.000349  
metadata       |       211 |              0.000041  
data end       |         4 |              0.000001  
chunk index    |    109996 |              0.021480  
footer         |        20 |              0.000004  

TOPIC                                  | SUM BYTES | % OF TOTAL MESSAGE BYTES  
-----------------------------------------+-----------+---------------------------
/CAM_FRONT_RIGHT/image_rect_compressed |  31133352 |                 4.014049  
/CAM_BACK_LEFT/image_rect_compressed   |  36039727 |                 4.646631  
/CAM_BACK_LEFT/camera_info             |     62359 |                 0.008040  
/RADAR_FRONT_LEFT                      |    643465 |                 0.082962  
/CAM_FRONT/camera_info                 |     62870 |                 0.008106  
/CAM_FRONT/lidar                       |  38053088 |                 4.906216  
/CAM_FRONT_RIGHT/annotations           |    977461 |                 0.126025  
/RADAR_FRONT                           |   1408531 |                 0.181603  
/RADAR_BACK_LEFT                       |   1443803 |                 0.186151  
/CAM_BACK/image_rect_compressed        |  29820298 |                 3.844755  
/CAM_BACK_LEFT/lidar                   |  51820479 |                 6.681257  
/CAM_FRONT_LEFT/image_rect_compressed  |  36577185 |                 4.715926  
/CAM_FRONT_LEFT/camera_info            |     63989 |                 0.008250  
/gps                                   |       702 |                 0.000091  
/drivable_area                         |   3999277 |                 0.515630  
/CAM_FRONT_RIGHT/camera_info           |     62225 |                 0.008023  
/CAM_BACK_RIGHT/annotations            |    820174 |                 0.105746  
/CAM_BACK_LEFT/annotations             |    157148 |                 0.020261  
/CAM_FRONT_LEFT/annotations            |    288426 |                 0.037187  
/markers/annotations                   |    846641 |                 0.109158  
/markers/car                           |      5794 |                 0.000747  
/diagnostics                           |   3512567 |                 0.452878  
/RADAR_BACK_RIGHT                      |   1386978 |                 0.178824  
/CAM_FRONT/annotations                 |   1284639 |                 0.165630  
/CAM_FRONT/image_rect_compressed       |  32778271 |                 4.226129  
/CAM_BACK_RIGHT/image_rect_compressed  |  33640191 |                 4.337257  
/CAM_BACK_RIGHT/camera_info            |     62294 |                 0.008032  
/CAM_BACK/lidar                        |  55684474 |                 7.179444  
/CAM_BACK/annotations                  |   1606465 |                 0.207123  
/odom                                  |    371586 |                 0.047909  
/map                                   |  15768687 |                 2.033070  
/tf                                    |    268080 |                 0.034564  
/CAM_BACK_RIGHT/lidar                  |  44188116 |                 5.697210  
/CAM_FRONT_LEFT/lidar                  |  43207431 |                 5.570769  
/pose                                  |      1465 |                 0.000189  
/imu                                   |    583740 |                 0.075262  
/semantic_map                          |     59500 |                 0.007671  
/CAM_FRONT_RIGHT/lidar                 |  40571903 |                 5.230968  
/CAM_BACK/camera_info                  |     60430 |                 0.007791  
/RADAR_FRONT_RIGHT                     |    986438 |                 0.127182  
/LIDAR_TOP                             | 265299545 |                34.205288

Implementation notes:

  • This implementation reads the entire file. There is an opportunity for "optimization" by using the MessageIndex records to figure out the size of records without decompressing chunks. I view this out-of-scope for the v1.
  • This is introduced as a separate command. I can see this feeling natural under the info command as a flag.

defunctzombie avatar Nov 19 '23 00:11 defunctzombie

Are there plans to finish this PR and merge? An mcap du command would be very useful.

jhurliman avatar Oct 25 '24 00:10 jhurliman

I got notified by John's comment. I don't want to kick a hornet's nest but my feedback on the concept is,

  • The principled way to handle this (consistent with MCAP design goals) is with a backward-compatible evolution of the statistics concept (probably via new records) to allow writers to record this information to the summary section. This would make "du" operations cheap on remote files, which is a goal met by all the other summarization operations supported by the format and tool. I think supporting "du" with a full scan doesn't play to the strengths of the format, and I think it's usually unwise to introduce features that can't be well-supported without additional design/work.
  • If extending the statistics concept feels like too much, I think this patch can still be improved by basing the statistics on the message index records and skipping all decompression. The timestamps/offsets in those indexes should be sufficient, and will require a lot less data to be inspected. However, it will still require a seek per chunk, which will make it untenably slow on NFS (and possibly worse on S3 etc) - unlike "mcap info" and the other related commands. It would suffer from the same issues as "rosbag info" in this article: https://foxglove.dev/blog/mcap-vs-ros1-bag-index-performance. For that reason I don't think this optimization is a "good" approach, for the same reason a full scan isn't.

Accordingly, my vote would be to close this and instead approach it as a side-effect of custom statistics support (there are multiple tickets/discussions on that around already). Once a better model for statistics exists, du can be implemented trivially with great performance, or even folded into "mcap info" (possibly just another column in the output?).

Writers would then need to update to get the benefit, but that's no different from writers needing to opt in to various features to get the full data out of the current "info" command.

Some previous custom stats ideas - https://github.com/foxglove/mcap/issues/723 https://github.com/foxglove/mcap/issues/384

edit: minor added detail

wkalt avatar Oct 28 '24 02:10 wkalt

Another thought: lots of tools allow you to write custom plugins for operations that are convenient but don't quite rise to the standard of inclusion in the tool itself. Two examples are git and cargo.

One possible way to split the difference on this would instead be to add a plugin extension mechanism, and point users who want this kind of feature either toward that or toward a plugin in some Foxglove repo. This would keep the door open for the required format changes to support this within the proper tool, without being a blocker for users who would accept the poor performance.

Two approaches I have seen for this:

  • scan the user's path for executables with a known prefix, and treat these as subcommands. IIRC this is how cargo and git both work.
  • for better integration with the cobra lib and help tooling, take a look at https://github.com/spf13/cobra/issues/691 and related discussions. This is how I handle plugins for the dp3 binary. Getting this working cross-language is probably more of a pain.

wkalt avatar Oct 28 '24 02:10 wkalt

My perspective is that this patch is a simple improvement that provides a level of valuable information. We can caveat it that different topics compress differently but there's no particular reason to hold up the feature on some idealized changes or optimizations. If later someone proposes alternatives or wants to implement proposals - we can evaluate those separately.

Custom stats puts the requirements on writers to add support for features. This allows a reader to provide insight into a file without writers having to be updated.

defunctzombie avatar Oct 28 '24 04:10 defunctzombie

+1 to moving this specific patch forward. My own personal interest in this is for MCAPs that already exist, not a future version of the spec for files that haven’t been serialized yet.

jhurliman avatar Oct 28 '24 06:10 jhurliman

Well, I think I have clearly expressed why this is not a good feature - it can't be made to play well with remote or large files without physical changes to the format, which subjects users to a poor UX.

The plugin mechanism I suggested also handles the backward compatibility issue, and unlike this approach is fully generic (plugin writers can implement whatever functionality they want, including making use of their own internal MCAP conventions -- something not possible in the tool today). In contrast with this approach we get a single expensive operation that does one thing.

Anyway, my stake in this is not too big so I'll appeal to @james-rms and bow out.

wkalt avatar Oct 28 '24 12:10 wkalt

@james-rms I've updated the PR to use utils.FormatTabe and humanBytes as suggested. I've also made it more clear that the message size is uncompressed size by adding (uncompressed) to the sum bytes heading item.

@wkalt The PR description does note that this implementation is "slower" by reading the entire fire rather than using message index records. That would be a good follow-on improvement that doesn't need to hold up the v1 of this command.

defunctzombie avatar Oct 30 '24 11:10 defunctzombie