viztracer icon indicating copy to clipboard operation
viztracer copied to clipboard

Write a compressor/decompressor for the trace log file

Open gaogaotiantian opened this issue 2 years ago • 17 comments

Now the trace log file is huge, which is okay on local machines. However, it makes it difficult to share the trace file through network, or to store it somewhere in the cloud.

Most of the info in the trace file is duplicated and we should be able to get a very decent compress ratio for the trace file.

gaogaotiantian avatar Jun 10 '22 06:06 gaogaotiantian

I'm trying to write compressor for vizTracer , but i do not know how to start up with a project of c mixed with python . maybe , could u tell me a simple way to start a debug env with the "vcompressor" . in my point , i just need to focus on the impl of the process of compression.

Nu1l998 avatar Aug 27 '22 15:08 Nu1l998

You can start with trace-log-compressor branch, then follow this documentation to setup the environment. You should be able to do viztracer --compress <your_result.json> to trigger the existing function.

gaogaotiantian avatar Aug 27 '22 16:08 gaogaotiantian

The example testcase for vcompressor will just fail as the input vdb_multithread.json is missing. I tried to generate this file using tests/data/vdb_multithread.py with --vdb option and only got a 2.35 KB json, which I would expect larger than exsiting multithread.json (105.22KB) as the help of --vdb option says that it will bring overhead.

By the way, it seems that ci is disabled for the new branch trace-log-compressor in this repo. Is it intentional?

Sefank avatar Aug 27 '22 19:08 Sefank

vdb_multithread.json is checked in and the example test case uses get_json_file_path("vdb_multithread.json") to locate it in test/data directory. Did you run the test and it failed for you?

-vdb is actually deprecated, but the help message means that it will bring overhead time-wise. You can check the json file and figure out what is not there, FEE or file info.

The CI is disabled on push to any branches other than master. However, it is enabled for all the pull requests to any branch.

gaogaotiantian avatar Aug 27 '22 19:08 gaogaotiantian

vdb_multithread.json is checked in and the example test case uses get_json_file_path("vdb_multithread.json") to locate it in test/data directory. Did you run the test and it failed for you?

I've double checked and confirmed that there is no vdb_multithread.json in tests/data/ which get_json_file_path trys to locate.

files in tests/data/

Sefank avatar Aug 28 '22 05:08 Sefank

You are correct, this is an error on my side, will fix soon.

gaogaotiantian avatar Aug 28 '22 05:08 gaogaotiantian

The latest fix is pushed to trace-log-compressor branch. There's no need to use vdb_multithread.json, it can simply use multithread.json. Please pull from the branch to your repo.

gaogaotiantian avatar Aug 28 '22 06:08 gaogaotiantian

Should we consider using protobuf? I think it is good for serializing and de-serializing, and it is popular in RPC cases.

Milkve avatar Aug 28 '22 15:08 Milkve

protobuf is not for compressing. It's an alternative for json.

gaogaotiantian avatar Aug 28 '22 23:08 gaogaotiantian

The current basic test fails on Windows and I'd love to leave this bug to you guys as a first issue to work on!

gaogaotiantian avatar Aug 29 '22 05:08 gaogaotiantian

The document doesn't seems to illustrate the save format of other event, e.g. instant event. Is this determined by me?

LorewalkerZhou avatar Aug 29 '22 15:08 LorewalkerZhou

The document doesn't seems to illustrate the save format of other event, e.g. instant event. Is this determined by me?

I would suggest to make a PR the the protocol first. We can discuss the design, then you can implement it after the design is accepted.

gaogaotiantian avatar Aug 29 '22 16:08 gaogaotiantian

For instant event, I consider using the following format: header(header) - pid(pid) - tid(tid) - name(str) - count(uint64) - [start(ts) - scopes(str)]*

LorewalkerZhou avatar Aug 30 '22 15:08 LorewalkerZhou

For instant event, I consider using the following format: header(header) - pid(pid) - tid(tid) - name(str) - count(uint64) - [start(ts) - scopes(str)]*

header(header) - pid(pid) - tid(tid) - name(str) - scopes(str)-count(uint64) - [start(ts)]* may be a better protocol.

LorewalkerZhou avatar Aug 30 '22 16:08 LorewalkerZhou

For instant event, I consider using the following format: header(header) - pid(pid) - tid(tid) - name(str) - count(uint64) - [start(ts) - scopes(str)]*

header(header) - pid(pid) - tid(tid) - name(str) - scopes(str)-count(uint64) - [start(ts)]* may be a better protocol.

Should probably get a better idea of instant events. For example, args is a critical part of an instant event which contains the data that user logs.

gaogaotiantian avatar Aug 30 '22 18:08 gaogaotiantian

Should probably get a better idea of instant events. For example, args is a critical part of an instant event which contains the data that user logs.

Since args has to be a jsonifiable object, I think we could simpily dump args to string and save it.

header(header) - pid(pid) - tid(tid) - name(str) - scopes(str)-count(uint64) - [start(ts)-args(str)]*

LorewalkerZhou avatar Aug 31 '22 15:08 LorewalkerZhou

Should probably get a better idea of instant events. For example, args is a critical part of an instant event which contains the data that user logs.

Since args has to be a jsonifiable object, I think we could simpily dump args to string and save it.

header(header) - pid(pid) - tid(tid) - name(str) - scopes(str)-count(uint64) - [start(ts)-args(str)]*

This is probably doable. You can check in the README first then start prototyping on it.

gaogaotiantian avatar Aug 31 '22 17:08 gaogaotiantian

Maybe it's a better way to store arg_name and arg_value separately, for example: If a function is called many times and we added '--log_func_args' option. Store the string directly may be redundant because the arg_name is stored many times.

TTianshun avatar Nov 03 '22 02:11 TTianshun