clp
clp copied to clipboard
KV pair ir stream (IR_v2) --> clp_s archive format
Description
This PR:
- Adds IR V2 to archive format conversion.
- Exposes the JSON to IRV2 parsing to the user through the command line
- Enables users to write the IRV2 format to a file.
Validation performed
- Validated IR conversion, compression, and extraction to JSON
- Generated IR V2 format for all 5 JSON public datasets
- ex)
./clp-s r elasticsearch_ir elasticsearch/
- ex)
- Compressed those IRs into Archive
- ex)
./clp-s i elasticsearch_archive elasticsearch_ir/
- ex)
- Extracted those archives back to JSON
- ex)
./clp-s x elasticsearch_archive elasticsearch_out/
- ex)
- Compress and Extract Json using clp-s for comparison
-
./clp-s c elasticsearch_clp-s_archive elasticsearch/
-
./clp-s x elasticsearch_clp-s_archive elasticsearch_clp-s_out/
-
- Sorted and compared the spark, elasticsearch, and postgresql datasets' JSON received to original and to clp_s for exact match
- ex)
-
jq -S -c '.' elasticsearch_out/original | sort > elasticsearch_sorted.json
-
jq -S -c '.' elasticsearch_clp-s_out/original | sort > elasticsearch_clp-s_sorted.json
-
diff elasticsearch_clp-s_sorted.json elasticsearch_sorted.json | diffstat
-
- ex)
- Spot checked match of momgodb and cockroach database
- Notes:
- I noticed some differences in numbers of significant digits for floats between both clp-s and the IRv2 to archive, and the original JSON.
- ex.) original: 1.0; clp_s: 1.00000; IRv2: 1.00000
- ex.) original: 0.9758767; clp_s: 0.9758767; IRv2: 0.975877
- Generated IR V2 format for all 5 JSON public datasets
Benchmarking Info
ElasticSearch : ~1.6x longer that clp-s
$ time ./clp-s c elasticsearch_clp-s_archive elasticsearch
real 18m4.676s
user 18m1.334s
sys 0m3.140s
$ time ./clp-s i elasticsearch_archive elasticsearch_ir/
real 29m31.376s
user 29m26.950s
sys 0m4.224s
Postgresql: 1.6x longer that clp-s
$ time ./clp-s c postgresql_clp-s_archive postgresql
real 1m37.820s
user 1m37.445s
sys 0m0.232s
$ time ./clp-s i postgresql_archive postgresql_ir/
real 2m41.273s
user 2m40.924s
sys 0m0.172s
Spark : 2.2x longer that clp-s
$ time ./clp-s c spark_archive spark-event-logs
real 4m18.178s
user 4m16.497s
sys 0m1.444s
$ time ./clp-s i spark_archive spark_ir/
real 9m38.949s
user 9m36.601s
sys 0m2.148s
Cockroach : ~1.45x longer that clp-s
$ time ./clp-s c cockroachdb_clp-s_archive cockroachdb
real 34m42.541s
user 33m27.539s
sys 0m11.856s
$ time ./clp-s i cockroachdb_archive cockroachdb_ir/
real 50m30.246s
user 50m20.945s
sys 0m8.311s
Postgres Perf Breakdown
CLP-S
Total: 1m 37s
64% in parse_line()
- 1m 2 s
18% in m_archive_writer->append_message()
- 17.5s
5.2% JSON I/O - 5s
IRV2 -> Archive
Total: 2m 41s
53% parse_kv_log_event()
... includes m_archive_writer->append_message()
- 1m 25s
35% deserializing IR (equivalent to the JSON I/O) - 56.5s
Summary: The deserialization process is providing significantly more overhead then the JSON I/O seem too. We are reconstructing the information essentially twice, once back into the format that was written out the the ir file and then into the archive format by walking over the IRV2 structures.
Summary by CodeRabbit
Summary by CodeRabbit
-
New Features
- Introduced new command options for JSON to IR format conversion and IR format compression in the command line interface.
- Added functionalities for compressing and generating IR data.
- Expanded the source file organization to accommodate new features and improvements.
- Enhanced serialization capabilities with new methods for managing and processing data.
- Improved command handling with additional error logging and validation for new functionalities.
- Enhanced JSON parsing capabilities with new methods for processing various data types.
-
Bug Fixes
- Improved command input parsing logic to validate metadata database configurations.
-
Documentation
- Updated command line help output to include new options and usage instructions.