KV pair ir stream (IR_v2) --> clp_s archive format

Open AVMatthews opened this issue 5 months ago • 3 comments

Description

This PR:

Adds IR V2 to archive format conversion.
Exposes the JSON to IRV2 parsing to the user through the command line
Enables users to write the IRV2 format to a file.

Validation performed

Validated IR conversion, compression, and extraction to JSON
- Generated IR V2 format for all 5 JSON public datasets
  - ex) ./clp-s r elasticsearch_ir elasticsearch/
- Compressed those IRs into Archive
  - ex) ./clp-s i elasticsearch_archive elasticsearch_ir/
- Extracted those archives back to JSON
  - ex) ./clp-s x elasticsearch_archive elasticsearch_out/
- Compress and Extract Json using clp-s for comparison
  - ./clp-s c elasticsearch_clp-s_archive elasticsearch/
  - ./clp-s x elasticsearch_clp-s_archive elasticsearch_clp-s_out/
- Sorted and compared the spark, elasticsearch, and postgresql datasets' JSON received to original and to clp_s for exact match
  - ex)
    - jq -S -c '.' elasticsearch_out/original | sort > elasticsearch_sorted.json
    - jq -S -c '.' elasticsearch_clp-s_out/original | sort > elasticsearch_clp-s_sorted.json
    - diff elasticsearch_clp-s_sorted.json elasticsearch_sorted.json | diffstat
- Spot checked match of momgodb and cockroach database
- Notes:
  - I noticed some differences in numbers of significant digits for floats between both clp-s and the IRv2 to archive, and the original JSON.
  - ex.) original: 1.0; clp_s: 1.00000; IRv2: 1.00000
  - ex.) original: 0.9758767; clp_s: 0.9758767; IRv2: 0.975877

Benchmarking Info

ElasticSearch : ~1.6x longer that clp-s $ time ./clp-s c elasticsearch_clp-s_archive elasticsearch real 18m4.676s user 18m1.334s sys 0m3.140s $ time ./clp-s i elasticsearch_archive elasticsearch_ir/ real 29m31.376s user 29m26.950s sys 0m4.224s

Postgresql: 1.6x longer that clp-s $ time ./clp-s c postgresql_clp-s_archive postgresql real 1m37.820s user 1m37.445s sys 0m0.232s $ time ./clp-s i postgresql_archive postgresql_ir/ real 2m41.273s user 2m40.924s sys 0m0.172s

Spark : 2.2x longer that clp-s $ time ./clp-s c spark_archive spark-event-logs real 4m18.178s user 4m16.497s sys 0m1.444s $ time ./clp-s i spark_archive spark_ir/
real 9m38.949s user 9m36.601s sys 0m2.148s

Cockroach : ~1.45x longer that clp-s $ time ./clp-s c cockroachdb_clp-s_archive cockroachdb real 34m42.541s user 33m27.539s sys 0m11.856s $ time ./clp-s i cockroachdb_archive cockroachdb_ir/ real 50m30.246s user 50m20.945s sys 0m8.311s

Postgres Perf Breakdown

CLP-S Total: 1m 37s 64% in parse_line() - 1m 2 s 18% in m_archive_writer->append_message() - 17.5s 5.2% JSON I/O - 5s

IRV2 -> Archive Total: 2m 41s 53% parse_kv_log_event() ... includes m_archive_writer->append_message() - 1m 25s 35% deserializing IR (equivalent to the JSON I/O) - 56.5s

Summary: The deserialization process is providing significantly more overhead then the JSON I/O seem too. We are reconstructing the information essentially twice, once back into the format that was written out the the ir file and then into the archive format by walking over the IRV2 structures.

Summary by CodeRabbit

New Features
- Introduced new command options for JSON to IR format conversion and IR format compression in the command line interface.
- Added functionalities for compressing and generating IR data.
- Expanded the source file organization to accommodate new features and improvements.
- Enhanced serialization capabilities with new methods for managing and processing data.
- Improved command handling with additional error logging and validation for new functionalities.
- Enhanced JSON parsing capabilities with new methods for processing various data types.
Bug Fixes
- Improved command input parsing logic to validate metadata database configurations.
Documentation
- Updated command line help output to include new options and usage instructions.

Sep 21 '24 11:09 AVMatthews

clp clp copied to clipboard

KV pair ir stream (IR_v2) --> clp_s archive format

Description

Validation performed

Benchmarking Info

Postgres Perf Breakdown

Summary by CodeRabbit

Summary by CodeRabbit

clp
clp copied to clipboard