vector icon indicating copy to clipboard operation
vector copied to clipboard

fix(file source): Fix checksum calculation

Open Ilmarii opened this issue 2 years ago • 9 comments

The checksum is now calculated only from the bytes read, and not from the entire buffer. Also added an auto-update procedure from the previous version.

Resolves: #15700

Ilmarii avatar Jan 11 '23 07:01 Ilmarii

Deploy Preview for vector-project ready!

Name Link
Latest commit 9969f9c238ad9ade2e37587bfacde89bd91f0264
Latest deploy log https://app.netlify.com/sites/vector-project/deploys/63be6538697ed0000b8979f8
Deploy Preview https://deploy-preview-15899--vector-project.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

netlify[bot] avatar Jan 11 '23 07:01 netlify[bot]

Deploy Preview for vrl-playground ready!

Name Link
Latest commit 9969f9c238ad9ade2e37587bfacde89bd91f0264
Latest deploy log https://app.netlify.com/sites/vrl-playground/deploys/63be65389141400008697e6f
Deploy Preview https://deploy-preview-15899--vrl-playground.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

netlify[bot] avatar Jan 11 '23 07:01 netlify[bot]

Regression Test Results

Run ID: 413bc53b-ab3c-4b3e-abbf-a76f889ad3e7
Baseline: 1727e729487c4075c29d1ca30cda5053def52085
Comparison: 9969f9c238ad9ade2e37587bfacde89bd91f0264
Total vector CPUs: 7

Explanation

A regression test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core.

The table below, if present, lists those experiments that have experienced a statistically significant change in their bytes_written_per_cpu_second performance between baseline and comparison SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±5% change in mean bytes_written_per_cpu_second are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting changes are observed.

Changes in bytes_written_per_cpu_second with confidence ≥ 90.00% and absolute Δ mean >= ±5%:

experiment Δ mean Δ mean % confidence
syslog_regex_logs2metric_ddmetrics 263.55KiB/CPU-s 7.42 100.00%
Fine details of change detection per experiment.
experiment Δ mean Δ mean % confidence baseline mean baseline stdev baseline stderr baseline outlier % baseline CoV comparison mean comparison stdev comparison stderr comparison outlier % comparison CoV erratic declared erratic
syslog_regex_logs2metric_ddmetrics 263.55KiB/CPU-s 7.42 100.00% 3.47MiB/CPU-s 389.66KiB/CPU-s 5.03KiB/CPU-s 0.0 0.109752 3.72MiB/CPU-s 455.44KiB/CPU-s 5.88KiB/CPU-s 0.0 0.119415 True False
syslog_log2metric_splunk_hec_metrics 300.18KiB/CPU-s 3.19 100.00% 9.19MiB/CPU-s 256.99KiB/CPU-s 3.32KiB/CPU-s 0.0 0.0273 9.49MiB/CPU-s 165.4KiB/CPU-s 2.14KiB/CPU-s 0.0 0.017027 False False
syslog_splunk_hec_logs 249.73KiB/CPU-s 2.75 100.00% 8.86MiB/CPU-s 218.54KiB/CPU-s 2.82KiB/CPU-s 0.0 0.024082 9.11MiB/CPU-s 196.91KiB/CPU-s 2.54KiB/CPU-s 0.0 0.021117 False False
socket_to_socket_blackhole 146.49KiB/CPU-s 1.05 100.00% 13.63MiB/CPU-s 226.08KiB/CPU-s 2.92KiB/CPU-s 0.0 0.016202 13.77MiB/CPU-s 105.81KiB/CPU-s 1.37KiB/CPU-s 0.0 0.007504 False False
syslog_loki 77.26KiB/CPU-s 0.86 100.00% 8.79MiB/CPU-s 233.0KiB/CPU-s 3.01KiB/CPU-s 0.0 0.025893 8.86MiB/CPU-s 144.31KiB/CPU-s 1.86KiB/CPU-s 0.0 0.015901 False False
syslog_humio_logs 76.29KiB/CPU-s 0.82 100.00% 9.09MiB/CPU-s 173.82KiB/CPU-s 2.24KiB/CPU-s 0.0 0.018673 9.16MiB/CPU-s 191.95KiB/CPU-s 2.48KiB/CPU-s 0.0 0.020454 False False
splunk_hec_route_s3 93.07KiB/CPU-s 0.77 100.00% 11.78MiB/CPU-s 544.83KiB/CPU-s 7.03KiB/CPU-s 0.0 0.045168 11.87MiB/CPU-s 513.67KiB/CPU-s 6.63KiB/CPU-s 0.0 0.042258 False False
datadog_agent_remap_datadog_logs_acks 242.15KiB/CPU-s 0.7 100.00% 33.67MiB/CPU-s 1.13MiB/CPU-s 14.87KiB/CPU-s 0.0 0.033426 33.91MiB/CPU-s 1.03MiB/CPU-s 13.61KiB/CPU-s 0.0 0.030362 False False
http_to_http_json 52.15KiB/CPU-s 0.38 100.00% 13.57MiB/CPU-s 299.12KiB/CPU-s 3.86KiB/CPU-s 0.0 0.021522 13.62MiB/CPU-s 211.65KiB/CPU-s 2.73KiB/CPU-s 0.0 0.015172 False False
otlp_grpc_to_blackhole 3.52KiB/CPU-s 0.33 99.99% 1.04MiB/CPU-s 42.8KiB/CPU-s 565.79B/CPU-s 0.0 0.040296 1.04MiB/CPU-s 53.18KiB/CPU-s 702.61B/CPU-s 0.0 0.049904 False False
splunk_hec_to_splunk_hec_logs_noack 8.05KiB/CPU-s 0.06 95.27% 13.62MiB/CPU-s 249.92KiB/CPU-s 3.22KiB/CPU-s 0.0 0.017922 13.63MiB/CPU-s 190.71KiB/CPU-s 2.46KiB/CPU-s 0.0 0.013668 False False
enterprise_http_to_http 4.46KiB/CPU-s 0.03 60.93% 13.61MiB/CPU-s 311.24KiB/CPU-s 4.02KiB/CPU-s 0.0 0.022325 13.62MiB/CPU-s 255.56KiB/CPU-s 3.3KiB/CPU-s 0.0 0.018325 False False
splunk_hec_to_splunk_hec_logs_acks 1.63KiB/CPU-s 0.01 20.86% 13.62MiB/CPU-s 331.58KiB/CPU-s 4.28KiB/CPU-s 0.0 0.023777 13.62MiB/CPU-s 342.41KiB/CPU-s 4.42KiB/CPU-s 0.0 0.024551 False False
fluent_elasticsearch 547.95B/CPU-s 0.0 67.82% 45.41MiB/CPU-s 29.99KiB/CPU-s 392.11B/CPU-s 0.0 0.000645 45.41MiB/CPU-s 29.84KiB/CPU-s 389.94B/CPU-s 0.0 0.000642 False False
splunk_hec_indexer_ack_blackhole -1.87KiB/CPU-s -0.01 31.73% 13.62MiB/CPU-s 246.25KiB/CPU-s 3.18KiB/CPU-s 0.0 0.017658 13.62MiB/CPU-s 255.16KiB/CPU-s 3.29KiB/CPU-s 0.0 0.018299 False False
file_to_blackhole -16.37KiB/CPU-s -0.03 54.34% 54.49MiB/CPU-s 1.08MiB/CPU-s 14.28KiB/CPU-s 0.0 0.019842 54.48MiB/CPU-s 1.27MiB/CPU-s 16.72KiB/CPU-s 0.0 0.02326 False False
http_to_http_noack -6.24KiB/CPU-s -0.04 68.35% 13.61MiB/CPU-s 307.37KiB/CPU-s 3.97KiB/CPU-s 0.0 0.022047 13.61MiB/CPU-s 372.1KiB/CPU-s 4.8KiB/CPU-s 0.0 0.026702 False False
http_to_http_acks -34.44KiB/CPU-s -0.63 49.14% 5.31MiB/CPU-s 2.81MiB/CPU-s 37.07KiB/CPU-s 0.0 0.527922 5.28MiB/CPU-s 2.77MiB/CPU-s 36.6KiB/CPU-s 0.0 0.524621 True False
otlp_http_to_blackhole -12.94KiB/CPU-s -0.82 100.00% 1.55MiB/CPU-s 106.46KiB/CPU-s 1.37KiB/CPU-s 0.0 0.067179 1.53MiB/CPU-s 117.85KiB/CPU-s 1.52KiB/CPU-s 0.0 0.074974 False False
http_text_to_http_json -271.19KiB/CPU-s -1.04 100.00% 25.54MiB/CPU-s 638.5KiB/CPU-s 8.24KiB/CPU-s 0.0 0.02441 25.28MiB/CPU-s 562.64KiB/CPU-s 7.26KiB/CPU-s 0.0 0.021735 False False
datadog_agent_remap_datadog_logs -426.22KiB/CPU-s -1.23 100.00% 33.94MiB/CPU-s 1.39MiB/CPU-s 18.37KiB/CPU-s 0.0 0.040968 33.52MiB/CPU-s 1.45MiB/CPU-s 19.2KiB/CPU-s 0.0 0.043362 False False
datadog_agent_remap_blackhole -636.49KiB/CPU-s -2.0 100.00% 31.02MiB/CPU-s 1.0MiB/CPU-s 13.24KiB/CPU-s 0.0 0.032282 30.4MiB/CPU-s 1.4MiB/CPU-s 18.5KiB/CPU-s 0.0 0.046042 False False
syslog_log2metric_humio_metrics -165.0KiB/CPU-s -2.63 100.00% 6.13MiB/CPU-s 171.1KiB/CPU-s 2.21KiB/CPU-s 0.0 0.027258 5.97MiB/CPU-s 265.52KiB/CPU-s 3.43KiB/CPU-s 0.0 0.043443 False False
datadog_agent_remap_blackhole_acks -1.14MiB/CPU-s -3.59 100.00% 31.77MiB/CPU-s 519.56KiB/CPU-s 6.71KiB/CPU-s 0.0 0.015968 30.63MiB/CPU-s 762.63KiB/CPU-s 9.85KiB/CPU-s 0.0 0.024311 False False

github-actions[bot] avatar Jan 11 '23 08:01 github-actions[bot]

Thanks for the contribution @Ilmarii - it struck us that this could also be a good opportunity to switch to crc32fast to improve performance here as well (while we're doing the migration of what we're checksumming).

Is that something you're interested in tackling, and if not do you mind if I/we push some commits to this PR to introduce that as well?

spencergilbert avatar Jan 12 '23 18:01 spencergilbert

Hi! Yes, I think I can integrate crc32fast here.

Ilmarii avatar Jan 13 '23 08:01 Ilmarii

Hi! Yes, I think I can integrate crc32fast here.

Awesome, thanks so much! Please let us know if you need a hand or need to hand over the PR for us to finish.

spencergilbert avatar Jan 13 '23 13:01 spencergilbert

Hi! I looked at the crate and found that it only supports CRC32, i.e. the result is 32-bit. Since CRC64 is currently used, which has a 64-bit result, this replacement will increase the number of collisions. @spencergilbert Please tell me if you are aware and it's ok

Ilmarii avatar Jan 13 '23 17:01 Ilmarii

@Ilmarii Thinking about it a little more, given what you brought up.... I think we can leave things as-is for now. Looking closer at all of this code, I realized we only use the checksum/fingerprint for identifying a file... instead of identifying it by path + fingerprint.

If we also included the filepath, I think using CRC32 would be totally fine, but without it... it definitely doesn't feel great to make the change.

I appreciate you pointing out that fact. We'll review the code as-is.

tobz avatar Jan 13 '23 18:01 tobz

👍 thanks y'all - I'll try give this a review before the end of day today, Tuesday at the latest.

spencergilbert avatar Jan 13 '23 19:01 spencergilbert