Adds location support with `LazyValue` API
Issue #909
Description of changes:
This PR works on adding support for location metadata in LazyValue API.
List of changes:
- Modifies top level matching logic in buffer to preserve
rowandprev_newline_offsetcount. - Adds a way to store row and
prev_newline_offsetvalue in theRawReaderState. This helps in preserving saved state whenever resume is used on the streaming reader. - Since streaming reader uses readers for each encoding get these values for encoding
Readertypes. - Add location (usize, usize) to
LazyValue - Adds unit tests for location support
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Codecov Report
Attention: Patch coverage is 85.54913% with 25 lines in your changes missing coverage. Please review.
Project coverage is 77.91%. Comparing base (
edf2321) to head (671cd75). Report is 17 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #922 +/- ##
==========================================
+ Coverage 77.87% 77.91% +0.03%
==========================================
Files 138 138
Lines 34818 34979 +161
Branches 34818 34979 +161
==========================================
+ Hits 27115 27253 +138
- Misses 5701 5724 +23
Partials 2002 2002
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Benchmarks by disabling and enabling location metadata for LazyValue
LazyValue support and with source-location feature disabled:
Gnuplot not found, using plotters backend
Text Ion 1.0 data size: 3690000 bytes
Bin Ion 1.0 data size: 2670089 bytes
text 1.0/scan all time: [28.825 ms 28.988 ms 29.147 ms]
change: [+58.126% +58.957% +59.878%] (p = 0.00 < 0.05)
Performance has regressed.
text 1.0/read all time: [26.999 ms 27.078 ms 27.163 ms]
change: [+2.0400% +3.2512% +4.1874%] (p = 0.00 < 0.05)
Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
14 (14.00%) high mild
5 (5.00%) high severe
text 1.0/read 'format' field
time: [22.133 ms 22.194 ms 22.261 ms]
change: [+2.9679% +3.3273% +3.6755%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
=== v1.1: maximally compact ===
Binary data size: 620000 bytes
Text data size: 890000 bytes
maximally compact text 1.1/scan all
time: [8.1125 ms 8.1636 ms 8.2191 ms]
change: [-12.529% -11.686% -10.873%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
maximally compact text 1.1/read all
time: [19.497 ms 19.558 ms 19.620 ms]
change: [-7.2783% -6.8479% -6.4477%] (p = 0.00 < 0.05)
Performance has improved.
maximally compact text 1.1/read 'format' field
time: [9.4405 ms 9.4808 ms 9.5267 ms]
change: [-11.395% -10.733% -10.121%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
5 (5.00%) high mild
7 (7.00%) high severe
=== v1.1: moderately compact ===
Binary data size: 1120000 bytes
Text data size: 1360000 bytes
moderately compact text 1.1/scan all
time: [8.0763 ms 8.0853 ms 8.0948 ms]
change: [-13.695% -13.311% -12.976%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
moderately compact text 1.1/read all
time: [16.860 ms 17.047 ms 17.357 ms]
change: [-10.334% -9.0061% -7.2801%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe
moderately compact text 1.1/read 'format' field
time: [9.7043 ms 9.7375 ms 9.7724 ms]
change: [-9.3995% -8.8850% -8.4146%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
=== v1.1: moderately compact w/length-prefixed top level ===
Binary data size: 1140000 bytes
Text data size: 1360000 bytes
moderately compact w/length-prefixed top level text 1.1/scan all
time: [8.1667 ms 8.1939 ms 8.2284 ms]
change: [-14.549% -13.700% -12.941%] (p = 0.00 < 0.05)
Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
7 (7.00%) high mild
9 (9.00%) high severe
moderately compact w/length-prefixed top level text 1.1/read all
time: [16.795 ms 16.833 ms 16.872 ms]
change: [-9.2602% -8.7515% -8.3216%] (p = 0.00 < 0.05)
Performance has improved.
moderately compact w/length-prefixed top level text 1.1/read 'format' field
time: [9.5801 ms 9.5935 ms 9.6092 ms]
change: [-11.235% -10.807% -10.464%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
1 (1.00%) high mild
3 (3.00%) high severe
LazyValue support and with source-location feature enabled:
Gnuplot not found, using plotters backend
Text Ion 1.0 data size: 3690000 bytes
Bin Ion 1.0 data size: 2670089 bytes
text 1.0/scan all time: [24.931 ms 25.045 ms 25.183 ms]
change: [-14.226% -13.603% -12.917%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
text 1.0/read all time: [31.497 ms 31.548 ms 31.598 ms]
change: [+16.092% +16.505% +16.896%] (p = 0.00 < 0.05)
Performance has regressed.
text 1.0/read 'format' field
time: [26.873 ms 26.964 ms 27.076 ms]
change: [+20.944% +21.496% +22.129%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
1 (1.00%) low mild
7 (7.00%) high mild
8 (8.00%) high severe
=== v1.1: maximally compact ===
Binary data size: 620000 bytes
Text data size: 890000 bytes
maximally compact text 1.1/scan all
time: [9.1104 ms 9.1191 ms 9.1289 ms]
change: [+10.946% +11.705% +12.418%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
maximally compact text 1.1/read all
time: [20.958 ms 20.993 ms 21.028 ms]
change: [+6.9559% +7.3383% +7.7246%] (p = 0.00 < 0.05)
Performance has regressed.
maximally compact text 1.1/read 'format' field
time: [10.495 ms 10.529 ms 10.573 ms]
change: [+10.389% +11.061% +11.729%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) high mild
9 (9.00%) high severe
=== v1.1: moderately compact ===
Binary data size: 1120000 bytes
Text data size: 1360000 bytes
moderately compact text 1.1/scan all
time: [9.2199 ms 9.2506 ms 9.2918 ms]
change: [+13.998% +14.412% +14.860%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
3 (3.00%) high mild
10 (10.00%) high severe
moderately compact text 1.1/read all
time: [18.236 ms 18.269 ms 18.305 ms]
change: [+5.2620% +7.1710% +8.3892%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
moderately compact text 1.1/read 'format' field
time: [10.620 ms 10.632 ms 10.645 ms]
change: [+8.7749% +9.1862% +9.5837%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
=== v1.1: moderately compact w/length-prefixed top level ===
Binary data size: 1140000 bytes
Text data size: 1360000 bytes
moderately compact w/length-prefixed top level text 1.1/scan all
time: [9.2430 ms 9.2693 ms 9.2992 ms]
change: [+12.563% +13.125% +13.659%] (p = 0.00 < 0.05)
Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
5 (5.00%) high mild
7 (7.00%) high severe
moderately compact w/length-prefixed top level text 1.1/read all
time: [18.280 ms 18.322 ms 18.366 ms]
change: [+8.4830% +8.8496% +9.1932%] (p = 0.00 < 0.05)
Performance has regressed.
moderately compact w/length-prefixed top level text 1.1/read 'format' field
time: [10.673 ms 10.699 ms 10.725 ms]
change: [+11.189% +11.524% +11.860%] (p = 0.00 < 0.05)
Performance has regressed.
Above results show 16% regression for this implementation to support location in LazyValue.
As I've been reading the code, I think that we could simplify things by storing the data as something like this:
#[cfg(not(feature = "source-location"))] pub type LocationInternalState = PhantomData<()>; #[cfg(feature = "source-location")] pub type LocationInternalState = Option<(NonZero<usize>, usize)>;Then, we don't need to have
#[cfg(feature = "source-location")]all over the codebase. We can have these fields always present with no performance penalty because when the feature is not enabled, it's a zero-sized type.
I think having it with a LocationInternbalState would give the same performance results as this current implementation. But yes, for code readability this suggested solution would be better.
@desaikd This PR was replaced by #941, right? Can we close this?
Yes, this has been replaced. Closing it with completion of #941