cudf
cudf copied to clipboard
JSON tokenizer memory optimizations
Description
The full push-down automata that tokenizes the input JSON string, as well as the bracket-brace FST over-estimates the total buffer size required for the translated output and indices. This PR splits the transduce calls for both FSTs into two invocations. The first invocation estimates the size of the translated buffer and the translated indices, and the second call performs the DFA run.
Checklist
- [X] I am familiar with the Contributing Guidelines.
- [X] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
Benchmark used for profiling study: https://github.com/karthikeyann/cudf/blob/enh-profile_memusage_json/wm_benchmark.py
Profiles before and after optimization:
We see that the peak memory usage comes down from 20.8GiB to 10.3GiB and the runtime of get_token_stream also reduces from 1.028s to 825.578ms
Debugged further: This issue is unrelated to this PR.
allocation goes to negative since in32_t > 2**31 is made negative.
json_in.size() = 2167967896 is more than 2 GiB (2.019 GiB).
We should have safe guards to prevent >2GiB passed to read_json due to int32_t limitations in cub algorithms.
created issue https://github.com/rapidsai/cudf/issues/17017
Debugged further: This issue is unrelated to this PR. allocation goes to negative since in32_t > 2**31 is made negative.
json_in.size() = 2167967896is more than 2 GiB (2.019 GiB). We should have safe guards to prevent >2GiB passed toread_jsondue to int32_t limitations in cub algorithms. created issue #17017
PR #17057 has been opened to resolve this overflow error in the benchmarks. To summarize the fix, we have enabled batching only for JSON lines inputs, not regular JSON. So running the benchmark for JSON inputs of sizes greater than 2GiB causes overflow bugs in the parser.
The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR.
For the end-to-end runs on a V100 for JSON_READER_NVBENCH.nested_json_gpu_parser_depth, I get the following performance numbers:
| depth | string_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|---------|---------------|------------|-------------|------------|-------------|------------|---------|----------|
| 2^1 | 2^20 | 2.848 ms | 10.80% | 2.964 ms | 10.46% | 115.818 us | 4.07% | PASS |
| 2^2 | 2^20 | 2.817 ms | 8.62% | 2.952 ms | 8.29% | 135.483 us | 4.81% | PASS |
| 2^3 | 2^20 | 8.325 ms | 4.53% | 8.367 ms | 4.19% | 41.742 us | 0.50% | PASS |
| 2^4 | 2^20 | 10.352 ms | 3.47% | 10.475 ms | 3.89% | 122.863 us | 1.19% | PASS |
| 2^1 | 2^22 | 3.707 ms | 4.67% | 3.977 ms | 3.27% | 269.364 us | 7.27% | FAIL |
| 2^2 | 2^22 | 3.694 ms | 4.71% | 3.982 ms | 4.77% | 288.939 us | 7.82% | FAIL |
| 2^3 | 2^22 | 9.137 ms | 2.21% | 9.411 ms | 2.80% | 273.779 us | 3.00% | FAIL |
| 2^4 | 2^22 | 11.413 ms | 2.87% | 11.683 ms | 2.62% | 269.444 us | 2.36% | PASS |
| 2^1 | 2^24 | 7.800 ms | 2.31% | 8.490 ms | 0.46% | 689.604 us | 8.84% | FAIL |
| 2^2 | 2^24 | 7.805 ms | 2.38% | 8.512 ms | 2.10% | 707.728 us | 9.07% | FAIL |
| 2^3 | 2^24 | 12.846 ms | 1.73% | 13.629 ms | 1.94% | 782.743 us | 6.09% | FAIL |
| 2^4 | 2^24 | 16.278 ms | 1.88% | 17.111 ms | 1.46% | 832.984 us | 5.12% | FAIL |
| 2^1 | 2^26 | 23.387 ms | 0.31% | 25.726 ms | 0.33% | 2.339 ms | 10.00% | FAIL |
| 2^2 | 2^26 | 23.339 ms | 0.41% | 25.718 ms | 0.20% | 2.380 ms | 10.20% | FAIL |
| 2^3 | 2^26 | 28.909 ms | 0.89% | 31.439 ms | 1.14% | 2.530 ms | 8.75% | FAIL |
| 2^4 | 2^26 | 37.511 ms | 1.30% | 39.965 ms | 0.45% | 2.454 ms | 6.54% | FAIL |
| 2^1 | 2^28 | 85.176 ms | 0.08% | 94.190 ms | 0.09% | 9.014 ms | 10.58% | FAIL |
| 2^2 | 2^28 | 85.214 ms | 0.50% | 94.241 ms | 0.06% | 9.027 ms | 10.59% | FAIL |
| 2^3 | 2^28 | 110.835 ms | 0.22% | 119.999 ms | 0.49% | 9.164 ms | 8.27% | FAIL |
| 2^4 | 2^28 | 142.120 ms | 0.25% | 151.918 ms | 0.41% | 9.798 ms | 6.89% | FAIL |
So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being?
The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR.
So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being?
@elstehle I created the issue https://github.com/rapidsai/cudf/issues/17114 for calculating total output values of FST without redundant calculations. Assigned to you.
The changes look good to me. I am just concerned about the potential performance downside we get from repeatedly running the FSTs and redundantly computing some values, such as the state-transition vectors and write offsets in the FST, especially for scenarios when the performance downside of redundant computation would not be compensated by the reduced allocation we get with the changes of this PR.
For the end-to-end runs on a V100 for
JSON_READER_NVBENCH.nested_json_gpu_parser_depth, I get the following performance numbers:| depth | string_size | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------|---------------|------------|-------------|------------|-------------|------------|---------|----------| | 2^1 | 2^20 | 2.848 ms | 10.80% | 2.964 ms | 10.46% | 115.818 us | 4.07% | PASS | | 2^2 | 2^20 | 2.817 ms | 8.62% | 2.952 ms | 8.29% | 135.483 us | 4.81% | PASS | | 2^3 | 2^20 | 8.325 ms | 4.53% | 8.367 ms | 4.19% | 41.742 us | 0.50% | PASS | | 2^4 | 2^20 | 10.352 ms | 3.47% | 10.475 ms | 3.89% | 122.863 us | 1.19% | PASS | | 2^1 | 2^22 | 3.707 ms | 4.67% | 3.977 ms | 3.27% | 269.364 us | 7.27% | FAIL | | 2^2 | 2^22 | 3.694 ms | 4.71% | 3.982 ms | 4.77% | 288.939 us | 7.82% | FAIL | | 2^3 | 2^22 | 9.137 ms | 2.21% | 9.411 ms | 2.80% | 273.779 us | 3.00% | FAIL | | 2^4 | 2^22 | 11.413 ms | 2.87% | 11.683 ms | 2.62% | 269.444 us | 2.36% | PASS | | 2^1 | 2^24 | 7.800 ms | 2.31% | 8.490 ms | 0.46% | 689.604 us | 8.84% | FAIL | | 2^2 | 2^24 | 7.805 ms | 2.38% | 8.512 ms | 2.10% | 707.728 us | 9.07% | FAIL | | 2^3 | 2^24 | 12.846 ms | 1.73% | 13.629 ms | 1.94% | 782.743 us | 6.09% | FAIL | | 2^4 | 2^24 | 16.278 ms | 1.88% | 17.111 ms | 1.46% | 832.984 us | 5.12% | FAIL | | 2^1 | 2^26 | 23.387 ms | 0.31% | 25.726 ms | 0.33% | 2.339 ms | 10.00% | FAIL | | 2^2 | 2^26 | 23.339 ms | 0.41% | 25.718 ms | 0.20% | 2.380 ms | 10.20% | FAIL | | 2^3 | 2^26 | 28.909 ms | 0.89% | 31.439 ms | 1.14% | 2.530 ms | 8.75% | FAIL | | 2^4 | 2^26 | 37.511 ms | 1.30% | 39.965 ms | 0.45% | 2.454 ms | 6.54% | FAIL | | 2^1 | 2^28 | 85.176 ms | 0.08% | 94.190 ms | 0.09% | 9.014 ms | 10.58% | FAIL | | 2^2 | 2^28 | 85.214 ms | 0.50% | 94.241 ms | 0.06% | 9.027 ms | 10.59% | FAIL | | 2^3 | 2^28 | 110.835 ms | 0.22% | 119.999 ms | 0.49% | 9.164 ms | 8.27% | FAIL | | 2^4 | 2^28 | 142.120 ms | 0.25% | 151.918 ms | 0.41% | 9.798 ms | 6.89% | FAIL |So, for larger inputs, a performance regression in the ballpark of 10% on the end-to-end timings. I just want to make sure that lowering the memory capacity is pressing enough for us to take this performance penalty for the time being?
From offline discussions, the current plan is to accept the end-to-end performance regression in favor of the reduction in memory footprint. Next steps would be to introduce a multi-stage FST implementation (#17114) that separates the tranduced buffer size estimation and final DFA run to remove the performance downside of redundantly running FSTs,
/merge