Gang Wu

Results 304 comments of Gang Wu

> @wgtmac thanks for the review! I will coordinate with #1014 and address the comments My PR has been merged. Feel free to rebase it and let me review once...

@JFinis Do you have a plan to revive this?

The new `IEEE754TotalOrder` looks elegant to me, though a single NaN value may still ruin the page index. Another challenge is how parquet-mr can efficiently implement this sort order.

Could we add `NOLINT` to those public/protected variables?

What about this? - Use `_` suffix for private member variable. - Do not use `_` suffix for protected and public member variable. - Avoid any `_` prefix and suffix...

We could leverage https://clang.llvm.org/extra/clang-tidy/checks/readability/identifier-naming.html for variable naming. Apache Arrow used cpplint approach which we cannot borrow: https://github.com/apache/arrow/blob/main/cpp/build-support/cpplint.py

Thanks for reporting the issue! @vuule The order of data streams are **NOT FIXED** meaning that: - In a direct-encoded string columns, `DATA stream` can be placed **BEFORE** or **AFTER**...

> Thank you @wgtmac , I will add these to #1465. Thanks @deshanxiao ! Could you also verify that the java implementation matches this behavior?

Yes, the order is fixed. This is implemented in the `recordPosition` call as below. In the `TreeWriterBase.java`, positions of present stream are recorded first. https://github.com/apache/orc/blob/792c3f820d0b7a64b27c9dc4c390443386e6baf0/java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java#L369-L377 And then in the `StringBaseTreeWriter.java`,...

> Thank you for the clarification @wgtmac ! What about other cases when a column has multiple streams? Is the order of index streams always the same as in the...