cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Try to read timezone file from `$CONDA_PREFIX` dir before trying the system dir

Open vuule opened this issue 1 year ago • 8 comments

Description

Closes https://github.com/rapidsai/cudf/issues/12926

Modified the logic when opening TZ files to try multiple paths before giving up. Without user-specified path, we try $CONDA_PREFIX/share/zoneinfo and then /usr/share/zoneinfo/.

Testing is limited to consistency with previous logic. We don't have tests for the $CONDA_PREFIX path.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.

vuule avatar Mar 02 '24 02:03 vuule

I'd love to know if this has any impact on #7314. 🤔

bdice avatar Mar 04 '24 04:03 bdice

@vuule is this PR waiting on anything? Just noticed it from #12926.

For testing, one (somewhat drastic) option would be to run a separate test where you delete the system timezone file and then try running the tests. You'd want to isolate that test run so that you don't bork anything else that actually relies on that file existing, though, so it may not be worth the effort.

vyasr avatar May 17 '24 18:05 vyasr

@vuule is this PR waiting on anything? Just noticed it from #12926.

Once I saw the CI failing with this change I got really discouraged to mess with this. I don't know what damage I can do on random systems by changing the default directory from which we read the timezone info.

vuule avatar May 17 '24 18:05 vuule

This is... very weird. You seem to have gotten extraordinarily unlucky because all of the failures that I can see appear to be either unrelated issues that have since been fixed, or connectivity issues. The number of network issues is the especially weird part. Let's try again...

vyasr avatar May 20 '24 22:05 vyasr

This is... very weird. You seem to have gotten extraordinarily unlucky because all of the failures that I can see appear to be either unrelated issues that have since been fixed, or connectivity issues. The number of network issues is the especially weird part. Let's try again...

Unless I'm misremembering, there were also ORC test failures with the change. Let's see :)

vuule avatar May 20 '24 22:05 vuule

Ah OK I looked at like 4 different runs and didn't see anything that looked real, but there were multiple merges and reruns that I could have missed it. So we'll see this time :)

vyasr avatar May 20 '24 23:05 vyasr

There it is FAILED tests/test_orc.py::test_orc_reader_apache_negative_timestamp - AssertionError: DataFrame.iloc[:, 0] (column name="v") are different

vuule avatar May 21 '24 18:05 vuule

From reading the code it looks like this PR changes things to make the conda-installed files the highest priority. Perhaps what we want is to use the conda files as backup if there isn't a system installation? It looks like you chose this order because of me, so sorry about that! I was going off of how we normally deal with conda, but maybe for timezones this isn't safe.

@karthikeyann @bdice WDYT? I've spent a lot less time on the tz work to know if this is a safe thing to do. If it's not, then we should just close the associated issue and this PR.

vyasr avatar May 21 '24 18:05 vyasr

The most-correct answer is probably to respect $TZDIR if it is set, according to https://github.com/ogham/exa/issues/856.

If we respected $TZDIR, it would also give the conda package tzdata a way to say "use this timezone data source!" without us having to explicitly reference $CONDA_PREFIX. However, this issue for tzdata would need to be resolved first: https://github.com/conda-forge/tzdata-feedstock/issues/9

Would it be helpful to adjust this error message:

RuntimeError: CUDF failure at: /opt/conda/conda-bld/work/cpp/src/io/orc/timezone.cpp:138: Failed to open the timezone file.

... to say Failed to open the timezone file. Is tzdata installed on the system?

Given the complexity of this topic and the relatively low number of issue reports we've seen, I'm inclined to say we should continue to rely on only system-installed timezone information. I would vote to close this PR and the corresponding issue. We could revisit this in this future if it is misaligned with how other conda packages handle things.

bdice avatar Jun 03 '24 23:06 bdice

@bdice should I modify the PR to try $TZDIR (if set) before the system dir?

vuule avatar Jun 03 '24 23:06 vuule

We could do that. If we do, I would want to make sure of two things:

  • Setting $TZDIR to $CONDA_PREFIX/share/zoneinfo works as expected and uses data from the tzinfo package
  • Adding $TZDIR support doesn't make libcudf have subtle/unexpected behavior changes compared to pandas or Spark -- i.e. we should have some sense of how other libraries that we work with obey or ignore $TZDIR

bdice avatar Jun 04 '24 00:06 bdice

Closing based on the above comment

Given the complexity of this topic and the relatively low number of issue reports we've seen, I'm inclined to say we should continue to rely on only system-installed timezone information. I would vote to close this PR and the corresponding issue. We could revisit this in this future if it is misaligned with how other conda packages handle things.

We can revisit if necessary in the future.

vyasr avatar Sep 26 '24 23:09 vyasr