awkward icon indicating copy to clipboard operation
awkward copied to clipboard

feat: replace v1 with v2

Open agoose77 opened this issue 2 years ago • 25 comments

This is just me playing around with removing v1. Should you want to do this yourself, no problem!

  • [x] Replace v1 tests with v2.
  • [x] Fix up setup.cfg linting rules.
  • [x] Move to malloc and free instead of Awkward allocators.
  • [x] Remove preliminary set of unused kernels. Any unused specialisations are not yet removed.
  • [x] Replace use of contents in Forth with NumPy arrays or import v2 objects
  • [x] Remove kernel dispatch, replace kernel::malloc with malloc or new
  • [x] Support datetime.datetime in from_iter
  • [x] Move array_deleter to include/awkward/util.h

Fixes #1701

agoose77 avatar Sep 09 '22 13:09 agoose77

Removing the v1 Python layer is fairly easy. Need to think about the C++ layer; the Forth machine returns a ContentPtr in a number of places. Some of these are trivial Numpy arrays, but the bytecodes getter function returns a jagged list. The obvious answer is to return the bytecodes and their offsets, and let the Python binding layer build the Array object from C++ via pybind Python execution. I haven't thought yet whether I like the idea of the ForthMachine returning pybind11's py::array outside of the Python layer, so I think I'm leaning towards removing the toNumpyArray interface and doing that at the bindings level.

agoose77 avatar Sep 10 '22 16:09 agoose77

Need to think about the C++ layer; the Forth machine returns a ContentPtr in a number of places. Some of these are trivial Numpy arrays, but the bytecodes getter function returns a jagged list. The obvious answer is to return the bytecodes and their offsets

Knowing that the split is coming up soon, we've been using the ForthMachine in ways that don't depend on the return values being ContentPtr and IndexPtr: we've been immediately casting them as NumPy arrays.

The bytecodes method/property is not used much. I think nothing would break if it were removed entirely, but it's probably better to replace it with just a 2-tuple of offsets and content, as you describe. It exists as a debugging tool (though the decompile one, which prints out source code reconstructed from the bytecodes, is more useful), and was constructed as a jagged array just because it occurred to me that one could do that. Unlike imperative bytecode, Forth bytecode does not have arbitrary jumps; it happens to be a structured language in the sense of not having GOTOs, a fact that was emphasized by Forth enthusiasts in the 80's, but nowadays nearly all languages are structured at the human-readable source code level. The only thing that stands out for Forth is that its bytecode can also be blocked out as subroutines, a jagged array.[^1]

Anyway, the point is that we've been using the ContentPtr/IndexPtr-dependent parts of ForthMachine in a minimal way so that they can be replaced by raw buffers (py::buffer_info in pybind11) without issues. Along the same lines, we will be removing the custom growable-buffer of ForthOutputBuffer to use Manasvi's new GrowableBuffer (and reap the benefits of non-contiguous filling), but that is an entirely separate step that can wait until the v1 pruning is entirely done. (ForthOutputBuffer doesn't depend on the old GrowableBuffer—that one can be safely deleted.)

[^1]: Although somebody might point out that this is now true of other bytecodes that I don't know about. Maybe JVM? I'm pretty sure Python bytecode has unconditional jumps.

jpivarski avatar Sep 10 '22 19:09 jpivarski

Phew.

Now v1-less Awkward compiles and passes most tests, having removed the array and type machinery.

I've removed the low hanging fruit in the C++ side, so I've therefore left a much reduced version of Index.cpp. My rationale there is that older C++ standards don't have std::span yet, so either we start using pybind features inside of AwkwardForth, or we keep this small abstraction.

New bugs:

from_iter doesn't currently handle native Python datetime objects, so many tests fail. We should fix this.

agoose77 avatar Sep 13 '22 13:09 agoose77

We should be able to remove Index from ForthMachine. I'll take a look at it later, but that's a separable problem from everything else.

jpivarski avatar Sep 13 '22 13:09 jpivarski

We should be able to remove Index from ForthMachine. I'll take a look at it later, but that's a separable problem from everything else.

I think it depends if we want to be passing around unbounded arrays. Actually, this is barely a point at the moment because the Index impl doesn't so anything safety wise (i.e. provide an iterator).

IIRC you never get an index directly; rather, you get the output and call toIndexXX on it, so we can just use the metadata from the output.

Otherwise, I still need to remove the unused kernels...

agoose77 avatar Sep 13 '22 15:09 agoose77

In the way AwkwardForth is used, the output of toNumpyArray and toIndex are used, they're immediately wrapped with np.asarray, so we only need one way to get a buffer out. We can even change the name (breaking Uproot), since Uproot v5 and Awkward v2 releases will be coordinated. A name change won't be very disruptive.

Otherwise, I still need to remove the unused kernels...

Perhaps removing unused kernels should be a separate PR. We might even be able to find more that could be removed and haven't yet. Some of them might be equivalent to arange or something.

jpivarski avatar Sep 13 '22 15:09 jpivarski

wip: remove unused kernels

It's okay. But I think we can remove more, and fishing for them can be a different PR.

jpivarski avatar Sep 13 '22 15:09 jpivarski

wip: remove unused kernels

It's okay. But I think we can remove more, and fishing for them can be a different PR.

Cool. I found these ones by directly looking with (Xonsh)

import yaml
import re

kernel_spec = yaml.safe_load(p"/home/angus/Git/awkward/kernel-specification.yml".read_text())['kernels']
in_use_specs = set($(rg r"awkward_\w+" src/awkward/contents src/awkward/_reducers.py -oNI).splitlines())
name_to_specs = {g['name']: [s['name'] for s in g['specializations']] for g in kernel_spec}
spec_to_name = {v: k for k, s in name_to_specs.items() for v in s}

# For cleaning up the spec file
in_use_names = (name_to_specs.keys() & in_use_specs)

# For removing kernels
for f in [f for f in pg`src/cpu-kernels/awkward*.cpp` if not any(s in f.read_text() for s in used_specs)]:
    f.unlink()

agoose77 avatar Sep 13 '22 15:09 agoose77

the output of toNumpyArray and toIndex are used

Yeah, now that all the removed code is gone, it's easier to see that we don't need an object that holds both length and ptr (besides the output itself). The main "issue" is internally within the Forth machine, rather than the bindings. It should be possible to just access the (shared) pointer directly within the Forth routines, and wrap it in the bindings.

agoose77 avatar Sep 13 '22 15:09 agoose77

We might even be able to find more that could be removed and haven't yet. Some of them might be equivalent to arange or something.

Agreed, there are several that I know of (not sure if they're still carried through in this PR, haven't checked) ;)

agoose77 avatar Sep 13 '22 15:09 agoose77

OK, this now removes Index as well, so it's nearly all the way there. I've rebased my commits where possible to give a clean-ish history. I can't guarantee that every commit builds a wheel that passes the test suite (they certainly don't when I start moving files), but after the large-scale refactoring commits, most revisions then compile.

agoose77 avatar Sep 13 '22 16:09 agoose77

I haven't fully pulled out the kernel dispatch mechanism yet. I'll tentatively note that I'm not intimately familiar with this code, but my current understanding is that our kernel handling has all been superseded by the nplike dispatch mechnism.

agoose77 avatar Sep 13 '22 16:09 agoose77

Any kernel-dispatch in C++ is old. We have an entirely new kernel-dispatch system in Python, using ctypes.

Anything you find in C++ about startup functions, particularly for setting up awkward-cuda-kernels, can be removed. That's all Python now as well.

Another thing that can go is the dlpack git submodule. That was all for supporting CUDA arrays in C++, which is now handled by CuPy in nplike.

jpivarski avatar Sep 13 '22 16:09 jpivarski

Another thing that can go is the dlpack git submodule

It's already gone!

agoose77 avatar Sep 13 '22 19:09 agoose77

OK, now we're at a point where there's not much else (anything?) that needs to be removed.

The main changes are now described in this PRs description.

agoose77 avatar Sep 13 '22 21:09 agoose77

I've added some rudimentary time handling. I originally tried employing std::chrono, but I clearly don't fully understand how C++ is handling times vs Python:

If I calculate the local epoch using:

+std::chrono::system_clock::time_point local_epoch() {
+    std::tm tm = {
+       /* .tm_sec  = */ 0,
+       /* .tm_min  = */ 0,
+       /* .tm_hour = */ 0,
+       /* .tm_mday = */ 1,
+       /* .tm_mon  = */ 1 - 1,
+       /* .tm_year = */ 1970 - 1900,
+    };
+    // Use local time zone DST
+    tm.tm_isdst = -1;
+    return std::chrono::system_clock::from_time_t(std::mktime(&tm));
+}

Then when I use this in these calculations, I get one-hour shifts for particular datetimes:

+    auto time = obj.cast<std::chrono::system_clock::time_point>();
+    int64_t time_since_epoch_us = std::chrono::duration_cast<std::chrono::microseconds>(
+        time - local_epoch()
+    ).count();
+    std::cout << "time since epoch     " << time_since_epoch_us << std::endl;
+    std::cout << "time since epoch cpp " << std::chrono::duration_cast<std::chrono::microseconds>(time.time_since_epoch()).count() << std::endl;
+    self.datetime(time_since_epoch_us, "datetime64[us]");

I am currently under the impression that this is because the timezones differ between Python and C++. I'm looking into this.

agoose77 avatar Sep 15 '22 22:09 agoose77

The above code example is wrong - the epoch should not be specified in local time (via std::tm). Instead, it's just std::time_t (0).

Now that I'm more familiar with mktime vs localtime, some thoughts on my timezone (BST)

  • NumPy's datetime objects are naive and don't attempt to determine any DST offsets
  • datetime.datetime.timestamp does try to use the local timezone to figure out DST offsets
  • (dt - epoch) does not try to do this, so behaves like NumPy
  • std::mktime (used by pybind11) converts from local clock into UTC. The tm_isdst flag is used to help mktime figure out how to transform the input. It's set to -1 by default, which means that mktime uses the local clock to figure out if the time period should be during DST, and corrects it if so
    tm_isdst expected DST? offset new tm_isdst
    1 yes 0 1
    1 no -1 0
    0 yes 1 1
    0 no 0 0
    -1 yes 0 1
    -1 no 0 0
    I'm least confident on -1 for tm_isdst. I believe that -1 just signals to mktime to determine from the local time whether the given time is DST, and reflect that guess in the field. It doesn't modify the time fields after deduction - it assumes the user entered the correct local time for the timezone. mktime is idempotent, as it only changes fields if tm_isdst is changed.
  • time_t literally returns the same as time_since_epoch()
  • mktime considers whether tm_isdst changes when building the result time_t.

agoose77 avatar Sep 16 '22 00:09 agoose77

Actually, I didn't think about this - the timezone boundaries do not map onto distinct values: 01:00 and 02:00 (UTC) in my current timezone on 2022/03/27 are both the same local times (01:00):

  • 1 AM BST = 1 AM UTC
  • 2 AM BST = 1 AM UTC so, it's not possible to losslessly recover the input (local) time from the UTC output (in std::time_t). Therefore, we either need to modify the timezone (set it to UTC), or do this in Python.

agoose77 avatar Sep 16 '22 12:09 agoose77

Since NumPy does not deal with timezones (its times are timezone-naive, which is to say that knowledge of the timezone is not contained within the array, but is managed by the user in some external metadata somewhere), then we have to operate at the same level (also timezone-naive).

Python's datetime library can operate in both modes: https://stackoverflow.com/a/7065242/1623645

Probably the least-surprising way to turn timezone-aware datetimes (datetimes with the tzinfo set to non-None) into timezone-naive datetimes would be to remove the tzinfo, rather than converting to UTC, as in the StackOverflow example. The best we can do is to be consistent and document it.

This is not the worst thing to happen to people who deal with time-data...

jpivarski avatar Sep 16 '22 13:09 jpivarski

We're on the same page w.r.t whether we support timezones!

In this PR, the issue I'm running into is concerned with the pybind11 bindings for datetime.datetime objects. If we want to bump performance by not calling into the Python interpreter several times per datetime.datetime object, then it's best if we convert them once into a native C++ type, e.g. std::time_t. However, the pybind11 converters use mktime to produce a std::chrono::system_clock::time_point object (which is convertible to time_t) according to the localtime settings (see https://github.com/pybind/pybind11/issues/1638). This means that the conversion result depends upon the user's system.

Actually, writing this response has given me an idea for a "hack". These problems all go away if the localtime is UTC, because there are no DST periods. However, setting the TZ env var from requires us to add special code to handle windows _putenv vs linux putenv. Or, we could just call into Python and let Python handle this, once per from_iter call. I'd prefer to just fix this at the root, which would probably require pybind11 to ship a third-party header-only library, (or we could do this and forgo pybind11's conversion).

agoose77 avatar Sep 16 '22 13:09 agoose77

Oh, I see: not specifying a timezone and converting a time-tuple into seconds since 1970 will implicitly pick up the platform-dependent locales. Well, we definitely don't want it to be platform- or locales-dependent, or to be changing OS-level variables to do the conversion. If there's a way to satisfy all constraints but from_iter runs a little slower, from_iter runs a little slower. It's expected that we'll be making a lot of Python calls, since we are converting from Python data, after all.

Actually, WWND (What Would NumPy Do)?

>>> import numpy as np
>>> from datetime import datetime
>>> from dateutil import tz
>>> d = datetime(1976, 7, 2, 4, 0, 0, tzinfo=tz.tzoffset("EDT", -4*60*60))
>>> d
datetime.datetime(1976, 7, 2, 4, 0, tzinfo=tzoffset('EDT', -14400))
>>> d.isoformat()
'1976-07-02T04:00:00-04:00'
>>> np.array([d])
array([datetime.datetime(1976, 7, 2, 4, 0, tzinfo=tzoffset('EDT', -14400))],
      dtype=object)
>>> np.array([d], "datetime64[s]")
<stdin>:1: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
array(['1976-07-02T08:00:00'], dtype='datetime64[s]')

Aha! It converts the timezone-aware datetime to UTC, which is platform-independent (it's not relying on the system's local timezone; the timezone is in the datetime object), but this is deprecated behavior and will raise an error in the future. See also here.

So we can just raise an error. (If it weren't for this NumPy deprecation, we could do that conversion as well.)

Thus, any datetime for which d.tzinfo is not None would stop processing as an error. It would be good to know what kind of error this would be, but I would assume ValueError.

jpivarski avatar Sep 16 '22 14:09 jpivarski

Oh, I see: not specifying a timezone and converting a time-tuple into seconds since 1970 will implicitly pick up the platform-dependent locales.

Yes that's it - naive datetimes are treated as local times during conversion.

changing OS-level variables to do the conversion.

I think this would be fine if we are doing it in a contextual manner, e.g. RAII to setup/tear down, but still non ideal.

So we can just raise an error.

Yes, separately for the case of timezone-specified datetimes, we should probably error.

agoose77 avatar Sep 16 '22 19:09 agoose77

Codecov Report

Merging #1690 (6e696b1) into main (e692946) will increase coverage by 1.01%. The diff coverage is n/a.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/__init__.py 96.87% <ø> (ø)
src/awkward/_broadcasting.py 93.38% <ø> (ø)
src/awkward/_connect/avro.py 87.17% <ø> (ø)
src/awkward/_connect/cling.py 24.90% <ø> (ø)
src/awkward/_connect/cuda/__init__.py 0.00% <ø> (ø)
src/awkward/_connect/jax/__init__.py 90.47% <ø> (ø)
src/awkward/_connect/jax/_reducers.py 76.92% <ø> (ø)
src/awkward/_connect/numba/arrayview.py 97.77% <ø> (ø)
src/awkward/_connect/numba/builder.py 81.60% <ø> (ø)
src/awkward/_connect/numba/layout.py 84.87% <ø> (ø)
... and 311 more

codecov[bot] avatar Sep 19 '22 19:09 codecov[bot]

I've fixed the remaining bug with date-time handling (I introduced this during refactoring). This means that the test suite now passes. I've done a very simple fix of the docs-building so we don't reference any v1 layouts anymore. I won't spend too much time on this because much of this work is/will be done in the docs branch.

In this PR, I'd mainly benefit from reviewers considering edge cases for sed et al, because that's how most of this work was done.

agoose77 avatar Sep 19 '22 19:09 agoose77

I've now rebased this PR on top of #1708, given that I expect that PR will be merged into 1.10 (it's a significant bug).

I've taken the time to rewrite most of these commits to join together similar changes, and remove intermediate commits that were used to keep localbuild happy. It's possible that some of the commits won't build / pass the test suite, although this is guaranteed for the early v1-v2 move commits (by virtue of separating mv from sd (sed)).

agoose77 avatar Sep 21 '22 11:09 agoose77

1.10.1 is out and I'm reviewing this now.

I'll start by pressing "update branch" so that it will test against main, though I'm pretty sure that the only change is VERSION_INFO or something like it. For formality's sake, since this is such a big PR.

jpivarski avatar Sep 22 '22 19:09 jpivarski

It should also be possible to remove

  EXPORT_SYMBOL void* awkward_malloc(int64_t bytelength);
  EXPORT_SYMBOL void awkward_free(void const *ptr);

from include/awkward/kernel-utils.h and wherever they're implemented. I think you've already converted over all of their uses to new and delete or standard malloc and free, right?

jpivarski avatar Sep 22 '22 19:09 jpivarski

I believe we'll be able to remove all instances of pyobject_deleter in favor of pybind11 features, and I don't think there's any need for array_deleter, now that we're not sharing C++-allocated arrays with Python anymore. Even if this could be done with a few extra steps, those steps might be tricky and we can put that off for a new PR.

jpivarski avatar Sep 22 '22 19:09 jpivarski

The low-level ArrayBuilder should not go into ak.contents, since this is for Content subclasses exclusively. (Similarly, the Index subclasses, Identifier, and low-level Record were also moved out of this directory. The idea is that we can quickly open all the files in this directory and get every Content, to apply some small change to all of them, typically.)

Could you elaborate on this Jim? I don't recall merging layout into contents. Although, I may have done this in an intermediary commit that I couldn't squash for whatever reason.

agoose77 avatar Sep 22 '22 20:09 agoose77

and I don't think there's any need for array_deleter

I think we do still need this; it's used internally to free newd arrays (before C++17): https://stackoverflow.com/questions/13061979/shared-ptr-to-an-array-should-it-be-used

agoose77 avatar Sep 22 '22 20:09 agoose77