awkward
awkward copied to clipboard
feat: replace v1 with v2
This is just me playing around with removing v1. Should you want to do this yourself, no problem!
- [x] Replace v1 tests with v2.
- [x] Fix up setup.cfg linting rules.
- [x] Move to
malloc
andfree
instead of Awkward allocators. - [x] Remove preliminary set of unused kernels. Any unused specialisations are not yet removed.
- [x] Replace use of contents in Forth with NumPy arrays or import v2 objects
- [x] Remove kernel dispatch, replace
kernel::malloc
withmalloc
ornew
- [x] Support
datetime.datetime
infrom_iter
- [x] Move
array_deleter
toinclude/awkward/util.h
Fixes #1701
Removing the v1 Python layer is fairly easy. Need to think about the C++ layer; the Forth machine returns a ContentPtr
in a number of places. Some of these are trivial Numpy arrays, but the bytecodes
getter function returns a jagged list. The obvious answer is to return the bytecodes and their offsets, and let the Python binding layer build the Array object from C++ via pybind Python execution. I haven't thought yet whether I like the idea of the ForthMachine returning pybind11's py::array
outside of the Python layer, so I think I'm leaning towards removing the toNumpyArray
interface and doing that at the bindings level.
Need to think about the C++ layer; the Forth machine returns a
ContentPtr
in a number of places. Some of these are trivial Numpy arrays, but thebytecodes
getter function returns a jagged list. The obvious answer is to return the bytecodes and their offsets
Knowing that the split is coming up soon, we've been using the ForthMachine in ways that don't depend on the return values being ContentPtr
and IndexPtr
: we've been immediately casting them as NumPy arrays.
The bytecodes
method/property is not used much. I think nothing would break if it were removed entirely, but it's probably better to replace it with just a 2-tuple of offsets
and content
, as you describe. It exists as a debugging tool (though the decompile
one, which prints out source code reconstructed from the bytecodes, is more useful), and was constructed as a jagged array just because it occurred to me that one could do that. Unlike imperative bytecode, Forth bytecode does not have arbitrary jumps; it happens to be a structured language in the sense of not having GOTOs, a fact that was emphasized by Forth enthusiasts in the 80's, but nowadays nearly all languages are structured at the human-readable source code level. The only thing that stands out for Forth is that its bytecode can also be blocked out as subroutines, a jagged array.[^1]
Anyway, the point is that we've been using the ContentPtr
/IndexPtr
-dependent parts of ForthMachine in a minimal way so that they can be replaced by raw buffers (py::buffer_info
in pybind11) without issues. Along the same lines, we will be removing the custom growable-buffer of ForthOutputBuffer to use Manasvi's new GrowableBuffer (and reap the benefits of non-contiguous filling), but that is an entirely separate step that can wait until the v1 pruning is entirely done. (ForthOutputBuffer doesn't depend on the old GrowableBuffer—that one can be safely deleted.)
[^1]: Although somebody might point out that this is now true of other bytecodes that I don't know about. Maybe JVM? I'm pretty sure Python bytecode has unconditional jumps.
Phew.
Now v1-less Awkward compiles and passes most tests, having removed the array and type machinery.
I've removed the low hanging fruit in the C++ side, so I've therefore left a much reduced version of Index.cpp
. My rationale there is that older C++ standards don't have std::span
yet, so either we start using pybind
features inside of AwkwardForth, or we keep this small abstraction.
New bugs:
from_iter
doesn't currently handle native Python datetime objects, so many tests fail. We should fix this.
We should be able to remove Index
from ForthMachine
. I'll take a look at it later, but that's a separable problem from everything else.
We should be able to remove
Index
fromForthMachine
. I'll take a look at it later, but that's a separable problem from everything else.
I think it depends if we want to be passing around unbounded arrays. Actually, this is barely a point at the moment because the Index
impl doesn't so anything safety wise (i.e. provide an iterator).
IIRC you never get an index directly; rather, you get the output and call toIndexXX
on it, so we can just use the metadata from the output.
Otherwise, I still need to remove the unused kernels...
In the way AwkwardForth is used, the output of toNumpyArray
and toIndex
are used, they're immediately wrapped with np.asarray
, so we only need one way to get a buffer out. We can even change the name (breaking Uproot), since Uproot v5 and Awkward v2 releases will be coordinated. A name change won't be very disruptive.
Otherwise, I still need to remove the unused kernels...
Perhaps removing unused kernels should be a separate PR. We might even be able to find more that could be removed and haven't yet. Some of them might be equivalent to arange
or something.
wip: remove unused kernels
It's okay. But I think we can remove more, and fishing for them can be a different PR.
wip: remove unused kernels
It's okay. But I think we can remove more, and fishing for them can be a different PR.
Cool. I found these ones by directly looking with (Xonsh)
import yaml
import re
kernel_spec = yaml.safe_load(p"/home/angus/Git/awkward/kernel-specification.yml".read_text())['kernels']
in_use_specs = set($(rg r"awkward_\w+" src/awkward/contents src/awkward/_reducers.py -oNI).splitlines())
name_to_specs = {g['name']: [s['name'] for s in g['specializations']] for g in kernel_spec}
spec_to_name = {v: k for k, s in name_to_specs.items() for v in s}
# For cleaning up the spec file
in_use_names = (name_to_specs.keys() & in_use_specs)
# For removing kernels
for f in [f for f in pg`src/cpu-kernels/awkward*.cpp` if not any(s in f.read_text() for s in used_specs)]:
f.unlink()
the output of toNumpyArray and toIndex are used
Yeah, now that all the removed code is gone, it's easier to see that we don't need an object that holds both length and ptr (besides the output itself). The main "issue" is internally within the Forth machine, rather than the bindings. It should be possible to just access the (shared) pointer directly within the Forth routines, and wrap it in the bindings.
We might even be able to find more that could be removed and haven't yet. Some of them might be equivalent to arange or something.
Agreed, there are several that I know of (not sure if they're still carried through in this PR, haven't checked) ;)
OK, this now removes Index
as well, so it's nearly all the way there. I've rebased my commits where possible to give a clean-ish history. I can't guarantee that every commit builds a wheel that passes the test suite (they certainly don't when I start moving files), but after the large-scale refactoring commits, most revisions then compile.
I haven't fully pulled out the kernel dispatch mechanism yet. I'll tentatively note that I'm not intimately familiar with this code, but my current understanding is that our kernel handling has all been superseded by the nplike dispatch mechnism.
Any kernel-dispatch in C++ is old. We have an entirely new kernel-dispatch system in Python, using ctypes.
Anything you find in C++ about startup
functions, particularly for setting up awkward-cuda-kernels, can be removed. That's all Python now as well.
Another thing that can go is the dlpack git submodule. That was all for supporting CUDA arrays in C++, which is now handled by CuPy in nplike.
Another thing that can go is the dlpack git submodule
It's already gone!
OK, now we're at a point where there's not much else (anything?) that needs to be removed.
The main changes are now described in this PRs description.
I've added some rudimentary time handling. I originally tried employing std::chrono
, but I clearly don't fully understand how C++ is handling times vs Python:
If I calculate the local epoch using:
+std::chrono::system_clock::time_point local_epoch() {
+ std::tm tm = {
+ /* .tm_sec = */ 0,
+ /* .tm_min = */ 0,
+ /* .tm_hour = */ 0,
+ /* .tm_mday = */ 1,
+ /* .tm_mon = */ 1 - 1,
+ /* .tm_year = */ 1970 - 1900,
+ };
+ // Use local time zone DST
+ tm.tm_isdst = -1;
+ return std::chrono::system_clock::from_time_t(std::mktime(&tm));
+}
Then when I use this in these calculations, I get one-hour shifts for particular datetimes:
+ auto time = obj.cast<std::chrono::system_clock::time_point>();
+ int64_t time_since_epoch_us = std::chrono::duration_cast<std::chrono::microseconds>(
+ time - local_epoch()
+ ).count();
+ std::cout << "time since epoch " << time_since_epoch_us << std::endl;
+ std::cout << "time since epoch cpp " << std::chrono::duration_cast<std::chrono::microseconds>(time.time_since_epoch()).count() << std::endl;
+ self.datetime(time_since_epoch_us, "datetime64[us]");
I am currently under the impression that this is because the timezones differ between Python and C++. I'm looking into this.
The above code example is wrong - the epoch should not be specified in local time (via std::tm
). Instead, it's just std::time_t (0)
.
Now that I'm more familiar with mktime
vs localtime
, some thoughts on my timezone (BST)
- NumPy's datetime objects are naive and don't attempt to determine any DST offsets
-
datetime.datetime.timestamp
does try to use the local timezone to figure out DST offsets -
(dt - epoch)
does not try to do this, so behaves like NumPy -
std::mktime
(used by pybind11) converts from local clock into UTC. Thetm_isdst
flag is used to helpmktime
figure out how to transform the input. It's set to-1
by default, which means thatmktime
uses the local clock to figure out if the time period should be during DST, and corrects it if sotm_isdst expected DST? offset new tm_isdst 1 yes 0 1 1 no -1 0 0 yes 1 1 0 no 0 0 -1 yes 0 1 -1 no 0 0 -1
fortm_isdst
. I believe that-1
just signals tomktime
to determine from the local time whether the given time is DST, and reflect that guess in the field. It doesn't modify the time fields after deduction - it assumes the user entered the correct local time for the timezone.mktime
is idempotent, as it only changes fields iftm_isdst
is changed. -
time_t
literally returns the same astime_since_epoch()
-
mktime
considers whethertm_isdst
changes when building the resulttime_t
.
Actually, I didn't think about this - the timezone boundaries do not map onto distinct values: 01:00 and 02:00 (UTC) in my current timezone on 2022/03/27 are both the same local times (01:00):
- 1 AM BST = 1 AM UTC
- 2 AM BST = 1 AM UTC
so, it's not possible to losslessly recover the input (local) time from the UTC output (in
std::time_t
). Therefore, we either need to modify the timezone (set it to UTC), or do this in Python.
Since NumPy does not deal with timezones (its times are timezone-naive, which is to say that knowledge of the timezone is not contained within the array, but is managed by the user in some external metadata somewhere), then we have to operate at the same level (also timezone-naive).
Python's datetime
library can operate in both modes: https://stackoverflow.com/a/7065242/1623645
Probably the least-surprising way to turn timezone-aware datetimes
(datetimes
with the tzinfo
set to non-None) into timezone-naive datetimes
would be to remove the tzinfo
, rather than converting to UTC, as in the StackOverflow example. The best we can do is to be consistent and document it.
This is not the worst thing to happen to people who deal with time-data...
We're on the same page w.r.t whether we support timezones!
In this PR, the issue I'm running into is concerned with the pybind11 bindings for datetime.datetime
objects. If we want to bump performance by not calling into the Python interpreter several times per datetime.datetime
object, then it's best if we convert them once into a native C++ type, e.g. std::time_t
. However, the pybind11 converters use mktime
to produce a std::chrono::system_clock::time_point
object (which is convertible to time_t
) according to the localtime settings (see https://github.com/pybind/pybind11/issues/1638). This means that the conversion result depends upon the user's system.
Actually, writing this response has given me an idea for a "hack". These problems all go away if the localtime is UTC, because there are no DST periods. However, setting the TZ
env var from requires us to add special code to handle windows _putenv
vs linux putenv
. Or, we could just call into Python and let Python handle this, once per from_iter
call. I'd prefer to just fix this at the root, which would probably require pybind11 to ship a third-party header-only library, (or we could do this and forgo pybind11's conversion).
Oh, I see: not specifying a timezone and converting a time-tuple into seconds since 1970 will implicitly pick up the platform-dependent locales. Well, we definitely don't want it to be platform- or locales-dependent, or to be changing OS-level variables to do the conversion. If there's a way to satisfy all constraints but from_iter
runs a little slower, from_iter
runs a little slower. It's expected that we'll be making a lot of Python calls, since we are converting from Python data, after all.
Actually, WWND (What Would NumPy Do)?
>>> import numpy as np
>>> from datetime import datetime
>>> from dateutil import tz
>>> d = datetime(1976, 7, 2, 4, 0, 0, tzinfo=tz.tzoffset("EDT", -4*60*60))
>>> d
datetime.datetime(1976, 7, 2, 4, 0, tzinfo=tzoffset('EDT', -14400))
>>> d.isoformat()
'1976-07-02T04:00:00-04:00'
>>> np.array([d])
array([datetime.datetime(1976, 7, 2, 4, 0, tzinfo=tzoffset('EDT', -14400))],
dtype=object)
>>> np.array([d], "datetime64[s]")
<stdin>:1: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
array(['1976-07-02T08:00:00'], dtype='datetime64[s]')
Aha! It converts the timezone-aware datetime
to UTC, which is platform-independent (it's not relying on the system's local timezone; the timezone is in the datetime
object), but this is deprecated behavior and will raise an error in the future. See also here.
So we can just raise an error. (If it weren't for this NumPy deprecation, we could do that conversion as well.)
Thus, any datetime
for which d.tzinfo is not None
would stop processing as an error. It would be good to know what kind of error this would be, but I would assume ValueError
.
Oh, I see: not specifying a timezone and converting a time-tuple into seconds since 1970 will implicitly pick up the platform-dependent locales.
Yes that's it - naive datetimes are treated as local times during conversion.
changing OS-level variables to do the conversion.
I think this would be fine if we are doing it in a contextual manner, e.g. RAII to setup/tear down, but still non ideal.
So we can just raise an error.
Yes, separately for the case of timezone-specified datetimes, we should probably error.
Codecov Report
Merging #1690 (6e696b1) into main (e692946) will increase coverage by
1.01%
. The diff coverage isn/a
.
Additional details and impacted files
Impacted Files | Coverage Δ | |
---|---|---|
src/awkward/__init__.py | 96.87% <ø> (ø) |
|
src/awkward/_broadcasting.py | 93.38% <ø> (ø) |
|
src/awkward/_connect/avro.py | 87.17% <ø> (ø) |
|
src/awkward/_connect/cling.py | 24.90% <ø> (ø) |
|
src/awkward/_connect/cuda/__init__.py | 0.00% <ø> (ø) |
|
src/awkward/_connect/jax/__init__.py | 90.47% <ø> (ø) |
|
src/awkward/_connect/jax/_reducers.py | 76.92% <ø> (ø) |
|
src/awkward/_connect/numba/arrayview.py | 97.77% <ø> (ø) |
|
src/awkward/_connect/numba/builder.py | 81.60% <ø> (ø) |
|
src/awkward/_connect/numba/layout.py | 84.87% <ø> (ø) |
|
... and 311 more |
I've fixed the remaining bug with date-time handling (I introduced this during refactoring). This means that the test suite now passes. I've done a very simple fix of the docs-building so we don't reference any v1 layouts anymore. I won't spend too much time on this because much of this work is/will be done in the docs
branch.
In this PR, I'd mainly benefit from reviewers considering edge cases for sed
et al, because that's how most of this work was done.
I've now rebased this PR on top of #1708, given that I expect that PR will be merged into 1.10 (it's a significant bug).
I've taken the time to rewrite most of these commits to join together similar changes, and remove intermediate commits that were used to keep localbuild
happy. It's possible that some of the commits won't build / pass the test suite, although this is guaranteed for the early v1-v2 move commits (by virtue of separating mv
from sd
(sed
)).
1.10.1 is out and I'm reviewing this now.
I'll start by pressing "update branch" so that it will test against main
, though I'm pretty sure that the only change is VERSION_INFO or something like it. For formality's sake, since this is such a big PR.
It should also be possible to remove
EXPORT_SYMBOL void* awkward_malloc(int64_t bytelength);
EXPORT_SYMBOL void awkward_free(void const *ptr);
from include/awkward/kernel-utils.h and wherever they're implemented. I think you've already converted over all of their uses to new
and delete
or standard malloc
and free
, right?
I believe we'll be able to remove all instances of pyobject_deleter
in favor of pybind11 features, and I don't think there's any need for array_deleter
, now that we're not sharing C++-allocated arrays with Python anymore. Even if this could be done with a few extra steps, those steps might be tricky and we can put that off for a new PR.
The low-level ArrayBuilder should not go into ak.contents, since this is for Content subclasses exclusively. (Similarly, the Index subclasses, Identifier, and low-level Record were also moved out of this directory. The idea is that we can quickly open all the files in this directory and get every Content, to apply some small change to all of them, typically.)
Could you elaborate on this Jim? I don't recall merging layout into contents. Although, I may have done this in an intermediary commit that I couldn't squash for whatever reason.
and I don't think there's any need for
array_deleter
I think we do still need this; it's used internally to free new
d arrays (before C++17): https://stackoverflow.com/questions/13061979/shared-ptr-to-an-array-should-it-be-used