awkward icon indicating copy to clipboard operation
awkward copied to clipboard

feat!: printing a typetracer array should not touch it

Open jpivarski opened this issue 6 months ago • 9 comments

I think the rationale for having print(array) touch all of the data in array (whether it's high-level or a layout) is so that the delayed operation is exactly the same as an eager operation. In Dask, a DAG-builder has to touch everything in a print(array) operation so that all of the data will be available to the Dask worker to actually print that string.

However,

  • The Dask worker is by definition non-interactive. No one is looking at the print output.
  • A print operation on deeply nested data is probably not sensitive to all of the deeply nested data—it's limited to a line width or 80 characters (configurable), so only a few parts of it are going to get printed. The Awkward 1 VirtualArray was careful to only materialize that which appears in the printed string, for a point of comparison.
  • Even if the Dask worker saves the __str__ or __repr__ string to a variable and does some computations with it, and even if the relevant data are close enough to the beginning or end of the array to appear in that string, I don't think there is (or should be) a strong user expectation that this string will be exactly the same as a string you'd get from an eager array. The string representation of a Python object, intended for interactive feedback, should not be relied upon programmatically. For example, whenever anyone does
[float(x) for x in str(a).lstrip("[").rstrip("]").split(" ")]

to get the values of a NumPy array, they are doing a Bad Thing™.

I think there's a stronger user expectation that observing the string representation of a Python object does not change that object, and that inclusion in the set of "touched" buffers is an important enough change to worry about.

This PR does two things:

  1. It makes high-level and layout string representations (__str__ and __repr__) non-touching, and it only affects the string representation context (i.e. not changing every use of array_str, just those in __str__ and __repr__).
  2. It gives PlaceholderArray a string representation, instead of an error, when it is accessed in the string representation context (only).

There is now a difference between __str__ and __repr__ on eager arrays versus in delayed operations: on eager arrays, the string representations contain values, and in delayed operations, they contain values if some other operation touched that data, and ?? placeholders otherwise.

To demonstrate that high-level and layout string representations no longer touch:

>>> a = ak.Array([1, 2, 3])
>>> layout, report = ak.typetracer.typetracer_with_report(a.layout.form_with_key())
>>> report.data_touched, report.shape_touched
([], [])
>>> b = ak.Array(layout)
>>> report.data_touched, report.shape_touched
([], [])
>>> b
<Array-typetracer [...] type='## * int64'>
>>> report.data_touched, report.shape_touched
([], ['node0'])

and

>>> layout, report = ak.typetracer.typetracer_with_report(a.layout.form_with_key())
>>> report.data_touched, report.shape_touched
([], [])
>>> layout
<NumpyArray dtype='int64' len='##'>[## ... ##]</NumpyArray>
>>> report.data_touched, report.shape_touched
([], ['node0'])

They touch the shape but not the data. I'm 90% sure that touching the shape does not invoke data-reading, only metadata, right @agoose77 and @martindurant?

Now to show that the Dask worker will not explode when trying to work with the data that now has PlaceholderArrays anywhere that hasn't been touched by some operation other than the printing:

>>> from awkward._nplikes.numpy import Numpy
>>> from awkward._nplikes.placeholder import PlaceholderArray
>>> z = ak.contents.NumpyArray(PlaceholderArray(Numpy.instance(), (3,), np.int64))
>>> z
<NumpyArray dtype='int64' len='3'>[## ... ##]</NumpyArray>
>>> ak.Array(z)
<Array [??, ??, ??] type='3 * int64'>

I have not added tests, and therefore, this PR is not ready to be merged!

@agoose77, which test files should I fill with tests of str(array) for arrays containing TypeTracerArrays and PlaceholderArrays to be sure that the TypeTracerArrays don't report as data-touching and that the PlaceholderArrays don't raise TypeErrors? I want to efficiently cover all of the cases. (I could make a new suite of examples to test, but I think there are probably examples in some test file already.)

jpivarski avatar Feb 12 '24 03:02 jpivarski

Maybe tests/test_2714_from_buffers_placeholders.py to test all of the PlaceholderArrays?

And tests/test_2660_expected_container_keys_from_form.py to test all of the layout types to ensure that str and repr are non-touching on both high-level and layout?

jpivarski avatar Feb 12 '24 03:02 jpivarski

@jpivarski computing the shape of a buffer requires reading iff. the buffer provider does not factorise the metadata from the array contents. This is true for parquet and partially true for ROOT TTree's; to compute shape A.shape we need to load an arbitrary array whose shape is known to yield A.shape, e.g. it's sibling B. If these buffers are interior buffers, e.g. a leaf NumpyArray's .data, then we also need to read the "shape-buffers" (e.g. offsets) of any parent nodes until we reach the root.

agoose77 avatar Feb 12 '24 08:02 agoose77

The shapes that we're talking about are regular dimensions, and I don't know of any formats that fail to factorize out regular dimensions, except for those formats that lack metadata entirely.

  • Regular dimensions in a ROOT file are stored in some streamers (TStreamerBasicType has an "array length" member, which indicates arrays of basic types, rather than scalars, if that length is not zero) and in title strings (when the text between [ and ] is an integer, [1-9][0-9]*). No TBaskets need to be read to get that shape.
  • Parquet doesn't natively have regular dimensions[^1], but Arrow arrays in Parquet indicate fixed-length structures in the metadata.
  • Even though Avro cleanly separates out its metadata, it doesn't have a concept of a "fixed-length list".
  • JSON has no metadata.
  • ...

I can't think of any situations in which needing to know the (fixed-length) shape of a buffer would require any non-metadata to be read.

It occurred to me last night that I should just grep the awkward and dask-awkward codebases for "shape_touched" and see what comes up.

[^1]: That I know of. As I understand it, Parquet lists are all variable-length, and only Arrow metadata provides the FixedLengthList structure on top of it. Similarly, Arrow has to add structure to even distinguish between an empty list and a missing list...

jpivarski avatar Feb 12 '24 14:02 jpivarski

@jpivarski are you talking about NumpyArray.shape or TypeTracerArray.shape (nplike arrays)?

Because (here I'm talking about nplike buffer lengths) any time a content is nested inside a list-type, the length of its metadata-buffers depends upon reading the parent list-offsets. We handle this process in the input layer e.g. dask-awkward / uproot. In simple terms, ?? * var * var * int64 requires two buffer reads to know how long the leaf data buffer is.

agoose77 avatar Feb 12 '24 14:02 agoose77

Parquet lists are all variable-length, and only Arrow metadata provides the FixedLengthList structure on top of it.

correct, parquet is always a column, and there's no regular ND array. The only fixed width type is byte-array, but I don't think anyone stores regular arrays in that.

Similarly, Arrow has to add structure to even distinguish between an empty list and a missing list

You mean "item"? An OPTIONAL field can have a REPEATED child in the parquet spec, so you could say it's fully general, just parquet-mr decided that you always need both these levels.

JSON has no metadata.

For completeness, I don't recall jsonschema having a list-length specifier, but it could in theory have created. It's a horrible data describing grammar.

martindurant avatar Feb 12 '24 14:02 martindurant

The place where it happens is here:

https://github.com/scikit-hep/awkward/blob/b7eb8b9a0d2fb801b2aebb89620cd3fdab1339d6/src/awkward/contents/numpyarray.py#L194

It's touching the TypeTracerArray's shape items that are not its length. These items need to be known at type-tracing time/Dask DAG-building time/on the head node, before anything is sent to the workers.

Because of this, asking for the Form of any array qualifies as touching its shape, but this is a bit of a technicality. It would be a little different if it were touching any length attributes, since lengths are not known at type-tracing time the way that a NumpyArray's inner_shape must be, ~~though even lengths are metadata; none of our file formats require data-reading to get that metadata.~~

I just realized why needing to know lengths requires some data-reading: the length of a Content node inside of a variable-length list is sometimes given by metadata (in Parquet) and sometimes not (in ROOT). That's a reason why length-touching matters, but this inner_shape touching does not matter.

Maybe this part of the report is just badly named: what we sometimes care about is length-touching, but this is reporting shape-touching.

jpivarski avatar Feb 12 '24 14:02 jpivarski

Similarly, Arrow has to add structure to even distinguish between an empty list and a missing list

You mean "item"? An OPTIONAL field can have a REPEATED child in the parquet spec, so you could say it's fully general, just parquet-mr decided that you always need both these levels.

That's what I was referring to (although it was a tangential point).

JSON has no metadata.

For completeness, I don't recall jsonschema having a list-length specifier, but it could in theory have created. It's a horrible data describing grammar.

I had forgotten about JSONSchema! Actually, JSONSchema lets you set the minimum and maximum allowed lengths of lists at a given level, and these two numbers can be the same, which specifies a fixed-length list. Now that I think of it, ak.from_json takes advantage of this fact.

jpivarski avatar Feb 12 '24 14:02 jpivarski

I just realized why needing to know lengths requires some data-reading: the length of a Content node inside of a variable-length list is sometimes given by metadata (in Parquet) and sometimes not (in ROOT). That's a reason why length-touching matters, but this inner_shape touching does not matter.

Yes, we were careful to distinguish len(shape) as a non-shape-touching operation over list(shape). We could have drawn the line further, but it's worth noting that these touching operations occur at the buffer level (which are nearly always 1D).

Additionally, although we might be able to compute the length of a deeply nested buffer from Parquet by a single read, we actually need to read the metadata buffers that from_buffers inspects during recursion (i.e. more than one). I say "need" because we currently provide that guarantee (reading a leaf length touches metadata buffers of all parent nodes to root).

agoose77 avatar Feb 12 '24 15:02 agoose77

Will circle back to this ASAP.

agoose77 avatar Feb 13 '24 12:02 agoose77

@agoose77 Let me know if this PR makes sense. If not, I'll close it.

jpivarski avatar Mar 04 '24 23:03 jpivarski

@jpivarski thanks for the nudge.

My read of this PR is that _is_getitem_at_placeholder provides a mechanism to test whether _getitem_at involves a single-index-into-placeholder operation, allowing us to avoid it.

I note that things like RegularArray._getitem_at would not be skipped according to this test, and would invoke a _getitem_range on their content, which may be a list with placeholder offsets. This operation would succeed providing that the placeholder has a known length, which should be the case if the same code-path was evaluated using typetracer in the graph building phase. So, that's fine!

I wonder whether we should have a better string value for placeholders than ??. AFAICT, the string representation of any subtree that involves reading a placeholder value is ??. But, consider a list of type `3 * var * int64

[
    [1, 2, 3],
    [4, 5],
    [6]
]

The known-data repr of list[0] is

[1, 2, 3]

so the repr of list is

[
    [1, 2, 3],
    [4, 5],
    [6]
]

If the list offsets are placeholders, then the repr of list[0] is ??, so the repr of list is

[
    ??,
    ??,
    ??
]

I think we might to do something smarter here. What do you think?

agoose77 avatar Mar 06 '24 10:03 agoose77

Would something better be

[
    [??],
    [??],
    [??]
]

or

[
    [??, ...],
    [??, ...],
    [??, ...]
]

That is, acknowledging that there are lists (whose lengths we don't know), and only using "??" for numeric values? Maybe, but

  • if I saw "??", I wouldn't assume that it's necessarily numeric
  • actually implementing the above, within a specified number-of-characters budget, would be much more complicated.

The type is known, and it's something that users can look at. Going to the trouble of producing the right nesting, within the right number of characters and handling ellipses correctly (i.e. not getting [..., ...] or anything!), based on a guess of how someone will interpret "??", seems premature. Maybe if there are a lot of complaints that people expect "??" to only mean numbers and the show representation seems to be at odds with the type.

jpivarski avatar Mar 06 '24 11:03 jpivarski

@jpivarski you know what, let's just merge this as-is. We can iterate if someone isn't happy with it! That saves us all a headache.

agoose77 avatar Mar 06 '24 11:03 agoose77

Yeah, we can consider it iterative. I think that the string repr is not a part of the public interface that must be guaranteed. Sometimes, people parse the repr programmatically, but I think that's always a mistake.

Okay, I'll merge it as-is.

jpivarski avatar Mar 06 '24 12:03 jpivarski