pandas2 icon indicating copy to clipboard operation
pandas2 copied to clipboard

lazy array attributes

Open jreback opened this issue 8 years ago • 4 comments

IIRC this from the design docs, but wanted to make an issue to remember. We want to have a set of lazily computed array attributes. Sometimes these can be set at creation time based on the creation method / dtype. If the array is immutable then these are not affected by indexing checks.

  • immutability / read-only, xref https://github.com/pydata/pandas/pull/14359
  • unique
  • is_monotonic*
  • has_nulls
  • is_hashable - only on non homogeneous dtype, usually true on object dtypes (but NOT if they are mutable). The issue is that this can currently be expensive to figure out (as you need to iterate over and call hash on each element). xref

e.g. imagine a pd.date_range(....., ...), then unique, monotonic, has_nulls are trivial to compute at creation time. Since this is currently an Index in pandas it is immutable by-definition.

xref https://github.com/pydata/pandas/issues/12272, https://github.com/pydata/pandas/issues/14266

jreback avatar Sep 21 '16 10:09 jreback

API question - what does it look like to opt-in to one of these checks? As a specific example, I've used this "optimization" a few times to speed up merges on a monotonic column.

a.merge(b, on='sorted_col')

# takes advantage of monotonicity
a.set_index('sorted_col').join(b.set_index('sorted_col'))

What should that look like? Could be something like this, although maybe should be even more hidden as "advanced api" to avoid too many parameters on basic functions?

a.merge(b, on='sorted_col', check_monotonicity=True)

check_monotonicity= {'infer' | True | False}

chris-b1 avatar Sep 21 '16 17:09 chris-b1

Things like monotonicity are so cheap to check and provide such significant performance benefits when they are known, that I would support always checking when it may be advantageous.

These attributes can be cached and invalidated whenever the array is mutated (we'd have to have a "dirty" flag to indicate that any cached array statistics need to be recomputed)

wesm avatar Sep 21 '16 18:09 wesm

Regarding immutabity: What should happen if a user creates a series from an immutable array, and then later sets the array to mutable and mutates it. I think a valid answer is "don't do that", but it should be explicitly defined. If that should be supported behavior you could forward checks to immutable down to the underlying storage's check each time. The small indirection shouldn't be too expensive but idk if you can cache that.

llllllllll avatar Oct 11 '16 23:10 llllllllll

@llllllllll when you create a pandas.Series from an pandas.Array you are actually obtaining a view on that array, so if the source array mutates itself, it triggers copy-on-write (because it observes that it's use count is > 1). So this will be a non-issue.

wesm avatar Oct 12 '16 02:10 wesm