ibis feat(bigquery): implement CountDistinctStar

Description of changes

Implements countdistinctstar for bigquery

Bigquery does not support count(distinct a,b,c) or count(distinct (a, b, c)) (e.g. using a tuple, as is done with DuckDB) as expressions must be groupable

Instead, convert the entire expression to a string SELECT COUNT(DISTINCT ARRAY_TO_STRING([TO_JSON_STRING(a), TO_JSON_STRING(b)], ''))

This works with all the bigquery datatypes (source). Using an array generates a unique, deterministic string for each combination of rows deterministically (see json encoding)

I do not know what the impact on cost or runtime is here, but there aren't many other ways of achieving a count distinct on multiple column types of rows.

Jun 28 '24 18:06 ssabdb

@tswast Out of curiosity are there any performance concerns here? Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries.

Jun 29 '24 13:06 cpcloud

@tswast Out of curiosity are there any performance concerns here? Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries.

There will certainly be an overhead, but it's diffiicult to answer how bad it is. From the experimenatation I decided to run, the string encoding certainly shows up significantly for the more complex timestamp datatype.

TLDR; of the below. It seems to depend on data type. Testing a distinct on a significant number of rows (120M, predicated to ~30M) still lead to queries mostly completing in 2 seconds, but with significant variations in slot-time consumed (which is a good proxy for resource requirements in BQ).

I've found a small, easy optimization for my current implementation to avoid initializing an array which I'll update the PR with shortly. Other optimizations would have to be on a type by type basis and might get quite complex with arrays and structs because they can be nested, whilst there's no-point re-encoding existing string or bytes objects. The beauty of the TO_JSON_STRING is that it natively handles nested datatypes itself.

Experimental Setup

Following this medium article with some changes to make some diverse columns to generate 128m rows, credit to this article with some minor changes:

DECLARE i INT64 DEFAULT 0;

create or replace table `my_bq_input_dataset.generated_array` as
select 
  id,
  TIMESTAMP_TRUNC(timestamp_add(TIMESTAMP '2020-01-01T00:00:00', INTERVAL id SECOND), MINUTE) as time_id,
  case when mod(id,4) = 1 then 'TR'
  when mod(id,4) = 2 then 'DE'
  when mod(id,4) = 3 then 'GB'
  else 'US'
  end as country,
  current_timestamp() as _load_time
from unnest(generate_array(1,1000000)) as id;


REPEAT
  insert into my_bq_input_dataset.generated_array select * from my_bq_input_dataset.generated_array;
  SET i = i + 1;
  UNTIL i >= 7
END REPEAT;

I tested encoding the int column (id), then the more complex timestamp colum time_id and then tested overlaying one on the other.

Int Column

TLDR; there is no significant performance overhead of JSON encoding an int.

First comparing the overhead of running count distinct on the int column:

select count(distinct(TO_JSON_STRING(id))) from my_bq_input_dataset.generated_array
where country = 'TR'

33 slot seconds, 252MB shuffled

select count(distinct(id)) from my_bq_input_dataset.generated_array
where country = 'TR'

33 slot seconds, 208MB shuffled

Timestamp column

TLDR; string encoding a timestamp column using TO_JSON_STRING is much slower than casting it to string. Interestingly, turning it into an integer and then a string is much faster, presumably because date encoding to the ISO format is so much slower.

select count(distinct(time_id)) from my_bq_input_dataset.generated_array
where country = 'TR'

12 slot seconds, 29MB shuffled

select count(distinct(TO_JSON_STRING(time_id))) from my_bq_input_dataset.generated_array
where country = 'TR'

57 slot seconds, 88.83MB shuffled

select count(distinct(CAST(time_id AS string))) from my_bq_input_dataset.generated_array
where country = 'TR'

33 sec slot seconds, 88MB shuffled

select count(distinct(TO_JSON_STRING(UNIX_MICROS(time_id)))) from my_bq_input_dataset.generated_array
where country = 'TR'

18 slot seconds, 69MB shuffled

Both columns

Current Implementation

select count(distinct(ARRAY_TO_STRING([TO_JSON_STRING(id), TO_JSON_STRING(time_id)], ''))) from my_bq_input_dataset.generated_array
where country = 'TR'

1min 52 secs slot seconds, 1.02gb shufflled

Optimization to submit, skip array initialization

select count(distinct(CONCAT(TO_JSON_STRING(id), TO_JSON_STRING(time_id)))) from my_bq_input_dataset.generated_array
where country = 'TR'

1 min, 28 slot seconds, 1.09gb shuffled

As expected, the small saving from directly casting a timestamp to string rather than json encoding saves time

select count(distinct(CONCAT(CAST(id as STRING), CAST(time_id as STRING)))) from my_bq_input_dataset.generated_array
where country = 'TR'

1 min, 22 slot seconds, 936mb shuffled

Finally, the signficant saving from first transforming a timestamp in an int through unix_micros also translates into both columns. In order to implement this for all data types, I'd have to come up with a way of optimally string encoding each data type and handle it in the PR

select count(distinct(CONCAT(id, unix_micros(time_id)))) from my_bq_input_dataset.generated_array
where country = 'TR'

56 slot seconds.

Conclusion

The problem here remains that BQ does not support count distinct on more than one column so a single type is required to contain all the input types. I think the above quantifies the significant overhead that comes from string encoding complex datatypes like TIMESTAMP, but there are future workarounds if this ends up being a problem.

Jun 30 '24 21:06 ssabdb

Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise.

I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO.

Jun 30 '24 23:06 cpcloud

Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise.

I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO.

Great. Will push an updated version today with the minor performance improvement.

Nitpicking: it's hard to get away from needing to string encode: arrays aren't groupable and it's not just heterogeneous types. If you need to distinct more than a single column and each column is the same, group able type, you need to combine them. I agree you could skip string encoding for strings themselves and JSON encoding is a catchall.

Jul 01 '24 07:07 ssabdb

Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise. I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO.

Great. Will push an updated version today with the minor performance improvement.

No need to do that in this PR. I'd like to hear from @tswast before merging, but I think we can address performance concerns (to the extent possible) in a follow-up (or never if we don't hear about them!).

Nitpicking: it's hard to get away from needing to string encode: arrays aren't groupable and it's not just heterogeneous types. If you need to distinct more than a single column and each column is the same, group able type, you need to combine them. I agree you could skip string encoding for strings themselves and JSON encoding is a catchall.

Yep! On second thought I'm not sure you should do any additional work here until we hear from folks whose workflows are limited by the performance of this operation. The fact of the matter is that people are already working around the lack of support for this in BigQuery, or they are using approximate alternatives, which we already support, so this is at base an improvement.

Jul 01 '24 12:07 cpcloud

Fine by me. I've removed the redundant array initialization in favour of a simple concat and left it at that, which itself saves a bit of time in the profiling above. Otherwise I think ready for a review.

Jul 01 '24 13:07 ssabdb

Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries.

We make the same workaround in BigQuery DataFrames for some operations, indeed as @ssabdb there are some types that aren't groupable so there isn't a great way around those.

That said, TO_JSON_STRING is slower than some other conversion methods. I would recommend borrowing our implementation here which uses more specific methods when available: https://github.com/googleapis/python-bigquery-dataframes/blob/6d947a2b2930cd34faf39e920711d0330b8a5651/bigframes/core/compile/default_ordering.py#L36-L50

Or maybe we update cast to support these workarounds for types that aren't directly castable to string normally?

Jul 01 '24 22:07 tswast

The workaround sounds good to me. I think that can be done in a follow up! @ssabdb If you're feeling up for that, would be greatly appreciated!

Jul 01 '24 22:07 cpcloud

I'll fix any xfailing tests here and then merge.

Jul 01 '24 22:07 cpcloud

Ok, fixed up the xfails and implemented the count distinct star with filter case.

Jul 01 '24 22:07 cpcloud

The tricky bit is that COUNT(DISTINCT ...) can't use the usual aggregation filter syntax, so you have to do the "null if the filter is true" thing.

Jul 01 '24 22:07 cpcloud

BigQuery is passing:

…/ibis on  ssabdb/main:main is 📦 v9.1.0 via 🐍 v3.10.14 via ❄️   impure (ibis-3.10.14-env)
❯ pytest -m bigquery -n 8 --dist loadgroup -q
bringing up nodes...
xxsssssssssssssssssssss...sssssssssssssssssssssssssss.x...x.x....x..x.....x.......x.....xx...x...x...x.......xx..x.......xx.x..x...........x...x......x.x....x.x.............................x......x [  9%]
......x........x...............xxx.xxxxxxx....x....x...xxxx.x.x.xx........x.xx.xxx..x..xxxx..xx.x.x.x..xxxxxxxxxxx.x..xxx..x.x.xxx......x......x...x..........xx.x..x..x..x.........x..............x. [ 19%]
.......x...x....x.........x..............................x.x............x.x............x.........x..........x.xx............x...................s..s......s.......................................... [ 29%]
......................................x.............s.........................x..............xsx.s..x.......s.........x..x...x........sx.....x........x........x.x....x..x........................... [ 39%]
..xx......x.......xx.x.x...x..............x..x.............x.....x......xxx...x........xx.......xxx....x.x........x..........x................x.xxxxxx...xx...xxxxxxxxx.x..xxxxxxxxxxx.xx.xxxxxxxxxxx [ 49%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..xx................x...x.....................xx..................x.....................x...................x.. [ 59%]
x........................................x........x........................................................x................................x..............x......x...x..x..xxx...................... [ 69%]
xx.......xx........x................x.......................x..xx..x..x.x......x....x..........x..x..x....xx..x................x..x......xx........xx.......x......xx...................x............ [ 79%]
.........................................x..x.........................................................................................................x.............................................. [ 89%]
....................................................................................................................................................x................................................ [ 99%]
.............                                                                                                                                                                                         [100%]
1592 passed, 56 skipped, 335 xfailed in 688.16s (0:11:28)

Jul 01 '24 23:07 cpcloud

Thanks @ssabdb, keep 'em coming!

Jul 01 '24 23:07 cpcloud