python-bigquery-pandas
python-bigquery-pandas copied to clipboard
ENH: option efficient memory use by downcasting
We use pandas-gbq a lot for our daily analyses. It is known that memory consumption can be a pain, see e.g. https://www.dataquest.io/blog/pandas-big-data/
I have started to write a patch, which could be integrated into an enhancement for read_gbq
(rough idea, details TBD):
- Provide boolean
optimize_memory
option - If
True
, the source table is inspected with a query to get min, max, presence of nulls and % of unique number of strings for INTEGER and STRING columns, respectively - When calling
to_dataframe
this information is passed to thedtypes
option, downcasting integers to the appropriate numpy (u)int type, and converting strings to pandascategory
type at some threshold (less than 50% of unique values)
I already have a working monkey-patch, which is still a bit rough. If there is enough interest I'd happily make it more robust and submit a PR. Would be my first significant contribution to an open source project, so some help and feedback would be appreciated.
Curious to hear your views on this.
Thanks for the suggestion @dkapitan
Interesting ideas - what's the rationale for not doing that every time? I haven't used the new int dtype much; my understanding is that categoricals were generally dominant of numpy object strings, it's basically dictionary encoding
Either way, I would definitely be keen to add these as options. Explicit > implicit, and I think that having the library behind the scenes send additional queries and self-optimize would be too much magic, at least to start. What are your thoughts?
@max-sixty Thanks for the prompt response, motivates me to get this going not just for ourselves. Agree that this feature could be used more generically, and that for the first iteration we should explicitly separate the steps (rather than opaque magic). So here's my user-story:
Enhancements
-
GbqConnector.optimize_schema(table_id, threshold_strings=0.5, default_float=np.float32)
:- reflects on existing table or query results to determine current SQL types
- does calculation in standardSQL to determine optimal pandas dtypes
- returns dtypes as dict, possibly with table with statistics from ii) for inspection
-
Add
read_gbq(dtypes=None)
option, for passing result of iii) togoogle.cloud.bigquery.job.QueryJob.to_dataframe(dtypes=dtypes)
Design considerations and focus first version
-
optimze_schema
only on existing tables, reflection of dynamic query results requires more engineering (e.g. referencing temporary table in GBQ cache) - Only optimize non-nested, non-repeated columns. Pandas can’t handle these anyhow, so any nested and/or repeated columns will be read as a dict in an object column
- Logic for downcasting
INTEGER
andSTRING
. Need to make trade-off between null-handling ofINTEGER
since onlypandas.Int64
is nullable, numpyint
s aren't. This is actually quite significant: lot's of coded.enumerated data often fits inuint8
oruint16
but then we need to see how we handle nulls (use sentinel values?!) - FLOAT can be set to default
np.float32
without inspection. Almost for all my use cases I don't need larger mantissa or exponent. - User need to explicitly invoke
optimize_schema
and pass it indtypes
. In future we may want to considerto_dataframe(optimize=True)
Feel free to add/modify so we have a clear goal for the first iteration. Have started with the how-to for contributors, my fork is at https://github.com/dkapitan/pandas-gbq. Will let you know when I have done my first commit of the optimize-memory-usage
branch.
Explicit > implicit, and I think that having the library behind the scenes send additional queries and self-optimize would be too much magic, at least to start.
I agree. Especially since it's not supported by the BigQuery team to query anonymous query results tables.
I propose we add a dtypes
argument to pandas-gbq (passing through to that of google-cloud-bigquery) to support this use case. Maybe we make a public function to get a memory optimal schema via these queries that can be used to pass as the dtypes
argument?
Oops, I should read more closely, I think I just proposed your user story. :-)
Thanks for the clear plan @dkapitan (I've just returned from a conference so thanks in advance for your patience)
Can we clarify whether the data in the data influences whether we'd want these - e.g. why not always use pandas.Int64
, or categorical for strings?
If this is something that doesn't depend on the underlying data, we could skip the work behind GbqConnector.optimize_schema
. We could add the dtypes
option first, and then potentially promote to default after more experience.
The blog post recommends using smaller types where possible.
The way this is implemented in google-cloud-bigquery
, I fear you'll still need quite a bit of temporary memory (that should get freed after dataframe construction), since it does convert the BQ response JSON/protos into Python objects first. I'm working on improving this situation with the BQ Storage API via a custom Avro parser (https://github.com/googleapis/google-cloud-python/issues/7805), but even then we'll use the most natural dtype first (e.g. pandas int64 for BigQuery INTEGER columns). Using a smaller dtype will require a copy operation which will slow the parsing down, but will save steady-state memory (but still require enough temporary memory to fit the column as int64). I guess the Avro parser could use the requested dtype to avoid this, but then you risk either out-of-bounds errors or silent overflow.
@tswast @max-sixty
So the good news is that I have got a working version, see this commit. Basic testing seems OK (haven't written all the tests yet), but when actually converting the query results to a dataframe I'm getting weird results. It seems that, depending on the size of the QueryJob.results, the conversion from STRING to category fails; it only works for small result sets. See attached .ipynb. I have tried to trace this bug in the source code, but can't find anything. Your help would be much appreciated.
Then, regarding your questions:
- Donwncasting INTEGER to smaller numpy.int types significantly reduced the memory footprint
- Indeed during load-time there will be a temporary peak in memory usage. It would be best if the type conversion is done directly from Avro to pandas, but indeed you risk errors as mentioned. I've actually come across that during testing: integer data got all messed up with negative numbers because I'd enforced an int-type that was too small.
Finally, on a practical note: I don't have much experience using Git with branches (only used basic Git stuff). I'm afraid I've made a mess of my branch already (too many merges and rebases whilst trying to figure out how it works). Please let me know if this OK (assuming it can be fixed during a PR). If not, please let me know and I will start with a clean fork.
Curious to hear your thoughts.
the conversion from STRING to category fails; it only works for small result sets.
Probably because the downcasting happens for each page, so if there's a page without all the possible categories, the concat will fail.
Please let me know if this OK (assuming it can be fixed during a PR).
I can probably clean it up if you mail the PR and check the "allow edits from maintainers" box. Don't worry about it too much.
the conversion from STRING to category fails; it only works for small result sets.
Probably because the downcasting happens for each page, so if there's a page without all the possible categories, the concat will fail.
Could you file an issue to google-cloud-python
with this for me? Maybe there's a way we can delay the cast. I hadn't actually considered the categorical type when I added the dtype
option.
Great @dkapitan
My experience is that pandas doesn't deal that well with non-standard types (e.g. smaller floats are upcast on some operations). Combined with our current parsing, I'm not sure it'll save much memory, consistent with @tswast 's thoughts above.
If there's any tradeoff in prioritization between @dkapitan suggestions around using 1) nullable ints and 2) categoricals for strings; those would be immediately impactful.
Maybe there's a way we can delay the cast.
💯
Thanks for all the feedback @max-sixty @tswast.
I am not sure how to proceed from here. Below some extra tests, thoughts and questions. Please let me know what you think is best to do next.
Persistence of downcasted types
My experience is that pandas doesn't deal that well with non-standard types (e.g. smaller floats are upcast on some operations).
Have done a simple test (see below), and for now I'd say we stick to the following:
- if we use
np.float32
throughout, upcasting by pandas should be reduced since all float columns have been coerced tonp.float32
. Of course there may be less frequently used methods in pandas which would still upcast tofloat64
, but I expect this would reduce memory usage in most use cases - for now I'd prefer to stick to pandas nullable
Int64
for integer columns containing null-values, at the cost of more memory. I expect users requiring more memory reduction would do further optimisation themselves, replacing nulls with sentinel values and downcast from there. Integer columns with no nulls are downcasted to the smallest size
# test upcasting by pandas
df = (pd.DataFrame({'_float64': [1.1, 2.2, None, 4.4],
'_float32': [1.1, None, 3.3, 4.4],
'_float16': [1.1, 2.2, 3.3, 4.4],
'_int64': [1, None, 3, 4],
'_int32': [1, 2, -3, 4],
'_uint16': [1, 2, 3, 4],
})
.astype({'_float64': 'float',
'_float32': np.float32,
'_float16': np.float16,
'_int64': 'Int64',
'_int32': np.int32,
'_uint16': np.uint16,
})
)
df['_sum_floats_16'] = df._float16 + df._float16
df['_sum_floats_32'] = df._float32 + df._float32
df['_sum_floats_mixed'] = df._float16 + df._float32
df['_sum_ints_mixed'] = df._int32 + df._uint16
print(pd.concat([df.dtypes, df.memory_usage()], axis=1, sort=False)
.rename(columns=({0: 'dtype', 1: 'mem_use'})))
# output
dtype mem_use
_float64 float64 32
_float32 float32 16
_float16 float16 8
_int64 Int64 36
_int32 int32 16
_uint16 uint16 8
_sum_floats_16 float16 8
_sum_floats_32 float32 16
_sum_floats_mixed float32 16
_sum_ints_mixed int32 16
Index NaN 80
Peak memory consumption
I have found the memory-profiler module to plot the memory usage over time. Executing mprof mem_test.py
yields the following graph, showing peak memory use is 3.5 x compared to post-load.
memory usage with optimized dataframe (df_optimized 0.8 MB)
memory usage with non-optimized dataframe (df 4.6 MB)
Although df_optimized
is reduced to 17% of the original df
size for this (small) dataset, this reduction is negligible on the total memory footprint of the script. The impact may be larger for large datasets, but then it defeats the whole purpose of this exercise since the peak memory usage is likely to exceed the systems RAM (assuming something like 16 GB RAM for a data science machine/laptop).
#!/usr/bin/env python
# coding: utf-8
# mem_test.py
from pandas_gbq.gbq import read_gbq, GbqConnector, optimize_dtypes
from pandas_gbq.schema import (
select_columns_by_type,
generate_sql,
_determine_int_type,
_determine_string_type,
)
import pandas as pd
import numpy as np
default_project_id = "mediquest-cloud"
def count_dtypes(df):
return df.dtypes.apply(lambda x: x.name).value_counts()
def run_mem_test():
for table in [
"baseball.games_post_wide",
# 'ml_datasets.credit_card_default',
]:
table_reference = ".".join(["bigquery-public-data", table])
optimized = optimize_dtypes(
table_reference, default_project_id=default_project_id
)
# df_original = read_gbq(f'SELECT * FROM `{table_reference}`', use_bqstorage_api=True)
# df_optimized = read_gbq(f'SELECT * FROM `{table_reference}`', use_bqstorage_api=True, dtypes=optimized['dtypes'])
# print('Memory usage optimized dataframe: {:.1f}% of orginal dataframe'.
# format(100*df_optimized.memory_usage().sum()/df_original.memory_usage().sum()))
# strange behaviour: string columns are converted to category when 'limit 4000' in query
sql = "select * from `bigquery-public-data.baseball.games_post_wide` limit {}"
print("Reading from GBQ: {}".format(table))
df1 = read_gbq(sql.format(4000), dtypes=optimized["dtypes"])
print(count_dtypes(df1))
print(
"Memory use optimized table {}: {}".format(
table, df1.memory_usage().sum()
)
)
if __name__ == "__main__":
run_mem_test()
STRING -> category conversion I will await the solution for BigQuery GH8044 for conversion of strings to categories
Interesting: I have played around a bit more with mprof
to see what happens during load time. Taking bigquery-public-data.baseball.games_post_wide
as test data, I have compared memory use during loading (all with no dtype optimisation):
via read_gbq(use_bqstorage_api=False)
via read_gbq(use_bqstorage_api=True)
via fastavro
via pd.read_csv(compression='gzip')
via `pd.read_parquet(), using the gzip-imported dataframe -> exported as parquet as sourcefile
Note that:
-
bqstorage_api
has smaller overshoot than regulargbq.client
- parquet is the only intermediate format that doesn't overshoot memory consumption
- testing other parquet files (exports from database), type conversion is more robust, e.g.
int32
is preserved when exporting from SQL database -> parquet ->pandas.
So following Wes McKinney's long term plan, using parquet as columnar data middleware seems the way to go. That all depends on adding parquet export function to BigQuery, including efficient type casting from BigQuery's generic FLOAT64
/ INT64
to smaller types.
Thanks for the detailed graphs! I bet we could decrease the overshoot in the case of downloading the whole table / query results with google-cloud-bigquery
and google-cloud-bigquery-storage
by del
ing pages after they are processed. From the graph, it's pretty clear that the resources aren't being released until to_dataframe()
exits.
@max-sixty @tswast
Just wanted to check whether it's useful to submit a PR with the current half-working version. I expect it may take sometime before BigQuerh GH8044 is solved, and would like to prevent the branch from running out of sync to much.
Please let me know what's best.
Definitely open to adding the dtypes
argument sooner rather than later, since it has other applications, too.
I'm not too worried about the extra method for finding minimal schema getting out of sync, since there are fewer chances of git conflicts in that case. We probably want to wait to add it until the issues in google-cloud-bigquery
are resolved, otherwise we'll get issues that the peak memory is still too high.
Thanks for your hard work and patience @dkapitan . Those are superb graphs and I think exactly the right foundation for how to improve performance.
In the short term: agree with @tswast - definitely dtypes would be a great addition.
In the longer term: I think there are a couple of paths we can take:
- Try and optimize some of these approaches - e.g. ensure intermediate pages are dropped during the parse
- Move this parsing out of python
I think the latter is likely to be a fuller solution - will be more performant and less likely to have edge cases. It somewhat depends on either a) a good Avro parser than can build numpy-like arrays or b) alternative formats than Avro being available from the bq_storage API.
If there are easy changes to make from the former, that's a clear win, and those charts suggest there are some moderate improvements we can make...
@max-sixty @tswast Thanks for the feedback. Will hang on to the PR for now and await the results from the related issues. Please let me know if there's anything I can do in the meantime.
Thanks @dkapitan !
@tswast @max-sixty So what's the outlook for this issue given the new pyarrow functionality in the GBQ Storage API. Looks you've done a great job, Tim, was impressed reading “Announcing google-cloud-bigquery Version 1.17.0” by Tim Swast https://link.medium.com/2ZOZ5tfUh2
Given pyarrow, I am inclined to start from scratch and rewrite the downcasting feature specifically (only?) for the Arrow format (for now anyway). I expect the memory overshoots to be solved, and am hoping that downcasting is a lot more straightforward since there are no intermediate parsing steps in combining the data.
What do you think would be a good way to move forward with this issue?
In fact, looking at https://arrow.apache.org/docs/python/pandas.html#arrow-pandas-conversion it all boils down how the GBQ Storage API does the type conversion from GBQ to Arrow.
If Int64 is already downcasted to the smallest int-type, then this issue is solved.
If not, the proposed functionality in this issue could be implemented by post-processing the Arrow table, prior to calling to_dataframe()
using the same logic (largest numbers, cardinality of string columns) I have written previously.
My guess is that the downcasting isn't implemented in the GBQ --> Arrow conversion (haven't looked at the source code yet).
This seems relevant to this discussion, as to how to handle strings: https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/
This issue has not been active since 2019. I love the idea, but based on current workloads, etc I don't see us picking this up anytime soon. Closing as will not fix.