python-bigquery-pandas icon indicating copy to clipboard operation
python-bigquery-pandas copied to clipboard

ENH: option efficient memory use by downcasting

Open dkapitan opened this issue 5 years ago • 23 comments

We use pandas-gbq a lot for our daily analyses. It is known that memory consumption can be a pain, see e.g. https://www.dataquest.io/blog/pandas-big-data/

I have started to write a patch, which could be integrated into an enhancement for read_gbq (rough idea, details TBD):

  • Provide boolean optimize_memory option
  • If True, the source table is inspected with a query to get min, max, presence of nulls and % of unique number of strings for INTEGER and STRING columns, respectively
  • When calling to_dataframe this information is passed to the dtypes option, downcasting integers to the appropriate numpy (u)int type, and converting strings to pandas category type at some threshold (less than 50% of unique values)

I already have a working monkey-patch, which is still a bit rough. If there is enough interest I'd happily make it more robust and submit a PR. Would be my first significant contribution to an open source project, so some help and feedback would be appreciated.

Curious to hear your views on this.

dkapitan avatar May 15 '19 17:05 dkapitan

Thanks for the suggestion @dkapitan

Interesting ideas - what's the rationale for not doing that every time? I haven't used the new int dtype much; my understanding is that categoricals were generally dominant of numpy object strings, it's basically dictionary encoding

Either way, I would definitely be keen to add these as options. Explicit > implicit, and I think that having the library behind the scenes send additional queries and self-optimize would be too much magic, at least to start. What are your thoughts?

max-sixty avatar May 15 '19 22:05 max-sixty

@max-sixty Thanks for the prompt response, motivates me to get this going not just for ourselves. Agree that this feature could be used more generically, and that for the first iteration we should explicitly separate the steps (rather than opaque magic). So here's my user-story:

Enhancements

  • GbqConnector.optimize_schema(table_id, threshold_strings=0.5, default_float=np.float32):

    1. reflects on existing table or query results to determine current SQL types
    2. does calculation in standardSQL to determine optimal pandas dtypes
    3. returns dtypes as dict, possibly with table with statistics from ii) for inspection
  • Add read_gbq(dtypes=None) option, for passing result of iii) to google.cloud.bigquery.job.QueryJob.to_dataframe(dtypes=dtypes)

Design considerations and focus first version

  • optimze_schema only on existing tables, reflection of dynamic query results requires more engineering (e.g. referencing temporary table in GBQ cache)
  • Only optimize non-nested, non-repeated columns. Pandas can’t handle these anyhow, so any nested and/or repeated columns will be read as a dict in an object column
  • Logic for downcasting INTEGER and STRING. Need to make trade-off between null-handling of INTEGER since only pandas.Int64 is nullable, numpy ints aren't. This is actually quite significant: lot's of coded.enumerated data often fits in uint8 or uint16 but then we need to see how we handle nulls (use sentinel values?!)
  • FLOAT can be set to default np.float32 without inspection. Almost for all my use cases I don't need larger mantissa or exponent.
  • User need to explicitly invoke optimize_schema and pass it in dtypes. In future we may want to consider to_dataframe(optimize=True)

Feel free to add/modify so we have a clear goal for the first iteration. Have started with the how-to for contributors, my fork is at https://github.com/dkapitan/pandas-gbq. Will let you know when I have done my first commit of the optimize-memory-usage branch.

dkapitan avatar May 16 '19 09:05 dkapitan

Explicit > implicit, and I think that having the library behind the scenes send additional queries and self-optimize would be too much magic, at least to start.

I agree. Especially since it's not supported by the BigQuery team to query anonymous query results tables.

I propose we add a dtypes argument to pandas-gbq (passing through to that of google-cloud-bigquery) to support this use case. Maybe we make a public function to get a memory optimal schema via these queries that can be used to pass as the dtypes argument?

tswast avatar May 17 '19 16:05 tswast

Oops, I should read more closely, I think I just proposed your user story. :-)

tswast avatar May 17 '19 16:05 tswast

Thanks for the clear plan @dkapitan (I've just returned from a conference so thanks in advance for your patience)

Can we clarify whether the data in the data influences whether we'd want these - e.g. why not always use pandas.Int64, or categorical for strings?

If this is something that doesn't depend on the underlying data, we could skip the work behind GbqConnector.optimize_schema. We could add the dtypes option first, and then potentially promote to default after more experience.

max-sixty avatar May 20 '19 19:05 max-sixty

The blog post recommends using smaller types where possible.

The way this is implemented in google-cloud-bigquery, I fear you'll still need quite a bit of temporary memory (that should get freed after dataframe construction), since it does convert the BQ response JSON/protos into Python objects first. I'm working on improving this situation with the BQ Storage API via a custom Avro parser (https://github.com/googleapis/google-cloud-python/issues/7805), but even then we'll use the most natural dtype first (e.g. pandas int64 for BigQuery INTEGER columns). Using a smaller dtype will require a copy operation which will slow the parsing down, but will save steady-state memory (but still require enough temporary memory to fit the column as int64). I guess the Avro parser could use the requested dtype to avoid this, but then you risk either out-of-bounds errors or silent overflow.

tswast avatar May 20 '19 19:05 tswast

@tswast @max-sixty

So the good news is that I have got a working version, see this commit. Basic testing seems OK (haven't written all the tests yet), but when actually converting the query results to a dataframe I'm getting weird results. It seems that, depending on the size of the QueryJob.results, the conversion from STRING to category fails; it only works for small result sets. See attached .ipynb. I have tried to trace this bug in the source code, but can't find anything. Your help would be much appreciated.

Then, regarding your questions:

  • Donwncasting INTEGER to smaller numpy.int types significantly reduced the memory footprint
  • Indeed during load-time there will be a temporary peak in memory usage. It would be best if the type conversion is done directly from Avro to pandas, but indeed you risk errors as mentioned. I've actually come across that during testing: integer data got all messed up with negative numbers because I'd enforced an int-type that was too small.

Finally, on a practical note: I don't have much experience using Git with branches (only used basic Git stuff). I'm afraid I've made a mess of my branch already (too many merges and rebases whilst trying to figure out how it works). Please let me know if this OK (assuming it can be fixed during a PR). If not, please let me know and I will start with a clean fork.

Curious to hear your thoughts.

pandas-gbq-bugtracing.ipynb.zip

dkapitan avatar May 20 '19 20:05 dkapitan

the conversion from STRING to category fails; it only works for small result sets.

Probably because the downcasting happens for each page, so if there's a page without all the possible categories, the concat will fail.

tswast avatar May 20 '19 20:05 tswast

Please let me know if this OK (assuming it can be fixed during a PR).

I can probably clean it up if you mail the PR and check the "allow edits from maintainers" box. Don't worry about it too much.

tswast avatar May 20 '19 20:05 tswast

the conversion from STRING to category fails; it only works for small result sets.

Probably because the downcasting happens for each page, so if there's a page without all the possible categories, the concat will fail.

Could you file an issue to google-cloud-python with this for me? Maybe there's a way we can delay the cast. I hadn't actually considered the categorical type when I added the dtype option.

tswast avatar May 20 '19 21:05 tswast

Great @dkapitan

My experience is that pandas doesn't deal that well with non-standard types (e.g. smaller floats are upcast on some operations). Combined with our current parsing, I'm not sure it'll save much memory, consistent with @tswast 's thoughts above.

If there's any tradeoff in prioritization between @dkapitan suggestions around using 1) nullable ints and 2) categoricals for strings; those would be immediately impactful.

max-sixty avatar May 21 '19 18:05 max-sixty

Maybe there's a way we can delay the cast.

💯

max-sixty avatar May 21 '19 18:05 max-sixty

Thanks for all the feedback @max-sixty @tswast.

I am not sure how to proceed from here. Below some extra tests, thoughts and questions. Please let me know what you think is best to do next.

Persistence of downcasted types

My experience is that pandas doesn't deal that well with non-standard types (e.g. smaller floats are upcast on some operations).

Have done a simple test (see below), and for now I'd say we stick to the following:

  • if we use np.float32 throughout, upcasting by pandas should be reduced since all float columns have been coerced to np.float32. Of course there may be less frequently used methods in pandas which would still upcast to float64, but I expect this would reduce memory usage in most use cases
  • for now I'd prefer to stick to pandas nullable Int64 for integer columns containing null-values, at the cost of more memory. I expect users requiring more memory reduction would do further optimisation themselves, replacing nulls with sentinel values and downcast from there. Integer columns with no nulls are downcasted to the smallest size
# test upcasting by pandas
df = (pd.DataFrame({'_float64': [1.1, 2.2, None, 4.4],
                    '_float32': [1.1, None, 3.3, 4.4],
                    '_float16': [1.1, 2.2, 3.3, 4.4],
                    '_int64': [1, None, 3, 4],
                    '_int32': [1, 2, -3, 4],
                    '_uint16': [1, 2, 3, 4],
                    })
      .astype({'_float64': 'float',
               '_float32': np.float32,
               '_float16': np.float16,
               '_int64': 'Int64',
               '_int32': np.int32,
               '_uint16': np.uint16,
               })
      )
df['_sum_floats_16'] = df._float16 + df._float16
df['_sum_floats_32'] = df._float32 + df._float32
df['_sum_floats_mixed'] = df._float16 + df._float32
df['_sum_ints_mixed'] = df._int32 + df._uint16
print(pd.concat([df.dtypes, df.memory_usage()], axis=1, sort=False)
      .rename(columns=({0: 'dtype', 1: 'mem_use'})))

# output
                     dtype  mem_use
_float64           float64       32
_float32           float32       16
_float16           float16        8
_int64               Int64       36
_int32               int32       16
_uint16             uint16        8
_sum_floats_16     float16        8
_sum_floats_32     float32       16
_sum_floats_mixed  float32       16
_sum_ints_mixed      int32       16
Index                  NaN       80

Peak memory consumption I have found the memory-profiler module to plot the memory usage over time. Executing mprof mem_test.py yields the following graph, showing peak memory use is 3.5 x compared to post-load.

memory usage with optimized dataframe (df_optimized 0.8 MB) mprof_plot

memory usage with non-optimized dataframe (df 4.6 MB) mem_test_not_optimized

Although df_optimized is reduced to 17% of the original df size for this (small) dataset, this reduction is negligible on the total memory footprint of the script. The impact may be larger for large datasets, but then it defeats the whole purpose of this exercise since the peak memory usage is likely to exceed the systems RAM (assuming something like 16 GB RAM for a data science machine/laptop).

#!/usr/bin/env python
# coding: utf-8
# mem_test.py

from pandas_gbq.gbq import read_gbq, GbqConnector, optimize_dtypes
from pandas_gbq.schema import (
    select_columns_by_type,
    generate_sql,
    _determine_int_type,
    _determine_string_type,
)
import pandas as pd
import numpy as np


default_project_id = "mediquest-cloud"


def count_dtypes(df):
    return df.dtypes.apply(lambda x: x.name).value_counts()


def run_mem_test():
    for table in [
        "baseball.games_post_wide",
        #              'ml_datasets.credit_card_default',
    ]:
        table_reference = ".".join(["bigquery-public-data", table])
        optimized = optimize_dtypes(
            table_reference, default_project_id=default_project_id
        )
    #     df_original = read_gbq(f'SELECT * FROM `{table_reference}`', use_bqstorage_api=True)
    #     df_optimized = read_gbq(f'SELECT * FROM `{table_reference}`', use_bqstorage_api=True, dtypes=optimized['dtypes'])
    #     print('Memory usage optimized dataframe: {:.1f}% of orginal dataframe'.
    #           format(100*df_optimized.memory_usage().sum()/df_original.memory_usage().sum()))

    # strange behaviour: string columns are converted to category when 'limit 4000' in query
    sql = "select * from `bigquery-public-data.baseball.games_post_wide` limit {}"
    print("Reading from GBQ: {}".format(table))
    df1 = read_gbq(sql.format(4000), dtypes=optimized["dtypes"])
    print(count_dtypes(df1))
    print(
        "Memory use optimized table {}: {}".format(
            table, df1.memory_usage().sum()
        )
    )


if __name__ == "__main__":
    run_mem_test()

STRING -> category conversion I will await the solution for BigQuery GH8044 for conversion of strings to categories

dkapitan avatar May 22 '19 07:05 dkapitan

Interesting: I have played around a bit more with mprof to see what happens during load time. Taking bigquery-public-data.baseball.games_post_wide as test data, I have compared memory use during loading (all with no dtype optimisation):

via read_gbq(use_bqstorage_api=False) mem_test_baseball games_post_wide

via read_gbq(use_bqstorage_api=True) mem_test_gbq_storage_api

via fastavro mem_test_baseball games_post_wide avro

via pd.read_csv(compression='gzip') mem_use_gzip

via `pd.read_parquet(), using the gzip-imported dataframe -> exported as parquet as sourcefile mem_test_parquet

Note that:

  • bqstorage_api has smaller overshoot than regular gbq.client
  • parquet is the only intermediate format that doesn't overshoot memory consumption
  • testing other parquet files (exports from database), type conversion is more robust, e.g. int32 is preserved when exporting from SQL database -> parquet ->pandas.

So following Wes McKinney's long term plan, using parquet as columnar data middleware seems the way to go. That all depends on adding parquet export function to BigQuery, including efficient type casting from BigQuery's generic FLOAT64 / INT64 to smaller types.

dkapitan avatar May 22 '19 09:05 dkapitan

Thanks for the detailed graphs! I bet we could decrease the overshoot in the case of downloading the whole table / query results with google-cloud-bigquery and google-cloud-bigquery-storage by deling pages after they are processed. From the graph, it's pretty clear that the resources aren't being released until to_dataframe() exits.

tswast avatar May 22 '19 23:05 tswast

@max-sixty @tswast

Just wanted to check whether it's useful to submit a PR with the current half-working version. I expect it may take sometime before BigQuerh GH8044 is solved, and would like to prevent the branch from running out of sync to much.

Please let me know what's best.

dkapitan avatar May 29 '19 08:05 dkapitan

Definitely open to adding the dtypes argument sooner rather than later, since it has other applications, too.

I'm not too worried about the extra method for finding minimal schema getting out of sync, since there are fewer chances of git conflicts in that case. We probably want to wait to add it until the issues in google-cloud-bigquery are resolved, otherwise we'll get issues that the peak memory is still too high.

tswast avatar May 29 '19 17:05 tswast

Thanks for your hard work and patience @dkapitan . Those are superb graphs and I think exactly the right foundation for how to improve performance.

In the short term: agree with @tswast - definitely dtypes would be a great addition.

In the longer term: I think there are a couple of paths we can take:

  • Try and optimize some of these approaches - e.g. ensure intermediate pages are dropped during the parse
  • Move this parsing out of python

I think the latter is likely to be a fuller solution - will be more performant and less likely to have edge cases. It somewhat depends on either a) a good Avro parser than can build numpy-like arrays or b) alternative formats than Avro being available from the bq_storage API.

If there are easy changes to make from the former, that's a clear win, and those charts suggest there are some moderate improvements we can make...

max-sixty avatar May 29 '19 18:05 max-sixty

@max-sixty @tswast Thanks for the feedback. Will hang on to the PR for now and await the results from the related issues. Please let me know if there's anything I can do in the meantime.

dkapitan avatar May 29 '19 21:05 dkapitan

Thanks @dkapitan !

max-sixty avatar May 29 '19 21:05 max-sixty

@tswast @max-sixty So what's the outlook for this issue given the new pyarrow functionality in the GBQ Storage API. Looks you've done a great job, Tim, was impressed reading “Announcing google-cloud-bigquery Version 1.17.0” by Tim Swast https://link.medium.com/2ZOZ5tfUh2

Given pyarrow, I am inclined to start from scratch and rewrite the downcasting feature specifically (only?) for the Arrow format (for now anyway). I expect the memory overshoots to be solved, and am hoping that downcasting is a lot more straightforward since there are no intermediate parsing steps in combining the data.

What do you think would be a good way to move forward with this issue?

dkapitan avatar Dec 09 '19 21:12 dkapitan

In fact, looking at https://arrow.apache.org/docs/python/pandas.html#arrow-pandas-conversion it all boils down how the GBQ Storage API does the type conversion from GBQ to Arrow.

If Int64 is already downcasted to the smallest int-type, then this issue is solved.

If not, the proposed functionality in this issue could be implemented by post-processing the Arrow table, prior to calling to_dataframe() using the same logic (largest numbers, cardinality of string columns) I have written previously.

My guess is that the downcasting isn't implemented in the GBQ --> Arrow conversion (haven't looked at the source code yet).

dkapitan avatar Dec 09 '19 21:12 dkapitan

This seems relevant to this discussion, as to how to handle strings: https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/

dkapitan avatar Dec 16 '19 15:12 dkapitan

This issue has not been active since 2019. I love the idea, but based on current workloads, etc I don't see us picking this up anytime soon. Closing as will not fix.

chalmerlowe avatar Jan 17 '24 21:01 chalmerlowe