python-bigquery-pandas
python-bigquery-pandas copied to clipboard
ENH: load timestamp data with timezones other than UTC
dataframe.to_csv( csv_buffer, index=False, header=False, encoding='utf-8', date_format='%Y-%m-%d %H:%M')
No harm in adding %z here for timezone.
Current workaround for me is df[col] = df[col].dt.tz_localize('Australia/Sydney').tz_convert('UTC')
Will you accept pull request for this one?
Hi @jwBoral - thanks for the issue report.
Can you offer some more details? Is this loading into python? Into BQ? Do you have a repro example?
this is from the load.py script into BQ. in the function encode_chunk
def encode_chunk(dataframe): """Return a file-like object of CSV-encoded rows. Args: dataframe (pandas.DataFrame): A chunk of a dataframe to encode """ csv_buffer = six.StringIO() dataframe.to_csv( csv_buffer, index=False, header=False, encoding='utf-8', date_format='%Y-%m-%d %H:%M:%S.%f')
I suggest this - '%Y-%m-%d %H:%M:%S.%f%z'
It's dropping the df into a csv to upload to BQ. Even if you set the correct timezone information in the dataframe, the offset is not captured in this csv dump. This then results it not being captured in the timestamp column in BQ during the load. (BQ then defaults the value to UTC which isn't correct, if the timezone information in the dataframe was something other than UTC)
df datetime is : 2012-03-20 13:00:00+10:00 dump to csv is: 2012-03-20 13:00:00 (WITHOUT TZ OFFSET) BQ timestamp respresentation is: 2012-03-20 13:00:00+00:00 (Defaults back to UTC)
Great! Thanks for the update
If BQ supports this from CSV, this would be a very welcome addition. (If you're not sure, it's worth confirming this prior)
I think we'll have the best luck if we convert to UTC before serializing to CSV. I've found the time zone parsing to be iffy when loading to BigQuery.
We should be careful about types here. Are we sure we want timestamp for dates? Seems to me as though we should drop the hours and minutes to load these as a date column.
I agree this library should support uploading date and datetime fields but this library has chosen to stick with timestamp. This is a separate issue.
The fact that this library has chosen with timestamp for the time being, means it should support timezones. On the other hand I don't have experience uploading a full proper timestamp string into BigQuery. https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types.
Ideally I would like to have date and datetime fields.
I would love to support dates over datetimes - both as a feature and because most of our data is in dates. At the moment we upload as datetimes and then run a separate query to cast to dates.
But pandas doesn't have a date type, unless we support PeriodIndexes. We could try and guess whether the timestamps are attempting to rep dates - i.e. midnight with daily frequency. But I think that would be too much magic
But I think that would be too much magic
In case it's useful--it seems like the code in pandas.DataFrame.to_sql might be useful as a reference approach.
They use pandas._libs.lib.infer_dtype, which will return "date".
Possibly fixed by serializing to Parquet instead? https://github.com/googleapis/python-bigquery-pandas/pull/413
Needs an integration test to verify.
Re: a date dtype, we've made some progress in supporting this via an extension dtype package. https://github.com/googleapis/python-db-dtypes-pandas Core pandas folks are open to adding these to the core library, but requires some cleanup first.
This issue was created close to six years ago. There are some tools in the ecosystem (as noted above) that handle various date/time related datatypes and potentially make this item OBE.
Gonna close this as will not fix.