io
io copied to clipboard
[bigquery-io] Performance issue when number of columns is relative large
@vlasenkoalexey
Based on some of our internal benchmark within Twitter we saw BigQuery reader performance is relatively slower comparing to TFExample. Just give a sense:
With that benchmark we have 180 features (BigQuery columns, internal data so I can't share table link here), they are all primitive types (BOOLEAN, FLOAT, INTEGER) without repeated fields.
Unit is "examples per second" on a 32 cores machine with TF2.2 tf2-2-2-cpu image (https://cloud.google.com/ai-platform/deep-learning-vm/docs/images)
| Batch size | BigQuery Reader | Gzipped TFExample decoding |
|---|---|---|
| 32 | 1,279 | 6,317 |
| 256 | 10,223 | 26,646 |
| 1024 | 16,092 | 31,724 |
I understand there are many factors out there such as parameters such as batch_size, requested_streams for BigQuery as well as reader_num_threads, parser_num_threads for TFExample (we are using https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_batched_features_dataset for TFExample decoding when doing this benchmark).
I could potentially make a sharable and repeatable benchmark using BigQuery public dataset. But it would take some amount of work (convert that to TFExample, make that benchmark code etc). So before that, I wanted to know are you aware of the performance issue, if yes is there some plan on your side addressing these?
At this point, my gut feeling is it is very likely caused by BigQuery Storage API itself rather than the Avro -> Tensor decoding part, wondering do you know about that or not.
I also tried to switch using Arrow instead of Avro as dataformat, but because Arrow integration has several limitations
- TFIO BigQuery Arrow format API current can't support repeated fields
- TFIO BigQuery Arrow format API can't support support string type For example: If I specified "language" which is STRING in the selected fields i got error
- TFIO BigQuery Arrow format API can't support support bool type For example: If I specified "is_redirect" which is BOOLEAN in the selected fields i got error
With a limited amount of features been streamed. There is no big difference in terms of examples per second so far (it might because the feature type here is just integer and only few features, so the difference of Avro vs Arrow are not that much).
Here is the BigQuery Benchmark code:
# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Tests for BigQuery Ops."""
import concurrent.futures
from io import BytesIO
import os
import json
import fastavro
import numpy as np
import grpc # pylint: disable=wrong-import-order
import time
import tensorflow as tf # pylint: disable=wrong-import-order
from tensorflow.python.framework import dtypes # pylint: disable=wrong-import-order
from tensorflow.python.framework import errors # pylint: disable=wrong-import-order
from tensorflow.python.framework import ops # pylint: disable=wrong-import-order
from tensorflow import test # pylint: disable=wrong-import-order
from tensorflow_io.bigquery import (
BigQueryTestClient,
BigQueryClient,
) # pylint: disable=wrong-import-order
from tensorflow.python.framework import ops
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
os.environ['GOOGLE_APPLICATION_CREDENTIALS']='my credentials'
GCP_PROJECT_ID = 'my-project'
DATASET_GCP_PROJECT_ID = "bigquery-public-data"
DATASET_ID = "samples"
TABLE_ID = "wikipedia"
def get_dataset_examples_per_second(dataset, num_iterations=5, num_batches=1000, batch_size=32):
examples_per_second = []
for _ in range(num_iterations):
n = 0
start_t = time.time()
for data_batch in dataset.take(num_batches):
n += batch_size
delta_t = time.time() - start_t
examples_per_second.append((num_batches * batch_size) / delta_t)
print('Processed %d entries in %f seconds. [%.2f] examples/s' % (
n, delta_t, examples_per_second[-1]))
print('Average [%.2f] examples/s using TensorFlow v[%s]' % (
sum(examples_per_second) * 1.0 / len(examples_per_second),
tf.__version__))
def main():
ops.enable_eager_execution()
client = BigQueryClient()
selected_fields = ["id",
"num_characters",
"timestamp",
"wp_namespace",
#"is_redirect",
"revision_id"
]
output_types = [dtypes.int64,
dtypes.int64,
dtypes.int64,
dtypes.int64,
#dtypes.bool,
dtypes.int64]
read_session = client.read_session(
"projects/" + GCP_PROJECT_ID,
DATASET_GCP_PROJECT_ID, TABLE_ID, DATASET_ID,
selected_fields,
output_types,
requested_streams=2, #adjust this when benchmark
row_restriction="num_characters > 1000",
data_format=BigQueryClient.DataFormat.ARROW)
dataset = read_session.parallel_read_rows()
get_dataset_examples_per_second(dataset)
if __name__ == '__main__':
main()
@RuhuaJiang Is the performance disparity happens only when feature columns number very large (e.g., 180 as mentioned), or it happens even when the feature columns number small (like 1-5)?
Thanks for reporting this issue. I did BQ reader benchmarks and tried to optimize it before first release for Wiki dataset with few columns and compared it to GCS. According to my tests BQ performance was about same as GCS. BQ tended to be faster for powerful machines with multiple CPUs and slower for low power VMs.
| Machine | Prefetch | Num streams | Sloppy | BQ | GCS |
|---|---|---|---|---|---|
| Local TF2.0 | N | 10 | False | 135014 | 76682 |
| Local TF2.0 | N | 1 | False | 51449 | 120388 |
| Local TF2.0 | N | 10 | True | 196110 | 79824 |
| Local TF2.0 | Y | 10 | False | 192840 | 79719 |
| Local TF2.0 | Y | 1 | False | 60328 | 140254 |
| Local TF2.0 | Y | 10 | True | 209776 | 104283 |
| DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | N | 10 | False | 32954 | 78432 |
| DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | N | 1 | False | 34477 | 84363 |
| DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | N | 10 | True | 33629 | 73500 |
| DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | Y | 10 | False | 24904 | 76749 |
| DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | Y | 1 | False | 31163 | 86520 |
| DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | Y | 10 | True | 24660 | 72929 |
| GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | N | 10 | False | 56173 | 63212 |
| GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | N | 1 | False | 32932 | 84269 |
| GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | N | 10 | True | 64520 | 62389 |
| GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | Y | 10 | False | 69508 | 68000 |
| GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | Y | 1 | False | 31703 | 101243 |
| GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | Y | 10 | True | 78304 | 79070 |
Arrow was a bit faster, also I didn't do deep analysis yet.
| Num Streams | Sloppy | Avro | Arrow | |
|---|---|---|---|---|
| Local TF2.1 | 1 | False | 125176 | 185062 |
| Local TF2.1 | 10 | False | 88777 | 100657 |
| Local TF2.1 | 1 | True | 130562 | 186283 |
| Local TF2.1 | 10 | True | 87581 | 104446 |
Here is a benchmark I used https://github.com/vlasenkoalexey/bigquery_perftest, can you give it a shot?
I plan to spend some time debugging it once I'm done with my current project. One change which should help with throughput is creating multiple gRPC streams.
@yongtang good question. I had another smaller dataset that has 22 features (BigQuery 22 columns vs TFExample contains 22 features). the perf of BigQuery is slower too. Haven't tried even smaller numbers like 1...5 though.
| Batch size | BigQuery | Gzipped TFExample |
|---|---|---|
| 32 | 22,500 | 13,000 |
| 256 | 22,700 | 47,800 |
| 1024 | 22,700 | 52,300 |
@vlasenkoalexey thanks for the info, very informative. Let me try https://github.com/vlasenkoalexey/bigquery_perftest as well (with small & larger number of columns)
I got a repro, will see what I can do to make it better.
@vlasenkoalexey @RuhuaJiang any updates on this issue?
Sorry, forgot to provide an update here. After profiling reader on a benchmark with 100+ columns (see https://github.com/vlasenkoalexey/bigquery_perftest/blob/master/bq_perftest_mult_columns.py) I realized that the bottleneck is batch step:
streams_ds = tf.data.Dataset.from_tensor_slices(streams)
dataset = streams_ds.interleave(
read_rows,
cycle_length=streams_count64,
num_parallel_calls=streams_count64,
deterministic=not(sloppy))
dataset = dataset.batch(batch_size)
If you move batching to the same thread as read, performance is going to be much better.
Here is updated sample:
def _read_rows(stream):
dataset = read_session.read_rows(stream)
dataset = dataset.batch(batch_size)
return dataset
streams_ds = tf.data.Dataset.from_tensor_slices(streams)
dataset = streams_ds.interleave(
_read_rows,
cycle_length=streams_count64,
num_parallel_calls=streams_count64,
deterministic=not(sloppy))
Originally I planned to update API to use this approach all the time, but realized that it is not always desirable. I'll update BQ readme page to make it clear and close this bug.
Can I reopen this issue as I found when number of columns is not large the performance from using BigQuery reader is significantly lower than GCS reader. The tensorflow io used here is tensorflow_io==0.17.0. Could tensorflow_io > 0.17.0 for instance 0.22.0 could deliver something difference. I see the performance chart but is it general rule of thumb that GCS bucket has always better performance than BigQuery and why?
Did you have a chance to try approach suggested in https://github.com/tensorflow/io/issues/1066#issuecomment-757074730 ? And also please confirm that your data is stored in the location close to where you are reading it from. It is known that BQ is slightly slower than GCS, but not that much.