io [bigquery-io] Performance issue when number of columns is relative large

@vlasenkoalexey

Based on some of our internal benchmark within Twitter we saw BigQuery reader performance is relatively slower comparing to TFExample. Just give a sense:

With that benchmark we have 180 features (BigQuery columns, internal data so I can't share table link here), they are all primitive types (BOOLEAN, FLOAT, INTEGER) without repeated fields.

Unit is "examples per second" on a 32 cores machine with TF2.2 tf2-2-2-cpu image (https://cloud.google.com/ai-platform/deep-learning-vm/docs/images)

Batch size	BigQuery Reader	Gzipped TFExample decoding
32	1,279	6,317
256	10,223	26,646
1024	16,092	31,724

I understand there are many factors out there such as parameters such as batch_size, requested_streams for BigQuery as well as reader_num_threads, parser_num_threads for TFExample (we are using https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_batched_features_dataset for TFExample decoding when doing this benchmark).

I could potentially make a sharable and repeatable benchmark using BigQuery public dataset. But it would take some amount of work (convert that to TFExample, make that benchmark code etc). So before that, I wanted to know are you aware of the performance issue, if yes is there some plan on your side addressing these?

At this point, my gut feeling is it is very likely caused by BigQuery Storage API itself rather than the Avro -> Tensor decoding part, wondering do you know about that or not.

I also tried to switch using Arrow instead of Avro as dataformat, but because Arrow integration has several limitations

TFIO BigQuery Arrow format API current can't support repeated fields
TFIO BigQuery Arrow format API can't support support string type For example: If I specified "language" which is STRING in the selected fields i got error
TFIO BigQuery Arrow format API can't support support bool type For example: If I specified "is_redirect" which is BOOLEAN in the selected fields i got error

With a limited amount of features been streamed. There is no big difference in terms of examples per second so far (it might because the feature type here is just integer and only few features, so the difference of Avro vs Arrow are not that much).

Here is the BigQuery Benchmark code:

# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Tests for BigQuery Ops."""


import concurrent.futures
from io import BytesIO
import os
import json
import fastavro
import numpy as np
import grpc  # pylint: disable=wrong-import-order
import time
import tensorflow as tf  # pylint: disable=wrong-import-order

from tensorflow.python.framework import dtypes  # pylint: disable=wrong-import-order
from tensorflow.python.framework import errors  # pylint: disable=wrong-import-order
from tensorflow.python.framework import ops  # pylint: disable=wrong-import-order
from tensorflow import test  # pylint: disable=wrong-import-order
from tensorflow_io.bigquery import (
    BigQueryTestClient,
    BigQueryClient,
)  # pylint: disable=wrong-import-order



from tensorflow.python.framework import ops
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession

os.environ['GOOGLE_APPLICATION_CREDENTIALS']='my credentials'
GCP_PROJECT_ID = 'my-project'
DATASET_GCP_PROJECT_ID = "bigquery-public-data"
DATASET_ID = "samples"
TABLE_ID = "wikipedia"

def get_dataset_examples_per_second(dataset, num_iterations=5, num_batches=1000, batch_size=32):
    examples_per_second = []
    for _ in range(num_iterations):
      n = 0
      start_t = time.time()
      for data_batch in dataset.take(num_batches):
        n += batch_size
      delta_t = time.time() - start_t
      examples_per_second.append((num_batches * batch_size) / delta_t)
      print('Processed %d entries in %f seconds. [%.2f] examples/s' % (
            n, delta_t, examples_per_second[-1]))
    print('Average [%.2f] examples/s using TensorFlow v[%s]' % (
          sum(examples_per_second) * 1.0 / len(examples_per_second),
          tf.__version__))

def main():
  ops.enable_eager_execution()
  client = BigQueryClient()  
  selected_fields = ["id",
       "num_characters",
       "timestamp",
       "wp_namespace",
       #"is_redirect",
       "revision_id"
      ]

  output_types = [dtypes.int64,
       dtypes.int64,
       dtypes.int64,
       dtypes.int64,
       #dtypes.bool,
       dtypes.int64]
    
  read_session = client.read_session(
      "projects/" + GCP_PROJECT_ID,
      DATASET_GCP_PROJECT_ID, TABLE_ID, DATASET_ID,
      selected_fields,
      output_types,
      requested_streams=2, #adjust this when benchmark
      row_restriction="num_characters > 1000",
      data_format=BigQueryClient.DataFormat.ARROW)
  dataset = read_session.parallel_read_rows()
  get_dataset_examples_per_second(dataset)
  

if __name__ == '__main__':
  main()

Aug 03 '20 15:08 RuhuaJiang

@RuhuaJiang Is the performance disparity happens only when feature columns number very large (e.g., 180 as mentioned), or it happens even when the feature columns number small (like 1-5)?

Aug 04 '20 14:08 yongtang

Thanks for reporting this issue. I did BQ reader benchmarks and tried to optimize it before first release for Wiki dataset with few columns and compared it to GCS. According to my tests BQ performance was about same as GCS. BQ tended to be faster for powerful machines with multiple CPUs and slower for low power VMs.

Machine	Prefetch	Num streams	Sloppy	BQ	GCS
Local TF2.0	N	10	False	135014	76682
Local TF2.0	N	1	False	51449	120388
Local TF2.0	N	10	True	196110	79824
Local TF2.0	Y	10	False	192840	79719
Local TF2.0	Y	1	False	60328	140254
Local TF2.0	Y	10	True	209776	104283
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15	N	10	False	32954	78432
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15	N	1	False	34477	84363
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15	N	10	True	33629	73500
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15	Y	10	False	24904	76749
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15	Y	1	False	31163	86520
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15	Y	10	True	24660	72929
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0	N	10	False	56173	63212
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0	N	1	False	32932	84269
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0	N	10	True	64520	62389
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0	Y	10	False	69508	68000
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0	Y	1	False	31703	101243
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0	Y	10	True	78304	79070

Arrow was a bit faster, also I didn't do deep analysis yet.

	Num Streams	Sloppy	Avro	Arrow
Local TF2.1	1	False	125176	185062
Local TF2.1	10	False	88777	100657
Local TF2.1	1	True	130562	186283
Local TF2.1	10	True	87581	104446

Here is a benchmark I used https://github.com/vlasenkoalexey/bigquery_perftest, can you give it a shot?

I plan to spend some time debugging it once I'm done with my current project. One change which should help with throughput is creating multiple gRPC streams.

Aug 04 '20 17:08 vlasenkoalexey

@yongtang good question. I had another smaller dataset that has 22 features (BigQuery 22 columns vs TFExample contains 22 features). the perf of BigQuery is slower too. Haven't tried even smaller numbers like 1...5 though.

Batch size	BigQuery	Gzipped TFExample
32	22,500	13,000
256	22,700	47,800
1024	22,700	52,300

@vlasenkoalexey thanks for the info, very informative. Let me try https://github.com/vlasenkoalexey/bigquery_perftest as well (with small & larger number of columns)

Aug 04 '20 20:08 RuhuaJiang

I got a repro, will see what I can do to make it better.

Aug 10 '20 21:08 vlasenkoalexey

@vlasenkoalexey @RuhuaJiang any updates on this issue?

Jan 06 '21 08:01 kvignesh1420

Sorry, forgot to provide an update here. After profiling reader on a benchmark with 100+ columns (see https://github.com/vlasenkoalexey/bigquery_perftest/blob/master/bq_perftest_mult_columns.py) I realized that the bottleneck is batch step:

streams_ds = tf.data.Dataset.from_tensor_slices(streams)
dataset = streams_ds.interleave(
     read_rows,
     cycle_length=streams_count64,
     num_parallel_calls=streams_count64,
     deterministic=not(sloppy))
dataset = dataset.batch(batch_size)

If you move batching to the same thread as read, performance is going to be much better.

Here is updated sample:

  def _read_rows(stream):
    dataset = read_session.read_rows(stream)
    dataset = dataset.batch(batch_size)
    return dataset

  streams_ds = tf.data.Dataset.from_tensor_slices(streams)
  dataset = streams_ds.interleave(
      _read_rows,
      cycle_length=streams_count64,
      num_parallel_calls=streams_count64,
      deterministic=not(sloppy))

Originally I planned to update API to use this approach all the time, but realized that it is not always desirable. I'll update BQ readme page to make it clear and close this bug.

Jan 09 '21 01:01 vlasenkoalexey

Can I reopen this issue as I found when number of columns is not large the performance from using BigQuery reader is significantly lower than GCS reader. The tensorflow io used here is tensorflow_io==0.17.0. Could tensorflow_io > 0.17.0 for instance 0.22.0 could deliver something difference. I see the performance chart but is it general rule of thumb that GCS bucket has always better performance than BigQuery and why?

Nov 16 '21 08:11 stevenzhang-support

Did you have a chance to try approach suggested in https://github.com/tensorflow/io/issues/1066#issuecomment-757074730 ? And also please confirm that your data is stored in the location close to where you are reading it from. It is known that BQ is slightly slower than GCS, but not that much.

Nov 18 '21 16:11 vlasenkoalexey

io io copied to clipboard

[bigquery-io] Performance issue when number of columns is relative large

io
io copied to clipboard