tfx icon indicating copy to clipboard operation
tfx copied to clipboard

Extremely slow performance using remote MLMD CloudSQL instance

Open htahir1 opened this issue 3 years ago • 3 comments
trafficstars

Hello. I initially created an issue in the ML Metadata repository (https://github.com/google/ml-metadata/issues/157) but have subsequently realized that this is more of an issue with the launcher.Launch logic in TFX. I believe the querying happening in the driver code and preparing each component before each run is very unoptimized for certain use-cases.

Our team has been trying to use ml-metadata to run pipelines using a Cloud SQL backend. However, we have run into performance issues. Let's say my metadata connection config looks like this:

connection_config {
  mysql {
    host: '34.79.128.231'
    port: 3306
    database: 'my_database'
    user: 'root'
    password: '***'
    ssl_options {
      key: 'client-key.pem'
      cert: 'client-cert.pem'
      ca: 'server-ca.pem'
      capath: '/'
      verify_server_cert: false
    }
    skip_db_creation: false
  }
}

And run a TFX pipeline with the above, there is a near 60 second waiting time between running components. This is likely caused by the number of queries happening in the portable launcher code. I have ghetto profiled the performance of the code and it takes way too long to run through the 7 steps to get previous executions and resolve the caching logic.

We have also tried to run through the gRPC server rather than connecting to the DB directly but still the same result. On the other hand, when using the internal Kubeflow MLMD or using Kubeflow (Vertex) with Cloud SQL directly in the same VPC then the performance is fast as expected. The only time you run into this problem is when you try to locally run the pipeline and connect to a public IP like 34.79.128.231.

Is there a way to solve this problem? We would like to use TFX independant of Kubeflow or Vertex.


If the bug is related to a specific library below, please raise an issue in the respective repo directly:

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

  • Have I specified the code to reproduce the issue (Yes, No):
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc):
  • TensorFlow version: ...
  • TFX Version: 1.8.0
  • Python version: 3.8.10
  • Python dependencies (from pip freeze output): ...

Describe the current behavior Slow performance for TFX pipelines (up to 60 seconds wait time between simple components)

Describe the expected behavior Fast execution.

Standalone code to reproduce the issue See above

Providing a bare minimum test case or step(s) to reproduce the problem will greatly help us to debug the issue. If possible, please share a link to Colab/Jupyter/any notebook.

Name of your Organization (Optional)

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

htahir1 avatar Jun 01 '22 17:06 htahir1

This seems like an interesting report. @1025KB could you take a look?

jiyongjung0 avatar Jun 07 '22 00:06 jiyongjung0

for the 60s, can we confirm if that is taken by MLMD? There are some costs for e.g., container setup if running on Cloud.

Which DagRunner are you using?

1025KB avatar Jun 07 '22 00:06 1025KB

@1025KB Just using a custom DagRunner which is like a LocalDAGRunner essentially.

I investigated a bit more over the last week and I have reduced the time to 30 seconds to run a simple pipeline. Still it takes a long time. I can tell you why. In tfx/orchestration/portable/launcher.py:

# This takes about 8 seconds
context_lib.prepare_contexts(
          metadata_handler=m, node_contexts=self._pipeline_node.contexts)

# This takes 5-6 seconds
resolved_inputs = inputs_utils.resolve_input_artifacts_v2(
            pipeline_node=self._pipeline_node,
            metadata_handler=m)

# This take 7-8 seconds
_publish_successful_execution(...)

Maybe its just too many database calls over the network?

htahir1 avatar Jun 07 '22 08:06 htahir1