tfx
tfx copied to clipboard
Extremely slow performance using remote MLMD CloudSQL instance
Hello. I initially created an issue in the ML Metadata repository (https://github.com/google/ml-metadata/issues/157) but have subsequently realized that this is more of an issue with the launcher.Launch logic in TFX. I believe the querying happening in the driver code and preparing each component before each run is very unoptimized for certain use-cases.
Our team has been trying to use ml-metadata to run pipelines using a Cloud SQL backend. However, we have run into performance issues. Let's say my metadata connection config looks like this:
connection_config {
mysql {
host: '34.79.128.231'
port: 3306
database: 'my_database'
user: 'root'
password: '***'
ssl_options {
key: 'client-key.pem'
cert: 'client-cert.pem'
ca: 'server-ca.pem'
capath: '/'
verify_server_cert: false
}
skip_db_creation: false
}
}
And run a TFX pipeline with the above, there is a near 60 second waiting time between running components. This is likely caused by the number of queries happening in the portable launcher code. I have ghetto profiled the performance of the code and it takes way too long to run through the 7 steps to get previous executions and resolve the caching logic.
We have also tried to run through the gRPC server rather than connecting to the DB directly but still the same result. On the other hand, when using the internal Kubeflow MLMD or using Kubeflow (Vertex) with Cloud SQL directly in the same VPC then the performance is fast as expected. The only time you run into this problem is when you try to locally run the pipeline and connect to a public IP like 34.79.128.231.
Is there a way to solve this problem? We would like to use TFX independant of Kubeflow or Vertex.
If the bug is related to a specific library below, please raise an issue in the respective repo directly:
TensorFlow Data Validation Repo
TensorFlow Model Analysis Repo
System information
- Have I specified the code to reproduce the issue (Yes, No):
- Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc):
- TensorFlow version: ...
- TFX Version: 1.8.0
- Python version: 3.8.10
- Python dependencies (from
pip freezeoutput): ...
Describe the current behavior Slow performance for TFX pipelines (up to 60 seconds wait time between simple components)
Describe the expected behavior Fast execution.
Standalone code to reproduce the issue See above
Providing a bare minimum test case or step(s) to reproduce the problem will greatly help us to debug the issue. If possible, please share a link to Colab/Jupyter/any notebook.
Name of your Organization (Optional)
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
This seems like an interesting report. @1025KB could you take a look?
for the 60s, can we confirm if that is taken by MLMD? There are some costs for e.g., container setup if running on Cloud.
Which DagRunner are you using?
@1025KB Just using a custom DagRunner which is like a LocalDAGRunner essentially.
I investigated a bit more over the last week and I have reduced the time to 30 seconds to run a simple pipeline. Still it takes a long time. I can tell you why. In tfx/orchestration/portable/launcher.py:
# This takes about 8 seconds
context_lib.prepare_contexts(
metadata_handler=m, node_contexts=self._pipeline_node.contexts)
# This takes 5-6 seconds
resolved_inputs = inputs_utils.resolve_input_artifacts_v2(
pipeline_node=self._pipeline_node,
metadata_handler=m)
# This take 7-8 seconds
_publish_successful_execution(...)
Maybe its just too many database calls over the network?