tfx
tfx copied to clipboard
BigQueryExampleGen fails on Kubeflow Pipelines when using long queries
System information
- Have I specified the code to reproduce the issue (Yes, No): Yes
- Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): MacOS
- TensorFlow version:
tensorflow==2.4.2
- TFX Version:
0.26.3
but it might also affect the master branch - Python version:
Python 3.7.7
- Python dependencies (from
pip freeze
output): Irrelevant
Describe the current behavior
BigQueryExampleGen fails on Kubeflow Pipelines with the error standard_init_linux.go:211: exec user process caused "argument list too long"
due to the length of the tfx_ir / serialized_component arg when using long queries. This is a blocker for us.
Describe the expected behavior BigQueryExampleGen should not fail on Kubeflow Pipelines, even when using long queries.
Standalone code to reproduce the issue Run BigQueryExampleGen with a query that selects ~500 features to cause the tfx_ir / serialized_component string to exceed ~131k characters.
Name of your Organization (Optional) Twitter
Other info / logs
This (TFX IR exceeding the flag size limit) is a known issue and there's a TODO to fix it in the TFX component to KFP operator conversion logic. The suggested fix is writing the IR to the pipeline_root and letting container_entrypoint.py read it back. There seems to be a PR that could have resolved this, but it was marked stale and automatically closed: https://github.com/tensorflow/tfx/pull/3842. Relevant changes from that PR are in https://github.com/tensorflow/tfx/pull/4298. A PR to remove extra node information in the generated IR (https://github.com/tensorflow/tfx/pull/3992) merged after this was proposed, and I'm not sure if those changes would still be necessary when persisting the IR to a file instead of using a string.
We faced the same issue and ended up generating a new custom component that would accept a GCS path to a text file instead of passing the raw SQL string. That way only the path to the file is encoded in the "input_config" in TFX IR. We modified the Executor to then read the text file and make use of the SQL string in the same way as the current BaseExampleGenExecutor. The fix that @codesue linked looks like a better, more generic, solution to this problem though.
@rcrowe-google any chance to tag this to Twitter ?
For long queries, there are multiple limitations, e.g., mlmd exec properties field size, tfx ir size etc. Consider create a custom component that instead of passing in the query, it passes the file path with query in it, maybe name the file with its fingerprint or sth to make sure caching still work properly.
Would a stored procedure be a workable solution for long queries? It would allow for the complexity of the query to be saved and invoked with a shorter query.
We have been using views as an alternative yes. But it makes for versioning difficult as one needs to update those separately.
Regarding uploading it works, however the issue is we need to fork and retrieve the file on driver as we still need to resolve fields in the SQL query. So it's problematic
Following up on this thread to post our updated solution given recent activity.
We looked at the various options discussed here, such as VIEWS etc, but decided against those since as @casassg rightly points out versioning is very important and would be lost using this mechanism. Another thing we considered was compressing/decompressing the query string to reduce length, but this impacts the easy readability of pipeline inputs and would introduce undesired dependency on other modules to compress/decompress.
Similar to what @1025KB is suggesting, we created a custom QueryGen component that creates the query programmatically and stores it as an output artifact text file. Then we modified the out of the box big query examplegen component to read a Channel of type Query. We also modified the executor code to read the specified QueryGen artifact text files and inject the query strings in the necessary places.
@iain-stitt-by that sounds like a useful approach that others might benefit from. Would you consider contributing it to TFX-Addons?
Sure @rcrowe-google , we can look into adding some of our custom components to the TFX-Addons project