postgresml
postgresml copied to clipboard
Cannot use `pgml.activate_venv()` to set environment for parallel workers.
Currently, pgml
provides a UDF called pgml.activate_venv()
^1. However, when a query requires parallel workers, the venv environment cannot be set for parallel workers. This is not very easy to reproduce but parallel queries are not rare in PostgreSQL. Probably we can remove this UDF since we've already had the GUC parameter 'pgml.venv'
^2 to control the venv path.
Steps to reproduce:
-
The package
xgboost
exists in my venv environment. -
Remove the
IMMUTABLE
qualifier frompgml.validate_python_dependencies
, so that this UDF can be execute on parallel workers multiple times.diff --git a/pgml-extension/src/api.rs b/pgml-extension/src/api.rs index ad952e48..440df23d 100644 --- a/pgml-extension/src/api.rs +++ b/pgml-extension/src/api.rs @@ -27,7 +27,7 @@ pub fn activate_venv(venv: &str) -> bool { } #[cfg(feature = "python")] -#[pg_extern(immutable, parallel_safe)] +#[pg_extern(parallel_safe)] pub fn validate_python_dependencies() -> bool { unwrap_or_error!(crate::bindings::python::validate_dependencies()) }
-
Construct a query that involves parallel workers.
CREATE TABLE t1(i int); INSERT INTO t1 VALUES(generate_series(1,500000)); INSERT INTO t1 VALUES(generate_series(1,500000)); INSERT INTO t1 VALUES(generate_series(1,500000)); INSERT INTO t1 VALUES(generate_series(1,500000));
pgml=# select pgml.activate_venv('/tmp/virtualenv'); activate_venv --------------- t (1 row) pgml=# explain (analyze) select count(pgml.validate_python_dependencies()) from t1; INFO: Python version: 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] INFO: Scikit-learn 1.3.0, XGBoost 2.0.1, LightGBM 4.1.0, NumPy 1.26.1 INFO: Python version: 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] ERROR: The xgboost package is missing. Install it with `sudo pip3 install xgboost` ModuleNotFoundError: No module named 'xgboost' CONTEXT: parallel worker