postgresml icon indicating copy to clipboard operation
postgresml copied to clipboard

Cannot use `pgml.activate_venv()` to set environment for parallel workers.

Open higuoxing opened this issue 8 months ago • 8 comments

Currently, pgml provides a UDF called pgml.activate_venv()^1. However, when a query requires parallel workers, the venv environment cannot be set for parallel workers. This is not very easy to reproduce but parallel queries are not rare in PostgreSQL. Probably we can remove this UDF since we've already had the GUC parameter 'pgml.venv'^2 to control the venv path.

Steps to reproduce:

  1. The package xgboost exists in my venv environment.

  2. Remove the IMMUTABLE qualifier from pgml.validate_python_dependencies, so that this UDF can be execute on parallel workers multiple times.

    diff --git a/pgml-extension/src/api.rs b/pgml-extension/src/api.rs
    index ad952e48..440df23d 100644
    --- a/pgml-extension/src/api.rs
    +++ b/pgml-extension/src/api.rs
    @@ -27,7 +27,7 @@ pub fn activate_venv(venv: &str) -> bool {
     }
    
     #[cfg(feature = "python")]
    -#[pg_extern(immutable, parallel_safe)]
    +#[pg_extern(parallel_safe)]
     pub fn validate_python_dependencies() -> bool {
         unwrap_or_error!(crate::bindings::python::validate_dependencies())
     }
    
  3. Construct a query that involves parallel workers.

    CREATE TABLE t1(i int);
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    
    pgml=# select pgml.activate_venv('/tmp/virtualenv');
    activate_venv
    ---------------
     t
    (1 row)
    
    pgml=# explain (analyze) select count(pgml.validate_python_dependencies()) 
    from t1;
    INFO:  Python version: 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801]
    INFO:  Scikit-learn 1.3.0, XGBoost 2.0.1, LightGBM 4.1.0, NumPy 1.26.1
    INFO:  Python version: 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801]
    ERROR:  The xgboost package is missing. Install it with `sudo pip3 install xgboost`
    ModuleNotFoundError: No module named 'xgboost'
    CONTEXT:  parallel worker
    

higuoxing avatar Nov 07 '23 06:11 higuoxing