haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Pipeline Env Var not converted to int properly (incl solution)

Open LisaGLH opened this issue 1 year ago • 1 comments

Describe the bug When defining haystack-pipeline.yml components via docker compose's environment variables (using the naming convention COMPONENT_PARAMS_NAME), integers are not parsed correctly.

Docker Compose:

services:

  haystack-api:
    build:
      context: .
    image: "deepset/haystack:gpu"
    volumes:
      - ~/haystack-pipelines:/opt/pipelines
    expose:
      - 8000
    restart: no
    environment:
      - PIPELINE_YAML_PATH=/opt/pipelines/pipeline-managed.yml

      # Haystack pipeline configs
      - PREPROCESSOR_PARAMS_SPLIT_BY=${PREPROCESSOR_PARAMS_SPLIT_BY:-sentence} # split by sentence by default
      - PREPROCESSOR_PARAMS_SPLIT_LENGTH=5
      - DOCUMENTSTORE_PARAMS_INDEX

Error message split_length (defined via PREPROCESSOR_PARAMS_SPLIT_LENGTH) is not correctly converted to integer. As it is coming as an env variable, it comes as a string unfortunately, throwing the following error:

File "/opt/venv/lib/python3.10/site-packages/haystack/nodes/preprocessor/preprocessor.py", line 579, in _concatenate_units
segments = windowed(elements, n=split_length, step=split_length - split_overlap)
TypeError: unsupported operand type(s) for -: 'str' and 'int'
...
Exception: Exception while running node 'Preprocessor': unsupported operand type(s) for -: 'str' and 'int'

Expected behavior

I believe this can be fixed by adding

if value.is_digit():
     component_definition["params"][param_name] = int(value)

in haystack/pipelines/config.py, line 84.

original code:

        if overwrite_with_env_variables:
            for key, value in os.environ.items():
                env_prefix = f"{name}_params_".upper()
                if key.startswith(env_prefix):
                    param_name = key.replace(env_prefix, "").lower()
                    if "params" not in component_definition:
                        component_definition["params"] = {}
                    component_definition["params"][param_name] = value
                    # 
                    # new code here
                    # 
                    logger.info(
                        "Param '%s' of component '%s' overwritten with environment variable '%s' value '%s'.",
                        param_name,
                        name,
                        key,
                        "***",
                    )

Additional context Defining this in the haystack-pipeline.yaml yields the correct results as yamls are able to parse int.

  - name: Preprocessor
    type: PreProcessor
    params:
      split_length: 5
      split_overlap: 1

To Reproduce

FAQ Check

System:

  • OS: Ubuntu
  • GPU/CPU: GPU
  • Haystack version (commit or version number): v1.16.1 from Docker hub
  • DocumentStore: Elastic Search
  • Reader: FARMReader (deepset/roberta-base-squad2)
  • Retriever: EmbeddingRetriever (text-embedding-ada-002 Azure)
  • Prompt: PromptNode (gpt-35-turbo Azure)

LisaGLH avatar Jul 05 '23 14:07 LisaGLH

Hi @LisaGLH Thank you for reporting this and for suggesting how to fix it. Would you maybe like to open a pull request with your fix? That would be great! 🙂

julian-risch avatar Aug 30 '23 08:08 julian-risch