HttpResponseError in prompt_pipeline.py and prompt_eval.py - Scripts Value cannot be null. (Parameter 'bytes')

Open Stefano-Salvatori opened this issue 9 months ago • 1 comments

We are experiencing an intermittent issue when running the prompt_pipeline or prompt_eval scripts in the pipelines. The error encountered is:

azure.core.exceptions.HttpResponseError: (UserError) Value cannot be null. (Parameter 'bytes') Code: UserError Message: Value cannot be null. (Parameter 'bytes')

This issue seems to occur randomly. Most of the time, it blocks the execution of the pipelines, but occasionally, the scripts run without any errors. The problem started occurring on Monday, March 24th, 2025.

Complete Traceback:

File "/home/azureuser/myagent/_work/5/s/llmops/common/prompt_pipeline.py", line 356, in prepare_and_execute
    run = pf.run(
File "/home/azureuser/myagent/_work/_tool/Python/3.9.19/x64/lib/python3.9/site-packages/promptflow/azure/_pf_client.py", line 305, in run
    return self.runs.create_or_update(run=run, **kwargs)
File "/home/azureuser/myagent/_work/_tool/Python/3.9.19/x64/lib/python3.9/site-packages/promptflow/_sdk/_telemetry/activity.py", line 265, in wrapper
    return f(self, *args, **kwargs)
File "/home/azureuser/myagent/_work/_tool/Python/3.9.19/x64/lib/python3.9/site-packages/promptflow/azure/operations/_run_operations.py", line 187, in create_or_update
    self.stream(run=run.name)
File "/home/azureuser/myagent/_work/_tool/Python/3.9.19/x64/lib/python3.9/site-packages/promptflow/_sdk/_telemetry/activity.py", line 265, in wrapper
    return f(self, *args, **kwargs)
File "/home/azureuser/myagent/_work/_tool/Python/3.9.19/x64/lib/python3.9/site-packages/promptflow/azure/operations/_run_operations.py", line 641, in stream
    available_logs = self._get_log(flow_run_id=run.name)
File "/home/azureuser/myagent/_work/_tool/Python/3.9.19/x64/lib/python3.9/site-packages/promptflow/azure/operations/_run_operations.py", line 543, in _get_log
    return self._service_caller.caller.bulk_runs.get_flow_run_log_content(
File "/home/azureuser/myagent/_work/_tool/Python/3.9.19/x64/lib/python3.9/site-packages/azure/core/tracing/decorator.py", line 116, in wrapper_use_tracer
    return func(*args, **kwargs)
File "/home/azureuser/myagent/_work/_tool/Python/3.9.19/x64/lib/python3.9/site-packages/promptflow/azure/_restclient/flow/operations/_bulk_runs_operations.py", line 973, in get_flow_run_log_content
    raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: (UserError) Value cannot be null. (Parameter 'bytes') Code: UserError Message: Value cannot be null. (Parameter 'bytes')

Additional Information:

The issue started on March 24th, 2025. The environment uses Python 3.9.19; we updated all promptflow packages to 1.17.2

Request: We need assistance in identifying the root cause of this error and a potential fix or workaround to ensure the scripts run consistently without interruption.

Apr 02 '25 09:04 Stefano-Salvatori

We found out that the problem is related to the streaming mode (that is equals to True in the scripts). With this parameter, the code tries to call an API (/logContent) which is not available by the time the code makes the call and receives back a 400 error.

The workaround to fix the issue is to set stream=False in both scripts and add a polling mechanism to check the status of the job. We implemented this method and placed it after the runs are executed in the scripts:

def is_job_completed(job: Run) -> bool:
    """
    Check if the job is completed.

    Returns:
        bool: True if the job is completed, False otherwise.
    """
    return job.status == "Completed" or job.status == "Finished"

def poll_job_status(
    pf_client: PFClient, job: Run, max_retries: int = 200, polling_interval: int = 15
) -> Run:
    """
    Poll the job status until it is completed or max retries are reached.
    Args:
        pf_client: PFClient instance
        job: Run instance
        max_retries: maximum number of retries
        polling_interval: time to wait between polls

    Returns:
        Run: the last job_state retirved
    """
    number_of_retries = 0
    while not is_job_completed(job) and number_of_retries < max_retries:
        logger.info(f"Job status: {job.status}")
        job = pf_client.runs.get(job.name)
        time.sleep(polling_interval)
        number_of_retries += 1
    if is_job_completed(job):
        logger.info("job completed")
        return job
    else:
        logger.info(f"Max retries ({max_retries}) exceeded. Job not completed")
        return job

Now the pipelines are OK

Apr 07 '25 08:04 Stefano-Salvatori