fugue icon indicating copy to clipboard operation
fugue copied to clipboard

[BUG] fugue_sql intermittently throwing segmentation fault errors

Open jstammers opened this issue 1 year ago • 3 comments

Minimal Code To Reproduce

Describe the bug I have a set of unit tests that check the functionality of code that uses the fugue_sql API with a DuckDB backend. When running these tests locally, they all pass without any issue. However, when I run these as part of a Github actions workflow, I frequently encounter a segmentation fault that occurs at the following location

Current thread 0x00007f4e615547[40](https://github.com/****/****/actions/runs/4555672657/jobs/8035039892#step:7:41) (most recent call first):
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue_duckdb/dataframe.py", line 101 in as_arrow
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue_duckdb/dataframe.py", line 110 in as_local_bounded
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/dataframe/dataframe.py", line 90 in as_local
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue_duckdb/execution_engine.py", line 521 in convert_yield_dataframe
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/_tasks.py", line 1[47](https://github.com/****/****/actions/runs/4555672657/jobs/8035039892#step:7:48) in set_result
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/_tasks.py", line 293 in execute
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 683 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 171 in run_single
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 155 in run_tasks
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 129 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 270 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/_workflow_context.py", line 54 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/workflow.py", line 1584 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/sql/api.py", line 107 in fugue_sql

The function that fails has the following form

def filter_df(
    df: pd.DataFrame,
    outlets: pd.DataFrame,
    adjustments: pd.DataFrame,
):
    query = """keys = SELECT DateId, ProductId, LocationId, AdjustmentFactor, AdjustmentType, id
    FROM adjustments INNER JOIN outlets USING (LocationId)
    fdt = SELECT * FROM keys INNER JOIN df USING (DateId, ProductId, LocationId)"""
    result = fa.fugue_sql(
        query,
        df=df,
        outlets=outlets,
        adjustments=adjustments,
        engine='duckdb',
        as_fugue=True,
    )
    return result.as_pandas()

And I have multiple unit tests that call this function. It's difficult to fully isolate the problem as I can't fully reproduce it locally.

In this instance, I have been able to refactor my function to use the fugue api, but it would be good to be able to use the fugue_sql API for more complex queries where the SQL syntax is more suitable.

from fugue import api as fa

df = fa.join(...)
df = fa.filter(...)

Expected behavior I would expect these unit tests to run successfully.

Environment (please complete the following information):

  • Backend: pandas (duckdb)
  • Backend version: 0.8.2
  • Python version: 3.10
  • OS: linux

jstammers avatar Apr 11 '23 18:04 jstammers

@jstammers thanks for reporting. What duckdb version are you using?

I remember in earlier Duckdb versions (<3), I often saw segment fault but in later versions I have never seen this happening.

goodwanghan avatar Apr 12 '23 22:04 goodwanghan

One problem I saw in unit tests of duckdb is that it can have weird behaviors because the duckdb connection are not properly closed at certain step so the following steps are having issues.

goodwanghan avatar Apr 12 '23 23:04 goodwanghan

Hi @goodwanghan, thanks for looking into this. I'm currently using 0.7.1 which I believe is the latest version. It wouldn't surprise me if it's related to trying to a previous duckdb connection not being properly closed, but for now I will stick with the fugue API.

jstammers avatar Apr 17 '23 08:04 jstammers