explorer icon indicating copy to clipboard operation
explorer copied to clipboard

NIF Panic when reading parquet files from S3

Open rohfosho opened this issue 11 months ago • 14 comments

Code

Explorer.DataFrame.from_parquet("s3://path/to/file.parquet", config: %{FSS.S3.config_from_system_env() | region: "us-west-2"})

Expected

A working DataFrame

Actual

** (ErlangError) Erlang error: :nif_panicked
   (explorer 0.10.1) Explorer.PolarsBackend.Native.lf_compute(%Explorer.PolarsBackend.LazyFrame{resource: #Reference<0.1199058728.3890348049.170513>})
   (explorer 0.10.1) lib/explorer/polars_backend/data_frame.ex:286: Explorer.PolarsBackend.DataFrame.from_parquet/4
   iex:1: (file)

Note: When I try to load the parquet file lazily, I get a more detailed stacktrace:

Explorer.DataFrame.from_parquet("s3://path/to/file.parquet", config: %{FSS.S3.config_from_system_env() | region: "us-west-2"}, lazy: true)
#Inspect.Error<
  got ErlangError with message:

      """
      Erlang error: :nif_panicked
      """

  while inspecting:

      %{
        data: %Explorer.PolarsBackend.LazyFrame{
          resource: #Reference<0.3078565455.1145176087.107417>
        },
        remote: nil,
        names: ["id", "point", "rarity", "type"],
        struct: Explorer.DataFrame,
        groups: [],
        dtypes: %{
          "id" => :string,
          "point" => :string,
          "rarity" => :string,
          "type" => :string
        }
      }

  Stacktrace:

    (explorer 0.10.1) Explorer.PolarsBackend.Native.lf_fetch(%Explorer.PolarsBackend.LazyFrame{resource: #Reference<0.3078565455.1145176087.107417>}, 50)
    (explorer 0.10.1) lib/explorer/polars_backend/lazy_frame.ex:74: Explorer.PolarsBackend.LazyFrame.inspect/2
    (explorer 0.10.1) lib/explorer/data_frame.ex:6379: Inspect.Explorer.DataFrame.inspect/2
    (elixir 1.16.1) lib/inspect/algebra.ex:347: Inspect.Algebra.to_doc/2
    (elixir 1.16.1) lib/kernel.ex:2351: Kernel.inspect/2
    (iex 1.16.1) lib/iex/evaluator.ex:376: IEx.Evaluator.io_inspect/1
    (iex 1.16.1) lib/iex/evaluator.ex:335: IEx.Evaluator.eval_and_inspect/3
    (iex 1.16.1) lib/iex/evaluator.ex:306: IEx.Evaluator.eval_and_inspect_parsed/3

>

Context

This only happens when I deploy to staging or prod (using Docker with the base being the elixir-1.18.1 image & mix releases). It works perfectly when I'm developing locally (on Mac OS)

rohfosho avatar Jan 14 '25 05:01 rohfosho

Can you execute any other operation? If nothing works, then it is most likely incompatible gcc/musl versions, you can check the README information on precompilation: https://github.com/elixir-explorer/explorer?tab=readme-ov-file#precompilation

josevalim avatar Jan 14 '25 07:01 josevalim

I tested two different operations, with one succeeding and one resulting in the same NIF panic:

First test was the example used in #1011

Mix.install([{:explorer, "~> 0.10.0"}])

name_dtype = {"names",
{:list,
 {:struct,
  [
    {"language", :string},
    {"name", :string},
    {"transliteration", :category},
    {"type", :category}
  ]}}}

[
  %{names: []},
  %{names: [%{name: "CABK", type: "acronym", language: nil, transliteration: "none"}]}
]
|> Explorer.DataFrame.new(dtypes: [name_dtype])
|> dbg

Which resulted in a NIF panic:

[iex:6: (file)]
[
  %{names: []},
  %{names: [%{name: "CABK", type: "acronym", language: nil, transliteration: "none"}]}
] #=> [
  %{names: []},
  %{
    names: [
      %{name: "CABK", type: "acronym", language: nil, transliteration: "none"}
    ]
  }
]
|> Explorer.DataFrame.new(dtypes: [name_dtype]) #=> #Inspect.Error<
  got ErlangError with message:

      """
      Erlang error: :nif_panicked
      """

  while inspecting:

      %{
        data: %Explorer.PolarsBackend.DataFrame{
          resource: #Reference<0.3723385053.1587150849.9786>
        },
        remote: nil,
        names: ["names"],
        __struct__: Explorer.DataFrame,
        groups: [],
        dtypes: %{
          "names" => {:list,
           {:struct,
            [
              {"language", :string},
              {"name", :string},
              {"transliteration", :category},
              {"type", :category}
            ]}}
        }
      }

  Stacktrace:

    (explorer 0.10.1) Explorer.PolarsBackend.Native.s_to_list(#Explorer.PolarsBackend.Series<
  #Reference<0.3723385053.1586364425.234579>
>)
    (explorer 0.10.1) lib/explorer/polars_backend/shared.ex:24: Explorer.PolarsBackend.Shared.apply_series/3
    (explorer 0.10.1) lib/explorer/backend/data_frame.ex:324: anonymous fn/3 in Explorer.Backend.DataFrame.build_cols_algebra/3
    (elixir 1.18.1) lib/enum.ex:1714: Enum."-map/2-lists^map/1-1-"/2
    (explorer 0.10.1) lib/explorer/backend/data_frame.ex:283: Explorer.Backend.DataFrame.inspect/5
    (explorer 0.10.1) lib/explorer/data_frame.ex:6379: Inspect.Explorer.DataFrame.inspect/2
    (elixir 1.18.1) lib/inspect/algebra.ex:348: Inspect.Algebra.to_doc/2
    (elixir 1.18.1) lib/kernel.ex:2376: Kernel.inspect/2

>

The second operation I tried was creating a simple dataframe and that succeeded:

df = Explorer.DataFrame.new(%{
  "id" => ["a", "b", "c"],
  "type" => ["x", "y", "z"]
})

Output:

#Explorer.DataFrame<
  Polars[3 x 2]
  id string ["a", "b", "c"]
  type string ["x", "y", "z"]
>

I'm deploying using Mix releases and Docker with the base image being elixir-1.18.1

I also verified that during the build processing I'm correctly downloading the precompiled NIF:

[debug] Downloading NIF from https://github.com/elixir-nx/explorer/releases/download/v0.10.1/libexplorer-v0.10.1-nif-2.15-x86_64-unknown-linux-gnu.so.tar.gz

rohfosho avatar Jan 14 '25 17:01 rohfosho

@rohfosho Thank you for the additional info. Can you possibly share a dataframe which exhibits the panic you originally saw? #1011 is still an open issue, so it panicking is expected.

billylanchantin avatar Jan 14 '25 17:01 billylanchantin

@billylanchantin No problem! My use case is that I'm currently trying to read a parquet file straight from s3 that has 4 columns

%{
  "id" => :string,
  "point" => :string,
  "rarity" => :string,
  "type" => :string
}

I see that Explorer is able to pull it down and see the different columns but doesn't make it past that. I can send over a sample file if that's helpful!

rohfosho avatar Jan 14 '25 17:01 rohfosho

Yeah that'd be great, thanks!

billylanchantin avatar Jan 14 '25 17:01 billylanchantin

@billylanchantin github won't let me upload parquet files here, cool if I DM you on the Elixir slack?

rohfosho avatar Jan 14 '25 17:01 rohfosho

Just sent it via Slack. Let me know if you prefer something else and I can upload to google drive!

rohfosho avatar Jan 14 '25 17:01 rohfosho

Ok I got the file and some more info off slack.

  • from_parquet: works on their local machine but not in prod
  • load_parquet: works on their local machine and in prod

So they're technically unblocked right now since they can use load_parquet instead. But the bug is still there.

As a sanity check, I ran our setup-localstack.sh and uploaded the file to a local amazon-ec2-metadata-mock container (like we do with our wine dataset). I ran a modified version of our S3 test:

@tag :cloud_integration
test "reads rohfosho's parquet file from S3" do
  config = %FSS.S3.Config{
    access_key_id: "test",
    secret_access_key: "test",
    endpoint: "http://localhost:4566",
    region: "us-east-1"
  }

  assert {:ok, df} =
            DF.from_parquet("s3://test-bucket/rohfosho.parquet",
              config: config,
            )

  df |> DF.print()
end

which passed. Must be something more specific, IDK yet. @josevalim any ideas?

billylanchantin avatar Jan 14 '25 19:01 billylanchantin

Hey, I suspect that this may be some library missing inside the container. Can you print the result of the following command?

ldd -v /path/to/the/extracted/lib.so

Where this path is printed right after you install Explorer - there is a small bug though: the path should not end with tar.gz, so just omit it and it will work fine.

philss avatar Jan 14 '25 22:01 philss

@philss sure! here you go

ldd -v _build/prod/rel/oracle/lib/explorer-0.10.1/priv/native/libexplorer-v0.10.1-nif-2.15-x86_64-unknown-linux-gnu.so
        linux-vdso.so.1 (0x00007ffd65942000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007baebffe2000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007baebffdd000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007baebfefe000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007baebfef9000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007baebfd18000)
        /lib64/ld-linux-x86-64.so.2 (0x00007baec3f42000)

        Version information:
        _build/prod/rel/oracle/lib/explorer-0.10.1/priv/native/libexplorer-v0.10.1-nif-2.15-x86_64-unknown-linux-gnu.so:
                libgcc_s.so.1 (GCC_3.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libgcc_s.so.1 (GCC_3.3) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libgcc_s.so.1 (GCC_4.2.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libpthread.so.0 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libpthread.so.0
                libpthread.so.0 (GLIBC_2.12) => /lib/x86_64-linux-gnu/libpthread.so.0
                libm.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libm.so.6
                libm.so.6 (GLIBC_2.27) => /lib/x86_64-linux-gnu/libm.so.6
                libm.so.6 (GLIBC_2.29) => /lib/x86_64-linux-gnu/libm.so.6
                libdl.so.2 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libdl.so.2
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.6) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.7) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.9) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.17) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.18) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.25) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.28) => /lib/x86_64-linux-gnu/libc.so.6
                ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
        /lib/x86_64-linux-gnu/libgcc_s.so.1:
                libc.so.6 (GLIBC_2.35) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.34) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libpthread.so.0:
                libc.so.6 (GLIBC_ABI_DT_RELR) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libm.so.6:
                ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2
                libc.so.6 (GLIBC_ABI_DT_RELR) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_PRIVATE) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libdl.so.2:
                libc.so.6 (GLIBC_ABI_DT_RELR) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libc.so.6:
                ld-linux-x86-64.so.2 (GLIBC_2.35) => /lib64/ld-linux-x86-64.so.2
                ld-linux-x86-64.so.2 (GLIBC_2.2.5) => /lib64/ld-linux-x86-64.so.2
                ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
                ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2

rohfosho avatar Jan 15 '25 16:01 rohfosho

@rohfosho thank you for the info! I think that's nothing wrong there, based on what you sent. Would you mind to share the Dockerfile, or the full base image tag that you are using? It may be easier to reproduce.

philss avatar Jan 17 '25 12:01 philss

@philss for sure, here you go:

# syntax = docker/dockerfile:1.2

# Use the official Elixir image as the base image
FROM elixir:1.18.1 AS builder

# Set the working directory inside the container
WORKDIR /app

# Install required system dependencies
RUN apt-get update -y && \
    apt-get install -y --no-install-recommends \
    build-essential \
    git \
    nodejs \
    npm \
    postgresql-client \
    python3

RUN mix local.hex --force && \
    mix local.rebar --force

# Copy the mix files first for better docker build caching
COPY mix.exs ./
COPY mix.lock ./

RUN mix deps.get

# Compile the dependencies, set MIX_ENV beforehand
ARG MIX_ENV=prod
ENV MIX_ENV=${MIX_ENV}
RUN mix deps.compile

# Now copy the whole project to avoid rebuilding the deps when the source code changes
COPY . . 

# Set permissions for the release script
RUN chmod +x release.sh

# Source Code Compilation Stage
FROM builder AS compiler

# Set the working directory
WORKDIR /app

# Execute release script
RUN --mount=type=secret,id=_env,dst=/etc/secrets/.env ./release.sh

# New stage for the runtime image to reduce the final size
FROM elixir:1.18.1

# Set the working directory
WORKDIR /app

# Install minimal dependencies
RUN apt-get update -y && \
    apt-get install -y --no-install-recommends \
    nodejs \
    npm \
    postgresql-client \
    python3

COPY --from=compiler /app/ .

# Expose the port the application will run on
EXPOSE 4000

# Define the entrypoint for the application
ENTRYPOINT ["/app/_build/${MIX_ENV}/rel/my_app/bin/my_app"]

# Start the Phoenix application
CMD ["start"]

rohfosho avatar Jan 17 '25 19:01 rohfosho

@rohfosho sorry for the delay. I built a container image and ran the code, but I couldn't reproduce the problem. I'm running in a Linux environment (Fedora 41 - w/ Podman). If you don't mind, can you run your code with the EXPLORER_USE_LEGACY_ARTIFACTS env var configured to "true"? This might be something related to legacy CPUs.

Another shot would be to try to compile from source, from our main branch, and see if the problem persists. We updated Polars recently, so it may be working.

philss avatar Jan 31 '25 02:01 philss

When reading from parquet file, passing rechunk: true fixed this error for me

bcxbb avatar May 06 '25 19:05 bcxbb