arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

File written by `polars.DataFrame.write_ipc` read incorrectly

Open ForceBru opened this issue 11 months ago • 9 comments

Python code that writes the file:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["polars<=1.21.0"]
# ///
import polars as pl

pl.DataFrame({'text': "this is some text".split()}).write_ipc("data.arrow")

Polars can read this file:

>>> import polars as pl
>>> pl.read_ipc("data.arrow")
shape: (4, 1)
┌──────┐
│ text │
│ ---  │
│ str  │
╞══════╡
│ this │
│ is   │
│ some │
│ text │
└──────┘
>>>

Arrow.jl reads garbage:

julia> import Pkg; Pkg.status()
Status `~/tmp/Project.toml`
  [69666777] Arrow v2.8.0
  [a93c6f00] DataFrames v1.7.0

julia> using DataFrames; import Arrow

julia> DataFrame(Arrow.Table("./data.arrow"))
4×1 DataFrame
 Row │ text     
     │ String?  
─────┼──────────
   1 │ W1\0\0
   2 │ \xf2\xff
   3 │ \v\0\b\0
   4 │ \b\0\b\0

julia> 

Issue: this is not at all what Polars wrote to the file


Other data types are read properly:

> cat arrow_bug.py
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["polars<=1.21.0"]
# ///
from datetime import date
import polars as pl

pl.DataFrame({
    'text': "this is some text".split(),
    'date': [date(2025,1,i+1) for i in range(4)],
    'float': [float(i) for i in range(4)],
    'int': list(range(4))
}).write_ipc("dates.arrow")
> ./arrow_bug.py
> julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.3 (2025-01-21)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using DataFrames; import Arrow

julia> DataFrame(Arrow.Table("dates.arrow"))
4×4 DataFrame
 Row │ text      date        float     int    
     │ String?   Date?       Float64?  Int64? 
─────┼────────────────────────────────────────
   1 │ W1\0\0    2025-01-01       0.0       0
   2 │ \xf2\xff  2025-01-02       1.0       1
   3 │ \v\0\b\0  2025-01-03       2.0       2
   4 │ \b\0\b\0  2025-01-04       3.0       3

julia> 

ForceBru avatar Feb 09 '25 14:02 ForceBru

not at a computer but is _ipc the correct thing to write out?

Moelf avatar Feb 09 '25 14:02 Moelf

is _ipc the correct thing to write out?

Not sure, it's just what I've been using in Python. Should I be using a different write_ method to write Arrow files from Polars?

I tried write_ipc_stream, but Arrow.jl can't read the String column anyway:

> cat arrow_bug.py
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["polars<=1.21.0"]
# ///
from datetime import date
import polars as pl

df = pl.DataFrame({
    'text': "this is some text".split(),
    'date': [date(2025,1,i+1) for i in range(4)],
    'float': [float(i) for i in range(4)],
    'int': list(range(4))
})
df.write_ipc("dates.arrow")
df.write_ipc_stream("dates_stream.arrow")
> ./arrow_bug.py
> julia --project
julia> using DataFrames; import Arrow

julia> DataFrame(Arrow.Table("dates.arrow"))
4×4 DataFrame
 Row │ text      date        float     int    
     │ String?   Date?       Float64?  Int64? 
─────┼────────────────────────────────────────
   1 │ W1\0\0    2025-01-01       0.0       0
   2 │ \xf2\xff  2025-01-02       1.0       1
   3 │ \v\0\b\0  2025-01-03       2.0       2
   4 │ \b\0\b\0  2025-01-04       3.0       3

julia> DataFrame(Arrow.Table("dates_stream.arrow"))
4×4 DataFrame
 Row │ text              date        float     int    
     │ String?           Date?       Float64?  Int64? 
─────┼────────────────────────────────────────────────
   1 │ @\x01\0\0         2025-01-01       0.0       0
   2 │ \x04\0            2025-01-02       1.0       1
   3 │ \xf8\xff\xff\xff  2025-01-03       2.0       2
   4 │ \x04\0\0\0        2025-01-04       3.0       3

julia> 

Since the method's name is "write IPC stream", I also tried reading it with Julia's Arrow.Stream, but got this error:

julia> DataFrame(Arrow.Stream("dates_stream.arrow"))
ERROR: MethodError: Cannot `convert` an object of type Arrow.View{Union{Missing, String}} to an object of type String
The function `convert` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  convert(::Type{String}, ::StringManipulation.Decoration)
   @ StringManipulation ~/.julia/packages/StringManipulation/bMZ2A/src/decorations.jl:365
  convert(::Type{String}, ::Base.JuliaSyntax.Kind)
   @ Base /cache/build/builder-demeter6-3/julialang/julia-release-1-dot-11/base/JuliaSyntax/src/kinds.jl:975
  convert(::Type{String}, ::String)
   @ Base essentials.jl:461
  ...

Stacktrace:
  [1] convert(::Type{Union{Missing, String}}, x::Arrow.View{Union{Missing, String}})
    @ Base ./missing.jl:70
  [2] push!(a::Vector{Union{Missing, String}}, item::Arrow.View{Union{Missing, String}})
    @ Base ./array.jl:1260
  [3] add!
    @ ~/.julia/packages/Tables/8p03y/src/fallbacks.jl:140 [inlined]
  [4] eachcolumns
    @ ~/.julia/packages/Tables/8p03y/src/utils.jl:111 [inlined]
  [5] buildcolumns(schema::Tables.Schema{…}, rowitr::Tables.IteratorWrapper{…})
    @ Tables ~/.julia/packages/Tables/8p03y/src/fallbacks.jl:147
  [6] _columns
    @ ~/.julia/packages/Tables/8p03y/src/fallbacks.jl:274 [inlined]
  [7] columns
    @ ~/.julia/packages/Tables/8p03y/src/fallbacks.jl:258 [inlined]
  [8] DataFrame(x::Arrow.Stream; copycols::Nothing)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/other/tables.jl:57
  [9] DataFrame(x::Arrow.Stream)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/other/tables.jl:48
 [10] top-level scope
    @ REPL[4]:1
Some type information was truncated. Use `show(err)` to see complete types.

ForceBru avatar Feb 09 '25 14:02 ForceBru

I guess another check is to see if pyarrow can read it

Moelf avatar Feb 09 '25 16:02 Moelf

Yes, pyarrow can read files written by df.write_ipc and df.write_ipc_stream:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = ["polars==1.21.0", "pyarrow==19.0.0"]
# ///
from datetime import date
import polars as pl, pyarrow as pa

df = pl.DataFrame({
    'text': "this is some text".split(),
    'date': [date(2025,1,i+1) for i in range(4)],
    'float': [i * 0.7 for i in range(4)],
    'int': list(range(4))
})
print("!!!Writing df...")
df.write_ipc("dates.arrow")
df.write_ipc_stream("dates_stream.arrow")

print("\n!!!Reading IPC...")
with pa.OSFile("dates.arrow", 'rb') as src:
    data = pa.ipc.open_file(src).read_all()
    print(data)

print("\n!!!Reading IPC stream...")
with pa.OSFile("dates_stream.arrow", 'rb') as src:
    data = pa.ipc.open_stream(src).read_all()
    print(data)

Output:

> chmod +x code.py && ./code.py
!!!Writing df...

!!!Reading IPC...
pyarrow.Table
text: string_view
date: date32[day]
float: double
int: int64
----
text: [["this","is","some","text"]]
date: [[2025-01-01,2025-01-02,2025-01-03,2025-01-04]]
float: [[0,0.7,1.4,2.0999999999999996]]
int: [[0,1,2,3]]

!!!Reading IPC stream...
pyarrow.Table
text: string_view
date: date32[day]
float: double
int: int64
----
text: [["this","is","some","text"]]
date: [[2025-01-01,2025-01-02,2025-01-03,2025-01-04]]
float: [[0,0.7,1.4,2.0999999999999996]]
int: [[0,1,2,3]]

ForceBru avatar Feb 09 '25 23:02 ForceBru

More examples where Arrow.jl can't read the file:

> python
Python 3.12.7 (main, Jan 17 2025, 16:55:27) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.DataFrame({'text': ['this is some text'] * 10, 'more': ['hello']*10}).write_ipc("long.arrow")
>>> 
> julia --project -e "using DataFrames; import Arrow; Arrow.Table(\"long.arrow\") |> DataFrame |> display"
10×2 DataFrame
 Row │ text               more                 
     │ String?            String?              
─────┼─────────────────────────────────────────
   1 │ this is some text  W1\0\0\xff
   2 │ this is some text  \xf2\xff\xff\xff\x14
   3 │ this is some text  \v\0\b\0\n
   4 │ this is some text  \b\0\b\0\0
   5 │ this is some text  \x04\0\0\0\xec
   6 │ this is some text  \x18\0\0\0\x01
   7 │ this is some text  \x11\0\b\0\0
   8 │ this is some text  \x04\0\x04\0\x04
   9 │ this is some text  \xec\xff\xff\xff,
  10 │ this is some text  \x01\x18\0\0\x10
> 

A dataframe like pl.DataFrame({'ints': [0] * 10, 'ye': [5]*10, 'more': ['h'*L]*10}).write_ipc("long.arrow") is read incorrectly for 1<=L<=12 (checked manually), but is suddenly read fine for L==13:

> python
>>> import polars as pl; pl.DataFrame({'ints': [0] * 10, 'ye': [5]*10, 'more': ['h'*12]*10}).write_ipc("long.arrow")
> julia --project -e "using DataFrames; import Arrow; Arrow.Table(\"long.arrow\") |> DataFrame |> display"
10×3 DataFrame
 Row │ ints    ye      more                              
     │ Int64?  Int64?  String?                           
─────┼───────────────────────────────────────────────────
   1 │      0       5  W1\0\0\xff\xff\xff\xff\b\x01\0\0
   2 │      0       5  \xf2\xff\xff\xff\x14\0\0\0\x04\0…
   3 │      0       5  \v\0\b\0\n\0\x04\0\xf8\xff\xff\x…
   4 │      0       5  \b\0\b\0\0\0\x04\0\x03\0\0\0
   5 │      0       5  D\0\0\0\x04\0\0\0\xec\xff\xff\xff
   6 │      0       5   \0\0\0\x18\0\0\0\x01\x18\0\0
   7 │      0       5  \x04\0\x10\0\x11\0\b\0\0\0\f\0
   8 │      0       5  \xfc\xff\xff\xff\x04\0\x04\0\x04…
   9 │      0       5  \0\0\0\0\xec\xff\xff\xff8\0\0\0
  10 │      0       5  \x18\0\0\0\x01\x02\0\0\x10\0\x12…
> python
>>> import polars as pl; pl.DataFrame({'ints': [0] * 10, 'ye': [5]*10, 'more': ['h'*13]*10}).write_ipc("long.arrow")
> julia --project -e "using DataFrames; import Arrow; Arrow.Table(\"long.arrow\") |> DataFrame |> display"
10×3 DataFrame
 Row │ ints    ye      more          
     │ Int64?  Int64?  String?       
─────┼───────────────────────────────
   1 │      0       5  hhhhhhhhhhhhh
   2 │      0       5  hhhhhhhhhhhhh
   3 │      0       5  hhhhhhhhhhhhh
   4 │      0       5  hhhhhhhhhhhhh
   5 │      0       5  hhhhhhhhhhhhh
   6 │      0       5  hhhhhhhhhhhhh
   7 │      0       5  hhhhhhhhhhhhh
   8 │      0       5  hhhhhhhhhhhhh
   9 │      0       5  hhhhhhhhhhhhh
  10 │      0       5  hhhhhhhhhhhhh
> 

When strings are of different lengths, short ones are messed up:

> python
>>> from random import randint; col=[randint(1,50) for _ in range(10)]; print(col); import polars as pl; pl.DataFrame({'ints': [0] * 10, 'ye': [5]*10, 'more': ['h'*i for i in col]}).write_ipc("long.arrow")
[38, 5, 48, 32, 12, 3, 26, 23, 33, 37]
> julia --project -e "using DataFrames; import Arrow; Arrow.Table(\"long.arrow\") |> DataFrame |> display"
10×3 DataFrame
 Row │ ints    ye      more                              
     │ Int64?  Int64?  String?                           
─────┼───────────────────────────────────────────────────
   1 │      0       5  hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh…
   2 │      0       5  \xf2\xff\xff\xff\x14
   3 │      0       5  hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh…
   4 │      0       5  hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
   5 │      0       5  D\0\0\0\x04\0\0\0\xec\xff\xff\xff
   6 │      0       5   \0\0
   7 │      0       5  hhhhhhhhhhhhhhhhhhhhhhhhhh
   8 │      0       5  hhhhhhhhhhhhhhhhhhhhhhh
   9 │      0       5  hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
  10 │      0       5  hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh…

I tried "weird" non-ASCII scripts like Devanagari, but couldn't trigger the bug.

ForceBru avatar Feb 10 '25 00:02 ForceBru

Here's a BoundsError: attempt to access 0-element Vector{Vector{UInt8}} at index [1]:

> python
>>> from random import randint; col=[randint(1,500) for _ in range(100)]; print(col); import polars as pl; pl.DataFrame({'more': ['नमस्ते'*i for i in col],'text':['k'*i for i in col]}).write_ipc("long.arrow")
[232, 143, 235, 324, 105, 114, 47, 455, 111, 132, 125, 327, 249, 355, 317, 156, 312, 481, 107, 404, 493, 343, 41, 430, 1, 13, 107, 125, 114, 172, 443, 307, 328, 331, 318, 292, 327, 175, 41, 483, 147, 340, 309, 346, 414, 333, 103, 147, 143, 335, 132, 88, 409, 473, 45, 108, 112, 282, 150, 334, 261, 428, 316, 385, 157, 458, 348, 207, 444, 140, 425, 69, 500, 222, 472, 35, 170, 431, 11, 125, 484, 346, 187, 441, 108, 237, 18, 466, 128, 467, 466, 391, 310, 318, 171, 331, 450, 90, 194, 465]
> julia --project -e "using DataFrames; import Arrow; Arrow.Table(\"long.arrow\") |> DataFrame |> display"
ERROR: BoundsError: attempt to access 0-element Vector{Vector{UInt8}} at index [1]
Stacktrace:
  [1] throw_boundserror(A::Vector{Vector{UInt8}}, I::Tuple{Int64})
    @ Base ./essentials.jl:14
  [2] getindex
    @ ./essentials.jl:916 [inlined]
  [3] getindex(l::Arrow.View{Union{Missing, String}}, i::Int64)
    @ Arrow ~/.julia/packages/Arrow/3GbnS/src/arraytypes/views.jl:61
  [4] getindex
    @ ~/.julia/packages/DataFrames/kcA9R/src/dataframe/dataframe.jl:517 [inlined]
  [5] _pretty_tables_highlighter_func(data::DataFrame, i::Int64, j::Int64)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/prettytables.jl:13
  [6] _text_process_data_cell(ptable::PrettyTables.ProcessedTable, cell_data::PrettyTables.UndefinedCell, cell_str::String, i::Int64, j::Int64, l::Int64, column_width::Int64, crayon::Crayons.Crayon, alignment::Symbol, highlighters::Ref{Any})
    @ PrettyTables ~/.julia/packages/PrettyTables/oVZqx/src/backends/text/print_cell.jl:108
  [7] _text_print_table!(display::PrettyTables.Display, ptable::PrettyTables.ProcessedTable, table_str::Matrix{Vector{String}}, actual_columns_width::Vector{Int64}, continuation_row_line::Int64, num_lines_in_row::Vector{Int64}, num_lines_around_table::Int64, body_hlines::Vector{Int64}, body_hlines_format::NTuple{4, Char}, continuation_row_alignment::Symbol, ellipsis_line_skip::Int64, highlighters::Ref{Any}, hlines::Vector{Int64}, tf::PrettyTables.TextFormat, text_crayons::PrettyTables.TextCrayons{Crayons.Crayon, Crayons.Crayon}, vlines::Vector{Int64})
    @ PrettyTables ~/.julia/packages/PrettyTables/oVZqx/src/backends/text/print_table.jl:237
  [8] _print_table_with_text_back_end(pinfo::PrettyTables.PrintInfo; alignment_anchor_fallback::Symbol, alignment_anchor_fallback_override::Dict{Int64, Symbol}, alignment_anchor_regex::Dict{Int64, Vector{Regex}}, autowrap::Bool, body_hlines::Vector{Int64}, body_hlines_format::Nothing, continuation_row_alignment::Symbol, crop::Symbol, crop_subheader::Bool, columns_width::Int64, display_size::Tuple{Int64, Int64}, equal_columns_width::Bool, ellipsis_line_skip::Int64, highlighters::Tuple{PrettyTables.Highlighter}, hlines::Vector{Symbol}, linebreaks::Bool, maximum_columns_width::Vector{Int64}, minimum_columns_width::Int64, newline_at_end::Bool, overwrite::Bool, reserved_display_lines::Int64, show_omitted_cell_summary::Bool, sortkeys::Bool, tf::PrettyTables.TextFormat, title_autowrap::Bool, title_same_width_as_table::Bool, vcrop_mode::Symbol, vlines::Vector{Int64}, border_crayon::Crayons.Crayon, header_crayon::Crayons.Crayon, omitted_cell_summary_crayon::Crayons.Crayon, row_label_crayon::Crayons.Crayon, row_label_header_crayon::Crayons.Crayon, row_number_header_crayon::Crayons.Crayon, subheader_crayon::Crayons.Crayon, text_crayon::Crayons.Crayon, title_crayon::Crayons.Crayon)
    @ PrettyTables ~/.julia/packages/PrettyTables/oVZqx/src/backends/text/text_backend.jl:371
  [9] _print_table(io::IO, data::Any; alignment::Vector{Symbol}, backend::Val{:auto}, cell_alignment::Nothing, cell_first_line_only::Bool, compact_printing::Bool, formatters::Tuple{typeof(DataFrames._pretty_tables_general_formatter)}, header::Tuple{Vector{String}, Vector{String}}, header_alignment::Symbol, header_cell_alignment::Nothing, limit_printing::Bool, max_num_of_columns::Int64, max_num_of_rows::Int64, renderer::Symbol, row_labels::Nothing, row_label_alignment::Symbol, row_label_column_title::String, row_number_alignment::Symbol, row_number_column_title::String, show_header::Bool, show_row_number::Bool, show_subheader::Bool, title::String, title_alignment::Symbol, kwargs::@Kwargs{alignment_anchor_fallback::Symbol, alignment_anchor_regex::Dict{Int64, Vector{Regex}}, crop::Symbol, ellipsis_line_skip::Int64, hlines::Vector{Symbol}, highlighters::Tuple{PrettyTables.Highlighter}, maximum_columns_width::Vector{Int64}, newline_at_end::Bool, reserved_display_lines::Int64, row_label_crayon::Crayons.Crayon, vcrop_mode::Symbol, vlines::Vector{Int64}})
    @ PrettyTables ~/.julia/packages/PrettyTables/oVZqx/src/print.jl:1059
 [10] _print_table
    @ ~/.julia/packages/PrettyTables/oVZqx/src/print.jl:934 [inlined]
 [11] #pretty_table#62
    @ ~/.julia/packages/PrettyTables/oVZqx/src/print.jl:825 [inlined]
 [12] pretty_table
    @ ~/.julia/packages/PrettyTables/oVZqx/src/print.jl:794 [inlined]
 [13] _show(io::Base.TTY, df::DataFrame; allrows::Bool, allcols::Bool, rowlabel::Symbol, summary::Bool, eltypes::Bool, rowid::Nothing, truncate::Int64, kwargs::@Kwargs{})
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/show.jl:253
 [14] _show
    @ ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/show.jl:147 [inlined]
 [15] #show#871
    @ ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/show.jl:352 [inlined]
 [16] show
    @ ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/show.jl:339 [inlined]
 [17] show(io::Base.TTY, mime::MIME{Symbol("text/plain")}, df::DataFrame)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/io.jl:150
 [18] display(d::TextDisplay, M::MIME{Symbol("text/plain")}, x::Any)
    @ Base.Multimedia ./multimedia.jl:254
 [19] display
    @ ./multimedia.jl:255 [inlined]
 [20] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:340
 [21] |>(x::DataFrame, f::typeof(display))
    @ Base ./operators.jl:926
 [22] top-level scope
    @ none:1

Also, sometimes data from the first column appears in the second column, but only for dataframes with more than about 30 rows:

> python
>>> col=[7 for _ in range(40)]; import polars as pl; pl.DataFrame({'more': ['नमस्त *i for i in col],'text':['k'*i for i in col]}).write_ipc("long.arrow")
>>> 
> julia --project -e "using DataFrames; import Arrow; Arrow.Table(\"long.arrow\") |> DataFrame |> display"
40×2 DataFrame
 Row │ more                          text                     
     │ String?                       String?                  
─────┼────────────────────────────────────────────────────────
   1 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  W1\0\0\xff\xff\xff
   2 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \xf2\xff\xff\xff\x14\0\0
   3 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \v\0\b\0\n\0\x04
   4 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \b\0\b\0\0\0\x04
   5 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \x04\0\0\0\xec\xff\xff
   6 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \x18\0\0\0\x01\x18\0
   7 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \x11\0\b\0\0\0\f
   8 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \x04\0\x04\0\x04\0\0
   9 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \xec\xff\xff\xff,\0\0
  10 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \x01\x18\0\0\x10\0\x12
  11 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\f\0\0\0\0
  12 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \x04\0\0\0mor # trying to spell "more", name of 1st column?
  13 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \xe8\0\0\0\x04\0\0
  14 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\0\0\x14\0\0
  15 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \x10\0\x12\0\f\0\x04
  16 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\0\0\x90\0\0
  17 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\0\0\0\0\x0e
  18 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\x14\0\x02\0\0
  19 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\0\0\0\0\0
  20 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\0\0\0\0\0
  21 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\0\0\0\0\0
  22 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \x80\x02\0\0\0\0\0
  23 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  @\x16\0\0\0\0\0
  24 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  @\x16\0\0\0\0\0
  25 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\0\0\x02\0\0
  26 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\0\0\0\0\0
  27 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  \0\0\0\0\0\0\0
  28 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0 # न shouldn't be here
  29 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  30 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  31 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  32 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  33 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  34 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  35 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  36 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  37 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  38 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  39 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0
  40 │ नमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्तेनमस्ते  न\xe0\0\0\0

Pyarrow reads all of these correctly.

ForceBru avatar Feb 10 '25 00:02 ForceBru

text: string_view

It seems that arrow-julia doesn't support string view yet.

kou avatar Feb 10 '25 01:02 kou

Is string_view different than the new Utf8View that we support (added here: https://github.com/apache/arrow-julia/pull/512/files#diff-bdc4e5cd6aa22fdc5e659e805b70c4763308be9f41128c42db5eeb3c13ed8631)?

quinnj avatar Feb 10 '25 02:02 quinnj

Oh, sorry. They are the same type.

kou avatar Feb 10 '25 02:02 kou