Serde.jl icon indicating copy to clipboard operation
Serde.jl copied to clipboard

Preserve headers order by default in CSV serialisation

Open NeroBlackstone opened this issue 1 year ago • 20 comments

Pull request checklist

  • [x] Did you bump the project version?
  • [x] Did you add a description to the Pull Request?
  • [x] Did you add new tests?
  • [x] Did you add reviewers?

Related issue

  • Preserve headers order by default in CSV serialisation
  • Serde.to_flatten now returns OrderedDict

NeroBlackstone avatar May 03 '24 18:05 NeroBlackstone

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 71.78%. Comparing base (85e37b4) to head (1586f26).

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #45   +/-   ##
=======================================
  Coverage   71.78%   71.78%           
=======================================
  Files          24       24           
  Lines        1056     1056           
=======================================
  Hits          758      758           
  Misses        298      298           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar May 03 '24 18:05 codecov-commenter

Hi, @NeroBlackstone! Seems good to me. @artememelin any suggestions?

dmitrii-doronin avatar May 03 '24 20:05 dmitrii-doronin

I checked method to_csv with a benchmark and saw that this solution requires at least 2 times more memory allocations and runtime

using Serde
using BenchmarkTools

vals = Vector{Dict{String,Int64}}(undef, 100000)
for i in eachindex(vals)
    dict = Dict{String, Int64}()
    for c in 'a':'z'
        dict[string(c)] = i
    end
    vals[i] = dict
end

@benchmark Serde.to_csv(vals)

master

# BenchmarkTools.Trial: 5 samples with 1 evaluation.
#  Range (min … max):  1.081 s …   1.176 s  ┊ GC (min … max): 12.11% … 17.56%
#  Time  (median):     1.112 s              ┊ GC (median):    14.70%
#  Time  (mean ± σ):   1.129 s ± 41.683 ms  ┊ GC (mean ± σ):  15.29% ±  2.49%

#   █              █  █                                 █   █  
#   █▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
#   1.08 s         Histogram: frequency by time        1.18 s <

#  Memory estimate: 742.09 MiB, allocs estimate: 14786791.

PR/45

# BenchmarkTools.Trial: 3 samples with 1 evaluation.
#  Range (min … max):  2.161 s …   2.214 s  ┊ GC (min … max): 13.52% … 16.22%
#  Time  (median):     2.191 s              ┊ GC (median):    15.40%
#  Time  (mean ± σ):   2.189 s ± 26.668 ms  ┊ GC (mean ± σ):  15.06% ±  1.38%

#   █                              █                        █  
#   █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
#   2.16 s         Histogram: frequency by time        2.21 s <

#  Memory estimate: 1.39 GiB, allocs estimate: 33973506.

Is it possible to do something about this?

artememelin avatar May 06 '24 10:05 artememelin

I think we should try to use vectors instead of dictionaries, where possible

Therefore, it seems to me that it is worth adding a new parameter for method to_flatten, which will determine the type of the returned object (something similar to this) Then method to_flatten will return Dict by default, but inside method to_csv, you can specify Vector as the return type of to_flatten I think this should improve performance

artememelin avatar May 06 '24 11:05 artememelin

That's a tricky question. I don't think that this function was originally designed with unions in mind. We could try to give indexes to repeated column names to separate them (name and name_1 in your example) but that seems like a niche case. I almost exclusively see this function used with well-defined structs. What's your opinion on this matter?

dmitrii-doronin avatar May 07 '24 16:05 dmitrii-doronin

Yeah, I agree with you. I think we should disallow unions other than Maybe here.

dmitrii-doronin avatar May 08 '24 07:05 dmitrii-doronin

That seems very reasonable to me. Do you need any assistance with the implementation?

dmitrii-doronin avatar May 08 '24 18:05 dmitrii-doronin

I see that a lot of work has been done It's great that performance has increased! Soon I will take a closer look at all the changes and return with feedback

artememelin avatar May 13 '24 16:05 artememelin

I had to format a lot of the code to match our style guide (hopefully ready for the public soon) and created a pull request: https://github.com/NeroBlackstone/Serde.jl/pull/1 Can you see if everything is ok and nothing is broken?

Also, I noticed that now we have two implementations of the to_csv method that went in slightly different ways. Is it possible to somehow try to unify them into one?

artememelin avatar May 13 '24 21:05 artememelin

Okay, for now let there be two implementations @dmitrii-doronin Any other ideas or suggestions?

artememelin avatar May 15 '24 09:05 artememelin

Hi, @NeroBlackstone! I will look through the PRs today. Hopefully, we'll merge everything.

dmitrii-doronin avatar May 23 '24 08:05 dmitrii-doronin