Serde.jl
Serde.jl copied to clipboard
Preserve headers order by default in CSV serialisation
Pull request checklist
- [x] Did you bump the project version?
- [x] Did you add a description to the Pull Request?
- [x] Did you add new tests?
- [x] Did you add reviewers?
Related issue
- Preserve headers order by default in CSV serialisation
Serde.to_flattennow returnsOrderedDict
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 71.78%. Comparing base (
85e37b4) to head (1586f26).
Additional details and impacted files
@@ Coverage Diff @@
## master #45 +/- ##
=======================================
Coverage 71.78% 71.78%
=======================================
Files 24 24
Lines 1056 1056
=======================================
Hits 758 758
Misses 298 298
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hi, @NeroBlackstone! Seems good to me. @artememelin any suggestions?
I checked method to_csv with a benchmark and saw that this solution requires at least 2 times more memory allocations and runtime
using Serde
using BenchmarkTools
vals = Vector{Dict{String,Int64}}(undef, 100000)
for i in eachindex(vals)
dict = Dict{String, Int64}()
for c in 'a':'z'
dict[string(c)] = i
end
vals[i] = dict
end
@benchmark Serde.to_csv(vals)
master
# BenchmarkTools.Trial: 5 samples with 1 evaluation.
# Range (min … max): 1.081 s … 1.176 s ┊ GC (min … max): 12.11% … 17.56%
# Time (median): 1.112 s ┊ GC (median): 14.70%
# Time (mean ± σ): 1.129 s ± 41.683 ms ┊ GC (mean ± σ): 15.29% ± 2.49%
# █ █ █ █ █
# █▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
# 1.08 s Histogram: frequency by time 1.18 s <
# Memory estimate: 742.09 MiB, allocs estimate: 14786791.
PR/45
# BenchmarkTools.Trial: 3 samples with 1 evaluation.
# Range (min … max): 2.161 s … 2.214 s ┊ GC (min … max): 13.52% … 16.22%
# Time (median): 2.191 s ┊ GC (median): 15.40%
# Time (mean ± σ): 2.189 s ± 26.668 ms ┊ GC (mean ± σ): 15.06% ± 1.38%
# █ █ █
# █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
# 2.16 s Histogram: frequency by time 2.21 s <
# Memory estimate: 1.39 GiB, allocs estimate: 33973506.
Is it possible to do something about this?
I think we should try to use vectors instead of dictionaries, where possible
Therefore, it seems to me that it is worth adding a new parameter for method to_flatten, which will determine the type of the returned object (something similar to this)
Then method to_flatten will return Dict by default, but inside method to_csv, you can specify Vector as the return type of to_flatten
I think this should improve performance
That's a tricky question. I don't think that this function was originally designed with unions in mind. We could try to give indexes to repeated column names to separate them (name and name_1 in your example) but that seems like a niche case. I almost exclusively see this function used with well-defined structs. What's your opinion on this matter?
Yeah, I agree with you. I think we should disallow unions other than Maybe here.
That seems very reasonable to me. Do you need any assistance with the implementation?
I see that a lot of work has been done It's great that performance has increased! Soon I will take a closer look at all the changes and return with feedback
I had to format a lot of the code to match our style guide (hopefully ready for the public soon) and created a pull request: https://github.com/NeroBlackstone/Serde.jl/pull/1 Can you see if everything is ok and nothing is broken?
Also, I noticed that now we have two implementations of the to_csv method that went in slightly different ways.
Is it possible to somehow try to unify them into one?
Okay, for now let there be two implementations @dmitrii-doronin Any other ideas or suggestions?
Hi, @NeroBlackstone! I will look through the PRs today. Hopefully, we'll merge everything.