DataToolkit.jl
DataToolkit.jl copied to clipboard
Tutorial for adding a new data loader
As discussed in tecosaur/DataToolkitCommon.jl#10, here is a short docs writeup of the process of creating the Arrow loader as an example of how to add a loader to the package. Let me know what you think and of course feel free to adapt, extend or rephrase! (I would have made a PR but don't understand how the docs work).
Tutorial: Adding a new loader/writer
In case your favourite data format is not supported yet by DataToolkit, fret not! It is relatively straightforward to add a new loader to the package and PRs adding new loaders are welcome. The following will briefly outline the process based on the loader/writer for the arrow format.
Step 1: Adding the main loader / writer functions
Each loader has its own file in the src/transformers/saveload directory. So as a first step, we add a new file arrow.jl there and make sure to include('transformers/saveload/arrow.jl) in src/DataToolkitCommon.jl, next to the other loader files.
We're now ready to add methods to the main package functions responsible for loading and saving data, which are aptly called load and save. Starting with the loader, we add a method to load which dispatches on DataLoader{:arrow}, takes an IO and allows the specification of a sink type to read data into, e.g. a DataFrame. The final function looks as follows:
function load(loader::DataLoader{:arrow}, io::IO, sink::Type)
@import Arrow
convert = @getparam loader."convert"::Bool true
result = Arrow.Table(io; convert) |>
if sink == Any || sink == Arrow.Table
identity
elseif QualifiedType(sink) == QualifiedType(:DataFrames, :DataFrame)
sink
end
result
end
This function includes four things:
- An
@importstatement for theArrowpackage which we use for reading a.arrowfile - Use of the
@getparammacro to obtain arguments to the wrapped loader function (Arrow.Table, in our case) from theData.tomlfile and to set their defaults. Here, we just need to specify the singleconvertargument, but in principle, there can be many. - Reading the data from
io, most likely using a package and including the arguments obtained in step 2 (here:Arrow.Table(io; convert)). - Conversion to the specified sink type. Note the use of
QualifiedType, which needs to be specified separately.
The file types supported by the loader and resolved in step 4 are specified through inclusion of a method for the supportedtypes function. Here, we specify two possible return types: Arrow.Table, which is returned natively by the Arrow.jl package, and DataFrame from the DataFrames.jl package:
supportedtypes(::Type{DataLoader{:arrow}}) =
[QualifiedType(:DataFrames, :DataFrame),
QualifiedType(:Arrow, :Table)]
The writer follows an overall similar structure; @import necessary packages, obtain writer arguments using @getparam and then write the data in tbl to io. Here's the save method for the arrow loader:
function save(writer::DataWriter{:arrow}, io::IO, tbl)
@import Arrow
compress = @getparam writer."compress"::Union{Symbol, Nothing} nothing
alignment = @getparam writer."alignment"::Int 8
dictencode = @getparam writer."dictencode"::Bool false
dictencodenested = @getparam writer."dictencodenested"::Bool false
denseunions = @getparam writer."denseunions"::Bool true
largelists = @getparam writer."largelists"::Bool false
maxdepth = @getparam writer."maxdepth"::Int 6
ntasks = @getparam writer."ntasks"::Int Int(typemax(Int32))
Arrow.write(
io, tbl;
compress, alignment,
dictencode, dictencodenested,
denseunions, largelists,
maxdepth, ntasks)
end
We also need to add a method to th ecreate function for our loader with a regex to recognize files of our data format:
create(::Type{DataLoader{:arrow}}, source::String) =
!isnothing(match(r"\.arrow$"i, source))
...and a method to createpriority specifying... TODO: what exactly?
createpriority(::Type{DataLoader{:arrow}}) = 10
Finally, we add a docstring specifying how to use our loader/writer:
const ARROW_DOC = md"""
[...]
"""
That's the full content of the new arrow.jl file!
Step 2: Adding the new loader to package initialization
To make things work, we now just need to add two more things to the __init__() function in src/DataToolkitCommon.jl:
- A line specifying the necessary packages used by our loader with their respective UUIDs (which you can obtain from their respective
Project.toml): In our case that is@addpkg Arrow "69666777-d1a9-59fb-9406-91d4454c9d45" - A line adding our docstring to the package documentation. In our case, we just add
(:loader, :arrow) => ARROW_DOC,to the list of docstrings in theappend!(DataToolkitBase.TRANSFORMER_DOCUMENTATION, ...)call further below.
It's taken a while, but in the recent docs work I've taken a look of this and hopefully managed to incorporate the key bits into https://tecosaur.github.io/DataToolkit.jl/core/transformers/ and https://tecosaur.github.io/DataToolkit.jl/common/contributing/.
Let me know if you've got any feedback :slightly_smiling_face:
These look great, I think this can be closed now! 🎉 The only thing I was thinking is maybe you could add a link to one of the easier implementations under /contributing so people have a reference to look at.
Glad to hear!