daru
daru copied to clipboard
Consistency between reading and writing methods.
I think method name 'write_csv' is little confusing.
If you know 'from_csv' method, you will assume 'to_csv' method, not 'write_csv'.
I think we need method alias 'to_csv'. In addition, 'read_csv' alias might be good.
The to_csv
indicates a Ruby class change, like to_s
or to_i
. You expect it return a new object, not write or read a file, hence the naming. to_csv
will ideally return a CSV::Table
object or something like that (if we ever make that method).
This was also a bit confusing to me as well as a new Daru user. The write_csv
api wasn't discoverable after using from_csv
as I was expecting to_csv
.
While @v0dro makes a fair point about to_x
being something that should return a new object, could we consider the suggestion of the read_csv
alias in the original note? This would enable read_csv
<-> write_csv
symmetry and hopefully make write_csv
more discoverable. Perhaps we could also have examples use read_csv
or maybe note read_csv
as an option to from_csv
. Thoughts?
This, actually, could make sense.
Updated issue name.
@athityakumar I believe that could be part of your work for daru-io. In fact, read
/write
seems pretty consistent pair of methods (for reading and writing files)... and for reading and writing variables (e.g. dataframe.method_name # => string with json
), from
/to
could be reasonable. It looks like reasonable API, no?..
Daru::DataFrame.from_json('{key: value}')
Daru::DataFrame.read_json('foo.json')
dataframe.to_json # => {key: value}
dataframe.write_json('foo.json')
Maybe this explicitness could be better, than trying to guess from parameter, whether it path/url, or just a JSON string?.. Just thinking.
@zverok - Here are some pros and cons that I have in my mind regarding this. Feel free to discuss. 😄
Pros
- Having the
to_
andwrite_
methods distinct actually makes sense for file formats, especially when integrating with a web framework. This is a REALLY HUGE pro. 👍 - Simplifies initialization of daru-io modules, as arguments are now explicitly in a certain format. daru-io developers' life is made simpler (slightly).
Cons
-
While the
from_
andread_
methods do simplify the initialization part of daru-io modules, does it also make it slightly difficult / redundant / confusing to the user? That is, does imposing explicit arguments like,from_json
- takes string or Array or Hash as inputread_json
- takes local / remote json file as inputmake it better for the user or does it seem confusing? IMHO, I wouldn't prefer this change just because it makes the initialization part simpler - if it's going to be a burden on the users.
In this regard, we can probably just simply alias
read_
tofrom_
for having symmetry between the methods.
I like @zverok 's idea. @athityakumar you should go ahead with that. Just make sure you document this properly.
TBH, now when I have time to think, I am not sure about the idea. But now my mind travels as follows:
-
from_something
could (should be treated like the first argument is an instance of something. Examples: instance of a JSON string (from_json
, withfrom_json_string
implied but shortened), instance of an SQL DB connection (from_db
orfrom_sql
orfrom_ar
, depending on exact case). We can agree at least on this one. - some of the formats (definitely not all of them) could with equal probability have source in a string, or in a file which should be read into a string first. Probable examples: CSV, YAML, less probable: JSON, improbable (doesn't make sense): ActiveRecord connection;
-
some of others could most probably be only in files, like XLSX or SQLite DB (OK, there is small probability you've received XLSX through the network or DB blob, but it can be easily masked with Ruby's
IO
interface); - mostly the same could be said about writing:
- "save this df to CSV" could easily mean "to CSV string" or "to CSV file";
- "save this df to JSON" in most cases probably means "JSON string";
- "...to AR connection" definitely means "to this special AR connection object"
- "...to XLSX" is most definitely means "to file by this path"
To add more mind-blowing complications :trollface:, we can also think of:
-
write_csv
is not necessary justto_csv => File.write
(for performance/memory reasons it can write file in chunks); - writing to XLSX potentially can mean something like "write this dataframe on the fifth sheet of existing file"....
So.... On the constructive part, it probably means:
- that we can stick, at least for now, at idea of 4 methods (from/read + to/write)
- ...and say that any adapter could implement any number of them
- ...and probably provide easy default implementations of read/write (read as File.read + from(...)), but not exposed by default, something meta-programming-y:
def from_yaml(str)
...
end
define_read_method # defines easy implementaiton of read_yaml(path)
WDYT?..
I think the present functionality of from_
should stay. It has become quite widespread due to talks/blogs/notebooks and will confuse users. The from_*
methods can be extended to say "give me any object from which I can read data". This object can be the filename or the DB adapter object depending on the importer.
Lets stick to providing the 4 methods for each adapter. If one of them does not exist, simply throw an error. I don't see any reason to "generate" importers/exporters. In how many cases will these generated importers actually work? Moreover, they can create an illusion to users that something useful exists for a particular importer when its actually nothing useful at all.
@v0dro @zverok - Now that a sample Rails example showcasing daru-io & daru-view is under development here, I think that to_format
should simply return a string rather than returning a Format
object. For example, have a look at this PR - for downloading a Daru::DataFrame
into filename.format
file in Rails, it'd be more convenient if to_json
returns a string that'd be written into a file rather than a nested Array / Hash.
I just wanted to clarify this before re-writing the to and write methods.
How about to_format.to_s
returning the String
?
How about to_format.to_s returning the String?
Hmm, then - all the respective classes would have to be monkey-patched with to_s
of our convenience (and overriding the existing to_s
if it exists). For example, to_csv
returns a CSV
object whose to_s
might be different from our need. Wouldn't this be unnecessarily complex? Or, how about just keeping different to_s
, to_format
and write_format
for each Exporter?
to_s
=> String that can be used to write into downloadable files in Rails app
to_format
=> The gem object (like CSV
/ Spreadsheet::Workbook
)
write_format
=> Writes into given file
Yes. The to_s
, to_format
and write_format
approach is good. The to_s
method will return the same string that will be written to file by write_format
right?
@v0dro - Yes, to_s
will directly return a string that can be written into the write_format
file.
Um, sorry for not thinking about this one minor issue. Should this to_s
be available only from Daru::IO
and not be monkey-patched into Daru::DataFrame
? If it's to be monkey-patched, maybe something like df.to_format_s(args)
/ df.to_format_string(args)
will do? If no monkey-patching is required, then something like Daru::IO::Exporters::JSON.new(df, args).to_s
will do.
@v0dro @zverok - Please share your preferences. 😄
Just monkey patch to_format_string
and document it nicely.