daru icon indicating copy to clipboard operation
daru copied to clipboard

Consistency between reading and writing methods.

Open genya0407 opened this issue 7 years ago • 15 comments

I think method name 'write_csv' is little confusing.

If you know 'from_csv' method, you will assume 'to_csv' method, not 'write_csv'.

I think we need method alias 'to_csv'. In addition, 'read_csv' alias might be good.

genya0407 avatar Nov 06 '16 06:11 genya0407

The to_csv indicates a Ruby class change, like to_s or to_i. You expect it return a new object, not write or read a file, hence the naming. to_csv will ideally return a CSV::Table object or something like that (if we ever make that method).

v0dro avatar Nov 11 '16 17:11 v0dro

This was also a bit confusing to me as well as a new Daru user. The write_csv api wasn't discoverable after using from_csv as I was expecting to_csv.

While @v0dro makes a fair point about to_x being something that should return a new object, could we consider the suggestion of the read_csv alias in the original note? This would enable read_csv <-> write_csv symmetry and hopefully make write_csv more discoverable. Perhaps we could also have examples use read_csv or maybe note read_csv as an option to from_csv. Thoughts?

parthm avatar Jun 20 '17 17:06 parthm

This, actually, could make sense.

zverok avatar Jun 20 '17 19:06 zverok

Updated issue name.

@athityakumar I believe that could be part of your work for daru-io. In fact, read/write seems pretty consistent pair of methods (for reading and writing files)... and for reading and writing variables (e.g. dataframe.method_name # => string with json), from/to could be reasonable. It looks like reasonable API, no?..

Daru::DataFrame.from_json('{key: value}')
Daru::DataFrame.read_json('foo.json')
dataframe.to_json # => {key: value}
dataframe.write_json('foo.json')

Maybe this explicitness could be better, than trying to guess from parameter, whether it path/url, or just a JSON string?.. Just thinking.

zverok avatar Jun 20 '17 19:06 zverok

@zverok - Here are some pros and cons that I have in my mind regarding this. Feel free to discuss. 😄

Pros
  • Having the to_ and write_ methods distinct actually makes sense for file formats, especially when integrating with a web framework. This is a REALLY HUGE pro. 👍
  • Simplifies initialization of daru-io modules, as arguments are now explicitly in a certain format. daru-io developers' life is made simpler (slightly).
Cons
  • While the from_ and read_ methods do simplify the initialization part of daru-io modules, does it also make it slightly difficult / redundant / confusing to the user? That is, does imposing explicit arguments like,

    from_json - takes string or Array or Hash as input read_json - takes local / remote json file as input

    make it better for the user or does it seem confusing? IMHO, I wouldn't prefer this change just because it makes the initialization part simpler - if it's going to be a burden on the users.

    In this regard, we can probably just simply alias read_ to from_ for having symmetry between the methods.

athityakumar avatar Jun 23 '17 16:06 athityakumar

I like @zverok 's idea. @athityakumar you should go ahead with that. Just make sure you document this properly.

v0dro avatar Jul 03 '17 16:07 v0dro

TBH, now when I have time to think, I am not sure about the idea. But now my mind travels as follows:

  • from_something could (should be treated like the first argument is an instance of something. Examples: instance of a JSON string (from_json, with from_json_string implied but shortened), instance of an SQL DB connection (from_db or from_sql or from_ar, depending on exact case). We can agree at least on this one.
  • some of the formats (definitely not all of them) could with equal probability have source in a string, or in a file which should be read into a string first. Probable examples: CSV, YAML, less probable: JSON, improbable (doesn't make sense): ActiveRecord connection;
  • some of others could most probably be only in files, like XLSX or SQLite DB (OK, there is small probability you've received XLSX through the network or DB blob, but it can be easily masked with Ruby's IO interface);
  • mostly the same could be said about writing:
    • "save this df to CSV" could easily mean "to CSV string" or "to CSV file";
    • "save this df to JSON" in most cases probably means "JSON string";
    • "...to AR connection" definitely means "to this special AR connection object"
    • "...to XLSX" is most definitely means "to file by this path"

To add more mind-blowing complications :trollface:, we can also think of:

  • write_csv is not necessary just to_csv => File.write (for performance/memory reasons it can write file in chunks);
  • writing to XLSX potentially can mean something like "write this dataframe on the fifth sheet of existing file"....

So.... On the constructive part, it probably means:

  • that we can stick, at least for now, at idea of 4 methods (from/read + to/write)
  • ...and say that any adapter could implement any number of them
  • ...and probably provide easy default implementations of read/write (read as File.read + from(...)), but not exposed by default, something meta-programming-y:
def from_yaml(str)
...
end

define_read_method # defines easy implementaiton of read_yaml(path)

WDYT?..

zverok avatar Jul 05 '17 10:07 zverok

I think the present functionality of from_ should stay. It has become quite widespread due to talks/blogs/notebooks and will confuse users. The from_* methods can be extended to say "give me any object from which I can read data". This object can be the filename or the DB adapter object depending on the importer.

Lets stick to providing the 4 methods for each adapter. If one of them does not exist, simply throw an error. I don't see any reason to "generate" importers/exporters. In how many cases will these generated importers actually work? Moreover, they can create an illusion to users that something useful exists for a particular importer when its actually nothing useful at all.

v0dro avatar Jul 06 '17 15:07 v0dro

@v0dro @zverok - Now that a sample Rails example showcasing daru-io & daru-view is under development here, I think that to_format should simply return a string rather than returning a Format object. For example, have a look at this PR - for downloading a Daru::DataFrame into filename.format file in Rails, it'd be more convenient if to_json returns a string that'd be written into a file rather than a nested Array / Hash.

I just wanted to clarify this before re-writing the to and write methods.

athityakumar avatar Aug 09 '17 00:08 athityakumar

How about to_format.to_s returning the String?

v0dro avatar Aug 11 '17 08:08 v0dro

How about to_format.to_s returning the String?

Hmm, then - all the respective classes would have to be monkey-patched with to_s of our convenience (and overriding the existing to_s if it exists). For example, to_csv returns a CSV object whose to_s might be different from our need. Wouldn't this be unnecessarily complex? Or, how about just keeping different to_s , to_format and write_format for each Exporter?

to_s => String that can be used to write into downloadable files in Rails app to_format => The gem object (like CSV / Spreadsheet::Workbook) write_format => Writes into given file

athityakumar avatar Aug 11 '17 20:08 athityakumar

Yes. The to_s, to_format and write_format approach is good. The to_s method will return the same string that will be written to file by write_format right?

v0dro avatar Aug 12 '17 11:08 v0dro

@v0dro - Yes, to_s will directly return a string that can be written into the write_format file.

athityakumar avatar Aug 12 '17 12:08 athityakumar

Um, sorry for not thinking about this one minor issue. Should this to_s be available only from Daru::IO and not be monkey-patched into Daru::DataFrame? If it's to be monkey-patched, maybe something like df.to_format_s(args) / df.to_format_string(args) will do? If no monkey-patching is required, then something like Daru::IO::Exporters::JSON.new(df, args).to_s will do.

@v0dro @zverok - Please share your preferences. 😄

athityakumar avatar Aug 12 '17 20:08 athityakumar

Just monkey patch to_format_string and document it nicely.

v0dro avatar Aug 15 '17 10:08 v0dro