pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

software paper

Open ericmjl opened this issue 7 years ago • 24 comments

@zbarry @szuckerman, I would like to invite you to participate in the pyjanitor software manuscript that I am writing.

I am writing it in a branch off master: https://github.com/ericmjl/pyjanitor/blob/whitepaper/paper/manuscript.md

At the moment, I am seeking out input on:

  1. Current known limitations of pyjanitor.
  2. Possible extensions.
  3. Readability.

If you would like to participate, please put in a PR against the whitepaper branch and add your name!

ericmjl avatar Nov 28 '18 15:11 ericmjl

Thanks Eric; this is quite the honor.

zbarry avatar Nov 28 '18 15:11 zbarry

Great! I'd be glad to take a look.

szuckerman avatar Nov 28 '18 17:11 szuckerman

Tidy data manuscript citation to possibly use: https://www.jstatsoft.org/article/view/v059i10

zbarry avatar Nov 28 '18 18:11 zbarry

I'm just stream-of-consciousness'ing things here that I might add in myself later.

Cite the SO study

zbarry avatar Nov 28 '18 18:11 zbarry

I think the wishlist section code example you've given argues very clearly why pyjanitor is so game-changing.

zbarry avatar Nov 28 '18 18:11 zbarry

More significant has been the contributions from data scientists seeking a cleaner API for cleaning data

The connotation here is probably not what is intended.

zbarry avatar Nov 28 '18 19:11 zbarry

Diagrams showing how a DataFrame is progressively mutated over a chain of methods might be interesting.

zbarry avatar Nov 28 '18 19:11 zbarry

Provide note in architecture section that the chaining is not copying data on each call unless that would be the point of the method. That's an important point for people concerned about performance.

zbarry avatar Nov 28 '18 19:11 zbarry

I'm just stream-of-consciousness'ing things here that I might add in myself later.

Cite the SO study

Already done.

ericmjl avatar Nov 28 '18 20:11 ericmjl

More significant has been the contributions from data scientists seeking a cleaner API for cleaning data

The connotation here is probably not what is intended.

Wait what? Not sure what you mean by that.

ericmjl avatar Nov 28 '18 20:11 ericmjl

Diagrams showing how a DataFrame is progressively mutated over a chain of methods might be interesting.

I think that belongs in the docs, which could be much better improved. There is already one example available, copied directly from the janitor repository.

ericmjl avatar Nov 28 '18 20:11 ericmjl

I think that belongs in the docs, which could be much better improved. There is already one example available, copied directly from the janitor repository.

I was thinking in more of an overview figure sense so people can get a clear visual indication of how you can easily track the chain and its effects on the DataFrame, though it's not super important. It's more for improving visual appeal of the paper than anything else.

Newcomer contributors to open source have made their maiden contributions to pyjanitor, and experienced software engineers have also chipped in. More significant has been the contributions from data scientists seeking a cleaner API for cleaning data.

Wasn't very clear what I meant, haha. Idk, when I was reading this in a meeting, I kind of snickered because for whatever reason, to me at the time, it read like "yeah, sure, these people chipped in, but these other guys are more important" which, while that may not be an incorrect statement, could possibly be said in a different way. Not sure if others would read it like that.

zbarry avatar Nov 29 '18 14:11 zbarry

Wasn't very clear what I meant, haha. Idk, when I was reading this in a meeting, I kind of snickered because for whatever reason, to me at the time, it read like "yeah, sure, these people chipped in, but these other guys are more important" which, while that may not be an incorrect statement, could possibly be said in a different way. Not sure if others would read it like that.

Got it. Yes, could use rephrasing. Feel free to PR a change.

ericmjl avatar Nov 29 '18 15:11 ericmjl

Just curious - what would be the eventual format the paper would be written in? E.g., Word, LaTeX, etc.? If it's the latter, there are good templates out there for structuring manuscripts, of course. Happy to do the typesetting if we go for it.

zbarry avatar Dec 09 '18 16:12 zbarry

@zbarry that'll depend, but for an arXiv deposition, I'd probably use some tooling I already have to convert the markdown text into latex, which involves Pandoc in the loop. It's something I did for my thesis, where a PDF of the paper can go into continuous integration build step as well. It'll be like how readthedocs builds docs, except for papers!

I think you're more well-versed in latex typesetting than I would be, so if you'd like to build the template, please go for it!

ericmjl avatar Dec 09 '18 17:12 ericmjl

Whoa, that's cool lol. I'd expect nothing less. Sounds good.

zbarry avatar Dec 09 '18 18:12 zbarry

Not sure how comprehensive that you're trying to be with the "Comparison to other tools" section, but another couple of tidyverse influenced libraries for python are plydata (https://github.com/has2k1/plydata) and kadro (https://github.com/koaning/kadro). I think that both are some of the most interesting previous art in the python space.

Also, one not tied to the tidyverse, but also some similar objectives with a verb based method chaining approach to data preparation, pdpipe (https://github.com/shaypal5/pdpipe).

jcmkk3 avatar Dec 28 '18 21:12 jcmkk3

@jcmkk3 thanks for the feedback! Yes, I will have to update the "comparison to other tools" as well.

ericmjl avatar Jan 12 '19 18:01 ericmjl

Going to start a GitHub projects board for this.

ericmjl avatar Jan 12 '19 18:01 ericmjl

Inviting @eli-s-goldberg and @zsailer onto the thread.

@eli-s-goldberg I have finally gotten to reviewing your proposed changes, and I definitely like and value the feedback. I am making changes on the basis of this. I'd also like to invite you to contribute an end-user testimony of some kind to the paper - if you're inclined! Totally understand if you'd like to decline, given your load at the moment.

@zsailer I am also inviting you onto the thread because pandas-flavor has been very enabling for this project. At the moment, I am wondering if you are open to contributing comments on whether additional description about pandas-flavor would help educate an end-user about pyjanitor's architecture? Again, also only if you're inclined to do so!

I hope to keep things lightweight for both of you, since I'm sure both of you are super busy with your respective things. If you don't want to do the hassle of a PR, I'm happy to accept your contributions via the issue tracker (i.e. just copy/paste the text you'd like added or modified).

ericmjl avatar Jan 24 '19 01:01 ericmjl

@ericmjl thanks for the invite!

I'd be happy to contribute. I'll take a closer look at the "architecture" section in the next couple days and leave comments.

After a quick read, the paper is looking great!

Zsailer avatar Jan 25 '19 04:01 Zsailer

After thinking about it for a bit, I don't think you need to add anything more about pandas-flavor 👍
The current level of detail is appropriate for your readers. Anything more gets into the "weeds" of Pandas.

That said, I've add a TL;DR section to the pandas-flavor README that you could always reference if you'd like (no pressure from me 😃). It provides a simple explanation of how method registration works in the register_dataframe_method decorator.

Zsailer avatar Jan 29 '19 19:01 Zsailer

@Zsailer thanks for the feedback! I'm more than happy to include you on the paper regardless, because pandas-flavor was very enabling for pyjanitor to become a reality. Would you still like to be included? Please let me know!

ericmjl avatar Jan 30 '19 01:01 ericmjl

I'd be honored! 😃

Zsailer avatar Jan 30 '19 03:01 Zsailer