pyjanitor
pyjanitor copied to clipboard
software paper
@zbarry @szuckerman, I would like to invite you to participate in the pyjanitor software manuscript that I am writing.
I am writing it in a branch off master: https://github.com/ericmjl/pyjanitor/blob/whitepaper/paper/manuscript.md
At the moment, I am seeking out input on:
- Current known limitations of pyjanitor.
- Possible extensions.
- Readability.
If you would like to participate, please put in a PR against the whitepaper branch and add your name!
Thanks Eric; this is quite the honor.
Great! I'd be glad to take a look.
Tidy data manuscript citation to possibly use: https://www.jstatsoft.org/article/view/v059i10
I'm just stream-of-consciousness'ing things here that I might add in myself later.
Cite the SO study
I think the wishlist section code example you've given argues very clearly why pyjanitor is so game-changing.
More significant has been the contributions from data scientists seeking a cleaner API for cleaning data
The connotation here is probably not what is intended.
Diagrams showing how a DataFrame is progressively mutated over a chain of methods might be interesting.
Provide note in architecture section that the chaining is not copying data on each call unless that would be the point of the method. That's an important point for people concerned about performance.
I'm just stream-of-consciousness'ing things here that I might add in myself later.
Cite the SO study
Already done.
More significant has been the contributions from data scientists seeking a cleaner API for cleaning data
The connotation here is probably not what is intended.
Wait what? Not sure what you mean by that.
Diagrams showing how a
DataFrameis progressively mutated over a chain of methods might be interesting.
I think that belongs in the docs, which could be much better improved. There is already one example available, copied directly from the janitor repository.
I think that belongs in the docs, which could be much better improved. There is already one example available, copied directly from the janitor repository.
I was thinking in more of an overview figure sense so people can get a clear visual indication of how you can easily track the chain and its effects on the DataFrame, though it's not super important. It's more for improving visual appeal of the paper than anything else.
Newcomer contributors to open source have made their maiden contributions to pyjanitor, and experienced software engineers have also chipped in. More significant has been the contributions from data scientists seeking a cleaner API for cleaning data.
Wasn't very clear what I meant, haha. Idk, when I was reading this in a meeting, I kind of snickered because for whatever reason, to me at the time, it read like "yeah, sure, these people chipped in, but these other guys are more important" which, while that may not be an incorrect statement, could possibly be said in a different way. Not sure if others would read it like that.
Wasn't very clear what I meant, haha. Idk, when I was reading this in a meeting, I kind of snickered because for whatever reason, to me at the time, it read like "yeah, sure, these people chipped in, but these other guys are more important" which, while that may not be an incorrect statement, could possibly be said in a different way. Not sure if others would read it like that.
Got it. Yes, could use rephrasing. Feel free to PR a change.
Just curious - what would be the eventual format the paper would be written in? E.g., Word, LaTeX, etc.? If it's the latter, there are good templates out there for structuring manuscripts, of course. Happy to do the typesetting if we go for it.
@zbarry that'll depend, but for an arXiv deposition, I'd probably use some tooling I already have to convert the markdown text into latex, which involves Pandoc in the loop. It's something I did for my thesis, where a PDF of the paper can go into continuous integration build step as well. It'll be like how readthedocs builds docs, except for papers!
I think you're more well-versed in latex typesetting than I would be, so if you'd like to build the template, please go for it!
Whoa, that's cool lol. I'd expect nothing less. Sounds good.
Not sure how comprehensive that you're trying to be with the "Comparison to other tools" section, but another couple of tidyverse influenced libraries for python are plydata (https://github.com/has2k1/plydata) and kadro (https://github.com/koaning/kadro). I think that both are some of the most interesting previous art in the python space.
Also, one not tied to the tidyverse, but also some similar objectives with a verb based method chaining approach to data preparation, pdpipe (https://github.com/shaypal5/pdpipe).
@jcmkk3 thanks for the feedback! Yes, I will have to update the "comparison to other tools" as well.
Going to start a GitHub projects board for this.
Inviting @eli-s-goldberg and @zsailer onto the thread.
@eli-s-goldberg I have finally gotten to reviewing your proposed changes, and I definitely like and value the feedback. I am making changes on the basis of this. I'd also like to invite you to contribute an end-user testimony of some kind to the paper - if you're inclined! Totally understand if you'd like to decline, given your load at the moment.
@zsailer I am also inviting you onto the thread because pandas-flavor has been very enabling for this project. At the moment, I am wondering if you are open to contributing comments on whether additional description about pandas-flavor would help educate an end-user about pyjanitor's architecture? Again, also only if you're inclined to do so!
I hope to keep things lightweight for both of you, since I'm sure both of you are super busy with your respective things. If you don't want to do the hassle of a PR, I'm happy to accept your contributions via the issue tracker (i.e. just copy/paste the text you'd like added or modified).
@ericmjl thanks for the invite!
I'd be happy to contribute. I'll take a closer look at the "architecture" section in the next couple days and leave comments.
After a quick read, the paper is looking great!
After thinking about it for a bit, I don't think you need to add anything more about pandas-flavor 👍
The current level of detail is appropriate for your readers. Anything more gets into the "weeds" of Pandas.
That said, I've add a TL;DR section to the pandas-flavor README that you could always reference if you'd like (no pressure from me 😃). It provides a simple explanation of how method registration works in the register_dataframe_method decorator.
@Zsailer thanks for the feedback! I'm more than happy to include you on the paper regardless, because pandas-flavor was very enabling for pyjanitor to become a reality. Would you still like to be included? Please let me know!
I'd be honored! 😃