datawizard icon indicating copy to clipboard operation
datawizard copied to clipboard

Creating a visual schematic diagram for data wrangling workflow in `{datawizard}`

Open IndrajeetPatil opened this issue 2 years ago • 29 comments

IMHO, the current README is quite dull and long-winded, and doesn't provide much insight into how this package can be useful for the users.

What we need is for it to feature a visual schematics like the following ones in our other popular high-level packages:

image

image

Of course, paging our in-house visualization wizard @DominiqueMakowski! 🪄

Needless to say, this is low-priority and if you think so necessary, we can definitely wait for the package to even mature further.

IndrajeetPatil avatar Mar 06 '22 14:03 IndrajeetPatil

What would it contain? I can start a draft with powerpoint

DominiqueMakowski avatar Mar 06 '22 23:03 DominiqueMakowski

@DominiqueMakowski How about something like this? (cc @bwiernik, @strengejacke, @mattansb, @etiennebacher)

unnamed

Of course, there is a lot of room for improvement here. Specifically,

  • I am not sure how to visually depict messy versus clean/tidy data. The only thing I could come up with was a hamper full of dirty clothes versus clean, folded clothes. Maybe others have better ideas.
  • This includes no functions about "Data Properties". Is it important to include them?
  • The list of functions I've included in the two columns is incomplete. Not sure how comprehensive we want to be here.

IndrajeetPatil avatar Jul 03 '22 06:07 IndrajeetPatil

I can give it a go next week (do ping me then if you remember :)

The list of functions I've included in the two columns is incomplete.

It's okay not to be comprehensive otherwise we will be obsolete as soon as we add a new function, better perhaps to create like a wordcloud or something like that

DominiqueMakowski avatar Jul 03 '22 06:07 DominiqueMakowski

Yeah, I agree. That's why I had put the ... in those columns. I don't think we need to be comprehensive, but we should definitely include the most important ones (filter, select, join, etc.).

IndrajeetPatil avatar Jul 03 '22 07:07 IndrajeetPatil

I think a separate viz of data cleaning versus data summary functions would be good

bwiernik avatar Jul 03 '22 08:07 bwiernik

@DominiqueMakowski It will be nice to have something like this in the JOSS paper.

IndrajeetPatil avatar Jul 09 '22 10:07 IndrajeetPatil

Will do within the next couple of days

DominiqueMakowski avatar Jul 10 '22 03:07 DominiqueMakowski

Would be nice to generate a wordcloud of the functions tho

DominiqueMakowski avatar Jul 10 '22 03:07 DominiqueMakowski

Wordlist for wordclouds (https://www.wordclouds.com/):

  • Preparation:

data_filter() data_select() data_to_long() data_to_wide() data_rotate() data_rename() data_relocate() data_join()

  • Transformation:

standardize() normalize() center() degroup() winsorize() data_cut() data_recode() data_shift()

DominiqueMakowski avatar Jul 12 '22 12:07 DominiqueMakowski

I want to wait for #57 and #197 to be resolved before we can include the following functions in the wordcloud:

data_cut() data_recode() data_shift()

We should avoid including any functions names in a publication that we are not sure will survive for long.

IndrajeetPatil avatar Jul 12 '22 12:07 IndrajeetPatil

you're right, I'll come up with a diagram prototype nonetheless and then we can fine-tune the wordcloud

DominiqueMakowski avatar Jul 12 '22 12:07 DominiqueMakowski

We can focus on the dirty clothes metaphor but it lacks some text at the bottom? (feel free to directly edit the powerpoint on the diagram branch!)

image

DominiqueMakowski avatar Jul 12 '22 12:07 DominiqueMakowski

Thanks, Dom! This looks like a great start.

I think one way this can be improved is by making it visually less busy and more minimal. Additionally, we need to mention only a few (key and most useful) functions and just have ... (which will cover all the other existing or future functions).

I don't like the star shape in the "Transformations" section.

Maybe this can be an ironing table with a shirt on it? As in, imperfections in prepared data are ironed out using statistical transformations before the data is ready to be fed into a statistical model.

Instead of "No dependencies", I'd write "Lightweight", since we do import{insight}.

IndrajeetPatil avatar Jul 23 '22 12:07 IndrajeetPatil

I also want to hear what @etiennebacher, @strengejacke, @bwiernik, @mattansb think about the current status of the illustration and how it can be further improved.

IndrajeetPatil avatar Jul 26 '22 13:07 IndrajeetPatil

I agree with Indra's comments and don't have much more to add there. I like the ironing metaphor (maybe the function names in a cloud of steam?). And agree that making the function names less busy/stand out more would be good

bwiernik avatar Jul 26 '22 13:07 bwiernik

Looks good. I would maybe change the color of the bg color of the washing machine to a lighter blue? And for transform use the non data_* variant names.

mattansb avatar Jul 26 '22 13:07 mattansb

Looks good to me too, but it's a bit hard to read most function names in steps 2 and 3. Maybe you can remove the very small ones to increase the size of the others?

etiennebacher avatar Jul 26 '22 14:07 etiennebacher

Thank you all for great suggestions!

WDYT, @DominiqueMakowski? Will this be possible? Don't know how complicated it will be to design.

IndrajeetPatil avatar Jul 31 '22 15:07 IndrajeetPatil

Before we finalize this, we should definitely decide on the new function names. Mostly, I'm not quite satisfied with change_code(). recode is a verb, and everyone would expect such a function would recode old into new values. in change_code(), the verb is change, and what can we expect if we change a code? What we actually change when recoding are values (or factor levels, but values is maybe more generic). What do you think about change_values()? or maybe recode_values(), or recode_variables().

strengejacke avatar Jul 31 '22 15:07 strengejacke

The "code" being the mapping of quantities to values/labels. So the function is changing the coding scheme used.

Maybe change_coding()? change_values() would be my second choice

bwiernik avatar Jul 31 '22 16:07 bwiernik

@DominiqueMakowski Let us know if these suggestions make sense.

IndrajeetPatil avatar Aug 24 '22 13:08 IndrajeetPatil

bump

IndrajeetPatil avatar Sep 15 '22 11:09 IndrajeetPatil

hello-mcfly

strengejacke avatar Sep 15 '22 12:09 strengejacke

bump

IndrajeetPatil avatar Sep 21 '22 10:09 IndrajeetPatil

Is that the correct list?

Preparation: data_filter() data_select() data_to_long() data_to_wide() data_rotate() data_rename() data_relocate() data_join()

Transformation: standardize() normalize() center() degroup() winsorize() categorize() change_code() slide()

DominiqueMakowski avatar Sep 21 '22 10:09 DominiqueMakowski

thanks for the bumps 🙊

DominiqueMakowski avatar Sep 21 '22 10:09 DominiqueMakowski

These need to change to their new names:

  • data_cut() -> categorize()
  • data_recode() -> recode_values() ~~change_code()~~
  • data_shift() -> slide()

Btw, feel free to not include all of them. Whatever looks better with the chosen graphic design.

IndrajeetPatil avatar Sep 21 '22 10:09 IndrajeetPatil

recode_values() not change_code()

bwiernik avatar Sep 21 '22 18:09 bwiernik

bump

IndrajeetPatil avatar Oct 11 '22 14:10 IndrajeetPatil