janitor remove_empty only to last rows of last columns of data.frame

tkx for the amazing package.

In my application, I extract several tables that are in the same spreadsheet (for hundreds of spreadsheets with a similar format, but some tables have more rows than others, as users add rows to tables as needed). I extract the starting points of each table by regular expressions on the in-sheet column names of each table and the end from the start of the subsequent table -1.

After extracting, several of the resulting tables (data.frames) have empty rows and at the end (or empty columns on the right). Those are conceptually different than empty rows and columns in the middle of the table.

So I would like to be able to remove just the empty cells in the bottom and right margins of the table. Perhaps the syntax could be something like:

remove_empty(remove_bottom=TRUE)

Edit 1: This is the solution I came up with to remove rows.

## Aux function to remove NA rows below table
remove_empty_row_last <- function(dt){
  dt[ ,row_empty:=rowSums(is.na(dt))==ncol(dt)] 
  while (dt[.N,row_empty]==TRUE) {
    dt <- dt[1:(.N-1)]
    
  }
  dt %>% return()
}

d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d)

It requires a data.table but you could adapt to dplyr based. The function should also not leave the row_empty column in its output.

EDIT2: I cross-posted on this SO quesiton

Jan 12 '21 16:01 lucasmation

Thanks for the suggestion. My first impression is that this is rather special-purpose. (That is to day, I have not had a need for this in my work flow.)

Can you please describe the use case in a bit more detail-- specifically, why is an all NA row in the middle good to include while an all NA row at the bottom is bad?
And, how do you know that the first row that would be removed in your data is not the same as a middle row? (For the second question, how do you know that row 6 should be removed while row 2 should be included in your dataset?)

If @sfirke finds it to be a good feature, I would suggest the implementation to expand the use of the which argument so that it would be possible to select any of c("rows", "cols", "rows top", "rows middle", "rows bottom", "cols left", "cols middle", "cols right") where "rows" implies all of the rows and "cols" implies all of the columns (both overriding the more specific choices with a warning, if both are given).

Jan 18 '21 17:01 billdenney

Ok. Maybe this is too special-purpose. Let me describe the case.

Let's say a spreadsheet (tab) has several tables, A, B, C. It is easy to identify the starting line of each table because each table starts with the column names. However, there is no consistent marker of the end of the tables. So I can define the end of table A as the row number for the start of table B minus 1. This will work, but table A will be imported with a bunch of empty rows at the end.

These empty rows at the end are conceptually different because I know they do not contain information. Empty rows at the middle may indicate some parsing error, typing error, etc. So, in my application, I want to differentiate between them.

Feel fee to disregard this suggestion and close the issue.

Jan 18 '21 18:01 lucasmation

I would have a related, but slightly different suggestion... my use case is an extract from a database where the first few variables are identifiers of some sort and the rest are actual data. suppose that some variables are conditional on something, in which case you would end up with something like

id	v1	v2	v3
a	b	c	d
b
c

I need to keep the id column for merging with other data sets, so throwing that away isn't possible. But if remove_empty could be conditioned to use regex (via tidyselect perhaps or anything else) then I could do something like remove_empty(d, "rows", vars = starts_with("v")) to remove rows like b and c.

Admittedly, I'm not sure how relevant it would be to removing columns (maybe there are particular columns that should be retained regardless of them always being empty?)

Jul 22 '22 11:07 aghaynes

I have this same need, there are some fully NULL columns that I want to retain and some I want to drop (two different types of columns - one is student responses, which could possibly be fully null, while the other is a column that we don't want if it is fully null). Adding something along the lines of vars = starts_with("v") would be useful!

Feb 10 '23 18:02 ljanda-amplify

This is related to #530, and that is where future discussion is planned to occur.

Thanks to all contributors to this issue. These issues seem difficult to generalize, overall, except for the specific column definition (so, switching to #530 for that). Each of these seems useful, but the context for the use would require code that seems specialized for the use case. And, with that specialization, it seems that the code would add to the code base and API complexity.

If you have ideas for generalized use cases, please open a new, specific issue for those.

Feb 10 '23 21:02 billdenney

janitor janitor copied to clipboard

remove_empty only to last rows of last columns of data.frame

janitor
janitor copied to clipboard