rtables
rtables copied to clipboard
Consider allowing `N=xx` denominators to be based on unique subjects and instead of only nrows
Originally posted by @anajens in https://github.com/insightsengineering/sme-tasks/issues/184#issuecomment-1714011239
Can we move this to tern? I think {rtables} is supposed to be a general framework and this is too specific to pharma context.
I am just thinking that there is a place in alt_counts_df where we may want to specify a unique(alt_counts_df[column_name])
but it is still possible to do all of this by specifying counts in rtables or doing some processing in afun. I put it here just to discuss it, it is fine to move it to {tern}
Typically you do this with your selection of alt_count_df itself. alt_count_df is primarily used to control what shows up in the N=
part; I don't know about a case where you need to use something else for alt_count_df and still wanted that behavior, do you have more details?
This issue came up as there was a bug in the alt_count_df
where 1 subject was duplicated. It was legacy data and you would think this wouldn't happen in a real project, but it did.
hi @anajens , can we just add an assertion to check "alt_count_df" that the number of unique subject id is equal to the total number of rows then? I feel this way, it can give more informative error messages, and our assumption is protected
Once the column subset expressions PR is in, this would be doable in a split function and could be used by default in tern/chevron without needing a mandatory change in the foundational rtables behavior.
Another possibility would scripts/snippets that QC data assumptions, which would be useful in a wider context
On further reflection, a split function wouldn't really help here, but I'm still very skeptical of relaxing the "same thing happens to the alt_count_df and df" aspect as its pretty core to how rtables works currently.
If this were going to be something that needs handling in rtables, it seems like it would come in the form of declaring a "count function", which will be called on the relevant subset of alt_count_df. The default being simply NROW
.
i think what rtables has now is sufficient, it really should be the data preprocessing to clean up the data, I think an assertion to the assumption should be sufficient. what do you think?
Yeah I think some kind of assertion check in tern might be sufficient. It's potentially dangerous to leave as is since there is no error or warning given. This case shouldn't happen on a real project where people are familiar with the data but it still seems like a good sanity check to add.
I will need to confirm but I think there's a J&J template which calls for alt_count_df to have non-uniqueness; I don't know that baking in an assertion is a good idea for a general framework.