rtables Consider allowing `N=xx` denominators to be based on unique subjects and instead of only nrows

Originally posted by @anajens in https://github.com/insightsengineering/sme-tasks/issues/184#issuecomment-1714011239

Sep 11 '23 16:09 Melkiades

Can we move this to tern? I think {rtables} is supposed to be a general framework and this is too specific to pharma context.

Sep 11 '23 16:09 anajens

I am just thinking that there is a place in alt_counts_df where we may want to specify a unique(alt_counts_df[column_name]) but it is still possible to do all of this by specifying counts in rtables or doing some processing in afun. I put it here just to discuss it, it is fine to move it to {tern}

Sep 11 '23 17:09 Melkiades

Typically you do this with your selection of alt_count_df itself. alt_count_df is primarily used to control what shows up in the N= part; I don't know about a case where you need to use something else for alt_count_df and still wanted that behavior, do you have more details?

Mar 19 '24 18:03 gmbecker

This issue came up as there was a bug in the alt_count_df where 1 subject was duplicated. It was legacy data and you would think this wouldn't happen in a real project, but it did.

Mar 20 '24 19:03 anajens

hi @anajens , can we just add an assertion to check "alt_count_df" that the number of unique subject id is equal to the total number of rows then? I feel this way, it can give more informative error messages, and our assumption is protected

Mar 20 '24 23:03 shajoezhu

Once the column subset expressions PR is in, this would be doable in a split function and could be used by default in tern/chevron without needing a mandatory change in the foundational rtables behavior.

Another possibility would scripts/snippets that QC data assumptions, which would be useful in a wider context

Mar 21 '24 01:03 gmbecker

On further reflection, a split function wouldn't really help here, but I'm still very skeptical of relaxing the "same thing happens to the alt_count_df and df" aspect as its pretty core to how rtables works currently.

Mar 21 '24 02:03 gmbecker

If this were going to be something that needs handling in rtables, it seems like it would come in the form of declaring a "count function", which will be called on the relevant subset of alt_count_df. The default being simply NROW.

Mar 21 '24 23:03 gmbecker

i think what rtables has now is sufficient, it really should be the data preprocessing to clean up the data, I think an assertion to the assumption should be sufficient. what do you think?

Mar 22 '24 02:03 shajoezhu

Yeah I think some kind of assertion check in tern might be sufficient. It's potentially dangerous to leave as is since there is no error or warning given. This case shouldn't happen on a real project where people are familiar with the data but it still seems like a good sanity check to add.

Mar 22 '24 20:03 anajens

I will need to confirm but I think there's a J&J template which calls for alt_count_df to have non-uniqueness; I don't know that baking in an assertion is a good idea for a general framework.

Mar 22 '24 21:03 gmbecker

rtables rtables copied to clipboard

Consider allowing `N=xx` denominators to be based on unique subjects and instead of only nrows

rtables
rtables copied to clipboard