partition icon indicating copy to clipboard operation
partition copied to clipboard

Improve documentation for `part_pc1()` and allow `reduce_first_component()` to operate independently

Open jtian123 opened this issue 1 year ago • 5 comments

I am currently working with the part_icc() partitioner, which employs the direct-measure-reduce approach as follows:

Direct: direct_distance(), Minimum Distance Measure: measure_icc(), Intraclass Correlation Reduce: reduce_scaled_mean(), Scaled Row Means My intention was to retain the "direct" and "measure" stages unchanged while switching the "reduce" stage to reduce_first_component(). However, I encountered an issue where reduce_first_component() appears to be almost an "empty" function, with the primary functionality being handled within measure_variance_explained().

This is contrasted with the structure of another partitioner, part_pc1(), which also uses the direct-measure-reduce approach:

Direct: direct_distance(), Minimum Distance Measure: measure_variance_explained(), Variance Explained (PCA) Reduce: reduce_first_component(), First Principal Component It seems that in the part_pc1() partitioner, measure_variance_explained() is effectively performing the reduction task, making reduce_first_component() redundant and potentially misleading for users who intend to utilize it as a standalone reducer.

I propose separating the reduction functionality from measure_variance_explained() and enhancing reduce_first_component() to function as a true reducer. This modification would clarify the roles of these functions and facilitate easier customization for users.

Could this enhancement be considered for implementation to improve clarity and functionality?

jtian123 avatar Jul 29 '24 07:07 jtian123

Yes, I see what you're saying here. But do you think it's possible to separate without fitting the PCA twice? I imagine that could add a lot of computation time on large datasets. I'm not sure we can generate the variance explained without fitting the PCA, but I may be mistaken.

I'm traveling right now, but I'm happy to look at a pull request if you're interested in fleshing this out. Otherwise, I can get to it when I'm back.

malcolmbarrett avatar Aug 07 '24 14:08 malcolmbarrett

To estimate variance explained, we would need to compute PCA, because the variance explained relates to the reduced feature.

Josh


From: Malcolm Barrett @.> Sent: Wednesday, August 7, 2024 7:19 AM To: USCbiostats/partition @.> Cc: Subscribed @.***> Subject: Re: [USCbiostats/partition] Can't customize part_icc with new reducer: reduce_first_component (Issue #40)

Yes, I see what you're saying here. But do you think it's possible to separate without fitting the PCA twice? I imagine that could add a lot of computation time on large datasets. I'm not sure we can generate the variance explained without fitting the PCA, but I may be mistaken.

I'm traveling right now, but I'm happy to look at a pull request if you're interested in fleshing this out. Otherwise, I can get to it when I'm back.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/USCbiostats/partition/issues/40*issuecomment-2273592506__;Iw!!LIr3w8kk_Xxm!pXn4wFi-Af1T7ZjwMMA1z8G5idVmtEoFIjsr4egosWfNG5JXUBPpZP4Y_3vyJ51x_-BNx_fqtEEkTQUirsbh2XTMD0GPJQ$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AFQJR4DVE2MCBWDQMM3TUZLZQIUGPAVCNFSM6AAAAABLTUS2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTGU4TENJQGY__;!!LIr3w8kk_Xxm!pXn4wFi-Af1T7ZjwMMA1z8G5idVmtEoFIjsr4egosWfNG5JXUBPpZP4Y_3vyJ51x_-BNx_fqtEEkTQUirsbh2XQC9wTWeg$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

millstei avatar Aug 07 '24 16:08 millstei

Yes, this is all coming back to me now. I don't think we can avoid that detail, but I think we can address the issue.

Because the two can't be separated, when used together, measure_variance_explained() stores the first component for (see these lines) for reduce_first_component() to use later without needing to recompute it. So I think the principle is right and it just uses a computationally efficient approach.

Your example arguably shows a bug in reduce_first_component(). Just as measure_variance_explained() only stores the first component if that's the reducer, reduce_first_component() should compute the PCA if it hasn't already been computed by measure_variance_explained().

So I think there are two areas of improvement here: 1) better documenting this relationship between measure_variance_explained() and reduce_first_component() and 2) allowing reduce_first_component() to operate independently of measure_variance_explained() by computing the PCA itself when needed (and only when needed)

malcolmbarrett avatar Aug 07 '24 17:08 malcolmbarrett

Yes, the two important areas of improvement you summarized are exactly what I want to say in the issue report.

Thanks, Jinyao

Get Outlook for iOShttps://aka.ms/o0ukef


From: Malcolm Barrett @.> Sent: Wednesday, August 7, 2024 10:09:24 AM To: USCbiostats/partition @.> Cc: Jinyao Tian @.>; Author @.> Subject: Re: [USCbiostats/partition] Can't customize part_icc with new reducer: reduce_first_component (Issue #40)

Yes, this is all coming back to me now. I don't think we can avoid that detail, but I think we can address the issue.

Because the two can't be separated, when used together, measure_variance_explained() stores the first component for (see these lineshttps://urldefense.com/v3/__https://github.com/USCbiostats/partition/blob/5f7366a9e73ec76ec775276b0dffe4361ed88f68/R/metrics.R*L168-L176__;Iw!!LIr3w8kk_Xxm!u5-ZO6nCkHXkYEIXWFLJimOcnIj8UvkZZIaZA3oclC9-97FVruon_FlaK6CXJQIu3XNcYTU9pNIK0KJ_uu8Ox-np$) for reduce_first_component() to use later without needing to recompute it. So I think the principle is right and it just uses a computationally efficient approach.

Your example arguably shows a bug in reduce_first_component(). Just as measure_variance_explained() only stores the first component if that's the reducer, reduce_first_component() should compute the PCA if it hasn't already been computed by measure_variance_explained().

So I think there are two areas of improvement here: 1) better documenting this relationship between measure_variance_explained() and reduce_first_component() and 2) allowing reduce_first_component() to operate independently of measure_variance_explained() but computing the PCA itself when needed (and only when needed)

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/USCbiostats/partition/issues/40*issuecomment-2273934736__;Iw!!LIr3w8kk_Xxm!u5-ZO6nCkHXkYEIXWFLJimOcnIj8UvkZZIaZA3oclC9-97FVruon_FlaK6CXJQIu3XNcYTU9pNIK0KJ_uu1q6lmd$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AS25FF6KWXMUI752BEJWXBTZQJIEJAVCNFSM6AAAAABLTUS2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTHEZTINZTGY__;!!LIr3w8kk_Xxm!u5-ZO6nCkHXkYEIXWFLJimOcnIj8UvkZZIaZA3oclC9-97FVruon_FlaK6CXJQIu3XNcYTU9pNIK0KJ_utJMl013$. You are receiving this because you authored the thread.Message ID: @.***>

jtian123 avatar Aug 07 '24 17:08 jtian123

Thanks for confirming and for both of your input. I'll try to address this soon now that it's clearer in my mind

malcolmbarrett avatar Aug 07 '24 17:08 malcolmbarrett

#41 should address this. I'll try to get this on CRAN this week

library(partition)
set.seed(1234)
df <- simulate_block_data(5, lower_corr = .5, upper_corr = .65, n = 100)

part_custom <- replace_partitioner(
  part_icc,
  reduce = reduce_first_component
)

partition(df, threshold = .6, partitioner = part_custom)
#> Partitioner:
#>    Director: Minimum Distance (Pearson) 
#>    Metric: Intraclass Correlation 
#>    Reducer: First Principal Component
#> 
#> Reduced Variables:
#> 1 reduced variables created from 3 observed variables
#> 
#> Mappings:
#> reduced_var_1 = {block1_x1, block1_x2, block1_x4}
#> 
#> Minimum information:
#> 0.6

Created on 2024-10-09 with reprex v2.1.1

malcolmbarrett avatar Oct 09 '24 15:10 malcolmbarrett

On its way to CRAN

malcolmbarrett avatar Oct 09 '24 17:10 malcolmbarrett