ordr icon indicating copy to clipboard operation
ordr copied to clipboard

extend subset parameter to all biplot layers

Open corybrunson opened this issue 4 years ago • 14 comments

The experimental subset parameter of GeomIsolines$setup_data() should be extended to all plot layers.

To consistently distinguish between ggplot() and ggbiplot(), it would be good, if possible, to enable this parameter only for *_rows_*() and *_cols_*() layers. (Should it be a parameter of stat_rows(), stat_cols(), and the other new stat layers rather than of Geom*$setup_data()?)

An important decision is what inputs subset should be able to handle. Among the possibilities:

  • a positive (negative) integer vector indicating the rows to include (exclude) – understood by [.data.frame() and dplyr::slice() but not by subset(); almost definitely worth including since it will come most naturally to users
  • a logical vector of length the number of rows – understood by [.data.frame() and subset() but not by dplyr::slice(); should not cause confusion, but could just be required to be which()ed instead of handled separately
  • a character vector of row names – understood by [.data.frame() but not by subset() or dplyr::slice(); could cause confusion because print.tbl_ord() uses tibble-like printing, which does not display row names as print.data.frame() does, but might be important for large data workflows

corybrunson avatar Aug 27 '21 19:08 corybrunson

@jtr13, since you've used the package (and i'm so glad you've found it useful!), i'd be glad for your opinion. From the bulleted list above, what data would you expect to have to give ggbiplot() to have only a subset of variable axes, or just one variable axis, plotted?

corybrunson avatar Aug 27 '21 19:08 corybrunson

Happy to give an opinion... I assume by row names you mean the column names of the original data frame. I (always) prefer character vectors of names but not critical if there are other considerations.

jtr13 avatar Aug 28 '21 11:08 jtr13

Correct, when the data are provided as a data frame or matrix. Thank you!

corybrunson avatar Aug 28 '21 12:08 corybrunson

A draft solution is in the subset branch. The parameter is understood by the matrix stats (StatRows and StatCols) as well as by their corresponding extensions of StatScale, and can therefore be used with any geom layer that pairs with these stat layers (as the projection geom will).

It is not understood by the other stat layers because, as implemented, the subsetting would affect the results of their calculations. The logistic PCA examples illustrate its use with numerical input, though logical and character inputs are also accepted.

More examples, and unit tests, are needed, but this looks like a workable solution!

corybrunson avatar Aug 28 '21 19:08 corybrunson

Awesome! I'll give it a try and see if I can come up with some examples.

jtr13 avatar Aug 28 '21 20:08 jtr13

Contributions are never obligatory but always welcome. Especially bug reports.

corybrunson avatar Aug 28 '21 22:08 corybrunson

Good to know!

jtr13 avatar Aug 29 '21 00:08 jtr13

Works for me with integers but not characters...I'm not sure what the right syntax is. In the finches example geom_rows_vector(alpha = .5, color = "darkred", subset = 3) works.

With geom_rows_vector(alpha = .5, color = "darkred", subset = "Isabella") I get

Warning in f(...) :
  Rows have no defined `.name`, so `subset` will be ignored.

And with geom_rows_vector(alpha = .5, color = "darkred", subset = Isabella) I get

Error in layer(data = data, mapping = mapping, stat = rows_stat(stat),  : 
  object 'Isabella' not found

As an aside, imho there are an overwhelming number of geoms in the package. I'd prefer for example for geom_rows_axis() to automatically draw tick marks and tick mark labels, with parameters to turn them off if desired. Of course just a suggestion to take or leave!

jtr13 avatar Aug 29 '21 14:08 jtr13

@jtr13 try using augment_ord() before fortifying / plotting the data. In order for row or column names to work, the row or column data frame passed to ggplot() has to have a .name field, which will be retrieved by augment_ord() if it is available from the model object. It's clunky, but i could not come up with a better way that didn't require a package-wide overhaul.

Though i should make the warning message clearer.

corybrunson avatar Aug 29 '21 18:08 corybrunson

Never mind! I can reproduce the error after augment_ord(). I'll see if i can hack something.

corybrunson avatar Aug 29 '21 18:08 corybrunson

Commit a505a861b7bc9578d0f1620fd33bc29a66b64115 in the subset branch automatically maps the (custom) .name_subset aesthetic to the .name field, if it exists, in the fortified 'tbl_ord' object. The previous code that looked for .name now looks for .name_subset, which will be there if augment_ord() has been run first (and if names are found in the model object).

Please let me know again whether it works!

Note: This is not an ideal solution. A better one would be to have a "pointer" (not in the low-level sense) to the original model object that has been cloaked in the 'tbl_ord' class. The ability to access this object from within 'tbl_ord' methods and within the ggplot2 build process would potentially solve many other problems. I'm leaving this issue open until a new one is created for this goal.

corybrunson avatar Aug 29 '21 20:08 corybrunson

Sorry my fault for not providing a full reprex showing that I was using augment_ord(). It's working great now!

# site-species data frame of Sanderson Galapagos finches data
    library(ordr)
#> Loading required package: ggplot2
    library(magrittr)
    data(finches, package = "cooccur")
    finches %>% t() %>%
        logisticPCA_ord() %>%
        as_tbl_ord() -> finches_lpca
    finches_lpca %>%
        augment_ord() %>%
        ggbiplot(aes(label = .name), sec.axes = "cols", scale.factor = 50) +
        geom_rows_text_radiate(subset = "Isabella") +
        geom_rows_axis(subset = "Isabella", color = "royalblue3", lwd = .75) +
        geom_rows_axis_text(size = 3, subset = "Isabella", color = "royalblue3",
                            label_dodge = 2) +
        geom_rows_axis_ticks(subset = "Isabella", color = "royalblue3") +
        geom_rows_vector(color = "darkred") +
        geom_cols_point(alpha = .5, color = "royalblue3") +
        ggtitle(
            "Logistic PCA of the Galapagos island finches",
            "Islands (finches) scaled to the primary (secondary) axes"
        ) +
        expand_limits(x = c(-30, 25))

Created on 2021-08-29 by the reprex package (v2.0.1)

Now to be super picky, it doesn't seem that there's a label_dodge parameter for geom_rows_text_radiate().

jtr13 avatar Aug 30 '21 03:08 jtr13

Unfortunately this solution (for character vectors) interferes with the internal calculation of group, so for urgency i dropped it in 0e41f530ffad93bcbee9eee9441571d85329522b (@jtr13 please take note, with an apology). A "pointer" would remedy it.

corybrunson avatar Sep 21 '21 13:09 corybrunson

No apologies... thanks for the heads-up!

jtr13 avatar Sep 22 '21 01:09 jtr13