eis_toolkit icon indicating copy to clipboard operation
eis_toolkit copied to clipboard

447 coda transformations add selection of columns/attributes of the input data

Open ghost opened this issue 1 year ago • 8 comments

Fixes #447

I did not modify Single ILR as subcomposition parameters are already required.

And about PLR: I edited it so that if user selects some columns for the transformation, the numerator is placed on the first column and the selected columns on its right side so that the algorithm works. I'm not certain if it causes any errors in the results. Could @jtlait or @em-t verify whether this is okay or not?

ghost avatar Nov 14 '24 12:11 ghost

I am not too familiar with PLR, so maybe @em-t or @chudasama-bijal can comment on this. About another thing regarding column selection, check_in_simplex_sample_space cannot be run at first when columns are selected. If input dataframe contains some additional variables besides geochemical variables and only geochemical variables are chosen, then check_in_simplex_sample_space should be run only for the selected geochemical data.

jtlait avatar Nov 18 '24 08:11 jtlait

I started to look at this today, but didn't quite have time for a proper review. I'll return to this tomorrow!

em-t avatar Nov 18 '24 15:11 em-t

I will take a look at the coda related issues next week.

chudasama-bijal avatar Nov 20 '24 08:11 chudasama-bijal

Did you want to check this @chudasama-bijal ?

nmaarnio avatar Nov 29 '24 08:11 nmaarnio

@em-t @jtlait I tried to fix the notebook (testing_logratio_transformations.ipynb) but I believe we need compositional example data. The IOCG_CLB_Till_Geochem_reg_511p.shp dataset used in the "Testing with example data" section of the notebook is not compositional (at least the rows don't sum to 1 or 100). I'm not sure if we have suitable data for this: do you guys have an example dataset that could be used in the notebook? And can you fix it?

ghost avatar Nov 29 '24 08:11 ghost

I couldn't find the .shp version of the file, but I did look the IOCG_CLB_Till_Geochem_reg_511p.gpkg. In the .gpkg file the analyzed elements (columns with ppm) did not sum to 1, 100 or any other constant that would be equal between the rows. Perhaps this data is then a subcomposition, i.e., it is missing something. In that case better dataset might be needed for the notebook.

However, this data set looks like real world example as it is not curated to perfection, and thus, it would be good to think how to deal with these. Do we want the users be able to use log transforms with these non closed datasets or do we require them to fix their data to expected format beforehand?

If the constant sum check is one of the requirements, and we want user be able to deal also with data like IOCG_CLB_Till_Geochem_reg_511p.gpkg without preprocessing it, then I think that additional functionality is needed. There could be for example argument data_is_closed = True for each of the log transformations and in the case user sets this False, closure operation _closure from aitchison_geometry.py would be run for the selected columns at the beginning of function. What do you think @chudasama-bijal @em-t ?

jtlait avatar Nov 29 '24 11:11 jtlait

I think we did discuss this, in practice the usual (geochem) data will be concentration data like the IOCG_CLB_Till_Geochem_reg_511p file, and it will often not satisfy the sum constant condition.

So, providing an option to the user whether to perform closure on their data seems the way to go about it. Could be demonstrated in the notebook itself.

Regarding the plr transformation, the ordering of the columns is important. Hence at least in the plugin, I would recommend it is emphasized that the column selections are made by the user in the order that they desire the plr to be performed upon. Will this require any changes in the toolkit function itself in terms of input parameters, or can this be handled in the plugin directly? Do consider it while modifying these functions.

Finally, if there are any queries regarding the math of the transformations that affect the coding aspect, then to me this doesn't seem the most effective channel to delve into it.

chudasama-bijal avatar Dec 03 '24 19:12 chudasama-bijal

Hi @em-t, @jtlait, I've added a scale parameter for users to perform closure on the selected columns in the coda functions and fixed the notebook (sorry for the commit spam, btw). The tools and the corresponding CLI function should work just fine but let me know if you spot anything!

As for @em-t's earlier comments about setting limitations to scale parameter in the inverse transformation functions as well and keeping the original columns of the input dataframes alongside the transformed columns, I don't have an answer as I'm not familiar with the actual use cases either.

ghost avatar Dec 19 '24 10:12 ghost