austraits.build icon indicating copy to clipboard operation
austraits.build copied to clipboard

Refinements to automated substitutions

Open ehwenk opened this issue 1 year ago • 3 comments

There are certain circumstances where the automated substitutions code (process.R, line 971) currently requires long lists of substitutions - but maybe could be refined...

Since it only matches entire strings, in circumstances where there are multiple categorical values, one of which needs to be changed, each circumstance with a change to that term needs to be included. For instance, in order to change procumbent to prostrate, there are only 6 times you'd have to replace the term through some variant of str_replace, but 97 different substitutions you'd have to add.

From growth_form branch:

> austraits$traits %>%
+   filter(trait_name == "stem_growth_habit") %>% filter(value == "procumbent") %>% distinct(dataset_id,value)
# A tibble: 6 × 2
  dataset_id         value     
  <chr>              <chr>     
1 Flora_Florabase    procumbent
2 Flora_NT           procumbent
3 Flora_of_Australia procumbent
4 Flora_PlantNet     procumbent
5 Flora_SA           procumbent
6 Flora_VicFlora     procumbent
> austraits$traits %>%
+   filter(trait_name == "stem_growth_habit") %>% filter(str_detect(value, "procumbent")) %>% distinct(dataset_id,value)
# A tibble: 97 × 2
   dataset_id      value                             
   <chr>           <chr>                             
 1 Flora_Florabase procumbent scrambling             
 2 Flora_Florabase procumbent spreading              
 3 Flora_Florabase compact erect procumbent sprawling
 4 Flora_Florabase bushy erect procumbent            
 5 Flora_Florabase bushy procumbent spreading        
 6 Flora_Florabase erect procumbent spreading        
 7 Flora_Florabase erect procumbent                  
 8 Flora_Florabase procumbent prostrate              
 9 Flora_Florabase procumbent                        
10 Flora_Florabase decumbent procumbent prostrate    
# … with 87 more rows
# ℹ Use `print(n = ...)` to see more rows

This gets even harder to fix when the words are entered into the data.csv file in non-alphabetical order, because the output is alphabetical and it is tedious to look up each term in the data.csv file to figure out why the substitution isn't "working".

Could the code be rewritten to replace all instances of a term, rather than an exact string match?

(I also occasionally struggle with capital letters in the input causing substitutions to fail, but this shouldn't be a problem, should it?)

ehwenk avatar Aug 21 '22 23:08 ehwenk