bibliometrix icon indicating copy to clipboard operation
bibliometrix copied to clipboard

termExtraction function: Misleading documentation / bug on "remove.terms" argument

Open kdmaclean opened this issue 1 year ago • 1 comments

Hi,

Not sure if this is intended behaviour or not. If it IS intended, I think the documentation is misleading.

The termExtraction function has a "remove.terms" argument with the following description:

#' @param remove.terms is a character vector. It contains a list of additional terms to delete from the documents before term extraction. The default is \code{remove.terms = NULL}.

However, this is not what actually occurs. The terms are actually removed after term extraction, on the list of terms. The distinction I'm drawing is relevant in the case of a bi-gram. If I want to remove the word "learning", "management learning" as a bi-gram will still exist, because the remove.terms is used after extraction, on the list rather than removing it before, and not allowing "management learning" in the first place.

The relevant part of the code is below in the extractNgrams function.

ngrams <- ngrams %>% 
   unite(ngram, paste("word",1:nword,sep=""), sep = " ") %>%
   dplyr::filter(!ngram %in% custom_stopngrams) %>%
   mutate(ngram = toupper(ngram))

kdmaclean avatar Jul 19 '24 01:07 kdmaclean

Hi, thanks for your remarks. The terms are removed after term extraction and it is an intended behavoiur. In this way, it is possible to create n-grams containing a certain word and then decide to remove only some of them or remove only the single word, not the n-grams using it. You're right, the documentation is misleading and we will correct it. Thanks Massimo

massimoaria avatar Sep 10 '24 05:09 massimoaria