Error when id colum is not sequential and contains id values larger than corpus length.
Hi Dirk,
Thanks a lot for your great package! It is a great help with a project I am currently working on.
I noticed that I get an error when the doc_id column is not sequential. The example below should be able to reproduce it.
require(text2sdg)
require(corpustools)
d <- data.frame(text = c('Text one first sentence.',
'Climate change is bad. Do something about extreme poverty',
'Do something about extreme poverty', 'One '),
doc_id = c(1, 3, 7, 10),
date = c('2010-01-01','2010-01-01','2012-01-01', '2012-01-01'),
source = c('A','B','B', 'C'))
tc <- create_tcorpus(d)
sdgs <- detect_sdg(tc)
Running systems
Obtaining text lengths
Building features
Running ensemble
Error: Missing data in columns: n_words.
In addition, if you only fix the ensemble.R, you will see that text 3 (ID 7) is not being identified in the result. I looked at the code and identified that this is because of how the ID columns are created internally in the ensemble.R and systems.R files.
The fix is really minor, so I have implemented it and will create a pull request.
Hi @grlju,
Thanks a lot for bringing this to our attention and proposing a solution! I will merge your pull request into a new branch, test it, and then merge it into main.
Happy to hear that you find the package useful!
Best, Dominik