Error when id colum is not sequential and contains id values larger than corpus length.

Open grlju opened this issue 1 year ago • 1 comments

Hi Dirk,

Thanks a lot for your great package! It is a great help with a project I am currently working on.

I noticed that I get an error when the doc_id column is not sequential. The example below should be able to reproduce it.

require(text2sdg)
require(corpustools)

d <- data.frame(text = c('Text one first sentence.',
                        'Climate change is bad. Do something about extreme poverty', 
                        'Do something about extreme poverty', 'One '),
                doc_id = c(1, 3, 7, 10),
                date = c('2010-01-01','2010-01-01','2012-01-01', '2012-01-01'),
                source = c('A','B','B', 'C'))
tc <- create_tcorpus(d)
sdgs <- detect_sdg(tc)
Running systems
Obtaining text lengths
Building features
Running ensemble
Error: Missing data in columns: n_words.

In addition, if you only fix the ensemble.R, you will see that text 3 (ID 7) is not being identified in the result. I looked at the code and identified that this is because of how the ID columns are created internally in the ensemble.R and systems.R files.

The fix is really minor, so I have implemented it and will create a pull request.

Jan 19 '24 03:01 grlju

Hi @grlju,

Thanks a lot for bringing this to our attention and proposing a solution! I will merge your pull request into a new branch, test it, and then merge it into main.

Happy to hear that you find the package useful!

Best, Dominik

Jan 22 '24 13:01 psychobas