text2sdg icon indicating copy to clipboard operation
text2sdg copied to clipboard

Error when id colum is not sequential and contains id values larger than corpus length.

Open grlju opened this issue 1 year ago • 1 comments

Hi Dirk,

Thanks a lot for your great package! It is a great help with a project I am currently working on.

I noticed that I get an error when the doc_id column is not sequential. The example below should be able to reproduce it.

require(text2sdg)
require(corpustools)

d <- data.frame(text = c('Text one first sentence.',
                        'Climate change is bad. Do something about extreme poverty', 
                        'Do something about extreme poverty', 'One '),
                doc_id = c(1, 3, 7, 10),
                date = c('2010-01-01','2010-01-01','2012-01-01', '2012-01-01'),
                source = c('A','B','B', 'C'))
tc <- create_tcorpus(d)
sdgs <- detect_sdg(tc)
Running systems
Obtaining text lengths
Building features
Running ensemble
Error: Missing data in columns: n_words.

In addition, if you only fix the ensemble.R, you will see that text 3 (ID 7) is not being identified in the result. I looked at the code and identified that this is because of how the ID columns are created internally in the ensemble.R and systems.R files.

The fix is really minor, so I have implemented it and will create a pull request.

grlju avatar Jan 19 '24 03:01 grlju

Hi @grlju,

Thanks a lot for bringing this to our attention and proposing a solution! I will merge your pull request into a new branch, test it, and then merge it into main.

Happy to hear that you find the package useful!

Best, Dominik

psychobas avatar Jan 22 '24 13:01 psychobas