readtext
readtext copied to clipboard
Sourcing doc_id does not work for 1-row tabular files
Let's create example files:
csv1 <- data.frame(
doc_id = c("doc1", "doc2"),
text = c("Lorem ipsum", "dolor sit amet"),
docvar1 = c("A", "B"),
docvar2 = c("C", "D"),
stringsAsFactors = FALSE
)
csv2 <- csv1[1, ]
write.csv(csv1, file = "/tmp/csv1.csv", row.names = FALSE)
write.csv(csv2, file = "/tmp/csv2.csv", row.names = FALSE)
For csv1.csv doc_id and text are sourced correctly:
> readtext::readtext("/tmp/csv1.csv", docid_field = "doc_id", text_field = "text")
readtext object consisting of 2 documents and 2 docvars.
# Description: df[,4] [2 × 4]
doc_id text docvar1 docvar2
<chr> <chr> <chr> <chr>
1 doc1 "\"Lorem ipsu\"..." A C
2 doc2 "\"dolor sit \"..." B D
For csv2.csv doc_id is based on filename:
> readtext::readtext("/tmp/csv2.csv", docid_field = "doc_id", text_field = "text")
readtext object consisting of 1 document and 2 docvars.
# Description: df[,4] [1 × 4]
doc_id text docvar1 docvar2
<chr> <chr> <chr> <chr>
1 csv2.csv "\"Lorem ipsu\"..." A C