readtext
readtext copied to clipboard
importing docvars from docx files
I am undertaking a media content analysis and have downloaded DOCX files from nexus.com and am trying to import them into R. I have unzipped the file, and imported them using the readtext() but I cant seem to import the metadata associated with it. As it stands, I only have the doc_id (title) and the text. I was wondering whether you knew how I would extract docvars from the files such as the publisher or the date published.
Hi @louislegum, you could look into @JBGruber's LexisNexisTools R package. It includes functions to import files downloaded from NexisLexis. The package also allows you to extract the relevant meta data of each article.
Hi @stefan-mueller , thanks for the quick reply! Im afraid I have been trying to use the LexisNexisTools package but am unable to import my sources using it. I am currently trying to find a solution to that issue aswell.
I think you contacted me by email. I had a quick look at your example files and don't know yet what the problem is. They look fine at first glance, but somehow none of the relevant keywords are found.
Anyway, I don't think this is a readtext issue, as the behaviour you describe is how the package is intended. LexisNexis files have a weird way of storing the metadata and this is not something readtext is designed to handle as far as I know.
Hi @JBGruber, Thank you for looking at the files I sent! Does this mean you are still not sure about the reason behind the error message when I try to use the LNT package?
Ah I see, Do you know any other news article databases that store metadata for files in ways that are more compatible with readtext by any chance?
@louislegum I think the above discussion has identified the issue as being some non-standard metadata issue, but I'd be happy to take a look nonetheless. Can you send me an example file and what you want to have extracted and how? Feel free to post the link here if it's not something you need to keep private.
That would be great! I have attached an example docx file that I will be using in my analysis. Overall I am aiming to create a corpus of texts and use the metdata mentioned below to analysis them. Ideally, the metadata I want to extract and create docvars with are the date of publishing ( which is always prefixed with "Load-Date") and the publisher which is the next line down from the title.
example docx: 15 species that should be brought back to rewild Britain;From wolves to grey whales and lynxes, plan.DOCX
Thanks for the help!
I had another look at the files today and found that the problem is linked to this issue: https://github.com/JBGruber/LexisNexisTools/issues/14
So setting remove_cover = FALSE solves the problem.
Ah amazing, thats seems to have worked! Thank you very much for all your help @JBGruber and @kbenoit !