papaja icon indicating copy to clipboard operation
papaja copied to clipboard

Word count

Open m-Py opened this issue 8 years ago • 22 comments

Hi,

is it currently possible to count the number of words in the manuscript (excluding abstract and references) automated? The example manuscript states 'too lazy to count', which I take as a hint that this is not possible? Any other suggestions about how to do it, other than manual counting?

Best regards, Martin

m-Py avatar Apr 12 '16 08:04 m-Py

Hi Martin,

the answer to this question depends a little on the target document type you are working with. For MS Word documents I'd recommend to simply generate the document and then enter the word count by hand using Word's word count feature. If you are going through LaTeX to produce a PDF I suggest you try TeXcount, which was probably installed along with your TeX distrubution. It's a command line tool (but there is also a web interface, see the link) and has served me well. Let me know if this works for you.

Best regards, Frederik

crsh avatar Apr 12 '16 14:04 crsh

Thinking a little more about this, I forgot to mention that the way pandoc manages references TeXcount will, by default, count your references against the word limit. So maybe you need to remove the reference section manually from the TeX source file before calling TeXcount.

crsh avatar Apr 12 '16 14:04 crsh

Hi Frederik,

thanks a lot for your response! I will look into TeXCount. I am converting into pdf, but I'll probably need to use MS Word as well as soon as I need to collaborate with others. Removing the reference section for word count is not a problem.

Best regards,
Martin

m-Py avatar Apr 13 '16 10:04 m-Py

There's a new kid on the block: Try wordcountaddin. This is an R solution that works on the RMarkdown file directly. I haven't played with it and I'm not sure how reliable it is; e.g., I don't know if it counts text in tables etc. If you give it a try, I'd be interested in your experience.

crsh avatar May 10 '16 10:05 crsh

Ah, that seems nice. It says it is an RStudio addin, but you can also use an internal function to process any R strings without using RStudio. In a first test this worked, for example r chunks were not counted. It should be able to read whole Rmd text files, too. Thanks for that catch! I will report as soon as I have done more testing with Rmd files.

m-Py avatar May 11 '16 15:05 m-Py

I played around a little and it seems that I was able to write a function that reads an Rmd file and then uses the word count function of wordcountaddin to count all words in this file. Indeed, the r chunks, inline r and yaml headers are ignored. Inline LaTex commands are not ignored, but maybe this will be implemented at some point.

The two word count methods that are offered (koRpus & stringi) differ rather strongly in my case, but koRpus yields reasonable results. The koRpus estimate is close to what I get when I import the text into Libre Office and just remove all r chunks and the yaml header (stringi is actually very far off).

I will use this function from now on :-)

m-Py avatar May 12 '16 09:05 m-Py

That's good to know. Thanks for providing the feedback. I had noticed rather large discrepancies between the word count methods, too, but I hadn't yet cross-validated them. Thanks for that. I'd be interested in the function you wrote because I've been wanting to automate word counting in papaja. Would you be willing to share your function, e.g., in a gist?

crsh avatar May 12 '16 10:05 crsh

Sure, gladly, here is the gist:

https://gist.github.com/m-Py/faf679a0a0be3dbafa2b43b390519923

I crossvalidated the function with the RStudio addin - the results are the same (at least they were for me, you might double check that ;) ).

m-Py avatar May 12 '16 12:05 m-Py

@m-Py that's good to know about the accuracy of the two methods (I'm the author of the wordcountaddin), thanks for sharing your test results. I might drop the stringi method from the addin.

The addin will count text that are present in markdown tables in the Rmd file before the file is knit, but excluding those is on my list of things to do. It won't count tables generated by R code that only appear in the rendered document.

benmarwick avatar May 21 '16 12:05 benmarwick

@benmarwick glad I could help. The difference between the two estimates was really rather large, koRpus yielded ~ 10,000 words and stringi only estimated ~ 6,000 words. Thank you for creating this nice package, it is really of great use to me.

If you intend to include a function in your package that counts the words without necessarily using RStudio, feel free to just use the code in my gist above. I tried some code to process the text of an Rmd file so that it is apparently formatted as a text selection in RStudio, so this should work.

m-Py avatar May 24 '16 07:05 m-Py

would be supercool if this could be integrated into the wordcount header in a .rmd manuscript!

ebergelson avatar Dec 20 '17 23:12 ebergelson

Note to self: I just found a nice example of a Lua filter that counts words, which may just do the trick with some adaptations.

crsh avatar Jan 20 '19 21:01 crsh

Also, a new function in wordcountaddin that might be useful here: wordcountaddin::word_count("my_file.Rmd")

This returns a single integer, so it might be handy for using in headers, etc.

benmarwick avatar Jan 20 '19 23:01 benmarwick

Thanks for the pointer, I'll also try that and report on how they compare.

crsh avatar Jan 21 '19 09:01 crsh

Hi @crsh, thank you so much for building papaja it has amazing defaults. I've found a simple workaround: putting a R field in the wordcount field:

"`r wordcountaddin::word_count('estimating_richness_sdm.Rmd')`"

minimal_reprex available here: https://gist.github.com/Rekyt/9ebda737eb7d818fdfe7981b79549a7f

Rekyt avatar Feb 18 '19 17:02 Rekyt

I just pushed a commit that adds a first draft of the Lua-filter that counts words on the intermediate AST after citations have been rendered by pandoc-citeproc (devtools::install_github("crsh/papaja@devel"); it's based on two other Lua-filters). The filter reports the word count in the console or the R Markdown tab in RStudio.

I have compared the output for the example document in this repository to several other common approaches. This document is probably a tough one, because it contains code, verbatim output, URLs in references etc.


Lua-filter

1749 words in text body
322 words in reference section

The word count for the text body does not contain, tables or images (or their captions), or the reference section.

wordcountaddin

> wordcountaddin::word_count('example/example.Rmd')
[1] 1407

The substantial deviation here is probably largely due to the not-yet-rendered citations, of which there are several in this document.

texcount

I pasted the LaTeX code into the texcount webinterface. It reported the following counts for the text body:

Words in text: 944
Words in headers: 31
Words outside text (captions, etc.): 58
Number of headers: 8
Number of floats/tables/figures: 4
Number of math inlines: 16

and

Words in text: 400
Words in headers: 1
Words outside text (captions, etc.): 0
Number of headers: 0
Number of floats/tables/figures: 0
Number of math inlines: 0

for the reference section. The output also noted several errors related to the code and verbatim output. I think those errors may have caused texcount to ignore some bits and are probably the reason for the low word count of the text body.

wordcounter.net

Copy-pasting the text from the word document (without tables and figures) yielded the following counts:

1713 for the text body 324 for the reference section

Pages

Similarly, the Pages count (again without tables and figures) yielded

1728 for the text body 429 for the reference section


Overall I'm fairly happy with the performance of the Lua-filter. Word counting is a tricky business and none of the above methods agree. The wordcountaddin and texcount (appear to) have technical limitations with this document; wordcounter.net and Pages are in the same ballpark as the Lua-filter. I'm sure the filter can be improved (and I'll gladly take any suggestion) but I think in its current form it is a decent solution.

crsh avatar Apr 26 '19 21:04 crsh

Hi Frederik, I'm sorry but I couldn't really figure how to actually run/implement the Lua-filter -- could you maybe give a brief example? And do I understand correctly, it is not possible to include the count directly into the YAML header? But might it be possible to run it in a code chunk, save the result in the cache and load it that way in the YAML header? (Hence, something like r knitr::load_cache(label = "count-words", object = "n_words")?) Thanks so much!

tdienlin avatar Jul 04 '19 09:07 tdienlin

It's currently not possible to automatically include it, but I plan to look into ways to do this. The filter cannot be called in a code chunk because it is executed after all R code has been run and pandoc-citeproc has been applied.

If you are using the current development version of papaja (devtools::install_github("crsh/papaja@devel")), the filter should be automatically applied. The word count filter reports the word counts in the console or the R Markdown tab in RStudio, respectively, e.g.,

285 words in text body
23 words in reference section

crsh avatar Jul 04 '19 10:07 crsh

Ah, I see, now I understand. Works perfectly. Thanks for the quick reply!

tdienlin avatar Jul 04 '19 11:07 tdienlin

This is really awesome, @crsh!

Would you mind adding this word_count functionality to revision_letter output as well?

I know, we can manually put some pandoc_args for that YAML; however, it would be better if it is provided by default like apa6_pdf.

jooyoungseo avatar Nov 09 '19 00:11 jooyoungseo

Sounds like a reasonable request. I'm a little swamped at the moment. If you'd like to try tackling this, I'd be more than happy to review a PR.

crsh avatar Nov 13 '19 16:11 crsh

Is this wordcount "problem" solved? I am using the template, but the wordcount does't work (with all default settings in the YAML) keywords : "Public policy, Crime, Paraguay, Bayesian statistics" wordcount : "X" bibliography : "bibliography.bib" floatsintext : no linenumbers : yes draft : no mask : no figurelist : no tablelist : no footnotelist : no classoption : "man" output : papaja::apa6_word editor_options: markdown: wrap: 72

schneiderpy avatar Oct 30 '23 22:10 schneiderpy