quarto-wordcount
quarto-wordcount copied to clipboard
word count with books
Thanks for making this.
If I want to use it with a book, how would that happen? It looks like the yml that you have goes into the individual documents (I think).
Is there a way to the options into _quarto.yml and get either a per-chapter count or one for the entire book?
I wrote some code to compute and aggregate the counts per markdown file:
# Computes the word count of each md file and appends results to tibble
# Assumes that the lua filters from https://github.com/andrewheiss/quarto-wordcount
# are in a path that is accessible.
get_word_count <- function() {
require(purrr)
md_files <- list.files(pattern = "\\.md$")
md_stubs <- gsub("\\.md$", "", md_files)
res <-
map(md_stubs, file_word_count) %>%
map2_dfr(md_stubs, parse_output)
res
}
# ------------------------------------------------------------------------------
# helpers
# Runs a system command to get the outputs
#' @param x the file name stub
#' @param path The location of the lua filters
file_word_count <- function(x, path = "word_counts") {
require(glue)
file_path <- tempfile()
file_path <- paste0(file_path, ".html")
cmd_text <- glue::glue("pandoc {x}.md --output {file_path} --lua-filter {path}/wordcount.lua --citeproc")
system(cmd_text, intern = TRUE)
}
# operates on all results
#' @param x the results of the filter
#' @param file_name the file name to add to the results
parse_output <- function(x, file_name) {
require(purrr)
has_numbers <- grepl("[[:digit:]]", x)
x <- x[has_numbers]
res <- purrr::map_dfr(x, extract_count)
res$file <- file_name
res
}
# converts filter results to tibble
#' @param x the file name
extract_count <- function(x) {
require(tibble)
split_up <- strsplit(x, " ")[[1]]
count <- as.integer(split_up[1])
desc <- paste(split_up[-1], collapse = " ")
tibble::tibble(count = count, type = desc)
}
Sorry for the delay here! I ended up making a bunch of other major changes to the filter and I think I can figure out a solution here now
The filter works on individual documents, since it converts each document to a pandoc AST and finds the word count from that. With things like Quarto books and websites, Quarto renders each of the files separately (and would get a separate word count for each) and then does whatever magic it uses to stitch them all together into a single document. In the case of HTML output, I'm 99% sure that the separate documents are never combined into a single AST, since both books and websites are, um, websites
With PDF and Word output, though, there might be one unified AST prior to converting to the final output—I'll need to check that
In any case, even if PDF/Word have a single combined AST to work with, it still might be easier to to something like the purrr::map(file_word_count) approach you did, since HTML doesn't use one single file. Perhaps some function that captures the output from each file, or that builds a tidy CSV as it renders and then can read from that CSV, or something along those lines?
k cool, so in exploring this more, it looks like neither HTML nor PDF output uses a full single combined AST. They all render everything to individual markdown files and then (1) for HTML books, each file gets converted to individual HTML files and (2) for PDF books, Quarto/pandoc somehow merges them into one .tex file and then passes that through LaTeX. I'm assuming Word, typst, markdown, and others do something similar.
When the word counting filter is included in _quarto.yml (like here from the default book template you get from RStudio's new project dialog):
project:
type: book
book:
title: "blah-book"
author: "Norah Jones"
date: "6/3/2024"
chapters:
- index.qmd
- intro.qmd
- summary.qmd
- references.qmd
bibliography: references.bib
format:
html:
theme: cosmo
citeproc: false
filters:
- at: pre-quarto
path: _extensions/andrewheiss/wordcount/citeproc.lua
- at: pre-quarto
path: _extensions/andrewheiss/wordcount/wordcount.lua
pdf:
documentclass: scrreprt
…it counts each of the individual .qmd files separately
I think it should be fairly straightforward to make Lua output those counts to a (temporary?) file and then aggregate them at the end
I think it should be fairly straightforward to make Lua output those counts…
lol nope.
With this content in index.qmd:
# Preface {.unnumbered}
This is a Quarto book.
To learn more about Quarto books visit <https://quarto.org/docs/books>.
…if the word count filter is run on its own just on index.qmd, there are 14 words:
Overall totals:
-----------------------------
- 14 total words
- 14 words in body and notes
Section totals:
-----------------------------
- 14 words in text body
When it's run as a whole book, though, index.qmd suddenly has 32 words.
Overall totals:
-----------------------------
- 32 total words
- 32 words in body and notes
Section totals:
-----------------------------
- 32 words in text body
I don't know where they're coming from either.
If I keep the intermediate md files:
project:
type: book
book:
title: "blah-book"
author: "Norah Jones"
date: "6/3/2024"
chapters:
- index.qmd
- intro.qmd
- summary.qmd
- references.qmd
bibliography: references.bib
format:
html:
theme: cosmo
keep-md: true
citeproc: false
count-code-blocks: false
filters:
- at: pre-quarto
path: _extensions/andrewheiss/wordcount/citeproc.lua
- at: pre-quarto
path: _extensions/andrewheiss/wordcount/wordcount.lua
…the resulting intermediate index.html.md still only has 14 words in it:
# Preface {.unnumbered}
This is a Quarto book.
To learn more about Quarto books visit <https://quarto.org/docs/books>.
Quarto's collection of book-related Lua filters are doing something extra behind the scenes that I can't track down
Hi, anyone found a solution for this in the meantime? :-)