quarto-wordcount word count with books

Thanks for making this.

If I want to use it with a book, how would that happen? It looks like the yml that you have goes into the individual documents (I think).

Is there a way to the options into _quarto.yml and get either a per-chapter count or one for the entire book?

Apr 17 '23 16:04 topepo

I wrote some code to compute and aggregate the counts per markdown file:

# Computes the word count of each md file and appends results to tibble
# Assumes that the lua filters from https://github.com/andrewheiss/quarto-wordcount
# are in a path that is accessible.

get_word_count <- function() {
  require(purrr)
  md_files <- list.files(pattern = "\\.md$")
  md_stubs <- gsub("\\.md$", "", md_files)

  res <-
    map(md_stubs, file_word_count) %>%
    map2_dfr(md_stubs, parse_output)
  res
}

# ------------------------------------------------------------------------------
# helpers

# Runs a system command to get the outputs
#' @param x the file name stub
#' @param path The location of the lua filters
file_word_count <- function(x, path = "word_counts") {
  require(glue)
  file_path <- tempfile()
  file_path <- paste0(file_path, ".html")
  cmd_text <- glue::glue("pandoc {x}.md --output {file_path} --lua-filter {path}/wordcount.lua --citeproc")
  system(cmd_text, intern = TRUE)
}

# operates on all results
#' @param x the results of the filter
#' @param file_name the file name to add to the results
parse_output <- function(x, file_name) {
  require(purrr)
  has_numbers <- grepl("[[:digit:]]", x)
  x <- x[has_numbers]
  res <- purrr::map_dfr(x, extract_count)
  res$file <- file_name
  res
}

# converts filter results to tibble
#' @param x the file name
extract_count <- function(x) {
  require(tibble)
  split_up <- strsplit(x, " ")[[1]]
  count <- as.integer(split_up[1])
  desc <- paste(split_up[-1], collapse = " ")
  tibble::tibble(count = count, type = desc)
}

Apr 18 '23 15:04 topepo

Sorry for the delay here! I ended up making a bunch of other major changes to the filter and I think I can figure out a solution here now

The filter works on individual documents, since it converts each document to a pandoc AST and finds the word count from that. With things like Quarto books and websites, Quarto renders each of the files separately (and would get a separate word count for each) and then does whatever magic it uses to stitch them all together into a single document. In the case of HTML output, I'm 99% sure that the separate documents are never combined into a single AST, since both books and websites are, um, websites

With PDF and Word output, though, there might be one unified AST prior to converting to the final output—I'll need to check that

In any case, even if PDF/Word have a single combined AST to work with, it still might be easier to to something like the purrr::map(file_word_count) approach you did, since HTML doesn't use one single file. Perhaps some function that captures the output from each file, or that builds a tidy CSV as it renders and then can read from that CSV, or something along those lines?

Jun 03 '24 20:06 andrewheiss

k cool, so in exploring this more, it looks like neither HTML nor PDF output uses a full single combined AST. They all render everything to individual markdown files and then (1) for HTML books, each file gets converted to individual HTML files and (2) for PDF books, Quarto/pandoc somehow merges them into one .tex file and then passes that through LaTeX. I'm assuming Word, typst, markdown, and others do something similar.

When the word counting filter is included in _quarto.yml (like here from the default book template you get from RStudio's new project dialog):

project:
  type: book

book:
  title: "blah-book"
  author: "Norah Jones"
  date: "6/3/2024"
  chapters:
    - index.qmd
    - intro.qmd
    - summary.qmd
    - references.qmd

bibliography: references.bib

format:
  html:
    theme: cosmo
    citeproc: false
    filters:
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/citeproc.lua
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/wordcount.lua
  pdf:
    documentclass: scrreprt

…it counts each of the individual .qmd files separately

I think it should be fairly straightforward to make Lua output those counts to a (temporary?) file and then aggregate them at the end

Jun 04 '24 01:06 andrewheiss

I think it should be fairly straightforward to make Lua output those counts…

lol nope.

With this content in index.qmd:

# Preface {.unnumbered}

This is a Quarto book.

To learn more about Quarto books visit <https://quarto.org/docs/books>.

…if the word count filter is run on its own just on index.qmd, there are 14 words:

Overall totals:
-----------------------------
- 14 total words
- 14 words in body and notes

Section totals:
-----------------------------
- 14 words in text body

When it's run as a whole book, though, index.qmd suddenly has 32 words.

Overall totals:
-----------------------------
- 32 total words
- 32 words in body and notes

Section totals:
-----------------------------
- 32 words in text body

I don't know where they're coming from either.

If I keep the intermediate md files:

project:
  type: book

book:
  title: "blah-book"
  author: "Norah Jones"
  date: "6/3/2024"
  chapters:
    - index.qmd
    - intro.qmd
    - summary.qmd
    - references.qmd

bibliography: references.bib

format:
  html:
    theme: cosmo
    keep-md: true
    citeproc: false
    count-code-blocks: false
    filters:
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/citeproc.lua
      - at: pre-quarto
        path: _extensions/andrewheiss/wordcount/wordcount.lua

…the resulting intermediate index.html.md still only has 14 words in it:

# Preface {.unnumbered}

This is a Quarto book.

To learn more about Quarto books visit <https://quarto.org/docs/books>.

Quarto's collection of book-related Lua filters are doing something extra behind the scenes that I can't track down

Jun 04 '24 01:06 andrewheiss

Hi, anyone found a solution for this in the meantime? :-)

Feb 10 '25 18:02 sarahwarchhold

quarto-wordcount quarto-wordcount copied to clipboard

word count with books

quarto-wordcount
quarto-wordcount copied to clipboard