lua-filters icon indicating copy to clipboard operation
lua-filters copied to clipboard

Suggestion: Adding nonbreakablespace lua filter

Open Delanii opened this issue 4 years ago • 6 comments

Hello mr. Tarleb,

with your help, I have finished writing and testing filter that introduces non-breakable space before or after specific strings. If I would prepare informative and add makefile and data to perform tests, would you be interested in adding this filter to this repository?

I tryed to follow the lua-code style recommendations and also added comments that should clarify enough what I am doing (wanting to do).

Next goes final code of the filter:

Indexed table of one-letter prefixes, after which should be inserted '\160'.
Verbose, but can be changed per user requirements.

local prefixes = {

Some languages (czech among them) require nonbreakable space *before* long dash

local dashes = {

Function responsible for searching for one-letter prefixes, after which is 
inserted non-breakable space. Function is short-circuited, that means:

* If it finds match with `prefix` in `prefixes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (prefix wasnt 

function findOneLetterPrefix(myString)
  for index, prefix in ipairs(prefixes) do
    if myString == prefix then
      return true
  return false

Function responsible for searching for dashes, before whose is inserted 
non-breakable space. Function is short-circuited, that means:

* If it finds match with `dash` in `dashes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (dash wasnt 

function findDashes(myDash)
  for index, dash in ipairs(dashes) do
    if myDash == dash then
      return true
  return false

Core filter function:

* It iterates over all inline elements in block
* If it finds Space element, uses previously defined functions to find
`prefixes` or `dashes`
* Replaces Space element with `Str '\u{a0}'`, which is non-breakable space 
* Returns modified list of inlines

function Inlines (inlines)
  for i = 1, #inlines do
    if inlines[i].t == 'Space' then
	  -- Check for one-letter prefixes in Str before Space
      if inlines[i - 1].t == 'Str' then
	    local oneLetterPrefix = findOneLetterPrefix(inlines[i - 1].c)
		if oneLetterPrefix == true then
--		  inlines[i] = pandoc.Str '\xc2\xa0' -- Both work
          inlines[i] = pandoc.Str '\u{a0}'
	  -- Check for dashes in Str after Space
	  if inlines[i + 1].t == 'Str' then
	    local dash = findDashes(inlines[i + 1].c)
		if dash == true then
		  inlines[i] = pandoc.Str '\u{a0}'
	  -- Check for not fully parsed Str elements - Those might be products of 
	  -- other filters, that were executed before this one
	  if inlines[i + 1].t == 'Str' then
	    if string.match(inlines[i + 1].c, '%.*%s*[„]?%d+[“]?%s*%.*') then
		  inlines[i] = pandoc.Str '\u{a0}'

	Check for Str containing sequence " prefix ", which might occur in case of
	preceding filter creates it in one Str element. Also check, if quotation
	mark is present introduced by "quotation.lua" filter
	if inlines[i].t == 'Str' then
	  for index, prefix in ipairs(prefixes) do
	    if string.match(inlines[i].c, '%.*%s+[„]?' .. prefix .. '[“]?%s+%.*') then
		  front,detection, replacement, back = string.match(inlines[i].c, '(%.*)(%s+[„]?' .. prefix .. '[“]?)(%s+)(%.*)')
		  inlines[i].c = front .. detection .. '\u{a0}' .. back
  return inlines

Looking forward to you reply.

Regards, Tomas

Delanii avatar Oct 02 '20 09:10 Delanii

Thank you Tomas, I appreciate the offer! Could you tell a bit more about the use-cases of this filter? If I understand correctly, then this is for text written in Czech. I'd like to understand why it is needed, and whether it supports a common typographical convention. If it solves a common problem for Czech writers, then I believe it should fit in.

For the case that this is a less general filter, an alternative would be to host it in you own repository and tag the repo with the pandoc-filter topic to make it discoverable. In that case it should also be mentioned in the pandoc wiki under

In either case, I'll be happy to help and provide more feedback.

tarleb avatar Oct 02 '20 20:10 tarleb

Indeed you are correct. This filter is trying to solve common typografy requirement for one-letter words (in case of Czech those are prefixes (or prepositions in language context?) and conjunctions (again in language context, I might be missing correct terms)) never to appear at the end of a line. Also it tryes to add non-breakable space before every en-dash and before every number (to prevent separation of number and its meaning, like "chapter 9" being broken in two lines). It should be noted (I do in README) that this creates some strain on line-breaking patterns, so where possible hyphenation should be allowed.

The functions with regexes inside are trying to find before mentioned patters in strings, that for some reason are not parsed to Strings and Spaces - I have tested that in case there is filter, that does macro expansions or string replacement.

Also, I am trying to detect strings that have different quotation marks inside them - I have found a simple filter proposed by jgm, that changes quotation marks inserted by pandoc to chosen UTF symbols, which sadly produces Strings like Str „text“;

which in such case:

Str "„a" Space "quoted" Space Str "string“"

my filter would not detect the "a" with starting quotation mark. With those regexes it should.

Well, I am not using the official quotations.lua which I maybe should.

The filter is far from perfect, doesnt cover every typografical aspect, and also might require user intervention depending on his language requirements, but I dare to say that it is a good start.

I have tested it in docx and odt formats, which I am targeting mostly for conversion to them from TeX. In LuaTeX and ConTeXt, I am using lua callbacks (post-linebreak-filter), so I have not tested in .tex format, but I expect the Str "\u{a0}" inserts ~ in .tex source.

Some references in this topic (on

Using non-breaking space Another typography

Also this issues led to creation of vlna TeX preprocessor (specifically Czech here), lua-vlna package CTAN and ConteXt alternative, and others ...

So the use-case would be general writing with level of typography in mind, that requires conformity with this rule. In Czech, this is widely known, but sometimes neglected (due to docx authoring, which is trying to manage that automatically, but not really ... )

Sure, posting it in my repository is great too, but I dare to say that having any filter accepted here is a kind of quality-assurance, which I would like to achieve (and follow any requirements or recommendations).

Final note: It seems that code formatting little broke; I am using notepad, which automatically introduces tabs instead of spaces. If neccessary, I try to fix that.

Delanii avatar Oct 05 '20 09:10 Delanii

Thanks for the resources, this helped. I agree that the filter is an excellent fit for this repo, and I'll be glad to merge it. Would you like to open a PR?

There are some remaining questions and possible modifications. I apologize beforehand for me being a rather critical reviewer. The strictness is mostly motivated by the fact that I must be able to maintain any filter in case the original author become unavailable and we have to include fixes, or updates to newer pandoc versions. We also try to use a consistent style for the filters.

  • From what I gather, there are some single letter words which could be placed at the end of a line, e.g. í or š. Is that correct? Some answers in the linked Q/A appear to place nbsp even after those letters, while most don't. I assume you excluded those letters from prefixes on purpose?
  • The use of .c to access an elements contents is not officially supported and might break in future versions. Better to use .text when accessing Str contents.
  • A common Lua idiom to check whether a string is in a set of strings is to define the set as a table with strings as keys and booleans as values: local prefixes = {['a'] = true, ['z'] = true}; this allows us to check set membership by running prefixes[word].
  • The style guide liked above recommends snake_case instead of camelCase for most names. We are not super strict about it, but it would be nice to become more consistent across the codebase.


tarleb avatar Oct 07 '20 16:10 tarleb

I definitely will open a Pull Request then. I have to say, it will be me first time doing that, so please bear with me ... :) I prepare a suitable README, test and makefile. I do understand your requirements, and also value that, because for me being a beginner is kind of easier to follow some guidelines, or rules.

About the first bullet, I did excluded them just on basis in which there are no such one-letter words in Czech language. The filter could be written in such a way, to just prohibit any one-letter word being at the end of the line. But I know about people that want actually to go beyond this rule and even prevent two-letter prefixes being "orphaned" at the end of a line. I thought, that for people like that, I would like to offer easy option to tweak filter behavior.

Second: Oh, OK, I must have seen that somewhere. I fix that.

Third: So after modifying prefixes table as you suggest, I should in for loop in function findOneLetterPrefix (to be renamed) instead of:

for index, prefix in ipairs(prefixes) do


for word in prefixes[word] do

Did I get that correctly? As a lua newbie, I have never seen that.

Fourth: OK, I must have missed that. I change that, but I very much prefer camelCase over snake_case; it kinda drew me out of playing with Rust, which compiler is very restrictive even in functions naming.

I will get the modifications done in few days time, currently I am experiencing regular autumn cold, so I will get to it when I will be in full strength again.

Delanii avatar Oct 07 '20 18:10 Delanii

if prefixes[word] then -- do what you need to do when word is a prefix end

-- Better --help|less than helpless

Den ons 7 okt. 2020 20:07Delanii [email protected] skrev:

I definitely will open a Pull Request then. I have to say, it will be me first time doing that, so please bear with me ... :) I prepare a suitable README, test and makefile. I do understand your requirements, and also value that, because for me being a beginner is kind of easier to follow some guidelines, or rules.

About the first bullet, I did excluded them just on basis in which there are no such one-letter words in Czech language. The filter could be written in such a way, to just prohibit any one-letter word being at the end of the line. But I know about people that want actually to go beyond this rule and even prevent two-letter prefixes being "orphaned" at the end of a line. I thought, that for people like that, I would like to offer easy option to tweak filter behavior.

Second: Oh, OK, I must have seen that somewhere. I fix that.

Third: So after modifying prefixes table as you suggest, I should in for loop in function findOneLetterPrefix (to be renamed) instead of:

for index, prefix in ipairs(prefixes) do


for word in prefixes[word] do

Did I get that correctly? As a lua newbie, I have never seen that.

Fourth: OK, I must have missed that. I change that, but I very much prefer camelCase over snake_case; it kinda drew me out of playing with Rust, which compiler is very restrictive even in functions naming.

I will get the modifications done in few days time, currently I am experiencing regular autumn cold, so I will get to it when I will be in full strength again.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe .

bpj avatar Oct 07 '20 19:10 bpj

I have actually found out that the filter does not work for html and latex formats - in that case doesnt insert anything (I was hoping for the unicode sequence to convert to or ~.

I try to fix that.

EDIT: I am still struggling with the suggestion about membership checking. Even with @bpj clarification I am unable to make it work. I have settled with following nonbeakablespace.lua filter:


Indexed table of one-letter prefixes, after which should be inserted '\160'.
Verbose, but can be changed per user requirements.

local prefixes = {

Some languages (czech among them) require nonbreakable space *before* long dash

local dashes = {

Table of replacement elements

local nonbreakablespaces = {
  html = ' ',
  latex = '~',
  context = '~'

Function responsible for searching for one-letter prefixes, after which is
inserted non-breakable space. Function is short-circuited, that means:

* If it finds match with `prefix` in `prefixes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (prefix wasnt

function find_one_letter_prefix(my_string)
  for index, prefix in ipairs(prefixes) do
    if my_string == prefix then
      return true
  return false

Function responsible for searching for dashes, before whose is inserted
non-breakable space. Function is short-circuited, that means:

* If it finds match with `dash` in `dashes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (dash wasnt

function find_dashes(my_dash)
  for index, dash in ipairs(dashes) do
    if my_dash == dash then
      return true
  return false

Function to determine Space element replacement for non-breakable space according to output format

function insert_nonbreakable_space(format)
  if format == 'html' then
    return pandoc.RawInline('html', nonbreakablespaces.html)
  elseif format:match 'latex' then
    return pandoc.RawInline('tex',nonbreakablespaces.latex)
  elseif format:match 'context' then
    return pandoc.RawInline('tex',nonbreakablespaces.latex)
    --fallback to inserting non-breakable space unicode symbol
    return pandoc.Str '\u{a0}'

Core filter function:

* It iterates over all inline elements in block
* If it finds Space element, uses previously defined functions to find
`prefixes` or `dashes`
* Replaces Space element with `Str '\u{a0}'`, which is non-breakable space
* Returns modified list of inlines

function Inlines (inlines)

  --variable holding replacement value for the non-breakable space
  local insert = insert_nonbreakable_space(FORMAT)

  for i = 1, #inlines do
    if inlines[i].t == 'Space' then

	  -- Check for one-letter prefixes in Str before Space

      if inlines[i - 1].t == 'Str' then
	      local one_letter_prefix = find_one_letter_prefix(inlines[i - 1].text)
		    if one_letter_prefix == true then
--		    inlines[i] = pandoc.Str '\xc2\xa0' -- Both work
          inlines[i] = insert

	  -- Check for dashes in Str after Space

	    if inlines[i + 1].t == 'Str' then
	      local dash = find_dashes(inlines[i + 1].text)
		    if dash == true then
		      inlines[i] = insert

	    -- Check for not fully parsed Str elements - Those might be products of
	    -- other filters, that were executed before this one

	    if inlines[i + 1].t == 'Str' then
	      if string.match(inlines[i + 1].text, '%.*%s*[„]?%d+[“]?%s*%.*') then
		      inlines[i] = insert


	  Check for Str containing sequence " prefix ", which might occur in case of
	  preceding filter creates it in one Str element. Also check, if quotation
	  mark is present introduced by "quotation.lua" filter

	  if inlines[i].t == 'Str' then
	    for index, prefix in ipairs(prefixes) do
	      if string.match(inlines[i].text, '%.*%s+[„]?' .. prefix .. '[“]?%s+%.*') then
		      front, detection, replacement, back = string.match(inlines[i].c, '(%.*)(%s+[„]?' .. prefix .. '[“]?)(%s+)(%.*)')
		      inlines[i].text = front .. detection .. insert .. back

  return inlines

If try following changes:

local prefixes = {
  ['a'] = true,
  ['i'] = true,
  ['k'] = true,
  ['o'] = true,
  ['s'] = true,
  ['u'] = true,
  ['v'] = true,
  ['z'] = true,
  ['A'] = true,
  ['I'] = true,
  ['K'] = true,
  ['O'] = true,
  ['S'] = true,
  ['U'] = true,
  ['V'] = true,
  ['Z'] = true

function find_one_letter_prefix(my_string)
  for index, prefix in ipairs(prefixes) do
    if prefixes[prefix] then
      return true
  return false

making the whole code to:

Indexed table of one-letter prefixes, after which should be inserted '\160'.
Verbose, but can be changed per user requirements.

local prefixes = {
  ['a'] = true,
  ['i'] = true,
  ['k'] = true,
  ['o'] = true,
  ['s'] = true,
  ['u'] = true,
  ['v'] = true,
  ['z'] = true,
  ['A'] = true,
  ['I'] = true,
  ['K'] = true,
  ['O'] = true,
  ['S'] = true,
  ['U'] = true,
  ['V'] = true,
  ['Z'] = true

Some languages (czech among them) require nonbreakable space *before* long dash

local dashes = {

Table of replacement elements

local nonbreakablespaces = {
  html = ' ',
  latex = '~',
  context = '~'

Function responsible for searching for one-letter prefixes, after which is
inserted non-breakable space. Function is short-circuited, that means:

* If it finds match with `prefix` in `prefixes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (prefix wasnt

function find_one_letter_prefix(my_string)
  for index, prefix in ipairs(prefixes) do
    if prefixes[my_string] then
      return true
  return false

Function responsible for searching for dashes, before whose is inserted
non-breakable space. Function is short-circuited, that means:

* If it finds match with `dash` in `dashes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (dash wasnt

function find_dashes(my_dash)
  for index, dash in ipairs(dashes) do
    if my_dash == dash then
      return true
  return false

Function to determine Space element replacement for non-breakable space according to output format

function insert_nonbreakable_space(format)
  if format == 'html' then
    return pandoc.RawInline('html', nonbreakablespaces.html)
  elseif format:match 'latex' then
    return pandoc.RawInline('tex',nonbreakablespaces.latex)
  elseif format:match 'context' then
    return pandoc.RawInline('tex',nonbreakablespaces.latex)
    --fallback to inserting non-breakable space unicode symbol
    return pandoc.Str '\u{a0}'

Core filter function:

* It iterates over all inline elements in block
* If it finds Space element, uses previously defined functions to find
`prefixes` or `dashes`
* Replaces Space element with `Str '\u{a0}'`, which is non-breakable space
* Returns modified list of inlines

function Inlines (inlines)

  --variable holding replacement value for the non-breakable space
  local insert = insert_nonbreakable_space(FORMAT)

  for i = 1, #inlines do
    if inlines[i].t == 'Space' then

	  -- Check for one-letter prefixes in Str before Space

      if inlines[i - 1].t == 'Str' then
	      local one_letter_prefix = find_one_letter_prefix(inlines[i - 1].text)
		    if one_letter_prefix == true then
--		    inlines[i] = pandoc.Str '\xc2\xa0' -- Both work
          inlines[i] = insert

	  -- Check for dashes in Str after Space

	    if inlines[i + 1].t == 'Str' then
	      local dash = find_dashes(inlines[i + 1].text)
		    if dash == true then
		      inlines[i] = insert

	    -- Check for not fully parsed Str elements - Those might be products of
	    -- other filters, that were executed before this one

	    if inlines[i + 1].t == 'Str' then
	      if string.match(inlines[i + 1].text, '%.*%s*[„]?%d+[“]?%s*%.*') then
		      inlines[i] = insert


	  Check for Str containing sequence " prefix ", which might occur in case of
	  preceding filter creates it in one Str element. Also check, if quotation
	  mark is present introduced by "quotation.lua" filter

	  if inlines[i].t == 'Str' then
	    for index, prefix in ipairs(prefixes) do
	      if string.match(inlines[i].text, '%.*%s+[„]?' .. prefix .. '[“]?%s+%.*') then
		      front, detection, replacement, back = string.match(inlines[i].c, '(%.*)(%s+[„]?' .. prefix .. '[“]?)(%s+)(%.*)')
		      inlines[i].text = front .. detection .. insert .. back

  return inlines

It doesnt work, no Space replacement is done for the prefixes. I have tested all variations I could think of, almost blindly, because I am just missing how this concept (idiom) works.

Could you help me with accomodating for this requirement with an simple example? I have tryed to find something on SO or in "Programming in Lua," but I wasnt successfull.

On the other side, I have already all required files prepared - filter file, test file, correct test result and makefile. I have created makefile according to pagebreak makefile. So except for this not-fullfilled requirement I can start PR anytime.

Delanii avatar Oct 09 '20 14:10 Delanii