novelWriter icon indicating copy to clipboard operation
novelWriter copied to clipboard

Insertion method for em spaces

Open ag-eitilt opened this issue 2 months ago • 16 comments

I'd be using five if you provided an entry method for either U+2002 or U+2003; as it is I just fall back on double-spacing the ends of sentences

Feel free to add a separate feature request for that. I can certainly add it, but I don't think there are any sensible keyboard shortcuts left. However, optionally auto-replacing double space with em space could be an option. Possibly a setting "Replace double space with" and the options "Ignore", "En Space" and "Em Space"?

Honestly wasn't even thinking about that when I mentioned it in #2534. The only time I use the Unicode character is if I'm writing HTML or (less consistently) Markdown in Vim, so two space characters is very common for me to look at.

I definitely agree that you've got a pretty full shortcut space for the spaces, so I think responding to <Space><Space> in the same manner as the dashes would probably be best; it's certainly easiest on the memory. It's not a method you can realistically use in Markdown unless you are very strict with your use of tab characters for indentation, but since novelWriter doesn't use Markdown nor source-file indents, it is easily viable here.

As for which character to replace them with, I'm not entirely sure we need to present multiple options rather than just a checkbox, but I also have a feeling the few people who come to novelWriter already using wide Unicode spaces have very settled ideas of which character they prefer, so if it's not too much additional trouble in the code, the dropdown selection is probably the best way to make everyone happy.

ag-eitilt avatar Oct 02 '25 04:10 ag-eitilt

@ag-eitilt : I don't want to say anything against advanced input options, but isn't this topic more about typesetting than manuscript writing? I take this issue seriously, but I prefer to use a postprocessor or appropriate macros in LibreOffice/OpenOffice (1). This helps me to maintain my writing flow. What is your opinion? Are your rules for using different spaces suitable for automation?

(1) For my own use, I have written OpenOffice extensions as well as a converter for the DTP software QuarkXpress. This also performs other typographical tasks, such as reducing the size of consecutive capital letters in acronyms by half a point, or using medieval numerals.

peter88213 avatar Oct 02 '25 17:10 peter88213

You may have a point here that this is a typesetting matter, and not something to consider while writing. My intention was mainly to support wider spaces between sentences, but avoid any automatic detection of these and leave it to the writer to determine when they should be used.

When writing in mono space it is still quite common, at least in the US it seems, to double-space. I find the practice highly distracting when reading myself, but that's beside the point. It is also common enough to generate the manuscript in mono space font. However, I'm not sure either of these cases are suitable for em spaces instead of simply having two regular spaces.

I tried to do a search to see if there are any typesetting use cases for en and em space in modern text at all. I couldn't really find anything. The thin space has a use case in French, and figure space has a clear use case in numbers, but both these are already supported.

vkbo avatar Oct 02 '25 17:10 vkbo

Just for the record: There is an extension for OpenOffice that is perfect for such use cases: Pepito Cleaner (an undocumented fix for LibreOffice is here). It uses a set of regular expressions for the typographical post-processing of text documents. You can edit these regular expressions and add your own. The original appears to be designed for Italian and American English. I have developed a separate configuration for German and provided an additional extension that installs this "language package".

peter88213 avatar Oct 03 '25 11:10 peter88213

I don't want to say anything against advanced input options, but isn't this topic more about typesetting than manuscript writing?

To one degree, probably, but as Veronica said there's a long history of hitting the space bar twice after ending a sentence, and with Unicode being a thing it feels more idiomatic to represent that as a single character rather than attaching special meaning to doubled occurrences of a separate character. Even so, I still don't see this as a particularly important request, and it's mainly just by observation that there are already a lot of Unicode spaces being provided by novelWriter shortcuts, so this would round out the set. The alternative of just using the two spaces unmodified is perfectly reasonable

By another degree, though, it can be thought of as a manuscript-level concern -- so far as I can figure, the only reasons people consider the space between words and the space between sentences the same are because we now input them with the same key, and because they're both blank gaps that happen to often be rendered the same width. As a highly-contrived example: "Go see that Dr. Smith is here." It can tenuously be read with "that" being demonstrative and "Smith" being a third person, or with "that" meaning "whether" and "Dr. Smith" being a single name. (The better linguistic solution would probably have been to have separate abbreviation and sentence periods, but that's not the level of analysis that Unicode is built on.)


I prefer to use a postprocessor or appropriate macros in LibreOffice/OpenOffice (1). This helps me to maintain my writing flow.

Something which I forgot to copy over from my original post:

There's definitely an argument that that's overkill during the manuscript, but it does allow me to write scripts which specifically target a specific division -- for example "split the document into individual sentences" -- which I can then use to analyze the style of the piece while editing.

It's impossible to be fully accurate in programmatically distinguishing sentences when everything is just using the same single space, without either a lot of heuristics (might be possible in LibreOffice, very heavy-weight in Vim) or bringing in an LLM (which has its own problems with "fully accurate") -- throwing enough regexes at it like that extension does might get close, but there's always going to be ambiguity. That's not an argument for Unicode in particular, just for some spacing distinction being helpful.

Most of the other spaces I use do wind up being automatable, but I've also got them set up on two-key macros so they're not actually disruptive to insert (em space is on <Mod>+<Enter>), and novelWriter's <Ctrl>+K sequences are a similar low overhead; it winds up feeling like less effort to put a thin space in manually than to check all the exceptions in the script (what if I transcribe a character texting "???").


I tried to do a search to see if there are any typesetting use cases for en and em space in modern text at all. I couldn't really find anything.

This is definitely the biggest problem facing the feature request. There really isn't any modern typesetting which uses broader sentence spacing to begin with, and even if it were I wouldn't be surprised if it were marked up using XML or some other explicit tags rather than relying on Unicode. The en/em spaces do stem from typesetting concerns, and that shows in how they aren't expanded for justified lines. I only started using the Unicode characters because I was writing blog posts (the site's since died) and highly cross-referenced story notes/hyperfiction in HTML, which "helpfully" collapses doubled spaces into a single width.[^html]

[^html]: To be fair, there is a legitimate technical reason for HTML doing so. I just get a bit salty with the number of guides that try to sell sentence-space "fixing" as a user-experience feature, when it's just an accident of other parts of the system.

So long as the editor continues to render two spaces as broader than a single one, issues with monospaced fonts aside, I'm perfectly happy to stick with that.

ag-eitilt avatar Oct 03 '25 17:10 ag-eitilt

Okay, understood. So the use case is to insert sentence delimiters for text analysis. Then it's clear to me that double spaces are a potential problem because, as you mentioned, they disappear in HTML, which also applies to (regular) Markdown, which was originally designed for HTML generation.

So, at the very least, you would need to be certain that novelWriter does not automatically “clean up” double spaces, either when loading and saving its own format or when exporting to ODT format. Incidentally, it should be noted that, as an XML-based file format, ODT doesn't support multiple spaces either.
So if OpenOffice needs to store consecutive spaces, it uses special "spacer" tags internally: <text:s/>. Off the top of my head, I can't say whether the novelWriter ODT exporter uses them.

This naturally raises the question of where the interface of your self-created text analysis lies for you.

peter88213 avatar Oct 03 '25 18:10 peter88213

So, at the very least, you would need to be certain that novelWriter does not automatically “clean up” double spaces, either when loading and saving its own format or when exporting to ODT format. Incidentally, it should be noted that, as an XML-based file format, ODT doesn't support multiple spaces either. So if OpenOffice needs to store consecutive spaces (which is, by the way, quite difficult to enter), it uses special "spacer" tags internally: <text:s/>. Off the top of my head, I can't say whether the novelWriter ODT exporter uses them.

I can clarify that. No spaces are stripped inside of text paragraphs. In some contexts leading or trailing spaces are removed though. Like in the context of markup. novelWriter also generates <text:s/> tags for the second consecutive space in ODT files, and with a count value for more than two repeated spaces. Same as LibreOffice does. So multiple spaces are supported.

https://github.com/vkbo/novelWriter/blob/98cdd4eedf4e922dc233b50ae39b54ada42de13d/novelwriter/formats/toodt.py#L1505-L1520

vkbo avatar Oct 03 '25 19:10 vkbo

It's impossible to be fully accurate in programmatically distinguishing sentences when everything is just using the same single space, without either a lot of heuristics (might be possible in LibreOffice, very heavy-weight in Vim) or bringing in an LLM (which has its own problems with "fully accurate") -- throwing enough regexes at it like that extension does might get close, but there's always going to be ambiguity. That's not an argument for Unicode in particular, just for some spacing distinction being helpful.

This is precisely why I don't want to try to auto-detect these things. A naive detection of detecting .!? is clearly not enough as . can also be used in the context if an abbreviation, and : can sometimes be followed by a full sentence. Then there are all the punctuation rules related to quote marks. Letting the writer indicate the sentence separator offloads all that.

That said, there doesn't seem to be a standard here that can be used as an argument for having this feature other than the typewriter double-space. I'm not sure if there are any manuscript guidelines that demand it. But that would be a definite use case within the scope of novelWriter.

Allowing double-spaces to be replaced by a Unicode space is consistent with how dashes, ellipsis, and quotes are handled, so I am fine in principle with adding it as a non-default option. However, auto-replace comes with a small computational cost even when disabled, so if there is no standard to connect it to, and it's just a matter of personal preference for a small group of users, I'm a little reluctant to add it.

vkbo avatar Oct 03 '25 19:10 vkbo

I can clarify that. [...]

Then everything is clear. Perhaps this could be mentioned in the specification?

In any case, it is possible to identify the sentences with the help of LibreOffice macros. The built-in macro language StarBasic has convenient functions for text cursor control, including jumping to the beginning and end of a sentence. Ambiguities, such as in the example Go see that Dr. Smith is here., could be handled via the additional criterion of double spaces.

peter88213 avatar Oct 03 '25 19:10 peter88213

Then everything is clear. Perhaps this could be mentioned in the specification?

Which specification are you thinking of? There is one doc that covers the project XML file format, but that's all. I considered preserving multiple spaces in the manuscript text as default and expected behaviour, so I didn't consider it to be something that needed mentioning. You and I are only aware of this issue because we have the (mis)fortune of knowing the Open Document XML spec. 😅

vkbo avatar Oct 03 '25 19:10 vkbo

Allowing double-spaces to be replaced by a Unicode space is consistent with how dashes, ellipsis, and quotes are handled, so I am fine in principle with adding it as a non-default option.

That immediately came to mind as well. However, my many years of experience as a proofreader tell me that double spaces are among the most common typos. That is why they are blocked by default in some word processors such as LibreOffice Writer.

peter88213 avatar Oct 03 '25 19:10 peter88213

Which specification are you thinking of?

You have such a nice and comprehensive manual, don't you?

peter88213 avatar Oct 03 '25 20:10 peter88213

That immediately came to mind as well. However, my many years of experience as a proofreader tell me that double spaces are among the most common typos. That is why they are blocked by default in some word processors such as LibreOffice Writer.

I opted to add an (optional) error underline on repeated space instead, because personally I want to avoid them. It may be a good idea to add an option to filter them out in manuscript builds actually.

You have such a nice and comprehensive manual, don't you?

Ah, sure. But that's a level of detail I think is too much for a user guide. It is more appropriate for a format spec, which I only have one of as mentioned.

vkbo avatar Oct 03 '25 20:10 vkbo

But that's a level of detail I think is too much for a user guide.

Why not just tell the average users that they can string together as many spaces as they want without losing anything? Then they can happily indent paragraphs, format poems, or construct tables ;-)

peter88213 avatar Oct 03 '25 20:10 peter88213

This naturally raises the question of where the interface of your self-created text analysis lies for you.

Unix tooling, scripts, and a couple tools being fed via them. As a very simple example, the aforementioned sentence splitting is a sed replacement to put all of them on their own lines, and I can use that rendering to be sure I'm not getting too monotonous in my structure/length. Or I feed documents through pandoc to get plain text with no formatting to copy-paste to ProWritingAid for their heuristic reports. The only times I use LibreOffice are when I'm preparing something for someone who doesn't accept a more textual format, or when I want to do something in Calc.

Allowing double-spaces to be replaced by a Unicode space is consistent with how dashes, ellipsis, and quotes are handled, so I am fine in principle with adding it as a non-default option. However, auto-replace comes with a small computational cost even when disabled, so if there is no standard to connect it to, and it's just a matter of personal preference for a small group of users, I'm a little reluctant to add it.

Makes sense, and since I have little need for it (it would just cut out one case in any post-processing script I run these through, when I'd already need one to normalize metadata and a few other things before handing it over to pandoc), I'm not going to be the one to push you beyond that reluctance. So far as I'm concerned, this ticket can be downgraded to discussion/decision tracking, in case another double-space-using Unicode snob like me comes along in the future.

However, my many years of experience as a proofreader tell me that double spaces are among the most common typos.

This is why people should be using explicit em spaces more, entered via a separate key combo requiring deliberate opt-in. 😉 All <Space><Space> pairs can get flagged as typos (or just collapsed automatically), all explicit wide spacing isn't going to get caught by that search, and the (much fewer) remaining questionable [.!?]['")]?[ ] regex patterns can be inspected to be sure the're accurately not ending a sentence. Though, you're right, it is another strike against the proposed auto-replace entry method.

Why not just tell the average users that they can string together as many spaces as they want without losing anything? Then they can happily indent paragraphs, format poems, or construct tables ;-)

@vkbo will obviously have the final say, but because most of the time those are things that shouldn't be manuscript-level concerns,[^spacing] and mentioning that they're preserved will just lead users down the wrong path (c.f. https://github.com/vkbo/novelWriter/issues/2143#issuecomment-2563627878). At a guess it would probably lead to more frustration when people try to use spaces to do things in this most basic level that should be handled at a more complex one.

[^spacing]: From my perspective, this is ultimately novelWriter, and by the time you're getting into things like whitespace-sensitive poetry or table layouts, you're better off building them in some other software for insertion after export. Never tried the poems, but Markdown is just barely capable of representing tables, and even then only with rather fiddly extensions to the syntax. I almost always fall back to raw HTML whenever they come up because even if I get the first draft written up, there's a very good chance I'll wind up wanting to span columns or rows, or use a column as header cells rather than the top row, or any number of other common things which break Markdown tables -- and that's before getting into any formatting concerns. Plain text works well for tables, HTML works well, Markdown doesn't, and novelWriter occupies an even more restricted location in that stack.

ag-eitilt avatar Oct 05 '25 05:10 ag-eitilt

So far as I'm concerned, this ticket can be downgraded to discussion/decision tracking, in case another double-space-using Unicode snob like me comes along in the future.

That's a good idea. This topic would actually fit well in the discussion forum. That's probably why it's suggested to discuss topics first before making feature requests. If I'm not mistaken, a separate analysis program for novelWriter is also in the pipeline. And here we already have a few ideas for it floating around.

This is why people should be using explicit em spaces more, entered via a separate key combo requiring deliberate opt-in.

If I understand correctly, however, this would require adapting one's typing method to the analysis tool. I mean, ending every sentence with a special keyboard shortcut takes some getting used to, doesn't it? One could consider what the alternatives are.

  • If there are only a few exceptions, such as your “Dr. Smith” example, it might be more economical to use a substitute character for the abbrevition point, which can be replaced later. For example, I have gotten into the habit of using the # character for apostrophes, which is located on the same key on my keyboard as the minute sign. This way, I don't get confused with single quotation marks, and I can then have a macro fix it.
  • One could simply accept the inaccuracies caused by special cases.
  • Or explicitly identify the most common special cases.

Unix tooling, scripts, and a couple tools being fed via them.

I see. What does your workflow look like in practice? Do you use the Markdown export or the novelWriter markup export? Is a round trip possible with this? After all, we are talking about the editing stage. Or do you access the text files directly (*gulp*) from the back end?

peter88213 avatar Oct 05 '25 14:10 peter88213

We could make it as simple as I just add the en and em space entries to the Insert menu and leave it at that. I can't really see any viable keyboard shortcuts, so they wouldn't be very easy to use. Certainly not for every sentence, but at least they're there when you need them for other reasons.

I'm considering adding a way to make it easier to use special symbols and characters anyway, like a custom toolbar or something like that. I've unfortunately opted to name my main character in my current novel project Zoë, which is annoying to type, even with the Linux compose key!

vkbo avatar Oct 05 '25 15:10 vkbo