jabref
jabref copied to clipboard
Add quality check and cleanup for problematic unicode characters
Is your suggestion for improvement related to a problem? Please describe.
Some unicode characters make problems, even with biblatex support (eg pdflatex still not completely supporting unicode). For example, Garcı́a gives
Package inputenc Error: Unicode character ́ (U+0301)
A few of such problematic characters are:
- U+0300: https://tex.stackexchange.com/questions/555086/package-inputenc-error-unicode-char-%cc%80-u300-inputenc-not-set-up-for-use-wit
- U+0301: see https://tex.stackexchange.com/questions/329239/how-to-solve-a-unicode-char-u301-error and https://tex.stackexchange.com/questions/443018/package-inputenc-error-unicode-char-%cc%81-u301inputenc?noredirect=1&lq=1
- U+2212: https://tex.stackexchange.com/questions/361019/unicode-error-in-inputenc-package
Describe the solution you'd like
As these characters are hard to recognize, it would be nice if there would be an integrity check warning about them, and an automatic cleanup to convert them to their unproblematic equivalents (e.g. 0131 + 0301 to 00ED).
Additional context Might be helpful: https://github.com/zepinglee/citeproc-lua/blob/ab3ce712cc92073f12be26ff0b22b30eb906092d/citeproc/citeproc-latex-data.lua#L517
Have you tried converting them to latex? We have latex2unicode and vice versa conversion already
It actually came from latex code that I converted to unicode (I want all my stuff in unicode). This is also not very helpful in recognizing which entry/field has the problematic character.
The feature will be listed in the Check integrity dialog of JabRef.
The implementation will be similar to org.jabref.logic.integrity.AmpersandChecker.
Hi koppor, thank you for suggesting this issue to me! I hope to take it. I try to reproduce this problem:
- create an example
test.bibwith a problematic unicode character
@Article{test,
author = {Garcı́a},
title = {Test Article},
}
- import the
test.bibinto the library in JabRef. There‘s no error in this step - create an example
document.tex
\documentclass[12pt]{article}
{
\begin{document}
\begin{enumerate}
\item Sample Citation: \cite{test}
\end{enumerate}
\bibliographystyle{apalike}
\bibliography{test.bib}
\end{document}
}
- build
document.tex
$ pdflatex document.tex
$ bibtex document
$ pdflatex document.tex
$ pdflatex document.tex
There's the error
! LaTeX Error: Unicode character ́ (U+0301)
not set up for use with LaTeX.
I wonder if the goal is to automatically convert the problematic unicode character when importing or adding bib files in JabRef ?
Perfectly reproduced! 👍
Did you see my comment https://github.com/JabRef/jabref/issues/10506#issuecomment-1783939962?
- Click
- Issue appears
- [ ] Side TODO: Please let JabRef focus the tab where the issue occurs
I think, what @tobiasdiez would like to have, is some warning at a field - if the field misses an integrity check:
Note that the non-ascii check should be on only at bibtex mode, not in biblatex mode.
Note that the integrity checks should be turned on/off per library (maybe too much for this PR).
If one wants to get it compiling:
Try biber instead of bibtex. Or try bibtex8. The normal bibtex tool doesn't handle utf8 properly.
You can also try to use biber. See https://tex.stackexchange.com/a/34136/9075 for a hint.
integrity check warning about them ... and an automatic cleanup .. It actually came from latex code that I converted to unicode (I want all my stuff in unicode). This is also not very helpful in recognizing which entry/field has the problematic character.
@tobiasdiez
- JabRef has a check for non-ASCII-characters. See my screenshot at https://github.com/JabRef/jabref/issues/10506#issuecomment-1820790480. I think, this fulfills your "integrity check warning" wish. Could you retry with your JabRef
- We have the unicode-to-latex conversion. We also do have automatic save. Please try to activate the converter "on save"
On save, JabRef pops up "file was modified externally". Then, you even have a character diff.
Does that work for you?
@tobiasdiez @Siedlerchr I am not sure how to guide the student. I recommended him to put the checkers into the entry editor on type. Because it did not work there. Is this OK - Or should we find yet another issue?
The goal is not to automatically convert the symbols, because while unicode engines like LuaTeX and XeTeX can read the unicode characters, there are problems with older engines like pdfTeX. We can bridge the gap by detecting these characters in JabRef and hope PDFTeX will eventually catch up, or what is more likely: Users will stop using pdfTeX.
Given the choice, I would assume most people would prefer not having to convert an à in their text to
\`{a}
just because their font engine can't read it. They probably would prefer an engine that simply works without having to do magic conversions. By the way it was hard to cite this non-precomposed character in markdown xD
I am not sure why we would force users that already use more modern unicode engines to convert their precomposed unicode characters like à back into non-precomposed characters like
\`{a}
in their entries. Manual conversion is fine, but no need for automatic conversion I think, no?
PdfTeX is still maintained, but there are not a lot of updates to their repo. See here: https://tug.org/applications/pdftex/. Postscript fonts, which are natively supported by pdfTeX seem outdated and being dropped by many operating systems and applications, so at one point the reason for pdfTeX's existence will fade away and people will move to other font enginges. I think we should make it hard for users to stick to the outdated pdfTeX and incentivise users switching to unicode compatible engines.
I propose the path forward for JabRef should be as follows:
- Have a (long) grace period with a warning that these characters are not supported by pdfTeX and offer converting characters to their unproblematic equivalents, but do not do so automatically, instead offer manual conversion in the cleanup dialogue. The warning should include pointing to alternative modern engines like LuaTeX or XeTeX that support unicode.
- In a future version of JabRef (very far in the future), drop support for manual conversion and only offer unicode characters.
Note that the issue goes beyond the usual "bibtex is not compatible with unicode". As @ThiloteE correctly analyzed, the problem is the combination pdftex + biber (in particular the ascii check is not helpful).
The simplest solution would be indeed an automatic conversion of unicode characters to the Normal Form C, or at least combine unicode characters if they have an single-character equivalent. So à can stay the same but 0131 + 0301 is converted to 00ED (but not to its latex code). Since by definition these unicode representations are the equivalent, lualatex/xetex will display the same character - it's just to help pdftex.
Alternatively, implement it as a save-action that is on by default.
Ah, I see.
Naively, this can be achieved, by running unicode-to-latex and latex-to-unicode, because our unicode tables use the normal form c. -- However, this is much effort (See https://github.com/JabRef/jabref/pull/6155)
Similar functionality as org.jabref.logic.layout.format.ReplaceUnicodeLigaturesFormatter, but for the character "compression".
@tobiasdiez Do you propose a manual table as our org.jabref.logic.util.strings.UnicodeLigaturesMap#UnicodeLigaturesMap, but for Normal Form C? -- If yes, then this is a good first issue. Otherwise, I need to take back the assingment as good-first-issue.
@tobiasdiez Do you propose a manual table as our
org.jabref.logic.util.strings.UnicodeLigaturesMap#UnicodeLigaturesMap, but for Normal Form C? -- If yes, then this is a good first issue. Otherwise, I need to take back the assingment as good-first-issue.
Yes. It also doesn't have to cover all known character compressions. The ones containing some of the problematic linked in the issue description should be good enough for now.
Latex2Unicode library also uses NFC, or at least we use it https://github.com/JabRef/jabref/blob/4718930a6f32d94956caa352c49777864ea2b823/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java#L28-L31
https://github.com/JabRef/jabref/blob/4718930a6f32d94956caa352c49777864ea2b823/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java#L43-L46
Two things to do
- New Integrity check: String result = normalize-to-fc(input); raise error if result != input
- New Cleanup/FieldFormatter/ ...: result = normalize-to-fc(input);
Hi @koppor As you have mentioned, we need to do two things,so can you elaborate which result you are pointing or just elaborate your last comment
I have sucessfully reproduced the bug/issue and figure out with the help of above thread comments.
@harsh1898 Do you know Ctrl+Shift+F in IntelliJ? Here, you can search for code.
Integrity Check
The class is org.jabref.logic.integrity.IntegrityCheck. With Alt+F1 and then Enter, you can navigate to the package in the project view. Then, you find other integrity checks. I browsed around and found ValueChecker. Think, the implementation is as follows:
- Implement
UnicodeNormalFormCCheckin packageorg.jabref.logic.integrity. It implements interfaceValueChecker.- See https://github.com/JabRef/jabref/issues/10506#issuecomment-1820998978 for an implementation hint.
- Use Ctrl+Shift+T to generate a skeleton of a test class. You can see other test classes outlining how to implement (e.g.,
org.jabref.logic.integrity.BibStringCheckerTest)
- Add the
UnicodeNormalFormCChecktoorg.jabref.logic.integrity.FieldCheckers#getAllMap(in the biblatex mode branch). - Check if it appears in the UI and test it with the example
New cleanup action
- Create a new formatter
NormalizeUnicodeFormatterinorg.jabref.logic.formatter.bibtexfields. Also create test cases - Add it to
org.jabref.logic.formatter.Formatters#getOthers. - Check if it appears in the UI and test it with the example
As a general advice for newcomers: check out Contributing for a start. Also, guidelines for setting up a local workspace is worth having a look at.
Feel free to ask here at GitHub, if you have any issue related questions. If you have questions about how to setup your workspace use JabRef's Gitter chat. Try to open a (draft) pull-request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback.
Hi @koppor As per your suggestion, I have tried to fix this issue with some update in code repository.
You can review this #10817 to see my updates and Pull Request.