Special characters breaking project pages
We are facing some difficulties with READMEs that contain certain special characters, specially "<" and ">".
The issue is that our MDX parser interprets that as a special syntax and causes 500 errors on the project page. See the example on the notes.
To fix this issue, we could escape special characters on README, so instead of:
>
Have:
{">"}
The problems with this approach are that:
- README becomes less readable when not being read on datahub-next
- We would have to fix all individual situations in all the README files
A better approach would be to find a way to make our parser ignore errors like this OR automatically escape these characters since we don't really use this syntax anywhere.
Acceptance
- [ ]
Tasks
- [ ]
Notes
Example:
Take a look at:
https://demo.datahub.io/@datasets/world-religion-projections
If we access this dataset on a dev environment, the following error can be found:

Note that what's breaking the page is this part of the README:
The data is sourced from [PEW RESEARCH CENTER ](https://www.pewforum.org/2015/04/02/religious-projection-table/2010/percent/all/). In original dataset the number and the percentage share of followers for some religions as "<10000" and "<10%". Because of technical limitations of the visualization tool, these values had to be changed into "10000".
This issue is still happening on https://datahub.io/core/world-religion-projections - i think we probably want a way to fix this.
Shaping
Problem:
MDX parser interprets special characters "<" and ">" and alikes in README files as special syntax, causing 500 errors on the project page.
Appetite:
~ 1 week to implement a robust solution.
Rabbit-holes:
- N/A
No Gos:
- Manually fixing all individual instances of special characters in all README files.
- Introducing a solution that significantly impacts the readability of README files outside of datahub-next.
- Implementing a fix that is not scalable or maintainable in the long term.
Solution:
Write a regex to escape characters properly
@LuisVCSilva where is the "Solution" part? :)
The "Rabbit holes" should be any potential facets of the solution you're sketching that you're either not sure of or that have a potential of draining lots of resources on resolving them. Those can also be any questions that might have come up to you and you don't have answers to them yet.
No-goes are usually rabbit-holes that you can safely discard - you think they are not worth working on.
@olayway Edits made, thanks!
~ 1 week to implement a robust solution.
It shouldn't take more than 1 day
@LuisVCSilva also IMO the solution should be worked out in much more detail b4 implementation here e.g. where in the code will you likely make these changes? Will you write tests? If so where ...
Is this just < and > that causes issues?
And what about things like => (common in obsidian for "implies") - will you replace just the > or will you replace the whole => with implies unicode character etc?
Do you actually need to fix this or do we just need to report to users what the error relates to so they can fix it (getting into regexes to fix stuff is a potentially complex business - or maybe it is very simple ...)
Have you done any research online e.g. googling about this issue.
PS: does this relate to #1206?
Situation
Facing difficulties with READMEs that contain certain special characters, specially "<" and ">".
Problem
MDX throws an error whenever special characters like < or > appears in the .md file
It could also mean problem with other special characters that might also give us an MDX parsing error.
Solution
- Currently content from
README.mdfiles are parsed through remark plugins in thelib/markdown.tsfile, thus making sure that before passing the content we can replace special characters to an HTML entities such> = >or< = %lt;. - It would also make sense to write unit-tests for all possible different cases, and it would also make sense replace maybe these ones as well:
' = '& = &" = "
- Making sure everything documented somewhere how the parsing of the
.mdfile works could be a good feature to add thoughts? @olayway
Appetite
Implementation should be pretty quick(15-30 mins), also writing different test cases might take some time to test(1-2h).
Rabbit-holes
- Adding more character conversion that might complicate the solution or require different approach
- Unexpected edge cases such as
&which might become&amp;and etc. - Also when adding new features such as this #1233 might not require converting special character like this
!which can break the whole implementation of newly added feature
No-goes
- Hardcoding solution only for specific characters, meaning focusing only
< or >this might be not a good approach
Appendix
cc @olayway @rufuspollock
@gradedSystem Ok, let's just do this. I think we're making it more complicated than it actually is at this point. For now let's just check >, <, =>, <=, <=> maybe.
Hint: we you want to escape these characters before this line: https://github.com/datopian/datahub-next/blob/34c53cdc9d9a1dbd96e6ce08f219c9e064e698b1/lib/markdown.ts#L30
Creating a small remark plugin will not work, as the error is thrown already by gray-matter.
Also, I advise starting by writing tests and coding the solution second.
I just did the shaping and thought to include above tooπ , ok noted will start working on tests before solution
FIXED
Reopening as we haven't thought about some cases and the current solution breaks e.g. callouts and mermaid diagrams.
WONTFIX
Closing as wont-fix as there are too many edge cases and supporting this would require quite a lot of work. For now any standalone > or < that are causing errors need to be escaped with expressions: {">"} .