portaljs icon indicating copy to clipboard operation
portaljs copied to clipboard

Special characters breaking project pages

Open demenech opened this issue 2 years ago β€’ 1 comments

We are facing some difficulties with READMEs that contain certain special characters, specially "<" and ">".

The issue is that our MDX parser interprets that as a special syntax and causes 500 errors on the project page. See the example on the notes.

To fix this issue, we could escape special characters on README, so instead of:

>

Have:

{">"}

The problems with this approach are that:

  • README becomes less readable when not being read on datahub-next
  • We would have to fix all individual situations in all the README files

A better approach would be to find a way to make our parser ignore errors like this OR automatically escape these characters since we don't really use this syntax anywhere.

Acceptance

  • [ ]

Tasks

  • [ ]

Notes

Example:

Take a look at:

https://demo.datahub.io/@datasets/world-religion-projections

If we access this dataset on a dev environment, the following error can be found:

image

Note that what's breaking the page is this part of the README:

The data is sourced from [PEW RESEARCH CENTER ](https://www.pewforum.org/2015/04/02/religious-projection-table/2010/percent/all/). In original dataset the number and the percentage share of followers for some religions as "<10000" and "<10%". Because of technical limitations of the visualization tool, these values had to be changed into "10000".

demenech avatar Apr 20 '23 19:04 demenech

This issue is still happening on https://datahub.io/core/world-religion-projections - i think we probably want a way to fix this.

rufuspollock avatar Apr 08 '24 11:04 rufuspollock

Shaping

Problem:

MDX parser interprets special characters "<" and ">" and alikes in README files as special syntax, causing 500 errors on the project page.

Appetite:

~ 1 week to implement a robust solution.

Rabbit-holes:

  • N/A

No Gos:

  • Manually fixing all individual instances of special characters in all README files.
  • Introducing a solution that significantly impacts the readability of README files outside of datahub-next.
  • Implementing a fix that is not scalable or maintainable in the long term.

Solution:

Write a regex to escape characters properly

LuisVCSilva avatar Jul 03 '24 12:07 LuisVCSilva

@LuisVCSilva where is the "Solution" part? :)

The "Rabbit holes" should be any potential facets of the solution you're sketching that you're either not sure of or that have a potential of draining lots of resources on resolving them. Those can also be any questions that might have come up to you and you don't have answers to them yet.

No-goes are usually rabbit-holes that you can safely discard - you think they are not worth working on.

olayway avatar Jul 03 '24 13:07 olayway

@olayway Edits made, thanks!

LuisVCSilva avatar Jul 04 '24 12:07 LuisVCSilva

~ 1 week to implement a robust solution.

It shouldn't take more than 1 day

olayway avatar Jul 05 '24 12:07 olayway

@LuisVCSilva also IMO the solution should be worked out in much more detail b4 implementation here e.g. where in the code will you likely make these changes? Will you write tests? If so where ...

Is this just < and > that causes issues?

And what about things like => (common in obsidian for "implies") - will you replace just the > or will you replace the whole => with implies unicode character etc?

Do you actually need to fix this or do we just need to report to users what the error relates to so they can fix it (getting into regexes to fix stuff is a potentially complex business - or maybe it is very simple ...)

Have you done any research online e.g. googling about this issue.

PS: does this relate to #1206?

rufuspollock avatar Jul 08 '24 21:07 rufuspollock

Situation

Facing difficulties with READMEs that contain certain special characters, specially "<" and ">".

Problem

MDX throws an error whenever special characters like < or > appears in the .md file It could also mean problem with other special characters that might also give us an MDX parsing error.

Solution

  • Currently content from README.md files are parsed through remark plugins in the lib/markdown.ts file, thus making sure that before passing the content we can replace special characters to an HTML entities such > = &gt; or < = %lt;.
  • It would also make sense to write unit-tests for all possible different cases, and it would also make sense replace maybe these ones as well:
    • ' = &apos;
    • & = &amp;
    • " = &quot;
  • Making sure everything documented somewhere how the parsing of the .md file works could be a good feature to add thoughts? @olayway

Appetite

Implementation should be pretty quick(15-30 mins), also writing different test cases might take some time to test(1-2h).

Rabbit-holes

  • Adding more character conversion that might complicate the solution or require different approach
  • Unexpected edge cases such as &amp; which might become &amp;amp; and etc.
  • Also when adding new features such as this #1233 might not require converting special character like this ! which can break the whole implementation of newly added feature

No-goes

  • Hardcoding solution only for specific characters, meaning focusing only < or > this might be not a good approach

Appendix

cc @olayway @rufuspollock

gradedSystem avatar Aug 08 '24 13:08 gradedSystem

@gradedSystem Ok, let's just do this. I think we're making it more complicated than it actually is at this point. For now let's just check >, <, =>, <=, <=> maybe.

Hint: we you want to escape these characters before this line: https://github.com/datopian/datahub-next/blob/34c53cdc9d9a1dbd96e6ce08f219c9e064e698b1/lib/markdown.ts#L30 Creating a small remark plugin will not work, as the error is thrown already by gray-matter.

Also, I advise starting by writing tests and coding the solution second.

olayway avatar Aug 09 '24 11:08 olayway

I just did the shaping and thought to include above tooπŸ˜…, ok noted will start working on tests before solution

gradedSystem avatar Aug 09 '24 11:08 gradedSystem

FIXED

olayway avatar Aug 16 '24 15:08 olayway

Reopening as we haven't thought about some cases and the current solution breaks e.g. callouts and mermaid diagrams.

olayway avatar Aug 19 '24 13:08 olayway

WONTFIX

Closing as wont-fix as there are too many edge cases and supporting this would require quite a lot of work. For now any standalone > or < that are causing errors need to be escaped with expressions: {">"} .

olayway avatar Aug 19 '24 13:08 olayway