engine
engine copied to clipboard
HTML to Markdown is wrong for lists of lists
As found in https://github.com/OpenTermsArchive/p2b-compliance-declarations/pull/113
lists are translated incorrectly and give
6. **chapter 6**.
1. **subchapter i**. looks ok ...
7. **Chapter 7**. with no subchapter seems ok too
8. **Chapter 8**.
1. **Entire Agreement**. still ok
9. **Chapter 9**.
1. **Entire Agreement**. still ok
10. **Chapter 10**.
1. **Entire Agreement**. not ok anymore
11. **Chapter 11**.
1. **Entire Agreement**. not ok anymore
instead of
6. **chapter 6**.
1. **subchapter i**. looks ok ...
7. **Chapter 7**. with no subchapter seems ok too
8. **Chapter 8**.
1. **Entire Agreement**. still ok
9. **Chapter 9**.
1. **Entire Agreement**. still ok
10. **Chapter 10**.
1. **Entire Agreement**. not ok anymore
11. **Chapter 11**.
1. **Entire Agreement**. not ok anymore
(Note the additional space after 10.
and 11.
You can retrieve the problematic html with
wget https://www.mturk.com/participation-agreement -O participation-agreement.html
I just tested the update of turndown and also this PR https://github.com/mixmark-io/turndown/pull/358
but it does not work, we will have either to
- switch library (which might generate new versions on many documents if some other changes exist)
- fix this abandoned library
Let's discuss on next planif
I reopen as it won't be fixed until this fork is actually included into the core
According to https://github.com/mixmark-io/turndown/pull/419#issuecomment-1361030545, it is very unlikely https://github.com/OpenTermsArchive/turndown/pull/2 will ever be merged into upstream.
The alternative suggested by the library author is to use a custom rule. This would indeed be a more perennial approach than maintaining our own fork.
The issue described has been fixed in #943, the technical improvement to make this fix perennial is now described in https://github.com/OpenTermsArchive/engine/issues/1019.