Medium.com headings are missing in parsed content
- Platform:
Darwin C02XH15QJHD4 18.2.0 Darwin Kernel Version 18.2.0: Thu Dec 20 20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64 x86_64 - Mercury Parser Version:
2.1.0 - Node Version (if a Node bug):
11.5.0
Expected Behavior
Expect all headings within Medium articles to be present within parsed content.
Current Behavior
h tags are stripped from the content.
Steps to Reproduce
- Use the following medium article: https://medium.com/@JakobUlbrich/flag-attributes-in-android-how-to-use-them-ac4ec8aee7d1
- Either feed it to the parser or use Mercury Reader
- Look for "What Are Bit Flags?" - this text will not be found in the parsed article.
Possible Solution
It looks as though the issue here is the .graf class. If I strip out this tag from NEGATIVE_SCORE_HINTS then the headings show successfully. Alternatively, I can create a content transform in MediumExtractor to strip all classes from h tags and again the headings show successfully.
Is there a way to override the negative hints without forking the repo? Or perhaps override the medium extractor already included?
Sorry if anything here is super dump/obvious - I'm not an experienced coder.
Hello @mtashley, we have the same problem. Have you got any solution? We replaced h1 and h2 with a generic tag headline eg. replace h1 to headline1. It works.
Best reguards Gabi
@gschach @ptrmrrs Would you mind sharing your workaround?
I'm facing the same issue. Is there a fix for this now?