parser icon indicating copy to clipboard operation
parser copied to clipboard

Medium.com headings are missing in parsed content

Open ptrmrrs opened this issue 6 years ago • 3 comments

  • Platform: Darwin C02XH15QJHD4 18.2.0 Darwin Kernel Version 18.2.0: Thu Dec 20 20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64 x86_64
  • Mercury Parser Version: 2.1.0
  • Node Version (if a Node bug): 11.5.0

Expected Behavior

Expect all headings within Medium articles to be present within parsed content.

Current Behavior

h tags are stripped from the content.

Steps to Reproduce

  1. Use the following medium article: https://medium.com/@JakobUlbrich/flag-attributes-in-android-how-to-use-them-ac4ec8aee7d1
  2. Either feed it to the parser or use Mercury Reader
  3. Look for "What Are Bit Flags?" - this text will not be found in the parsed article.
Screen Shot 2019-04-25 at 9 17 10 pm Screen Shot 2019-04-25 at 9 16 41 pm

Possible Solution

It looks as though the issue here is the .graf class. If I strip out this tag from NEGATIVE_SCORE_HINTS then the headings show successfully. Alternatively, I can create a content transform in MediumExtractor to strip all classes from h tags and again the headings show successfully.

Is there a way to override the negative hints without forking the repo? Or perhaps override the medium extractor already included?

Sorry if anything here is super dump/obvious - I'm not an experienced coder.

ptrmrrs avatar Apr 25 '19 11:04 ptrmrrs

Hello @mtashley, we have the same problem. Have you got any solution? We replaced h1 and h2 with a generic tag headline eg. replace h1 to headline1. It works.

Best reguards Gabi

gschach avatar Dec 10 '19 10:12 gschach

@gschach @ptrmrrs Would you mind sharing your workaround?

Barabazs avatar Mar 13 '20 20:03 Barabazs

I'm facing the same issue. Is there a fix for this now?

pbshgthm avatar Sep 15 '22 23:09 pbshgthm