Readability4J icon indicating copy to clipboard operation
Readability4J copied to clipboard

img tags with missing src which are set via javascript or noscript show as empty

Open Pranoy1c opened this issue 3 years ago • 0 comments

The following page:

https://netflixtechblog.com/full-cycle-developers-at-netflix-a08c31f83249

has img tags which have empty src attribute. The src is set via javascript upon scroll I think or via noscript tags right after the img tags.

Here's a piece of the page's HTML:

<img alt="" class="iq ir t u v is ak c" width="687" height="60" role="presentation"><noscript><img alt="" class="t u v is ak" src="https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png" width="687" height="60" srcSet="https://miro.medium.com/max/552/1*JnixtUHJjNYXNT15P42eJQ.png 276w, https://miro.medium.com/max/1104/1*JnixtUHJjNYXNT15P42eJQ.png 552w, https://miro.medium.com/max/1280/1*JnixtUHJjNYXNT15P42eJQ.png 640w, https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png 687w" sizes="687px" role="presentation"/></noscript></div></div></div><figcaption class="jd je cm ck cl jf jg en b eo ep fv" data-selectable-paragraph="">SDLC components</figcaption></figure>

This causes Readability to return empty images for the large images and tiny thumbnails only when using ReadabilityExtended.

I am able to solve the issue by searching for all img tags with missing src and then checking if such Element has a noscript sibling with an img in it and if so, then extract the src from the noscript and set it to the original img:

I placed the following code at the very beginning of the protected open fun removeNoscripts(document: Document) {} function in Preprocessor.kt:

try {
    document.select("img[src=\"\"], img:not([src])").forEach { img ->

//                println("Empty: ${img}")
//                println("Noscript: ${img.siblingElements().select("noscript")}")

        img.siblingElements().select("noscript").firstOrNull()?.let {
            img.attr("src",Jsoup.parse(it.html(), "", Parser.xmlParser()).selectFirst("img").attr("src"))
        }
    }
} catch (e: Exception) {
    println("Exception in setting img for missing src from noscript tags")
}

Pranoy1c avatar Jun 10 '21 11:06 Pranoy1c