python-markdownify icon indicating copy to clipboard operation
python-markdownify copied to clipboard

Handle relative image URLs

Open kaichen opened this issue 1 year ago • 5 comments

Add ability to process non-full image url, such as 'path/to/img.png' or '/path/to/img.png'

kaichen avatar Jun 23 '24 14:06 kaichen

Hey, thanks for your contribution! Any reason why the base_url gets cut into host and protocol, instead of using it as-is as prefix? Maybe the user wants to prefix their URLs with a full locator.

AlexVonB avatar Nov 24 '24 16:11 AlexVonB

@kaichen - could you provide an example use case for this feature? I don't fully understand it from the pull request description.

chrispy-snps avatar Jan 01 '25 18:01 chrispy-snps

could you provide an example use case for this feature? I don't fully understand it from the pull request description.

Some webpages might use relative paths for their image URLs. When using this library to download HTML and convert it to Markdown, need the full image URLs to ensure the images render correctly.

kaichen avatar Jan 02 '25 01:01 kaichen

Hey, thanks for your contribution! Any reason why the base_url gets cut into host and protocol, instead of using it as-is as prefix? Maybe the user wants to prefix their URLs with a full locator.

just want to make sure base_url join relative correctly.

kaichen avatar Jan 02 '25 01:01 kaichen

I have mixed feelings about this.

On one hand, I always appreciate a pull request contribution. And on the surface, this provides a nice convenience for this use case.

But on the other hand, Markdownify's job is to render the provided HTML to Markdown, and as the Unix mantra says, "do one thing and do it well." Modifying link content is content modification, not content rendering, which feels more like source preprocessing before Markdownify is called.

Two more random thoughts:

  • <a> links should be given similar consideration.

  • Another approach is to use a link-formatting function in process_img() and process_a():

    def format_link(link_text):
        return link_text;  # default is to use link text as-is
    

    then allow the user to override this, either by an option that takes a callback function, or by a subclassed function override.

Or maybe I am overthinking it, and this is simply a nice convenience that we should implement. :)

@AlexVonB, what are your thoughts on this?

chrispy-snps avatar Jan 02 '25 11:01 chrispy-snps