kimuraframework icon indicating copy to clipboard operation
kimuraframework copied to clipboard

absolute_url corrupts url escaping it

Open mdesantis opened this issue 6 years ago • 5 comments

I have an URL like this: https://www.example.com/path?query_param=N%2CU. absolute_url method:

https://github.com/vifreefly/kimuraframework/blob/512217924cab76498ee20594be3e15d26cabe426/lib/kimurai/base_helper.rb#L5-L8

escapes it so it becomes https://www.example.com/path?query_param=N%252CU, corrupting the URL and breaking the spider link following. What about adding an argument to absolute_url in order to skip escaping?

mdesantis avatar Aug 30 '19 15:08 mdesantis

I also come across this issue in various scraping frameworks, and firewall NAT! It means you then have to spend time investigating why your URL is getting mangled.

You probably already know this, but you can consider decoding that %2C to , before passing it to absolute_url.

I would love if there was an easier way to get the developer on the right track more quickly. And I agree that there should be a way to signal that you want to send the URL as it is.

ttilberg avatar Aug 30 '19 15:08 ttilberg

As workaround at the moment I added the following method in the spider class:

  def absolute_url(url, base:, skip_url_escape: false)
    super(url, base: base) unless skip_url_escape

    return unless url

    URI.join(base, url).to_s
  end

mdesantis avatar Aug 30 '19 15:08 mdesantis

After reviewing the class a bit more thoroughly: https://github.com/vifreefly/kimuraframework/blob/master/lib/kimurai/base_helper.rb

I've come to the conclusion that the correct course of action would be to not include escaping in the absolute_url method. There are already other helpers for escaping (escape_url), and combining escape + absolute url (normalize_url). Generating an absolute URL is a pretty common task, but you might not always wish to have other processing happening.

@vifreefly Do you agree with this? Are there other parts of the code or expectations that absolute_url is also handling escaping? This could potentially be a breaking change.

ttilberg avatar Aug 30 '19 15:08 ttilberg

I agree with you about that

mdesantis avatar Aug 30 '19 15:08 mdesantis

@ttilberg I think you're probably right, absolute_url should making the url is absolute, not escaping it.

I think it the next version of the framework we can make absolute_url not escaping url by default, and provide an option to escape it as well, like: absolute_url(some_url, base: url, escape: true).

As a workaround for now it's easy to redefine absolute url method in ApplicationHelper the way as you need, (like @mdesantis already did).

vifreefly avatar Aug 31 '19 08:08 vifreefly