hero icon indicating copy to clipboard operation
hero copied to clipboard

Ability to abort url request

Open jamesike opened this issue 4 years ago • 10 comments

Hello, I was wondering if there was a way to be able to intercept a URL request, and abort it before the request is able to be sent

jamesike avatar Sep 01 '21 17:09 jamesike

Hi @jamesike, there's not currently any ability to block specific requests. We do currently support blocking resources on a more general level: blockedResourceTypes option (https://secretagent.dev/docs/overview/configuration#blocked-resources.

Are you wanting to block a pattern of urls? Or inspect actual requests and block one-by-one?

blakebyrnes avatar Sep 01 '21 17:09 blakebyrnes

+1 for this feature.

something like puppeteers

page.on('request', request => {

});

would be so awesome to have here.

perhaps there is some sort of workaround @blakebyrnes ?

janisblaus avatar Sep 25 '21 07:09 janisblaus

@janisblaus I definitely want to support blocking urls with a list of wildcard url patterns, but I'm hesitant to add a feature to push every resource to the client to get approval before proceeding in the Man-in-the-middle. Our client/server model just makes something like this feel very heavy. Would a url blocking pattern solve your use-case? Or do you need to inspect them one by one and look at things like header/body/etc?

blakebyrnes avatar Sep 26 '21 17:09 blakebyrnes

@janisblaus I definitely want to support blocking urls with a list of wildcard url patterns, but I'm hesitant to add a feature to push every resource to the client to get approval before proceeding in the Man-in-the-middle. Our client/server model just makes something like this feel very heavy. Would a url blocking pattern solve your use-case? Or do you need to inspect them one by one and look at things like header/body/etc?

Pattern blocking sounds limiting. On puppeteer I use this to inspect requests, retrieve headers and in some specific cases - even rewrite them.

janisblaus avatar Sep 27 '21 12:09 janisblaus

I've definitely done that too, but it was usually to fix headers to match what they should have been for Chrome - which should not be a use case for SecretAgent. You can currently inspect every resource that comes through and inspect headers, etc, and you can write your own plugin to manipulate any headers. That approach works if you have a pattern of things you want to do for Every request, but it's a bit of the wrong approach if this is unique per scrape. Do you have cases where you've wanted to do something different on a per-scrape basis?

blakebyrnes avatar Sep 27 '21 13:09 blakebyrnes

I guess writing a plugin would be enough for such a use case, yes, will definitely look into it.

janisblaus avatar Sep 28 '21 04:09 janisblaus

This feature should be implemented by existing "request" events along with an "abort" function on them. We might need a mode here that "pauses" the request until a client responds with a continue or abort... This is a good place to also allow a "hook" to wait for the response body as a ReadableStream.

NOTE: we need discussion on whether the default "request" event should active an "abort" feature or if that's an additional function.

NOTE 2: we could also consider implementing this as a reference plugin.

blakebyrnes avatar Oct 14 '22 14:10 blakebyrnes

I agree that ideally this can be done as a plugin, which can be one of the available plugins in a ulixee-controlled repo/folder of plugins available for opt-in.

I also would very much like this, as it allows you to save on a lot of wasted resources, given how much unrequired crap most websites download in the background these days...

GlenDC avatar Oct 14 '22 17:10 GlenDC

The shorter path for much of this need is to simply add a blockedResourceUrls as a config to Hero (series of regexes or strings). It's already part of the Mitm code and interceptorHandlers pattern, we just need the configuration. Maybe we should log this separately - I think it's the feature you want.

A request interception plugin is a bit of a more advanced scenario where you need to block and/or modify post data, response data, etc. The only time you need the ability to just plain abort if we add the blockedResourceUrls is if you just don't know what the url will look like (or there are 2 good ones, and a 3rd bad one you want to block).

blakebyrnes avatar Oct 14 '22 18:10 blakebyrnes

I think it's the feature you want.

Sounds about right. Okay, let's go for that :)

GlenDC avatar Oct 14 '22 19:10 GlenDC

Here's a good starting place: you can see the existing configs coming in (they'll need to be added to client): https://github.com/ulixee/hero/blob/f7b3d0d07931fd8e06b5c6fa8c8477a4577450e3/core/lib/Tab.ts#L820

Regexps will automatically traverse the connection, but for places where we match urls, it's usually allow plain strings too (an example of this is how we do waitForResource)

blakebyrnes avatar Oct 20 '22 12:10 blakebyrnes

A request interception plugin is a bit of a more advanced scenario where you need to block and/or modify post data, response data, etc.

If that's what I'm looking for, what a good starting point, assuming a plugin is the place to do this? Are there any specific examples I reference?

My use case would be intercept a script resource, modify it and return the modified script.

rjbks avatar Oct 23 '22 18:10 rjbks

My use case would be intercept a script resource, modify it and return the modified script.

I think we need to make some minor modifications to the current plugin structure to support this. Likely we should add an additional callback to the beforeHttpRequest and beforeHttpResponse calls that indicates you have handled request processing and would like to halt request processing.

If you simply need a temporary way around this, you can subscribe to new Agent creation directly in HeroCore (note that this is a semi-internal api and is not documented because of that).

import HeroCore from '@ulixee/hero-core';

await HeroCore.start(); // has to be started before you can register event. You can also do this by starting a Ulixee Server
HeroCore.pool.on('agent-created', ({agent}) => {
  agent.mitmRequestSession.interceptorHandlers.push({
      urls: ['SCRIPT_URL', new RegExp('Or regex')],
      handlerFn(url, type, request, response) {
        response.end(`<YOUR SCRIPT>`);
        return true;
      },
    });
});

blakebyrnes avatar Oct 24 '22 14:10 blakebyrnes

blockedResourceUrls is now available and exposed. Still have to add automated tests.

GlenDC avatar Nov 09 '22 22:11 GlenDC

HeroCore.pool.on('agent-created', ({agent}) => { agent.mitmRequestSession.interceptorHandlers.push({ urls: ['SCRIPT_URL', new RegExp('Or regex')], handlerFn(url, type, request, response) { response.end(<YOUR SCRIPT>); return true; }, }); });

How would I then proceed to get the script contents by making the request from the browser tab context that originated the request? I can't seem to find anything in the agent object or the browserContext.

rjbks avatar Mar 06 '23 08:03 rjbks