http-crawler icon indicating copy to clipboard operation
http-crawler copied to clipboard

Allow user to choose whether to follow redirects

Open inglesp opened this issue 9 years ago • 11 comments

We currently use requests's default behaviour of following redirects.

A user might not always want this, as they might want to use the library to find unnecessary redirects on a site.

We should find a way to allow the user to configure the behaviour here.

inglesp avatar Jun 09 '16 11:06 inglesp

I am a beginner when it comes to contributing to opensource. I am interested in taking this up.

I was thinking crawl() could take keyword arguments (say, follow_redirects) along with base_url through which the user can specify whether to follow redirects or not. Then, we can set allow_redirects in get() method, accordingly.

Please let me know if my approach is a proper one.

rkrp avatar Jun 23 '16 23:06 rkrp

Hi @rkrp. That looks like a sensible approach.

One question: what should the default behaviour be? I think that following redirects should be the default, as this is what people who use requests will expect. What do you think?

Do you fancy trying to implement this?

inglesp avatar Jun 26 '16 18:06 inglesp

@inglesp I agree. The default behaviour must be to follow redirects. Anything otherwise, would be very counter-intuitive.

Also, I presume, I will need to write the corresponding unit tests for this new feature.

I would love to implement this. But, I am a student and I am in the middle of my exams. So, it would take sometime before I can send a pull request. I hope that is fine.

rkrp avatar Jun 28 '16 13:06 rkrp

Yes, when you come to implement this, please include tests.

Good luck with your exams!

inglesp avatar Jun 29 '16 08:06 inglesp

Thanks.

I was initially planning to create a new method test_redirect() in test_http_crawler.py to include the tests related to this. But, the code for starting the local server (serve())is defined in test_crawl(). So, should I write the tests for redirects in test_crawl() or should I move _serve() out of test_crawl()?

rkrp avatar Jul 08 '16 09:07 rkrp

Hi @rkrp!

I'm one step ahead! Last night, I added a commit that adds a new option to crawl(), and in the process, refactored the test code to move _serve() out of test_crawl(). Does this solve the problem for you?

inglesp avatar Jul 08 '16 09:07 inglesp

Hi @rkrp -- how's this going? Anything I can help with?

inglesp avatar Aug 07 '16 08:08 inglesp

@inglesp I am looking for the best way to setup local httpd for the redirection tests. I looked into the tests present already. And also, I am reading up on the documentations for http.server. Hopefully, I will be able to send a PR by this weekend.

rkrp avatar Aug 11 '16 15:08 rkrp

@rkrp -- good stuff!

inglesp avatar Aug 11 '16 20:08 inglesp

@rkrp, any progress? If there's anything I can help with, let me know. If not, I'd like to pass this issue on to somebody else to tackle.

inglesp avatar Oct 11 '16 19:10 inglesp

@inglesp I am sorry for the delay. I have sent a PR, implementing this. #8

rkrp avatar Oct 12 '16 14:10 rkrp