NewPipeExtractor icon indicating copy to clipboard operation
NewPipeExtractor copied to clipboard

Validate HTTP response codes across the extractor

Open absurdlylongusername opened this issue 5 months ago • 3 comments

  • [x] I carefully read the contribution guidelines and agree to them.
  • [x] I have tested the API against NewPipe.
  • [x] I agree to create a pull request for NewPipe as soon as possible to make it compatible with the changed API.

Validate HTTP response codes for network requests throughout the extractor

EDIT: Repurposing this PR to be the first of a series of PRs that will refactor the extractor codebase to add HTTP response checks for all network requests to improve logging and error handling/reporting.

Almsot all code that uses the Downloader does not do any checking on the response code: therefore errors and exceptions thrown due to error response code give false information (e.g. tests failing due to 404 were giving error about clientId when the error was 404).

Based on discussions you can read below, I have initially added some methods in Response.java for validating response codes and used it in places for soundcloud extraction. It will throw exception with information about the error code, which will propagate upwards to the call site which will print a more useful stack trace for debugging.

As we discuss more on how we want to design and refactor this, further changes will be made in other places

absurdlylongusername avatar Jul 08 '25 09:07 absurdlylongusername

@litetex

  • There are Todo comments

The comment was kind of intentionally left in there in the hopes that whoever reviewed it would provide an explanation and then I would remove it.

  • Response codes that are ok are 2XX response codes

Not exactly sure what you mean by this? Do you mean that 2xx is the only "success" code range all network calls throughout the codebase?

  • I'm really not sure about another HttpUtils method and a new Exception class I think the most simple approach would be to have a boolean Response#isOk() method and then use this method to control the underlying use cases accordingly.

Do you mean another as in having two validateResponseCode methods, one in Response and one in HttpUtils, or do you mean that there's another HttpUtils class somewhere (because I can't seem to find one)?

And does this also mean we shouldn't have a HttpResponseException? If so, why? I think having a specific exception for Http errors is better than throwing a generic IOException, and also makes for better logging and diagnostics, because right now our code does not give any useful information at all for what causes an error if the cause is an invalid response.


My idea was to have a generic method we can use to check for expected response codes instead of just checking it's 2xx or not 2xx, because I was of the assumption that:

  1. We are definitely not validating response codes anywhere in the codebase (I can't see anywhere we do this)
  2. There may be scenarios where we want to treat a specific 3xx or 4xx as not an error and it's not efficient to catch an exception in these cases and then check the response code (and then throw another exception with info about the http error that will get propagated upwards), as I believe we should never throw exception if we don't have to, and exceptions should only be thrown for unexpected behaviour, or in the same scenarios we already throw exceptions.

The end goal is that if a network call ever fails, we want to know why, using code that is:

  • As small as possible
  • Easy to understand
  • Easy to use
  • Easy to extend

If there's only one place in code where non-2xx codes shouldn't be considered an error then we could handle that case explicitly at that call site, but if it's at least 3 times then I think we should have more than a simple isOk method. Could perhaps have other methods like Response#isRedirect for 3xx, Response#isClientError for 4xx, Response#isServerError instead of a single generic method.

Since I am almost certain 1. is true, I am more than happy to trawl through the entire codebase for every http call we make (not jusr for get, but for post and head too) and add response code validation and see which approach makes the most sense.

I could update this PR with these changes and/or split it into multiple PRs if it is too big.

Thoughts?

absurdlylongusername avatar Jul 14 '25 06:07 absurdlylongusername

Thanks! We do need to check response codes more often. For example, PeerTube sometimes throws 429 Too Many Requests (TeamNewPipe/NewPipe#2979), but that shouldn't be treated as a recaptcha by the Downloader (while that's what happens now). It should be the PeerTube service's job to throw a normal ExtractorException for 429s, and instead the YouTube service's job to throw a RecaptchaException (along with a proper URL that users can use to solve the captcha, instead of having the Downloader guess that URL).

It would be good if you could take the above paragraph into consideration for this PR, and properly validate the HTTP response codes of all requests made by all services in the extractor, so we don't have to change the design once again later.

@Stypox Some problems.

In BandcampChannelLinkHandlerFactory.getId if I want to add HTTP response code validation, then this means that I need to handle the HttpResponseException.

Now, LinkHandlerFactory.getId and other base methods all throw ParsingException, but not ExtractionException. getId in the bandcamp class catches IOException | ReCaptchaException | ArrayIndexOutOfBoundsException | JsonParserException and swallows them all into a ParsingException.

Semantically, only JsonParserException is a parsing exception, and I don't know the rationale behind catching ArrayIndexOutOfBoundsException.

Now, what exactly is the difference between an ExtractionException and a ParsingException?

LinkHandlerFactory and pretty much all the other abstract base class extractors and similar classes all throw ParsingException, and they do that by swallowing up all other exceptions and wrapping it in a ParsingException.

If we add http response validation throughout the extractor, should the HttpResponseException be swallowed and wrapped in a ParsingException? Because as far as I can see that is what happens throughout the codebase regardless of the type of exception thrown when we make network requests.

How exactly do we want to do this, so that we can have correct semantic exceptions?

absurdlylongusername avatar Oct 11 '25 20:10 absurdlylongusername

Now, what exactly is the difference between an ExtractionException and a ParsingException?

I guess this was a very early attempt to distinguish exceptions caused by getting and deserializing data from the internet, vs parsing that data using the methods defined in the extractor. However, I don't think that distinction holds now, and the there is no semantic distinction between the two. If you want to open a PR to delete ParsingException and replace it with ExtractionException everywhere feel free to.

If we add http response validation throughout the extractor, should the HttpResponseException be swallowed and wrapped in a ParsingException? Because as far as I can see that is what happens throughout the codebase regardless of the type of exception thrown when we make network requests.

I would create a NetworkException extends ExtractionException to wrap all network-related exception. Error handling is generally difficult to do, but I think the current state of the extractor mostly works as expected: wrapping exceptions means we still get to look at the original exception in the stacktrace in bugreports, but the API remains clean so NewPipeExtractor users don't have to catch tens of exception types.

Stypox avatar Oct 21 '25 17:10 Stypox