caddy-dynamicdns icon indicating copy to clipboard operation
caddy-dynamicdns copied to clipboard

Enhance IP Address Extraction in lookupIP to Handle Complex Responses

Open kkkgo opened this issue 1 year ago • 9 comments

This PR significantly improves the lookupIP function by making it more robust and capable of handling complex response formats. Key improvements include:

  1. Enhanced Regex for Strict Validation:

    • IPv4: Matches only valid ranges (0-255) for each segment.
    • IPv6: Supports standard, compressed, and scoped formats.
  2. Improved Handling of Multiple IPs:

    • Extracts all potential IP addresses from the response body using regex.
    • Processes matches in reverse order, ensuring the last valid IP is returned.
    • Handles invalid IPs gracefully, iterating through matches until a valid one is found.
  3. Increased Flexibility:

    • Previous implementation only supported responses containing a single clean IP address.
    • Now supports extraction from mixed content such as HTML or JSON responses.
  4. Safety Enhancements:

    • Limits the maximum response size to 10 KB to avoid excessive memory usage.
    • Ensures robust error handling for invalid responses or unexpected content.

Test Plan: The following test scenarios have been verified:

  1. Clean IP Responses:

    • Example: A response containing only a single clean IP address like 203.0.113.1.
  2. Complex HTML Content:

    • Example: https://www.cloudflare.com/cdn-cgi/trace
      • Contains mixed data where IP address needs extraction from a key-value style content.
      • Ensures the last valid IP (2001:db8::1) is returned if multiple matches exist.
  3. Invalid IP Responses:

    • Example: Content with invalid IPs (e.g., 999.999.999.999) mixed with valid ones.
    • Verifies that only valid IPs are returned, and errors are raised if none exist.

This change makes the function much more versatile and robust, supporting a wider range of real-world use cases while maintaining strict validation for IP formats. It also ensures better handling of diverse and potentially malformed server responses.

kkkgo avatar Dec 09 '24 14:12 kkkgo

Huh, wow. That's quite a regex. (Did you write it?)

I think this is a cool capability, and I'm probably OK with this, but a little nervous about accepting such freeform input.

I think the best thing to do would be to test this pretty rigorously with automated tests. Move the regex code into a small function that we can easily unit test, then set up some table tests (range over an array of test cases in a Test* function). How does that sound?

mholt avatar Dec 09 '24 19:12 mholt

Indeed, you are absolutely right. Using a combined regular expression for IPv4 and IPv6 makes it difficult for humans to read, review, and maintain. After reevaluating the code, I have improved the handling approach.

The following changes were made:

  1. Simplified Regular Expressions for IP Extraction:

    • The previous regular expressions were overly complex, making them difficult to read and maintain.
    • Separate regular expressions are now used for IPv4 and IPv6, aligning more closely with their respective formats.
    • Based on the project's README specification:

      "The interface option will return at most 1 IPv4 and 1 IPv6 address. It will only return addresses that are not private according to RFC 1918 (IPv4) and RFC 4193 (IPv6), and is a global unicast address according to RFC 1122, RFC 4632, and RFC 4291 with the exception of IPv4 directed broadcast addresses."

    • The IPv6 regex has been simplified to focus on global unicast addresses, which typically start with 2. Additionally, now use net.ParseIP at the final validation step for reliability.
  2. Improved Efficiency in IP Retrieval:

    • The logic now avoids unnecessary computations. For example, in environments where IPv6 is disabled or not supported, the IPv6-specific regex and validation logic are not executed.
    • This ensures that only the relevant network stack (IPv4 or IPv6) is queried, streamlining the code flow and reducing potential overhead.
  3. Improved User-Agent Handling in HTTP Client

kkkgo avatar Dec 10 '24 16:12 kkkgo

Pardon me, but are you a human?

mholt avatar Dec 10 '24 18:12 mholt

Forgive me for not being a native English speaker. Regarding DDNS-related projects, I have written some scripts before 2018:

  • https://github.com/kkkgo/CloudXNS-DDNS-with-BashShell
  • https://github.com/kkkgo/dnspod-ddns-with-bashshell

In recent years, I’ve also worked on projects targeting POSIX shell scripting:

  • https://github.com/kkkgo/UE-DDNS

Some of my experience comes from the scripts I’ve written (e.g., [ue-ddns.sh]. Recently, while building a Caddy Docker, I came across this project through a recommendation. The methods for retrieving IPs overlap significantly with my earlier shell scripting ideas, and I hope my insights can contribute to this project.

Additionally, I find it interesting that this project uses UPnP to retrieve IPs. While it's challenging to implement this in shell scripts, I believe it’s worth considering using STUN servers for IP retrieval, as demonstrated in this project https://github.com/HMBSbige/NatTypeTester.

kkkgo avatar Dec 10 '24 20:12 kkkgo

I was just asking, since the post seemed very GPT-like. If you are using LLMs (or other "AI" tools) I kindly request that it be disclosed.

In any case, we'll still want some good test cases for this. :+1:

mholt avatar Dec 13 '24 17:12 mholt

The following content has been translated into English using DeepL.

I have added some test data in the caddyfile_test.go file to simulate a web scraping scenario. This includes the following tasks:

  • Extracting IP addresses from complex text content.
  • Retrieving the last valid IP address.
  • Discarding invalid IPs or dirty data.
  • Returning either IPv4 or IPv6 addresses from mixed content based on specific requirements.

Additionally, considering that no one would own a /8 IPv6 address, I have further optimized the regular expression for IPv6 to avoid extreme cases, such as incorrectly capturing 2001::gggg:1234 as 2001::.

kkkgo avatar Dec 13 '24 20:12 kkkgo

Thanks for the revisions and the disclosure.

Sounds like what is wanted is a way to scrape an IP address out of unstructured data. I don't think the SimpleHTTP IP source is the right type for that (hence the name, "Simple"). It should always work and be reliable, so no scraping.

I think what we really want is a new IP source, maybe called "IPScraper" or something, that can extract an IP out of other data. I don't think it will be as reliable, but if it's simple and doesn't give false positives, we can consider merging it in. Otherwise, it might best remain a separate module that has more niche use cases.

mholt avatar Dec 13 '24 23:12 mholt

The actual use cases are relatively simple, clean, and structured, such as the return of JSON or key-value pairs. Here are some example URLs:

  • https://ipwho.is/
  • https://speed.cloudflare.com/meta
  • https://www.apnic.net/cdn-cgi/trace
  • https://surfshark.com/api/v1/server/user

These URLs generally return a single valid and reliable IP address, which can simply be extracted from the structure. Complex scraping is not necessary, and reliability typically depends on whether the webpage responds with a 200 OK status.

For simple, clean IP address returns, this does not affect the original functionality. With strict filtering, it may even slightly enhance security. The actual use case can be limited to clean webpages containing only a JSON object, key-value pairs, or a similar structure with an IP address. However, the code also considers extreme scenarios for reliability, such as handling multiple IP addresses. This could potentially be used for interesting purposes, such as scraping IP addresses published for specific uses on certain websites, though this additional functionality is not guaranteed to be reliable.

Code Update:
I updated the code to account for extreme IPv6 extraction scenarios by selecting the longest IPv6 address string. Additionally, I included some test cases.

kkkgo avatar Dec 14 '24 07:12 kkkgo

I still think we need to create a new module for complex/regex extraction; not try to cram this into the SimpleHTTP type.

mholt avatar Dec 30 '24 19:12 mholt