rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

fix(mdns): arithmetic overflow in mDNS TTL implementation

Open momoshell opened this issue 8 months ago • 10 comments

Description

This PR tries to fix possible arithmetic overflows in the mDNS implementation when processing response packets with tremendous TTL values.

Fixes https://github.com/libp2p/rust-libp2p/issues/5943

Notes & open questions

I’ve kept the source in its original form and added a new filter combinator. This feels like a small, minimal change and, in my opinion, the cleanest solution.

Change checklist

  • [x] I have performed a self-review of my own code
  • [x] I have made corresponding changes to the documentation
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [x] A changelog entry has been made in the appropriate crates

momoshell avatar Mar 31 '25 12:03 momoshell

Hey hey, thank you for your responses. After reviewing the proposed changes, I realized something. It might be better not to filter out peers with large TTLs. A more effective solution could be to cap the new_expiration value to a reasonable limit, such as one day or smth.

// Use a reasonable maximum - 1 day should be sufficient for most scenarios
const MAX_TTL: Duration = Duration::from_secs(60 * 60 * 24);  

pub(crate) fn extract_discovered(
        &self,
        now: Instant,
        local_peer_id: PeerId,
    ) -> impl Iterator<Item = (PeerId, Multiaddr, Instant)> + '_ {
        self.discovered_peers()
            .filter(move |peer| peer.id() != &local_peer_id)
            // Remove this filter to keep all peers
            // .filter(move |peer| now.checked_add(peer.ttl()).is_some())
            .flat_map(move |peer| {
                let observed = self.observed_address();
                // Cap expiration time to avoid overflow
                let new_expiration = match now.checked_add(peer.ttl()) {
                    Some(expiration) => expiration,
                    None => {
                        now.checked_add(MAX_TTL)
                            .unwrap_or(now) // Fallback to now if even that overflows (extremely unlikely)
                    }
                };
                
                peer.addresses().iter().filter_map(move |address| {
                    let new_addr = _address_translation(address, &observed)?;
                    let new_addr = new_addr.with_p2p(*peer.id()).ok()?;
                    Some((*peer.id(), new_addr, new_expiration))
                })
            })
    }

momoshell avatar Apr 04 '25 11:04 momoshell

On the other hand, we're still not solving the essence of this problem. And it seems to me that there might be issues while decoding the MdnsPacket::new_from_bytes. But I guess this would require much larger efforts on the whole subject.

momoshell avatar Apr 04 '25 11:04 momoshell

After reviewing the proposed changes, I realized something. It might be better not to filter out peers with large TTLs.

Can you expand on that? Why would a non-malicious peer set a TTL that is large enough to cause an overflow?

And it seems to me that there might be issues while decoding the MdnsPacket::new_from_bytes.

What issue?

elenaf9 avatar Apr 05 '25 00:04 elenaf9

Can you expand on that? Why would a non-malicious peer set a TTL that is large enough to cause an overflow?

My initial reaction is that the mDNS protocol itself shouldn't be responsible for handling security. If we're implementing changes that address security concerns, we should consider handling them at a higher layer, outside the mDNS protocol itself.

Completely failing the response would be overly harsh - we'd lose all the valid data in the response because of one overly enthusiastic TTL. And if we filter out all nodes with problematic TTLs, it could potentially make debugging more difficult in the future. That’s why I believe capping the TTL is a more reasonable short-term fix, especially while we're still investigating the root cause of this issue.

What issue?

I think that we can't be sure whether the problematic TTLs are caused by malicious actors or if the responses were simply malformed. It seems important to investigate the creation of MdnsResponse to ensure that we're capturing the right information.

Finally, I would argue that it might be best to address this at the decoding stage rather than later in the process. This way, we can catch any malformed data earlier on and handle it more gracefully.

momoshell avatar Apr 06 '25 10:04 momoshell

I think that we can't be sure whether the problematic TTLs are caused by malicious actors or if the responses were simply malformed. It seems important to investigate the creation of MdnsResponse to ensure that we're capturing the right information.

Finally, I would argue that it might be best to address this at the decoding stage rather than later in the process. This way, we can catch any malformed data earlier on and handle it more gracefully.

If we packet is malformed we should discard the whole packet anyway. I don't think there is a way to recover from it. How are we supposed to know what other fields are correct?

elenaf9 avatar Apr 21 '25 11:04 elenaf9

This is turning into more of a design decision than we expected. We’re seeing two possible paths here:

  • Discard everything due to an overly aggressive TTL.
  • Cap the TTL, retain the data, and rely on the existing protocol implementations to handle the security aspects.

We’re still not entirely sure why TTL values are coming in this way, and it feels a bit “nuclear” to filter everything out, because it seems that there is some hidden problem somewhere. I feel that if we filter out everything, debugging this issue becomes a pretty hard thing to do, or somewhat impossible.

momoshell avatar Apr 22 '25 08:04 momoshell

I think that we can't be sure whether the problematic TTLs are caused by malicious actors or if the responses were simply malformed. It seems important to investigate the creation of MdnsResponse to ensure that we're capturing the right information.

Finally, I would argue that it might be best to address this at the decoding stage rather than later in the process. This way, we can catch any malformed data earlier on and handle it more gracefully.

We are using hickory_proto::Message::from_vec to decode a received packages, so this would be something to discuss at the upstream hickory-dns repo.

We’re still not entirely sure why TTL values are coming in this way, and it feels a bit “nuclear” to filter everything out, because it seems that there is some hidden problem somewhere. I feel that if we filter out everything, debugging this issue becomes a pretty hard thing to do, or somewhat impossible.

We could add a debug message when discarding a packet and print the discarded package, wouldn't that solve this problem? I don't think just capping the TTL makes debugging any easier, because then you wouldn't even discover that there was an invalid TTL (and thus potentially malformed packet) in the first place.

elenaf9 avatar May 05 '25 09:05 elenaf9

We are using hickory_proto::Message::from_vec to decode a received packages, so this would be something to discuss at the upstream hickory-dns repo.

Exactly, and this package is still in its alpha stage. That's why I was thinking about the ease of debugging.

We could add a debug message when discarding a packet and print the discarded package, wouldn't that solve this problem? I don't think just capping the TTL makes debugging any easier, because then you wouldn't even discover that there was an invalid TTL (and thus potentially malformed packet) in the first place.

Your call on this one — I’m happy to jump in and help whichever way you go! If I had to pick, I’d lean toward capping it and tossing in a debug message when it hits that point.

That said, your proposed solution works for me too — I just don’t have the same inside scoop on how you all usually like to handle these things.

momoshell avatar May 05 '25 11:05 momoshell

This pull request has merge conflicts. Could you please resolve them @momoshell? 🙏

mergify[bot] avatar Jul 14 '25 14:07 mergify[bot]

@dariusc93 @elenaf9 hey hey guys, how should we proceed with this one?

momoshell avatar Oct 24 '25 05:10 momoshell