fix(mdns): arithmetic overflow in mDNS TTL implementation
Description
This PR tries to fix possible arithmetic overflows in the mDNS implementation when processing response packets with tremendous TTL values.
Fixes https://github.com/libp2p/rust-libp2p/issues/5943
Notes & open questions
I’ve kept the source in its original form and added a new filter combinator. This feels like a small, minimal change and, in my opinion, the cleanest solution.
Change checklist
- [x] I have performed a self-review of my own code
- [x] I have made corresponding changes to the documentation
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] A changelog entry has been made in the appropriate crates
Hey hey, thank you for your responses.
After reviewing the proposed changes, I realized something. It might be better not to filter out peers with large TTLs. A more effective solution could be to cap the new_expiration value to a reasonable limit, such as one day or smth.
// Use a reasonable maximum - 1 day should be sufficient for most scenarios
const MAX_TTL: Duration = Duration::from_secs(60 * 60 * 24);
pub(crate) fn extract_discovered(
&self,
now: Instant,
local_peer_id: PeerId,
) -> impl Iterator<Item = (PeerId, Multiaddr, Instant)> + '_ {
self.discovered_peers()
.filter(move |peer| peer.id() != &local_peer_id)
// Remove this filter to keep all peers
// .filter(move |peer| now.checked_add(peer.ttl()).is_some())
.flat_map(move |peer| {
let observed = self.observed_address();
// Cap expiration time to avoid overflow
let new_expiration = match now.checked_add(peer.ttl()) {
Some(expiration) => expiration,
None => {
now.checked_add(MAX_TTL)
.unwrap_or(now) // Fallback to now if even that overflows (extremely unlikely)
}
};
peer.addresses().iter().filter_map(move |address| {
let new_addr = _address_translation(address, &observed)?;
let new_addr = new_addr.with_p2p(*peer.id()).ok()?;
Some((*peer.id(), new_addr, new_expiration))
})
})
}
On the other hand, we're still not solving the essence of this problem.
And it seems to me that there might be issues while decoding the MdnsPacket::new_from_bytes.
But I guess this would require much larger efforts on the whole subject.
After reviewing the proposed changes, I realized something. It might be better not to filter out peers with large TTLs.
Can you expand on that? Why would a non-malicious peer set a TTL that is large enough to cause an overflow?
And it seems to me that there might be issues while decoding the
MdnsPacket::new_from_bytes.
What issue?
Can you expand on that? Why would a non-malicious peer set a TTL that is large enough to cause an overflow?
My initial reaction is that the mDNS protocol itself shouldn't be responsible for handling security. If we're implementing changes that address security concerns, we should consider handling them at a higher layer, outside the mDNS protocol itself.
Completely failing the response would be overly harsh - we'd lose all the valid data in the response because of one overly enthusiastic TTL. And if we filter out all nodes with problematic TTLs, it could potentially make debugging more difficult in the future. That’s why I believe capping the TTL is a more reasonable short-term fix, especially while we're still investigating the root cause of this issue.
What issue?
I think that we can't be sure whether the problematic TTLs are caused by malicious actors or if the responses were simply malformed. It seems important to investigate the creation of MdnsResponse to ensure that we're capturing the right information.
Finally, I would argue that it might be best to address this at the decoding stage rather than later in the process. This way, we can catch any malformed data earlier on and handle it more gracefully.
I think that we can't be sure whether the problematic TTLs are caused by malicious actors or if the responses were simply malformed. It seems important to investigate the creation of MdnsResponse to ensure that we're capturing the right information.
Finally, I would argue that it might be best to address this at the decoding stage rather than later in the process. This way, we can catch any malformed data earlier on and handle it more gracefully.
If we packet is malformed we should discard the whole packet anyway. I don't think there is a way to recover from it. How are we supposed to know what other fields are correct?
This is turning into more of a design decision than we expected. We’re seeing two possible paths here:
- Discard everything due to an overly aggressive TTL.
- Cap the TTL, retain the data, and rely on the existing protocol implementations to handle the security aspects.
We’re still not entirely sure why TTL values are coming in this way, and it feels a bit “nuclear” to filter everything out, because it seems that there is some hidden problem somewhere. I feel that if we filter out everything, debugging this issue becomes a pretty hard thing to do, or somewhat impossible.
I think that we can't be sure whether the problematic TTLs are caused by malicious actors or if the responses were simply malformed. It seems important to investigate the creation of
MdnsResponseto ensure that we're capturing the right information.Finally, I would argue that it might be best to address this at the decoding stage rather than later in the process. This way, we can catch any malformed data earlier on and handle it more gracefully.
We are using hickory_proto::Message::from_vec to decode a received packages, so this would be something to discuss at the upstream hickory-dns repo.
We’re still not entirely sure why TTL values are coming in this way, and it feels a bit “nuclear” to filter everything out, because it seems that there is some hidden problem somewhere. I feel that if we filter out everything, debugging this issue becomes a pretty hard thing to do, or somewhat impossible.
We could add a debug message when discarding a packet and print the discarded package, wouldn't that solve this problem? I don't think just capping the TTL makes debugging any easier, because then you wouldn't even discover that there was an invalid TTL (and thus potentially malformed packet) in the first place.
We are using hickory_proto::Message::from_vec to decode a received packages, so this would be something to discuss at the upstream hickory-dns repo.
Exactly, and this package is still in its alpha stage. That's why I was thinking about the ease of debugging.
We could add a debug message when discarding a packet and print the discarded package, wouldn't that solve this problem? I don't think just capping the TTL makes debugging any easier, because then you wouldn't even discover that there was an invalid TTL (and thus potentially malformed packet) in the first place.
Your call on this one — I’m happy to jump in and help whichever way you go! If I had to pick, I’d lean toward capping it and tossing in a debug message when it hits that point.
That said, your proposed solution works for me too — I just don’t have the same inside scoop on how you all usually like to handle these things.
This pull request has merge conflicts. Could you please resolve them @momoshell? 🙏
@dariusc93 @elenaf9 hey hey guys, how should we proceed with this one?