Missing JA4 fingerprints in output
Hi :wave: . While working on a personal project that implements JA4, I noticed some discrepancies when comparing JA4 (TCP) fingerprint output against some of the tls PCAP files in your repo.
For example, I get the following TLS fingerprints from tls-handshake.pcapng:
$ python pcap.py --file ~/git/ext/ja4/pcap/tls-handshake.pcapng | sort | uniq -c | sort -nr
54 t13d1516h2_8daaf6152771_e5627efa2ab1
5 t13d1515h2_8daaf6152771_f37e75b10bcc
3 t13d1516h1_8daaf6152771_e5627efa2ab1
1 t13d1517h1_8daaf6152771_6cdcb247c39b
1 t13d151400_8daaf6152771_de4a06bb82e3
With ja4.py I get:
$ python ja4.py --ja4 ~/git/ext/ja4/pcap/tls-handshake.pcapng | grep -E -o 't\w{9}_\w{12}_\w{12}' | sort | uniq -c | sort -nr
49 t13d1516h2_8daaf6152771_e5627efa2ab1
5 t13d1515h2_8daaf6152771_f37e75b10bcc
3 t13d1516h1_8daaf6152771_e5627efa2ab1
1 t13d1517h1_8daaf6152771_6cdcb247c39b
1 t13d151400_8daaf6152771_de4a06bb82e3
With tshark (TShark (Wireshark) 4.2.6 (Git commit fca52ffc018f).) I get:
$ tshark -r ~/git/ext/ja4/pcap/tls-handshake.pcapng -Y 'tls.handshake.type == 1' -Tfields -e 'tls.handshake.ja4' | grep '^t' | sort | uniq -c | sort -nr
54 t13d1516h2_8daaf6152771_e5627efa2ab1
5 t13d1515h2_8daaf6152771_f37e75b10bcc
3 t13d1516h1_8daaf6152771_e5627efa2ab1
1 t13d1517h1_8daaf6152771_6cdcb247c39b
1 t13d151400_8daaf6152771_de4a06bb82e3
Upon looking at this a bit further I realised the caching functionality in common.py is based on streams. So, if there is more than one fingerprint in a stream, it gets overwritten in the cache? Examples stream:
I was able to resolve this locally by hacking together a change that uses a tuple containing the stream and frame number as the cache key, but this probably isn't suitable because it results in multiple outputs for a stream, instead of multiple fingerprints inside a single stream output.
Thanks for bringing this up! We should add any additional JA4s seen in streams to the output as JA4.2, etc. like how we do with JA4X I think. Would that work?
Considering the core functionality currently involves extracting fingerprints from each stream, that makes sense to me.
I'm simply grepping for the JA4 pattern, so it doesn't matter where it is in the output for my use-case. Thanks.
Thanks all!