ipfs-live-streaming icon indicating copy to clipboard operation
ipfs-live-streaming copied to clipboard

IPNS publishing performance

Open benhylau opened this issue 7 years ago • 24 comments

IPNS publishing takes more than a minute to propagate which is currently a blocker.

benhylau avatar May 19 '18 18:05 benhylau

Running IPFS v .15 on Debian based system

When I try to publish it takes over ~2 minutes Additionally when I try to access the published IPNS through any gateway its also takes ~2 mins

I have tried with no change the following

  • using127.0.0.1:8080 as a gateway where the data resides
  • checked swarm peers (over 600 all the time)
  • tried on NATed and NOT NATed ipfs server

Example: Publish takes ~ 2 minutes

ipfs@IP:/var/www/html/hls$ date > date.txt
ipfs@IP:/var/www/html/hls$ ipfs add date.txt
added QmXxk7Mn2LaKjsbFQHvCLtaXUChMfqEmqJrKxRdL6y9R1S date.txt
ipfs@IP:/var/www/html/hls$ time ipfs name publish QmPn6i5JxPnobYcHqdieDPH82nXnhmCHLqkgfL95jQ4j3W
Published to QmU6NeD2Uu34WKest1NZGvGmScLhN1zVo66K35GeE6Jft2: /ipfs/QmPn6i5JxPnobYcHqdieDPH82nXnhmCHLqkgfL95jQ4j3W

real	2m0.267s
user	0m0.044s
sys	0m0.004s
ipfs@IP:/var/www/html/hls$

Example: Request of IPNS content takes 2 mins

root@HTPC:~# time wget https://ipfs.io/ipns/QmSp4Do1bwo8HjgQXwUmicqkg8ddET1DZKBiCzxw2TdCyo
--2018-05-19 12:42:13--  https://ipfs.io/ipns/QmSp4Do1bwo8HjgQXwUmicqkg8ddET1DZKBiCzxw2TdCyo
Resolving ipfs.io (ipfs.io)... 209.94.78.78
Connecting to ipfs.io (ipfs.io)|209.94.78.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 221 [text/plain]
Saving to: ‘QmSp4Do1bwo8HjgQXwUmicqkg8ddET1DZKBiCzxw2TdCyo’

QmSp4Do1bwo8HjgQXwU 100%[===================>]     221  --.-KB/s    in 0s      

2018-05-19 12:44:04 (13.8 MB/s) - ‘QmSp4Do1bwo8HjgQXwUmicqkg8ddET1DZKBiCzxw2TdCyo’ saved [221/221]

real	1m50.735s
user	0m0.008s
sys	0m0.004s

However another user trying this had no problems IE https://ipfs.io/ipns/Qmbk1vmcgSczUjiz6KPjV3DoHXirPKhBbQKEttCz2gvxFR/hls/date.txt takes seconds

darkdrgn2k avatar May 20 '18 01:05 darkdrgn2k

@darkdrgn2k you are starting the wget immediately after the ipfs name publish completes? What if you do wget a second time does it take another 2 minutes or then it becomes immediate?

benhylau avatar May 20 '18 06:05 benhylau

Could be a second or an hour does not have impact on it.

I tried a day later with same results.

Even after the first successfull try second one still takes as long too.

darkdrgn2k avatar May 20 '18 17:05 darkdrgn2k

@flyingzumwalt shall we link this into a ipfs repo?

benhylau avatar May 20 '18 21:05 benhylau

Hey @benhylau, Just now seeing this issue. To get good latencies, you will want to use the ipns over pubsub. There are some docs here: https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#ipns-pubsub

Also see the ipfs name pubsub subcommand for some control commands.

You will need to have this feature enabled on both the publishers and the resolvers (your gateway nodes, ideally) and they will need to be connected in some sort of topology (to receive pubsub messages from a publisher, you have to be connected through some graph of subscribed nodes).

Both @lgierth and @flyingzumwalt are out on vacation right now, so you can ping me if you need anything until they are back.

whyrusleeping avatar Jun 02 '18 11:06 whyrusleeping

@whyrusleeping we looked at this a while ago but it does not seem to have any effect. Are we missing somethign maybe?

ipfs daemon --enable-namesys-pubsub
ipfs name pubsub state

enabled

time ipfs name publish QmXZUTbf28Ra85GZ1gHqURYh1v3vAh9ZBXMs1Te15h6vGp
Published to QmTVrsREgGQ7QN5wGhDJshAmPXPV4ZVcw7XQ9X7wDSwEWc: /ipfs/QmXZUTbf28Ra85GZ1gHqURYh1v3vAh9ZBXMs1Te15h6vGp

real	4m26.323s
user	0m0.080s
sys	0m0.036s

darkdrgn2k avatar Jun 02 '18 11:06 darkdrgn2k

@darkdrgn2k ah, yeah. Under the hood it publishes to both pubsub and the slow DHT. subscribers should receive updates very quickly (less than a second ideally). We should probably add an option to publish only to pubsub, but doing that prevents non-pubsub users from resolving the content (the DHT is a fallback).

If you just want to test out the pubsub stuff working quickly, you could add a --timeout=2s flag to your publish command, and see how things work.

whyrusleeping avatar Jun 02 '18 12:06 whyrusleeping

How are remove clients subscribed? Is it just enabling it in the client

I would assume if we are asking for the data from a gateway they need to do it a well? And ipfs. Io probobly doesn't run experimental flags?

Also anothe user is able to publish via regular ipns in a fraction of the time

What contributes to ipns speed?

darkdrgn2k avatar Jun 02 '18 13:06 darkdrgn2k

Yeah, if the client has the feature enabled, they will use it. The first resolve will still be slow, as thats when the subscription starts, and since it doesnt have any value at the beginning of the subscription it has to fall back to the DHT, but subsequent updated lookups will be fast.

If you are asking for data from the gateway, the gateway will need this feature enabled. And yes, our ipfs.io gateways do not run experimental flags, though you might be able to ask @lgierth nicely when he comes back ;)

whyrusleeping avatar Jun 02 '18 15:06 whyrusleeping

What contributes to ipns speed?

Standard (non-pubsub) IPNS is a simple DHT 'Put' operation. The main reason it is slow is that connectivity across the network is rather poor right now (too many clients behind NATs). The way that our DHT (Kademlia) works, is that it selects the K (we use 20, as suggested by the paper) closest peers by XOR metric to the key you are publishing to, and sends them the value. With how many badly NATed clients we're seeing, the majority of these dials are timing out, causing the 'Put' operation to be slow. To address this, we are working on a few different things, but the two main ones are; first: an option for clients to auto-detect if they are behind a bad NAT, and not participate in the DHT if so. Second: better relay logic, so we can ask other peers to relay connection for us to these peers that are NATed.

whyrusleeping avatar Jun 02 '18 15:06 whyrusleeping

Thanks for the explanation.

If I understand correctly currently it is pure luck if IPNS will be fast or slow.

If i have a closed network (CJDNS) that has no access to the "Internet". I place a IPNS server on a node that can route to both networks (Internet and CJDNS) Bootstrap the nodes on the CJDNS network with that servers address

Would the CJDNS nodes be populated with unreachable Internet IP addresses as well? And as such still cause the IPNS delays we have on the regular network.

darkdrgn2k avatar Jun 02 '18 19:06 darkdrgn2k

If I understand correctly currently it is pure luck if IPNS will be fast or slow.

Yes, as things stand now, its pretty much just luck that determines DHT publishing speed.

If you ran ipfs entirely on a cjdns network (with no egress to the normal internet), then you wouldnt have a problem. The current issues are only happening because so many users are running ipfs nodes at home without any NAT traversal or port forwarding.

whyrusleeping avatar Jun 03 '18 05:06 whyrusleeping

IPNS over PUBSUB seems to be the way to go after all.

We timeout the publish so it doesn't spend time doing DHT

First grab of the IPNS is still a bit slow. But I think it is only the "first" person pulling from that gateway.

darkdrgn2k avatar Jun 25 '18 12:06 darkdrgn2k

@darkdrgn2k before we close this one:

  1. Is our 1s timeout a little aggressive?
  2. Can we now drop our chunk duration to 10s?

Thanks for the help @whyrusleeping :)

benhylau avatar Jun 26 '18 09:06 benhylau

@benhylau Not a problem :) Glad to hear things are working, please please please let me know how it goes or if anything goes wrong. I'll be watching the stream when its up!

whyrusleeping avatar Jun 26 '18 17:06 whyrusleeping

I have been running the stream for several hours now work well (http://tomesh.org/stream/)

Only issue so far is that the ipns sets a cache of 1 min creating a "stall" of new segments.

darkdrgn2k avatar Jun 26 '18 23:06 darkdrgn2k

@darkdrgn2k ah, interesting. That happens only on new streams, right? things run smoothly once it gets started?

The one minute timeout is a server-side namesys cache. I think we should actually not be using that cache if we're using the pubsub routing (as the pubsub router effectively maintains its own cache). I'm gonna call this one a bug and file it for you.

whyrusleeping avatar Jun 27 '18 00:06 whyrusleeping

Spinning up a production rig for the stream we found some issues with pubsub name.

Background HLS creates ~15 second chunks of video that is hashed into IPFS A play list is created with a sequence of chunks in a text file called an M3U8 file This file is constantly downloaded by the player to learn of new chunks.

We publish the m3u8 text file every ~ 15 seconds (after every new chunk is created and added to ipfs)

We add them as such

m3u8hash=$(ipfs add current.m3u8 | awk '{print $2}')
ipfs name publish --timeout=5s $m3u8hash &

We found that IPNS would at times become stunned for a little while before updating the list. The data returned would be stale, and sometimes as many as 10 updates would pass before it returned to normal.

Other Times it would seem to work decently.

Few things to possibly consider

  • Is 5s timeout enough to publish over IPNS PubSub
  • Is it still publishing over DHT and maybe being picked up by the gateway

After switching to a plain http hosed version of the m3u8 file things seem to stream much better.

darkdrgn2k avatar Jul 02 '18 05:07 darkdrgn2k

First we need to move the IPNS publishing into a separate process. I am doing it manually to observe the results, using publish and resolve. They look like:

publish

grep SEQUENCE live/current.m3u8 && ipfs add live/current.m3u8 | awk '{print $2}' | ipfs name publish --timeout=15s 2>/dev/null || true

resolve

ipfs id | jq .ID | ipfs name resolve --timeout=15s 2>/dev/null | ipfs cat | grep SEQUENCE

A publish-resolve run looks like this:

root@ipfs-server:~# ./publish
#EXT-X-MEDIA-SEQUENCE:572
root@ipfs-server:~# ./resolve
Error: argument "ipfs-path" is required <-- Immediately after there is this awkward time
root@ipfs-server:~# ./resolve
#EXT-X-MEDIA-SEQUENCE:572 <-- Yay we resolved to the new playlist
root@ipfs-server:~# ./resolve
#EXT-X-MEDIA-SEQUENCE:542 <-- Huh? How did we resolve back to the old one?
root@ipfs-server:~# ./resolve
#EXT-X-MEDIA-SEQUENCE:542 <-- And it kept like that for a while, this is an IPFS bug?
...
root@ipfs-server:~# ./resolve
#EXT-X-MEDIA-SEQUENCE:542
root@ipfs-server:~# ./resolve
Error: argument "ipfs-path" is required <-- We enter these awkward times again with our 15s timeout
root@ipfs-server:~# ./resolve
Error: argument "ipfs-path" is required <-- This is when our players crap out we don't want to be here
root@ipfs-server:~# ./resolve
#EXT-X-MEDIA-SEQUENCE:572 <-- Okay we are back to the current playlist
root@ipfs-server:~# ./resolve
#EXT-X-MEDIA-SEQUENCE:572 <-- These return in < 1s, we want our player to see this!

Let's see across multiple nodes:

root@ipfs-server:~# date +%M:%S && grep SEQUENCE live/current.m3u8 && ipfs add live/current.m3u8 | awk '{print $2}' | ipfs name publish --timeout=5s || true && grep SEQUENCE live/current.m3u8 && ipfs id | jq .ID | ipfs name resolve | ipfs cat | grep SEQUENCE && curl -s "http://ipfs-mirror-0.mesh.world:8080/ipns/$(ipfs id | jq .ID | tr -d \")" | grep SEQUENCE && date +%M:%S
46:46
#EXT-X-MEDIA-SEQUENCE:631 <-- The sequence number of the current.m3u8 I am publishing
Error: Post http://127.0.0.1:5001/api/v0/name/publish?encoding=json&key=self&lifetime=24h&resolve=true&stream-channels=true&timeout=5s: context deadline exceeded
#EXT-X-MEDIA-SEQUENCE:631 <-- By the time I am done publishing ffmpeg is here
#EXT-X-MEDIA-SEQUENCE:572 <-- My local resolves to old value still, IPFS bug?
#EXT-X-MEDIA-SEQUENCE:631 <-- On the ipfs-mirror-0 node it picks up the new playlist immediately
46:52

If we put in those 15s timeouts in both publish and resolve:

root@ipfs-server:~# date +%M:%S && grep SEQUENCE live/current.m3u8 && ipfs add live/current.m3u8 | awk '{print $2}' | ipfs name publish --timeout=15s 2>/dev/null || true && grep SEQUENCE live/current.m3u8 && ipfs id | jq .ID | ipfs name resolve --timeout=15s 2>/dev/null | ipfs cat | grep SEQUENCE && curl -s "http://ipfs-mirror-0.mesh.world:8080/ipns/$(ipfs id | jq .ID | tr -d \")" | grep SEQUENCE && date +%M:%S
52:53
#EXT-X-MEDIA-SEQUENCE:655 <-- Published
#EXT-X-MEDIA-SEQUENCE:656 <-- Current
#EXT-X-MEDIA-SEQUENCE:648 <-- Local resolve
#EXT-X-MEDIA-SEQUENCE:655 <-- Remote mirror resolve
53:08

I think the way forward is to publish less frequently, and more aggressively resolve, such that most of the time resolutions by the player are fast-returning. A few things will trip up the player:

  • Going back to an old playlist
  • Slow returns will timeout

So we should aim for minimal downtime and have playlist always available in < 1s.

benhylau avatar Jul 02 '18 08:07 benhylau

Sometimes we'd run into this bug where on one node you resolve:

root@ipfs-server:~# echo "QmWCNF1APg3nbsLyQDcrMTsd7hnjyGL1ddoi9MNrRdPQ2A" | ipfs name resolve --timeout=15s 2>/dev/null | ipfs cat
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:15
#EXT-X-MEDIA-SEQUENCE:752
#EXT-X-START:TIME-OFFSET=-60
#EXTINF:18.033000,
http://ipfs-server.mesh.world:8080/ipfs/QmZdMYtK2eVwVS3aeCfzPLuWHr9hw1Xg9uGFcYc8jpCR1y
#EXTINF:18.667000,
http://ipfs-server.mesh.world:8080/ipfs/QmS3gvcguGtdGAFhQ6rkgv35NCQfbSbNxmACbnMzRF2qTy
#EXTINF:9.766000,
http://ipfs-server.mesh.world:8080/ipfs/QmUyrb2522df9f1YxBUHzsy76MmQhXHeP6uw9zvdiEcYMc
#EXTINF:13.800000,
http://ipfs-server.mesh.world:8080/ipfs/QmVYXzkaj8dNXoEgmyYLnS8mvr2Xum3YV2AQyq2qJHs3UD
#EXTINF:19.734000,
http://ipfs-server.mesh.world:8080/ipfs/QmRRyWHD4QCd1RXuMMupPHWAnXRC758JLh7v1HgnZzy31k
#EXTINF:16.666000,
http://ipfs-server.mesh.world:8080/ipfs/QmNRdG631Zy4reatckDdJuPAo8x1zj4fjhmDDGc3q8T54A
#EXTINF:8.334000,
http://ipfs-server.mesh.world:8080/ipfs/QmUT7CZHAN7ans8B5UMG8KkwzsRy8FePQdUzYynvAssz6G
#EXTINF:16.666000,
http://ipfs-server.mesh.world:8080/ipfs/QmSBocE4LXR6r9uQBkB7pqnecFHznvRRS1UYahzgbKYYW9
#EXTINF:16.667000,
http://ipfs-server.mesh.world:8080/ipfs/QmPmv4qZ7LDRaxsMDRhe6Ye9ECrJCT5UpvqPdVAj1tMTi8
#EXTINF:17.200000,
http://ipfs-server.mesh.world:8080/ipfs/QmRLTdboXxhEeHqfcjCD8JBLs6YBsdoxSLRiCc5cjizFNP
#EXTINF:16.333000,
http://ipfs-server.mesh.world:8080/ipfs/QmeGuuVjECqdJ6hcN6AuR5JS1472JthpyBnarwG38kgCXF
#EXTINF:13.934000,
http://ipfs-server.mesh.world:8080/ipfs/Qmd4zwWYYgS8eGMKoY1ngWrMiGn6eMaGWi63vigN1BmbQ1
#EXTINF:16.666000,
http://ipfs-server.mesh.world:8080/ipfs/QmREnHmTSQGNCeqsMzpbeSVgpYohVUVhgSZtmBVCBqcYDc
#EXTINF:8.000000,
http://ipfs-server.mesh.world:8080/ipfs/QmbEGY7x59o5ZhwSDXmpXtbDvkKGbqRNC4XLW9u2EnZorT
#EXTINF:16.067000,
http://ipfs-server.mesh.world:8080/ipfs/QmNkAaLAemCbiHdbjFFzT9kTksvcXvh71JPfhoEddZKDsv
#EXTINF:16.200000,
http://ipfs-server.mesh.world:8080/ipfs/QmaTNvn7mzS2odSx8FqQnrzEuBiytpyfUcuLng3RxHssSJ
#EXTINF:19.767000,
http://ipfs-server.mesh.world:8080/ipfs/QmdmhRqv1X1nLPoARKA48AdU9ZUKSWaXyfMh1BLdS7ugGh
#EXTINF:12.500000,
http://ipfs-server.mesh.world:8080/ipfs/QmeDjuZq4L44SKrMtKANUD7wkXB2NU3kzMVebus19ydY5L
#EXTINF:11.233000,
http://ipfs-server.mesh.world:8080/ipfs/QmUPYoUJ5mcsDbSPiL5spCefVZ6mTJBDRRreMmQ6EW96iq
#EXTINF:20.067000,
http://ipfs-server.mesh.world:8080/ipfs/QmPx6D5Xiqhxo6SNx2gqHbLfgEuKzj62G9F4KwE2mgknwy
#EXT-X-ENDLIST

And on another node (in this case the mirror):

root@ipfs-mirror:~# echo "QmWCNF1APg3nbsLyQDcrMTsd7hnjyGL1ddoi9MNrRdPQ2A" | ipfs name resolve --timeout=15s 2>/dev/null | ipfs cat
Error: this dag node is a directory
root@ipfs-mirror:~# echo "QmWCNF1APg3nbsLyQDcrMTsd7hnjyGL1ddoi9MNrRdPQ2A" | ipfs name resolve --timeout=15s 2>/dev/null | ipfs cat
Error: argument "ipfs-path" is required

It is not a transient thing, ipfs-mirror will resolve like this indefinitely (at least for half an hour it's been like this).

cc. @whyrusleeping @lgierth

benhylau avatar Jul 02 '18 17:07 benhylau

@benhylau very interesting... I wonder if its a connectivity thing. If a pubsub ipns node gets disconnected from the swarm of the publisher, it won't be able to resolve until they get connected again (its not receiving any updates). Could you verify with ipfs swarm peers that the two are connected while the failure is occurring?

whyrusleeping avatar Jul 02 '18 18:07 whyrusleeping

@whyrusleeping we unfortunately have to remove IPNS dependency for v1.0 which is the tag we intend to use for Our Networks. Perhaps we can have a sprint with @lgierth to debug these IPNS issues we are experiencing :)

benhylau avatar Jul 08 '18 05:07 benhylau

@benhylau i'm pretty sure I know the issue now. The pubsub code doesnt add any connection manager tags, so ipfs doesnt know to keep connections to pubsub topic friends alive, meaning they will randomly be closed. I'm gonna try and hack that in this week and we can see how it helps.

whyrusleeping avatar Jul 11 '18 09:07 whyrusleeping

This will re-enable IPNS for testing purposes: https://github.com/tomeshnet/ipfs-live-streaming/tree/enable-ipns

benhylau avatar Jul 19 '18 02:07 benhylau