cohosting Lazy cohosting

Technical Background

Things in MFS are implicitly pinned (won't be garbage collected).

Copying IPFS content paths to MFS (ipfs files cp /ipfs/{CID} /path/in/mfs) does not ensure recursive content is fetched and store in local repo. All cp does is to create entry in MFS and only fetches new blocks if the user actually goes there and requested block is not in the local repo.

That is why additional ipfs refs -r call is needed to ensure all blocks are prefetched nad present in the local repo (https://github.com/ipfs-shipyard/cohosting/issues/4, https://github.com/ipfs-shipyard/cohosting/pull/5)

Potential Use

Here's an idea: the lack of recursive fetch could be a feature that enables regular people to "cohost" parts of big websites such as https://ipfs.io/ipns/en.wikipedia-on-ipfs.org

Without explicit preload (https://github.com/ipfs-shipyard/cohosting/pull/5) only the parts that were actually visited are stored in local repo, saving disk space and enabling cohosting of more content overall.

Challenges

The key challenge here is UX. Initial ideas:

how to communicate the difference between "full" and "lazy" cohosting?
how to enable user to make the decision?
- when adding website via ipfs-companion user could be prompted if they want to cohost in "full" ("eager") or "lazy" fashion (while showing current total size of a website)
- for adding website via CLI or programmatic interface, there should be a separate addLazy command that does not run ipfs refs -r

cc @meiqimichelle @hacdias @autonome

Oct 08 '19 19:10 lidel

Okay, so on the SPEC, we could just change it and add (optional) and explain the differences between both types.

On cohosting.sh and ipfs-cohost: is it worth to be a different command? Can't we just pass a flag --lazy for example?

About IPFS Companion: I think it's a great idea to prompt which type but it can also be annoying if the user tries to cohost multiple websites. Perhaps they could set a default type of cohosting (?) and there would be a different option to cohost with the non-default type.

About IPFS Desktop: checking for updates every 12 hours should make a lazy or full copy then? That is also something important on this case!

Oct 08 '19 21:10 hacdias

Extracting prefetch into separate command?

I lean on the side of doing lazy by default and making prefetch a separate command in the SPEC.

I always come back to wikipedia as the extreme example: ipfs refs will simply take too much time to be a blocking operation during add, especially in IPFS Companion GUI. If we run it async in background there is no API to track its progress, and user notices sudden bandwidth utilization without any explanation in UI.

Due to this, it makes sense to me to extract the call to ipfs refs into a standalone prefetch command (or --prefetch parameter of sync command). I think standalone command makes things simpler, parameters are noisy.

Something like:

$ cohosting.sh add docs.ipfs.io # lazy add
$ cohosting.sh sync # lazy sync
$ cohosting.sh prefetch # ensures data is in the repo

That way cohosting.sh prefetch could take long time, but its not a problem because it can happen in background (cohosting.sh prefetch &) and does not block other operations.

Should prefetch happen by default? Mixing lazy and full snapshots?

prefetch proposed above would preload everything in /cohosting/ directory

You raised good question:

Should we prefetch by default, as an opt-in, or an explicit manual step?
- I lean towards prefetch being manual or opt-in
  - In IPFS Companion Preferences we could have a toggle "preload cohosted websites" that flips the default from "lazy" to "full"
If we make it opt-in, would it be a global setting, or do we want a way of disabling prefetch for specific websites, making them always "lazy"?
- eg. I don't want to store entire en.wikipedia-on-ipfs.org, but I am ok with docs.ipfs.io
- If so, we need to figure out UX for communicating this choice at the moment user adds website to cohosting. Choices are:
  - "cohost pages I visited on this website"
  - "cohost full snapshot of this website"
    - ideally: "cohost full snapshot of this website up to X MB", but let's not go there yet :)

Oct 09 '19 11:10 lidel

Thanks for the feedback @lidel. I agree mostly with everything you said.

IPFS Companion UX

An idea for IPFS Companion UX: we could dynamically show a different option to the user by default: if the site is less than X MB, we emphasise the option to make a full cohosting. Otherwise, we default to lazy. However, this would not be consistent to the commands even though I don't think that'd be a problem.

By doing it that way, we could have kind of a dropdown where we could pick the other way of cohosting.

If we don't go with that, I believe the best way is to just show 2 options and avoid terms such as "full" and "lazy". Maybe something like:

Save pages I visit on this website
Save the entire website

For the SPEC / commands

I think that syncing should follow what we set for each website so we need a way to know if the user wants to cohost fully or just lazily.

Three approaches on how to do this:

1. Directory-based

We could separate lazy and full snapshots through their directories. Lazy would go into /cohosting/lazy/example.com and full could go into /cohosting/full/example.com.

If we wanted to switch, we could just move the directory over to the other place, but that would put, for example, lazy snapshots in the full directory; or
We could keep both repositories and when listing the snapshots for a domain, we could have something like:

$ ipfs-cohost ls example.com
Snapshots for example.com:
	2019_10_22 (full)
	2019_10_21 (lazy)
   	2019_10_20 (lazy)

Another con is that we wouldn't know which method of loading for a certain website would be active at the moment hence the sync command wouldn't know what to do.

2. Dot-file based

We could add a .full or .lazy (just one and having the other by default) on the root of /cohosting/example.com. That way, we would know the method of loading that was active at the moment. Thus, the sync command would certainly know what to do.

3. "Database" file

Similar to 2, but with a file in the root /cohosting/full or /cohosting/lazy with the list of domains that are being cohosted in either way.

I'm leaning towards 2 or 3. Even though we wouldn't have the possibility to show which kind of snapshot we have, it would be much simpler.

Also, I agree with lazily cohosting by default and per domain. This also solves a problem we discussed before which was "how to save certain pages to read later".

In spite of all of this, I don't know if we should add a different command (prefetch). If the setting was per domain, sync would know what to do per domain and thus we wouldn't need prefetch.

Still, we'd need to tell somehow how to change the type of cohosting for a domain. Maybe something like:

$ cohost add ipfs.io
# ipfs.io added lazily
$ cohost sync
# ipfs.io sync'ed lazily
$ cohost add ipfs.io --full
# ipfs.io moved to full mode
$ cohost sync
# ipfs.io sync'ed fully

Wdyt about this?

Oct 10 '19 10:10 hacdias

IPFS Companion UX

If we don't go with that, I believe the best way is to just show 2 options and avoid terms such as "full" and "lazy". Maybe something like:

Save pages I visit on this website

Save the entire website

I think this is the way to go. Provides clear and uniform UX for all webistes.

If user clicks on "Cohost the entire website" we would check if it bigger than X and display additional Are you sure? dialog informing user that en.wikipedia-on-ipfs.org requires 650 GB of space, would you like to cohost only the pages you explicitly visited instead?. We could also check Datastore.StorageMax vs the current repo size (ipfs repo stat) and disable "full" option if it is more than what node can store.

We could also proactively display size stats above those two cohosting options, enabling user to make more informed decision which one to click.

SPEC for marking website cohosting "lazy" vs "full"

Thank you for writing this up!

1. Directory-based

I actually like the directory approach the most :grin:

It follows the spirit of convention-over-configuration, does not introduce any special config/files and makes it extremely easy to reason about and manage via MFS/WebUI.

It adds one additional level under /cohosting/, but I believe it to be a feature: Something to think about is user experience in situation when they run out of disk space and want to trim things down. With dedicated directories, they would go to /cohosting/ and immediately see which websites are in full/ vs lazy/:

In the process of slimming down, they could remove full/ snapshot(s), and/or move some of them to lazy/ directory. This is much faster than (2) or (3) where user needs to go into each website to see if it is full or lazy.

Addressing some concerns:

If we wanted to switch, we could just move the directory over to the other place, but that would put, for example, lazy snapshots in the full directory

I don't think this is a problem: if I have a lazy one and visit every page on a website, I effectively have a full snapshot under lazy directory :) When website gets updated, "lazy&full" snapshot will become "lazy" again.

Another con is that we wouldn't know which method of loading for a certain website would be active at the moment hence the sync command wouldn't know what to do.

sync would do lazy/ subdirectory first, then full/ (with prefetch)

Possible edge case is if website is manually added to both lazy/ and full/, but it should not be a problem really: entries in lazy/ and full/ are separate namespaces, sync would do its job in both directories sequentially (lazy first). Blocks are deduplicated, so there is no waste.

On the GUI side, if website happens to be in both, we would simply pick the full one.

Are there any other cons I don't see here?

2. Dot-file based

This is not bad, but makes it harder to see which website is full (really takes space) and manage it. It also adds complexity, as we need to handle situation when there is no .file, and spec what happens then (do we create .lazy?)

3. "Database" file

I'd avoid this, it introduces full blown config file, making it hard to change (needs text editor on top of MFS), which we try to avoid in this experiment.

Separate prefetch command vs being a part of sync

I don't know if we should add a different command (prefetch). If the setting was per domain, sync would know what to do per domain and thus we wouldn't need prefetch.

Agree, if we have it per domain, then ipfs refs can stay in sync and be enabled by default. The only ask I have here is to implement optional sync --no-recursive-fetch flag (useful for testing entire setup end-to-end without waiting for fetch of full snapshots).

How to set the type of cohosting for a domain

Let's make it super simple: idempotent add (lazy by default) + optional flags:

$ cohosting.sh add        docs.ipfs.io # implicitly lazy
$ cohosting.sh add --lazy docs.ipfs.io # explicitly lazy
$ cohosting.sh add --full docs.ipfs.io # explicitly full

add should check if domain already exists and update its type if needed. That way we don't add any new commands and keep spec simple.

Oct 10 '19 12:10 lidel

Are there any other cons I don't see here?

What if you want to stop fulling archiving a website without removing its full snapshots?

Agree with what you said about 2 and 3.

How to set the type of cohosting for a domain

👍

Oct 10 '19 12:10 hacdias

What if you want to stop fulling archiving a website without removing its full snapshots?

If you mark existing full snapshot as lazy, blocks that are already in local repo will still be there. When website gets updated, new blocks won't be fetched unless user explicitly requested new pages, making the switch a fairly smooth and predictable process.

Oct 10 '19 12:10 lidel

So your suggestion is to move the domain from full to lazy dir right?

Oct 10 '19 12:10 hacdias