Legacy-Research-Engine Discussion: How to approach server architecture

Discussion: How to approach server architecture

Open blackforestboi opened this issue 7 years ago • 22 comments

The plan for a user's autonomy / Server infrastructure

Hey folks,

thanks for your input on this.

What is the goal?

Our promise is to keep the user in full control over his data and to allow effective/asynchronous sharing of content-recommendations, content-associations and metadata. Therefore there is a need to host the data somewhere in the cloud, but in the control of users. Means that as soon as a server is needed, the idea is to make it as easy as possible for users to set up their own server with our firmware (i.e. a docker container), that handles all the data storage/processing.

Secondary effect.

Making this form of decentralisation the default architecture could contribute to a more decentralised internet infrastructure, since we also reach many non-technical internet user. Because the data of a user is always available via their servers this could also build the foundation for other decentralised projects to reach broader use. (i.e. P2P social networks or decentralised web search engines like Yacy/Sersia)

We hope that it would lead to a shift of ecosystems that form around users, not around platforms. This current, centralised and platform focussed, circumstance leads to unhealthy amassing of power on the web.

The architecture chart as seen below are how I imagine the system.

Stage 1:

In the first stage, its just a client side software, the browser extension. There is no communication with the outside world needed yet. Currently the used DB is PouchDB.

Stage 2:

Providing a server that handles all the the logic of syncing with the attached services and processing of data. (like building the search indexes or analysing for related content) It also has built in the first version of the communication API (called "Ragnorok-Module" as an homage to Daniel Suarez' Daemon & Freedom Books ;) ) In this stage this API is there to communicate with the different clients a user uses as well as provide a web-based interface the user can access from anywhere on the web. Here we possibly have to sync an index to the local machine in order to provide off-line support.

Stage 3:

As soon as the system is working for the users themselves we update API to be able to talk to other APIs in the network and exchange information, like content recommendation or provide searchable indexes of the pages other users visited. In this stage people can start following each other and therefore build circles of trust.

I have a couple of questions:

How seemless can the process be made for the user to setup the servers? (Important for non-technical users, as most of ours will be)
What kind of problems do you see with this architecture?
How can we make the code that runs on the server replicatable and agnostic from the server choice?
What storage solution do you know that is capable of running/syncing in an extension as well as on the server? Maybe also including a built in permission system to handle access. Afaik Pouch/CouchDB don't have that. remotestorage.js has for example.
If we use a system like the IPFS/IPDB, can we also host and run code there?
As far as I know, searching in encrypted datasets is not yet mature, so the question is, it it possible to add an encrypted accesslayer that would effectively sandbox data and its processing, making it unavailable to outside people without the right credentials?

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Dec 01 '16 20:12 blackforestboi

@tilgovi @bigbluehat thanks for your input here.

Dec 01 '16 20:12 blackforestboi

Just some quick thoughts:

How seemless can the process be made for the user to setup the servers?

It was the complication of setting up ones own servers that brought on the "social networking" and "software-as-a-service" onslaught. That said, groups like owncloud.org, cozy.io, and sandstorm.io provide some "DIY platform-as-a-service" options...but every one of them is as idiosyncratic as the next...

Ideally...the browser === server (+/- renting compute power and/or persistence/longevity).

What kind of problems do you see with this architecture?

Mostly the arrows coming from DB to idiosyncratic services (Evernote, etc). Those could/should be plugins and work as their own data transformers/publishers/sync-er-thingies.

How can we make the code that runs on the server replicatable and agnostic from the server choice? What storage solution do you know that is capable of running/syncing in an extension as well as on the server? Maybe also including a built in permission system to handle access. Afaik Pouch/CouchDB don'T have that. remotestorage.js has for example.

Well...it won't ever be 100% agnostic. Both sides have to speak the same data format and at least understand one (likely more) endpoints. With the PouchDB/CouchDB ecosystem you at least get a consistent, compatible ecosystem that also includes the likes of IBM's Cloudant and even Drupal. :smile:

Authentication and authorization is a thing unto itself, and while remoteStorage.js does "have that" it's unique and tied to a storage API... In the case of PouchDB and friends, there are many options. Checkout SuperLogin for a good server-side OAuth supporting thing.

There are also other more adventurous options such as the things going on in https://github.com/solid/solid-spec and http://opencreds.org/ (related to the W3C's Web Payments IG and Verifiable Claims stuff).

Good times ahead. :smiley:

Dec 02 '16 14:12 BigBlueHat

It was the complication of setting up ones own servers that brought on the "social networking" and "software-as-a-service" onslaught. That said, groups like owncloud.org, cozy.io, and sandstorm.io provide some "DIY platform-as-a-service" options...but every one of them is as idiosyncratic as the next... Ideally...the browser === server (+/- renting compute power and/or persistence/longevity).

Yeah I actually thought about creating plugins for the most common providers (since they offer the ability to add plugins) and also providing a firmware that can be put on AWS or so. Would that already help?

Mostly the arrows coming from DB to idiosyncratic services (Evernote, etc). Those could/should be plugins and work as their own data transformers/publishers/sync-er-thingies.

Yes, I see this as a difficulty as well. The feasible approach I thought about is to build the integrations 1 by 1 according to users needs, but already developing these integrations as modular plugins to the system. Later it should be possible for developers to add new channels by developing own plugins that understands the search queries and responds accordingly, instead of creating a full index for the content of each of the sources. Basically providing an API endpoint in the software that can send queries and receive specifically formatted results.

Well...it won't ever be 100% agnostic. Both sides have to speak the same data format and at least understand one (likely more) endpoints.

I am not sure, if I got that or If there was a miscommunication from my side. (I didnt get the answer :) ) What I mean with agnostic is the ability to take the code for the server side environment and put it on AWS or redhat, or whatever choice a user has for their servers. Then configuring the client to talk to this server. As for the API endpoints, the data communication API will most likely have the same (standard) endpoints for each server running it. Without it I see problems of having a network of servers talking to each other, yet alone with the user's different clients.

Dec 02 '16 15:12 blackforestboi

Not sure how you're storing things now (or how you plan to), but if you use Web Annotation Data Model (or if you could/would) then it seems sensible to use the Web Annotation Protocol.

If built on either of those standards, then it also seems reasonable that more than just you would be interested in building compatible proxies, transformers, etc, to make the cross-idiom-cloud integrations happen.

The Web Annotation Protocol has also been thought through from the vantage point of decentralization with places to put previous identifiers (via), store the canonical identifier (canonical), and how those should be handled/preserved in different scenarios.

There's a decent amount of existing protocol server code already, and more to come thanks to Apache Annotator (incubating)--which I know you're at least passingly familiar with. :grin:

Thoughts? 💭

Dec 02 '16 15:12 BigBlueHat

Not sure how you're storing things now (or how you plan to), but if you use Web Annotation Data Model (or if you could/would) then it seems sensible to use the Web Annotation Protocol.

The data model is in fact not clear yet, I thought about the Annotation data model as well. Have to dive into it a little bit more as soon as we start working on the web based view and have pouchDB implemented for the currently saved data points.

There's a decent amount of existing protocol server code already

I think it starts surpassing my knowledge and expertise in the field to understand what is needed and what can be used. One more reason why we soon need someone taking the developement steering wheel. If we get enough moneyz via the Patreon campaign, we might as well make that happen.

and more to come thanks to Apache Annotator (incubating)--which I know you're at least passingly familiar with. 😁

dont know what you are talking about 😉 (Will definitely stay in the loop there and try to contribute to the extend of what I can)

Dec 02 '16 16:12 blackforestboi

Possible technologies supporting this: BigchainDB, IPDB, Arpa2, LeastAuthorityS4, remotestorage.JS any other?

Jan 12 '17 15:01 blackforestboi

Reflecting notes from a conversation with @oliversauter yesterday:

PouchDB runs in the browser (on IndexedDB) or in node (on LevelDB)
If making a desktop app with a node runtime—such as with electron—it makes much more sense to use PouchDB than to embed CouchDB

Jan 18 '17 22:01 tilgovi

@tilgovi @BigBlueHat @danrl @obsidianart

have you ever worked with this? https://github.com/calvinmetcalf/crypto-pouch Or do you know/recommend any other solution to encrypt a PouchDB?

Do you have something I could read into that makes me better understand how we could encrypt the DB and have the search index still working?

Is it generally possible to encrypt the DB so that sensitive data is not readable from the outside, but have an unencrypted index? And how could we extend the index easily without having to decrypt the db?

Jan 19 '17 19:01 blackforestboi

I have not worked with PouchDB or the encryption plugin yet. Unless we need attachments to be encrypted, too, crypto-pouch looks fine based on the documentation. Has this code been independently reviewed?

Is it generally possible to encrypt the DB so that sensitive data is not readable from the outside, but have an unencrypted index?

Since most indexes uses hash tables for their indexes anyway, why not piggy-back on that? --> https://en.wikipedia.org/wiki/Hash_table

One could think of an index were key words are hashes, not plaintext strings. The index would be unreadable to an attacker but still useable. Existence or nonexistence of a particular plaintext can be proven by looking for hash(plaintext) in den index. References to were to find the corresponding additional data in PouchDB would be stored as the unencrypted IDs of the PouchDB entries. (Note that the encryption plugin does not encrypt the ID, so the ID should be somewhat random and not related to the content.)

That's one way of doing it.

Appending data do an encrypted database sounds like a use case for asymmetric cryptography (which eventually encrypts the symmetric key). Possible, but may not be very straightforward. However, allows for nice features as adding from untrusted devices/instances without compromising or giving up the trusted device/instance encryption key.

Jan 19 '17 19:01 danrl

Is encryption a hard requirement? For which environment?

Jan 19 '17 20:01 tilgovi

Has this code been independently reviewed?

I don't know

However, allows for nice features as adding from untrusted devices/instances without compromising or giving up the trusted device/instance encryption key.

This sounds like something we could need if we want to provide mobile support, as soon as we have the servers for the users. I assume the same goes for querying from "unknown" devices? Particularly if users log in from other people's computers to search their index?

Jan 19 '17 21:01 blackforestboi

I assume the same goes for querying from "unknown" devices? Particularly if users log in from other people's computers to search their index?

Unfortunately, not. This is a one-way road.

Jan 19 '17 21:01 danrl

@tilgovi

Not in the very near future, but I guess we tackle this this year. We will store a web of quite personal stuff there and will possibly have to store access credentials to all sorts of personal services.

Jan 19 '17 21:01 blackforestboi

@danrl So how would people be able to search their index from other devices?

In my mind I thought of something like the login credentials are at the same time the key to decrypt the db. Hence if they have a web interface, where they can log in they can simply search from anywhere.

Jan 19 '17 22:01 blackforestboi

So the DB sits were? On the server? I think we should discuss this in our next call, get's a bit out of topic here, doesn't it? Let me think a bit about the architecture and were crypto makes sense.

Jan 19 '17 22:01 danrl

How seemless can the process be made for the user to setup the servers? (Important for non-technical users, as most of ours will be)

Could be a binary that users can copy and run, could be a docker container that users can deploy to their favorite cloud, could be the sources for people who like to build everything themselves. Maybe a ownCloud plugin? I think many forms are possible, somemore seamless than others, depending on the target audience.

What kind of problems do you see with this architecture?

Stage 2: An API that is not aware of the content (because encryption) can not do much more than HTTP methods/REST already can do, right? Does every user have an own DB on the server?

How can we make the code that runs on the server replicatable and agnostic from the server choice?

I do not quite get this question.

As far as I know, searching in encrypted datasets is not yet mature, so the question is, it it possible to add an encrypted accesslayer that would effectively sandbox data and its processing, making it unavailable to outside people without the right credentials?

Are you referring to homomorphic operations? --> https://en.wikipedia.org/wiki/Homomorphic_encryption I am not aware of anything reliable and stable in that field yet. May be wrong, though.

Jan 19 '17 22:01 danrl

I think many forms are possible, somemore seamless than others, depending on the target audience.

Would it generally be possible to make it a one click thing? Maybe at selected providers, where we provide a setup process that effectively is like selling a house and just giving the user the keys which they could change?

Does every user have an own DB on the server?

Ideally it's a closed environment for each user yes. So everyone would get/have their own server.

I do not quite get this question.

I don't know what kind of differences there are between different server providers in terms of setup process etc. So hence the question if there is something to consider when building an application that (more technically savvy) users can use to run it on any server they like. I assume with Docker containers or binaries that would then be server agnostic. I think the question is answered if that is the case.

Are you referring to homomorphic operations?

My initial question was related to what you have written before in terms of encrypting the database and index. Generally my motivation is to make the data stored locally (or on the server) secure from outside attacks, but still readable/modifiable for the user. Furthermore as soon as the data sharing takes place, the user can give away permission to access/decrypt specific data sets to other people. This is why I mentioned implementing some sort of accesslayer that controls this sharing - don't know how to solve that yet.

Obviously I am a total noob here and surely ask many unexperienced questions :)

Jan 19 '17 23:01 blackforestboi

has anybody worked with blockstack before?

Do you think its something worth exploring for our purpose?

Jan 23 '17 21:01 blackforestboi

Doesn't blockstack require users to possess bitcoins, e.g. to pay for identity or zones? I may be getting the concept wrong.

Feb 04 '17 16:02 danrl

Standalone P2P search applications (e.g. Yacy) don't really make sense from a usability insightful. It's unrealistic to expect hundreds of millions of users to download a standalone app, and configure a P2P search node. What would make more sense, and would lead to much more widespread adoption, is to use protocols like Web Sockets to facilitate P2P connectivity in the web browser, so that everything can be done via a simple browser plugin that can be installed by anyone with few clicks, and would then just allow people to use the browser search bar as usual. Browser integration would also have the bonus of simplifying the choice of what to index -- it could just default to indexing bookmarked and frequently visited pages, and then be optionally customized by more incoming users to create custom indexes (i.e. all of the complexity of setting up indexing could be hidden from the user, unless they choose to look for it).

Here you can look into P : P2P networking with browsers, Actually after we establish a connection is established the middleman is no longer necessary, no proxies are involved.

Feb 04 '17 21:02 amitkumarj441

If I understand correctly you are suggesting a middle step that would facilitate fetching updates about good content in a P2P fashion?

So means as soon as I am online I fetch all the content-rating-updates from my friends that are also online?

Feb 05 '17 10:02 blackforestboi

Alright @oliversauter .

Feb 05 '17 14:02 amitkumarj441

Legacy-Research-Engine Legacy-Research-Engine copied to clipboard

Discussion: How to approach server architecture

The plan for a user's autonomy / Server infrastructure

What is the goal?

Secondary effect.

The architecture chart as seen below are how I imagine the system.

Stage 1:

Stage 2:

Stage 3:

I have a couple of questions:

Legacy-Research-Engine
Legacy-Research-Engine copied to clipboard