ADR-XXXX: Use OOP for parsing framework

Date: 2020-01-07

Status

Proposed

Context

Currently, monolith's architecture is a bit, well, monolithic. The vast majority of functionality is all packed together in very few functions. On top of this, these functions are the same functions responsible for making HTTP requests, which makes them very difficult to test. Most functions are also recursive, which means that asynchronous behavior is not a simple thing to implement. And of course, the wide scope of functions makes documentation not super useful and a bit of a chore. If anyone were to use monolith as a library in its current state, they'd basically be forced to use the whole program exactly like it's used here.

On top of this, as the functionality monolith grows, the current architecture is rather averse to extension. For instance, take the recent addition of CSS parsing: Instead of conforming to any generalizable frameworks, this addition requires calling an ad-hoc method that's kinda sloppily joined with the main HTML parser (with no bridge in-between to allow for testability). If monolith ever adds SVG parsing, or a custom processor for images (#89), this process will be repeated, leaving a tangled mess of methods with no clear organization, masturbating all the other problems mentioned.

Current Flow

This diagram is a simplified representation of the call graph. Functions in grey call (either directly or indirectly) a method of reqwest that makes an external HTTP request, and are thus difficult to test and isolate.

TLDR:

Monolithic API
HTTP requests make testing near-impossible
Recursion makes async difficult
Wide scope of functions makes documentation rough
rigid, ad hoc methods make extension difficult
Growth requires making all of the above worse

Decision

I propose a complete architectural overhaul.

What we need is a comprehensive framework that can be used to fit multiple purposes, and provides a way for many parsers (and potentially end-user-added parsers) to co-operate without direct knowledge of each other. This framework needs to be able to provide all the logic needed for HTTP requests, without requiring parsing code to get mixed up in those details, leaving it completely testable. Finally, the framework should support the new async functionality in Rust, which means that recursion should be avoided where async code is involved.

In short, we need a framework that is:

Generalizable
Modular
Asynchronous
Capable of isolating IO logic

Specifically, I propose the following architecture:

The framework shall consist of two key components, Resource and Asset.

Resource shall be a trait that denotes a parser for a certain type of resource. Resources shall receive some data, produce a list of needed remote assets, and once those assets have been provided, shall render the completed document.

Asset shall be a structure wrapping some potentially undownloaded Resource. Assets will store the location and type of a resource. When given an empty Resource (that is, a specific instance that has yet to be fed any data), Assets will offer methods for downloading data and feeding it into the Resource.

Instead of recursive calls to fetch child assets, when a Resource has requisite child assets, it shall internally create Assets to represent them. It will then offer mutable references to these Assets such that the user or method working with the resource can download them in the same way (and stack level) as the original resource was downloaded.

In order to enable the modularity of parsers, Assets do not necessarily present a specific implementation of Resource that must be used to parse that Asset, as this would prevent end users from swapping in and out Resources without re-writing all parsing code. Instead, Assets provide a partial or complete MIME type, and allow the user to decide based on that MIME type what kind of Resource to use to parse the data.

By making all of this customization possible, however, we risk making basic use over complicated. In order to avert this, we shall include several methods that generalize or provide default behavior. Importantly, the Asset::auto_select_resource_type method shall provide a default logic for the selection of Resources, and the Asset::download_complete method shall provide the logic for downloading an Asset and all child Assets.

Proposed Flow

Consequences

The implementation of this will require a considerable amount of review, and the reconciliation of this design with the current architecture will likely take a sizable amount of work.
Such a large change opens up lots of opportunity for the development of bugs and other problems.
If this design is unsatisfactory, or we end up wanting a different architecture, or even just a change in this architecture, it will be even more work to develop that change after (and if) this is finalized.
This implementation, (and pretty much all async implementations) requires putting resources on the heap instead of on the stack, which may have consiquences

Notes

I've put the ADR for this change in an issue rather than a PR adding it to the collection for now, since ADRs aren't finalized for use on master yet.
I've already created a functioning proof-of-concept for this design, with a complete framework, currently branch Alch_Emi/redesign on commit 27534a. This implementation doesn't have any complex parsers or options implemented yet, and requires a significant amount more work to reconcile with the current code, but does demonstrate the functionality of this approach. The docs for this code (and thus this design) are located temporarily at https://alchemi.dev/doc/monolith. That said, I'm open to making any changes to the design, or throwing it away completely if necessary.
Sorry these graphs aren't hosted on GitHub (GH didn't like the SVGs), but the dot source code used to render them is available in the above repository.
I'd be willing to but in the work to merge this in, but I want to make sure it's definitely something we'd want and to make any needed changes before I continue.
This issue affects:
- Issue #20 / #72: This change exposes a less monolithic API that is more suitable for documentation and export.
- Issue #4 (even though it's closed): This change is already asynchronous
- Issue #89: Having a framework for handling assets by MIME type makes this task a lot easier

Jan 07 '20 20:01 alch-emi

Thank you for this, Emi. I need to study the proposal a bit more before I'm able to respond, I'll do so before the end of the week. So far I can tell that the code definitely must be overhauled in order to make it possible to convert not only remote but also local (fs) documents into monolithic HTML files. There's also a need to make monolith work as a library so that we could offer a GUI version and let other developers build mobile apps for archiving pages as a single file. Aiming at Mozilla Firefox inheriting pieces of this code to provide native ability of saving monolithic HTML files is something that we could use as part of the vision for this overhaul.

Jan 08 '20 06:01 snshn

Hello, I just discovered monolith and I'm loving it!

This proposal is a great idea, isolation of responsabilities and testability is so important in modern tools & software!

Has there been any progress on this?

I think that isolation of responsabilities can come incrementally, improving testability and tests in general is a step in the right direction and will make future architectural changes easier to introduce.

Jan 25 '21 09:01 bew

Hi @bew!

Thank you for the question, glad seeing people still finding this tool useful!

No, no progress on this one. I agree that the proposal is great, and I would love to work on it, but to be honest haven't done anything like that before, and wouldn't want to botch this relatively popular tool, being the only maintainer at the moment.

I currently have to fix a few important things before I'm able to re-architecture the code base.

If it can be done incrementally, could you please provide me with some resources where that process is described?

Jan 29 '21 05:01 snshn

@Alch-Emi Hello, just a note: your links for current flow & proposed flow are broken for me (on desktop/mobile) with message Cannot proxy the given URL. Could you re-upload them maybe?

Feb 03 '21 12:02 bew

Oops, my apologies

I've updated the links and it should display now, although I'm not sure how accurate the diagrams are now.

Feb 12 '21 20:02 alch-emi

[ADR-Proposal] Architectural Redisgn, move towards OOP & isolation of responsibilities

ADR-XXXX: Use OOP for parsing framework

Status

Context

Decision

Consequences

Notes