atjson
atjson copied to clipboard
Introducing slices
This has been something that has come out of a bunch of experimentation, with a driving factor being a short talk I gave at have you tried to rub a database on it?. The code I wrote for this talk was taking wikitext, a well known and sort of esoteric format, and providing normalization via atjson. During the process of writing the code, I found that it was very difficult to have relational thinking of data using the subdocument model, and found myself creating a new primitive in atjson using slices as the base concept.
It has worked very well in some real applications where I've done testing and would like to propose that we deprecate subdocuments in favor of this new primitive. For an example of this in code, you can take a look at the code for the talk.
As I see it, there are the following benefits to this approach:
- document structures are flat, making it easy to find annotations without walking over every attribute key
- we can now understand relationships between bits of text and reuse slices of a document
- no additional type of thing to learn in atjson
To note, I've been able to implement this in different systems without having to make changes to atjson itself— this is more of me enshrining this pattern as a good approach for folks to take. I would encourage using slices instead of subdocuments as there's a bunch of gotchas that we've found in our code that makes using subdocuments fairly complicated— I'm specifically thinking of you, nested conversions.
informationally, @bobboiko2 you may be interested in taking a peek 😄
There's a few approaches here to resolve the security issues where random magic strings are produced that may allow for arbitrary values to point to annotations.
- We can introduce a new type to point to annotations in attributes
- We can introduce an id type that is used for both ids and for attributes that point to annotations
Both cases would need a special serialization that we can special case, which would be harder to exploit. I'm currently failing to come up with a solution that would avoid exploits other than introducing a new top level attribute that handles links. So that's another possible solution here, introducing links
as a top level attribute, which means annotations would have the following structure:
{
type: "citation",
start: 0,
end: 1,
attributes: {},
links: {
citations: ["1", "2"]
}
}
I had a discussion with @blaine about possible security concerns with using a string
for the slice reference— we've decided that this is fine, since the probability of a person guessing the id of a slice is incredibly hard, much better than password security both because ids are random and long.