specs icon indicating copy to clipboard operation
specs copied to clipboard

Hashes as city object IDs

Open liberostelios opened this issue 5 years ago • 1 comments

Why?

Currently, IDs of the city objects are meant to be any kind of unique identifiers without any restrictions. Messing around with different open datasets, I haven't seen a meaningful and functional use of the identifiers so far.

Instead of leaving this free to the implementation, I propose that we use hashes of the city object content as their IDs. This means that the hash can be representative of the city object. If something changes in the city object (geometry or attribute), then the hash changes as well.

That adds a lot of possibilities:

  • It can be used to validate the data integrity of city objects served on the web (which is one of the main aims of CityJSON). That could be extremely useful with regard to future plans for a web tiling scheme.
  • It allows us to do a lot of interesting tricks regarding the indexing of city object data. For instance, we can exploit hashes in order to keep track of versions and differences between objects (same as git uses hashes for storing versions of files).

How?

A simple solution would be by simply hashing the string of the JSON encoding of the city object as it is today. That would make the hash, though, fragile against small changes (such as alteration to the CityJSON attribute names). There are many other possibilities which could be discussed. But the idea is that it should be very concrete what the hash function input is, so that developers always implement it in the same way.

Pros

  • IDs are becoming meaningful, functional and consistent. Everyone knows what they are and can use it in order to validate and manipulate the content of the model.
  • ID uniqueness is ensured. Of course, hashes conflicts are possible in theory, but current applications of the mechanism provide proof that it is impossible in practice.

Cons

  • Adds a little more complexity on the implementation. But implementing a hashing functionality should be pretty straightforward for developers, as soon as the rules of hashing are concrete. If it's considered to be that complex, we could still propose it as an optional functionality.
  • Might make alterations of the schema on future versions more difficult, as the changes can affect the hash outcome. The idea of the ID being a hash is that we can know that two different hashes refer to two different city objects (or two different versions of the same city object). The notion that an unchanged city object can have a different hashes just because the schema has changed, breaks the original purpose. But that could be somewhat minimised if the hashing mechanism doesn't rely on the exact encoding, but on a more stable representation of the city object. Or, by simply accepting that two different CityJSON versions of the same city object can have different hashes.

liberostelios avatar Oct 22 '18 10:10 liberostelios

Briefly discussed my views about this with @liberostelios in person today. In short, I love the idea of hashes to verify an object's integrity and to enable versioning in the future, but I think they should be an optional attribute of every City Object, not their IDs.

My reasoning:

  • Having hashes as IDs significantly raises the bar to build quick-and-dirty applications on top of CityJSON. You can't do anything to an object (e.g. adding a new attribute) without recomputing the hash. Simple changes in a text editor wouldn't be possible. Debugging simple apps becomes much more complex.
  • User readable IDs are nice.
  • Versioning is still possible with hashes as attributes by parsing the file and building a map of hashes. Compared to parsing the file, building the map is a cheap operation that doesn't change the computational complexity of the operation.
  • Integrity checking works just as well even if the hash is an attribute.
  • For any serious application, the hashes you read in a file need to be verified anyway.

kenohori avatar Oct 22 '18 17:10 kenohori