fdb-document-layer icon indicating copy to clipboard operation
fdb-document-layer copied to clipboard

Store numeric field names as string keys, instead of integers

Open apkar opened this issue 6 years ago • 4 comments

Document Layer is storing numeric field names as binary integers, this is limiting numeric field names to be an integer. I am guessing this decision is taken to maintain the order of numeric fields. This is not a valid assumption. All JSON field names are strings.

In [74]: db.coll.insert({'_id': 'foo', '12345678901234567890': 'test'})
Out[74]: 'foo'

In [75]: for row in db.coll.find():
    ...:     print row
    ...:
{u'_id': u'foo', u'-1': u'test'}

Also, changing this behavior would make array expansion code much cleaner. As array indexes are integers, it is hard to guess if a kye is pointing to an array element or a numeric field.

Related code is here

void insertElementRecursive(bson::BSONElement const& elem, Reference<IReadWriteContext> cx) {
	std::string fn = elem.fieldName();
	if (std::all_of(fn.begin(), fn.end(), ::isdigit)) {
		const char* c_fn = fn.c_str();
		insertElementRecursive(atoi(c_fn), elem, cx);
	} else {
		insertElementRecursive(fn, elem, cx);
	}
}

apkar avatar Jan 06 '19 03:01 apkar

I ran into this today while trying to see if we could use FoundationDB's document layer here at Discord. We have one part of our user record that is a nested dictionary where the keys are longs and currently the document layer is causing the string keys to be converted to ints and the data is therefore invalid.

Any chance this is a reasonably fixable issue? I see @apkar you made some improvements here, but not quite enough to solve it?

zorkian avatar May 08 '19 06:05 zorkian

Yeah, it's not fixed yet. Fix is to store all field names as strings, including numeric field names. This is actually an easy change. But, this is an on-disk format change. This can be handled in one of two ways, without breaking backward compatibility

  • Have an on-disk format upgrade process which will upgrade all the existing collections to the new format by rewriting all the documents. This can be done consistently and in the background in parallel to ongoing writes. This works more like an index rebuild. And the next version upgrade would stop supporting the old format.
  • The Second approach is to leave the existing collections as they are and only change the format for new collections. This makes it easy to roll out, as we support both old and new formats. But, the code would be a bit messy as it needs to handle both formats forever.

The second approach is least intrusive and easy to prove that it's working.

apkar avatar May 08 '19 07:05 apkar

Thank you for the response!

That makes sense. In the short term, we're working to do something on our end where we rewrite our own keys so they look like strings. It's ugly, for sure.

I think that your approaches make sense, the first one is probably better. Do you think someone relatively unfamiliar with your codebase (like me) could take a crack at it? And if so, do you have any pointers to where in the code these bits would need to be written and/or any similar processes that already exist I could model off of?

zorkian avatar May 24 '19 22:05 zorkian

Just figured I'd swing back by in case you had any thoughts on this one. =)

zorkian avatar Jul 07 '19 02:07 zorkian