rapidyaml
rapidyaml copied to clipboard
Fix ambiguity between null strings and zero-length strings
fixes #263
I implemented option no. 2:
-
len == 0 && str == nullptr
→ This is a "null" string -
len == 0 && str != nullptr
→ This is a zero-length, but non-null string
However you will need to take a thorough look, because I basically only looked at the points where tests failed. There may be some forgotten points.
Unfortunately, when we only store str == nullptr
for null scalars, we lose the ability to get the scalars' locations.
Codecov Report
Merging #264 (6009769) into master (d53b444) will increase coverage by
0.03%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #264 +/- ##
==========================================
+ Coverage 96.35% 96.38% +0.03%
==========================================
Files 22 22
Lines 7991 8060 +69
==========================================
+ Hits 7700 7769 +69
Misses 291 291
Impacted Files | Coverage Δ | |
---|---|---|
src/c4/yml/emit.def.hpp | 92.26% <100.00%> (ø) |
|
src/c4/yml/node.hpp | 97.85% <100.00%> (+0.06%) |
:arrow_up: |
src/c4/yml/parse.cpp | 95.75% <100.00%> (+0.05%) |
:arrow_up: |
src/c4/yml/parse.hpp | 100.00% <100.00%> (ø) |
|
src/c4/yml/tree.hpp | 98.97% <100.00%> (+0.06%) |
:arrow_up: |
src/c4/yml/common.hpp | 100.00% <0.00%> (ø) |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@biojppm Can you look into this? I'm not sure, but since we would lose location information for null scalars, it might be a better option to invent a new flag for "key is null"/"value is null".
I've finally had some occasion to look at this. I've refined the tests and the code for it to pass those tests. While doing this I've found a parser bug that was also fixed in this PR. I pushed those changes to your branch.
Basically an important point is that for any given string, (whether null or otherwise), the result of operator<<
and operator=
should be semantically equivalent, with the only difference being where that resulting val is located.
There is a thorny ambiguity with std::string
because to_csubstr()
will return a null csubstr
if the input is empty. So this results in an ambiguity. I am inclined to rely in the small-string-optimization and return a nonnull pointing at the storage, but only if the SSO is mandated by the standard.
So for now I've left std::string
out of it.
As for the location issue, I have yet to look at it. It is possible that we have to reconsider the whole approach, and adding a flag as you suggest is something that could help.
@Gei0r sorry for the long delay in looking at this. I hope to be able to do it in the coming week or so.
@Gei0r I've rebased on master (which has breaking changes). Let me know if I can push to the PR (ie to your remote branch).
@Gei0r I went ahead and pushed to the PR. The handling of locations was improved such that when the original node is null, in most cases we get a location "close enough" to the original node. Only in some pathological cases does this yield to a null val.
It's not perfect but it's an improvement.
Yeah, likewise sorry for not being more engaged with this issue. If you want, I can take a look today. Nice that you found a way to store the location for null values, that's definitely an improvement.
The locations still need some more work; there is still a case where it goes null and didn't need to.
But right now I'm really bothered by what must be a compiler error in gcc's Release builds. Given this version of Tree::to_arena()
:
#define ______(id) \
printf("here --- %s --- a.len==%zu a.str=%p(%zu) (a.str==nullptr)==%d (a.str!=nullptr)==%d\n", \
#id, a.len, a.str, (intptr_t)a.str, \
(a.str == nullptr), (a.str != nullptr))
csubstr Tree::to_arena(csubstr a)
{
______(0);
substr rem(m_arena.sub(m_arena_pos));
size_t num = to_chars(rem, a);
______(1);
if(num > rem.len)
{
______(2);
rem = _grow_arena(num);
num = to_chars(rem, a);
RYML_ASSERT(num <= rem.len);
}
else if(num == 0u)
{
______(3);
if(a.str == nullptr) // ?????? a null string must enter this branch!
{
______(3.1);
return csubstr{};
}
else if(m_arena.str == nullptr)
{
______(3.2);
// Arena is empty and we want to store a non-null
// zero-length string.
// Even though the string has zero length, we need
// some "memory" to store a non-nullptr string
rem = _grow_arena(1);
}
}
______(4);
rem = _request_span(num);
return rem;
}
... I'm getting wrong behavior in gcc Release builds (Debug builds are ok). If I pass an empty csubstr
(where a.str == nullptr
), gcc will not take the branch marked ______(3.1)
:
TEST(empty_scalar, gcc_error)
{
Tree tree;
csubstr nullstr = {};
ASSERT_EQ(nullstr.str, nullptr);
ASSERT_EQ(nullstr.len, 0);
std::cout << "\nserializing with empty arena...\n";
csubstr result = tree.to_arena(nullstr);
EXPECT_EQ(result.str, nullptr); // fails!
EXPECT_EQ(result.len, 0);
std::cout << "\nserializing with nonempty arena...\n";
result = tree.to_arena(nullstr);
EXPECT_EQ(result.str, nullptr); // fails!
EXPECT_EQ(result.len, 0);
}
This is the output I'm getting:
[----------] 1 test from empty_scalar
[ RUN ] empty_scalar.gcc_error
serializing with empty arena...
here --- 0 --- a.len==0 a.str=(nil)(0) (a.str==nullptr)==1 (a.str!=nullptr)==0
here --- 1 --- a.len==0 a.str=(nil)(0) (a.str==nullptr)==1 (a.str!=nullptr)==0
here --- 3 --- a.len==0 a.str=(nil)(0) (a.str==nullptr)==1 (a.str!=nullptr)==0
here --- 3.2 --- a.len==0 a.str=(nil)(0) (a.str==nullptr)==1 (a.str!=nullptr)==0
here --- 4 --- a.len==0 a.str=(nil)(0) (a.str==nullptr)==1 (a.str!=nullptr)==0
/home/usr/proj/rapidyaml/test/test_empty_scalar.cpp:87: Failure
Expected equality of these values:
result.str
Which is: 0x555555685b70 pointing to "u<="
nullptr
Which is: (nullptr)
serializing with nonempty arena...
here --- 0 --- a.len==0 a.str=(nil)(0) (a.str==nullptr)==1 (a.str!=nullptr)==0
here --- 1 --- a.len==0 a.str=(nil)(0) (a.str==nullptr)==1 (a.str!=nullptr)==0
here --- 3 --- a.len==0 a.str=(nil)(0) (a.str==nullptr)==1 (a.str!=nullptr)==0
here --- 4 --- a.len==0 a.str=(nil)(0) (a.str==nullptr)==1 (a.str!=nullptr)==0
/home/usr/proj/rapidyaml/test/test_empty_scalar.cpp:91: Failure
Expected equality of these values:
result.str
Which is: 0x555555685b70 pointing to "u<="
nullptr
Which is: (nullptr)
[ FAILED ] empty_scalar.gcc_error (0 ms)
[----------] 1 test from empty_scalar (0 ms total)
So despite the fact that the 3.1 branch condition a.str == nullptr
is met (as the prints show), GCC refuses to take that branch (!). This occurs across several different gcc versions, as can be seen from the CI failures above.
The problem goes away in Debug builds, and everywhere with clang and msvc. So I think this is a GCC optimizer error.
I reshuffled the branches, and it now works:
modified src/c4/yml/tree.hpp
@@ -997,15 +997,19 @@ public:
* @see alloc_arena() */
csubstr to_arena(csubstr a)
{
- substr rem(m_arena.sub(m_arena_pos));
- size_t num = to_chars(rem, a);
- if(num > rem.len)
+ if(a.len > 0)
{
- rem = _grow_arena(num);
- num = to_chars(rem, a);
- RYML_ASSERT(num <= rem.len);
+ substr rem(m_arena.sub(m_arena_pos));
+ size_t num = to_chars(rem, a);
+ if(num > rem.len)
+ {
+ rem = _grow_arena(num);
+ num = to_chars(rem, a);
+ RYML_ASSERT(num <= rem.len);
+ }
+ return _request_span(num);
}
- else if(num == 0u)
+ else
{
if(a.str == nullptr) // ?????? should enter this branch!
{
@@ -1017,11 +1021,10 @@ public:
// zero-length string.
// Even though the string has zero length, we need
// some "memory" to store a non-nullptr string
- rem = _grow_arena(1);
+ _grow_arena(1);
}
+ return _request_span(0);
}
- rem = _request_span(num);
- return rem;
}
Right, that's one out of the way. This one sucked.
Three issues remain to be addressed:
- trim some more null cases from locations
- fix coverage misses
- what to do about
to_csubstr(std::string const&)
andto_substr(std::string &)
when the input string is empty, and add tests to verify
I hope to do these in the coming days.
Regarding the std::string
question, do you have any thoughts? Should it return a null or an empty csubstr
when the input std::string
is empty? (Remember that std::string
uses SSO, and the standard mandates SSO; so even though the .size()
of a std::string
is 0, there is still a buffer inside the std::string
)
I am inclined to make the resulting csubstr
be null instead of empty:
- it is consistent with most other cases
- less chance of pointing at a dangling buffer if the
std::string
goes out of scope.
But right now I'm really bothered by what must be a compiler error in gcc's Release builds.
This is a little over my head, but I suspect this is because in in to_chars()
, you pass a.str
to memcpy()
here.
Because calling memcpy()
with an invalid pointer is undefined behavior (I think?), the compiler is allowed to assume a.str != nullptr
and elide the branch.
Reminds me of this blog post by Raymond Chen.
Here's a short example of what I mean:
https://godbolt.org/z/KaYxa1qKG
As you can see, with -O3
gcc does not put the if expression, including the printf()
call into the output. clang does it.
I am inclined to make the resulting csubstr be null instead of empty
What's important to me is that an empty (e.g. default-constructed) std::string
is stored and serialized as an empty string scalar (''
) and not a null scalar.
Semantically, I'd prefer to make a csubstr from such a std::string
also non-null. That means there are no "null" std::strings, only empty scalars.
The thing with the dangling pointer if std::string
goes out of scope is always there... Maybe leave it all behind and switch to Rust 😜
I suspect this is because in in to_chars(), you pass a.str to memcpy() here.
Kudos and thanks, that is exactly the problem. Eg, looking at cppreference, it does say that:
If either dest or src is an invalid or null pointer, the behavior is undefined, even if count is zero.
... which is exactly the case that I had: count was zero, so I assumed it was ok to still do the call to avoid the branch, while not aware I was incurring UB by doing so. I will need to check the library for this problem, as there are quite a few calls to memcpy()
which may trigger this behavior.
What's important to me is that an empty (e.g. default-constructed) std::string is stored and serialized as an empty string scalar ('') and not a null scalar.
I'm curious. Why is it important, what is the use-case warranting that distinction?
Semantically, I'd prefer to make a csubstr from such a std::string also non-null. That means there are no "null" std::strings, only empty scalars.
Wanting a default-constructed std::string
to be empty effectively requires that to_csubstr()
returns non-null for that string. But again, that begs the question above.
Consider this:
TEST(empty_scalar, std_string)
{
std::string stdstr;
csubstr stdss = to_csubstr(stdstr);
csubstr nullss;
ASSERT_EQ(stdss, nullptr);
ASSERT_EQ(stdss.str, nullptr);
ASSERT_EQ(stdss.len, 0u);
ASSERT_EQ(nullss, nullptr);
ASSERT_EQ(nullss.str, nullptr);
ASSERT_EQ(nullss.len, 0u);
Tree tree = parse_in_arena("{ser: {}, eq: {}}");
tree["ser"]["stdstr"] << stdss;
tree["ser"]["nullss"] << nullss;
tree["eq"]["stdstr"] = stdss;
tree["eq"]["nullss"] = nullss;
EXPECT_EQ(emitrs_yaml<std::string>(tree),
"ser:\n"
" stdstr: \n"
" nullss: \n"
"eq:\n"
" stdstr: \n"
" nullss: \n"
);
}
This is the current situation. Everything is null, including a default-constructed std::string
. This is consistent and makes sense. However, if it goes your way it would be like this:
EXPECT_EQ(emitrs_yaml<std::string>(tree),
"ser:\n"
" stdstr: ''\n"
" nullss: \n"
"eq:\n"
" stdstr: ''\n"
" nullss: \n"
);
I think what you called "my way" is the correct way.
Everything is null, including a default-constructed
std::string
. This is consistent and makes sense.
But the yaml standard says:
Note that a null is different from an empty string.
So these asserts are wrong imo:
ASSERT_EQ(stdss, nullptr);
ASSERT_EQ(stdss.str, nullptr);
My use case:
The yaml I emit is read by a different program, which checks the datatype. Some fields must be strings. However, sometimes these strings are empty.
As I noted in my original issue, this was fine prior to 2707b25d7932bf70daf52edcbd8895ca45de85be, because empty strings were serialized to ''
. But I can't upgrade rapidyaml because of the change.
The way I see it, currently there is no way to emit an empty string using rapidyaml?
The yaml I emit is read by a different program, which checks the datatype. Some fields must be strings. However, sometimes these strings are empty.
Understood. Let me think about it. I do agree with the rust note, but I want to give some thought to the several issues.
But whatever happens, whatever defaults end up on rapidyaml, you should be aware that you need not be constrained by those choices.
Eg, even if rapidyaml would have the incoming std::string
be null after to_csubstr()
, you can ensure that its serialization is not, contrary to what rapidyaml would choose.
For example, you could do something like this in your code (assuming you're using <<):
node << nonnull(str);
where nonnull
could look like this:
struct nonnull { std::string const& subject; };
NodeRef& operator<< (NodeRef node, nonnull nn)
{
if(nn.subject.empty())
node << csubstr(""); // use a non-empty string
else
node << nn.subject;
return node;
}
Of course, then you would need to remember to ensure use of nonnull
everywhere in your serialization code where a std::string
is used. If that is too much of a downside (if for example there are a lot of places where you would need to remember to use it), instead of using something like nonnull
you could wrap <<
inside a function and specialize it for your types:
template<class T>
C4_ALWAYS_INLINE void myserialize(NodeRef &n, T const& var) { n << var; }
C4_ALWAYS_INLINE void myserialize(NodeRef &n, std::string const& var) { ... }
// then
myserialize(n, str);
myserialize(n, intvar);
Or you could choose to not use any of operator<<
and do the serialization yourself and just set the result. tree.alloc_arena()
may be helpful, or you may choose to have your own serialization buffer instead.
Or you could even ensure that ryml/std/string.hpp
is not included, and provide instead your own to_csubstr(std::string const&)
implementation.
I'm highlighting this because it is a conscious design decision for rapidyaml that although it may provide the basic facilities, it is not a framework forcing you to use its approach. It is not a pact with the devil; you are not selling your soul if you choose to use it. The point of rapidyaml is that it should be very easy (and fast) to do what you need, but if it isn't, you still have the freedom to do what's best for you.
And if there is something constraining you for which there is no recourse, I would consider it a design bug of rapidyaml.
I checked and apparently since c++11 it is no longer undefined behavior to call std::string()[0]
:
If pos == size(), a reference to the character with value CharT() (the null character) is returned.
For the first (non-const) version, the behavior is undefined if this character is modified to any value other than CharT() .
So I will change to_substr()
and to_csubstr()
to reflect this. That means that the behavior will be what you are looking for.