cyclonedds Deserialization of sample failed due to reasons unknown, possibly related to a sequence

I am using the cyclonedds python bindings to test some proof of concept connectivity and processes to another machine. That machine has a production standard interface (that I have no control over) which sends a topic with mocked up IDL similar to below (unable to share the full IDL).

struct ContributingData {
    string sensorName;
    short dataItemIndex;
};

typedef sequence<ContributingData, 20> ContributingDataType;

struct Data {
    short systemDataIndex;
    int dataValue;
    ContributingDataType contributingSensors;
};

On a machine running OSPL Test I can receive samples from the production VM, and if I create a sample with the same data on the Test VM I can also receive that at OSPL Test. However the packets from the Production VM on the Test VM produce the error "Unable to deserialize TopicName (reasons unknown)." I think that packets from the Test VM are received at the Production Machine and return the OSPL Error "Malformed Packet", but I am not able to monitor this in real time with any great ease.

It appears that there is something wrong in the serialization, possibly in the IDLC generated python code (I note that at least one of the members is named format which is a Python built in function). However other errors along these lines appear to point to sequences, but I am not sure where to start with debugging.

Any help greatly appreciated! For additional context I have successfully communicated to this machine from cyclonedds using topics not containing populated sequences. But this is the first data type I am receiving that has a populated sequence.

Aug 15 '24 20:08 FirebarSim

I note that at least one of the members is named format which is a Python built in function

I checked, that doesn't do any harm. I tried it with the above type with one of the fields renamed to format, and the CDR looks just fine to me.

If the type is no more complicated than this, I would double-check that the definition used in Cyclone is the same as the one used in OSPL, because it seems somewhat unlikely something so basic fails.

Then, I would verify that Cyclone (non-Python, just to be sure) receives the data correctly. Cyclone's dynsub example often comes in handy here. Just give it the topic name, then it should find the topic, retrieve & print the type definition and start printing the samples as they arrive.

If that all checks out, then I suppose the next check is to take one failing sample, and write a little test program in OSPL that publishes it. If you're adventurous, it is possible to run OpenSplice's IDL pre-processor to obtain the XML type definition that OpenSplice uses, then feed that into pubsub, like:

# cat x.idl
module X {
struct ContributingData {
    string sensorName;
    short dataItemIndex;
};
typedef sequence<ContributingData, 20> ContributingDataType;
struct Data {
    short systemDataIndex;
    long format;
    ContributingDataType contributingSensors;
};
#pragma keylist Data
};
# ./idl2md x.idl X::Data | tee x.md
X::Data

<MetaData version="1.0.0"><Module name="X"><Struct name="ContributingData"><Member name="sensorName"><String/></Member><Member name="dataItemIndex"><Short/></Member></Struct><TypeDef name="ContributingDataType"><Sequence size="20"><Type name="ContributingData"/></Sequence></TypeDef><Struct name="Data"><Member name="systemDataIndex"><Short/></Member><Member name="format"><Long/></Member><Member name="contributingSensors"><Type name="ContributingDataType"/></Member></Struct></Module></MetaData>
# ./pubsub -TS -Kx.md -qa:o=r -Ssm,pm ""
warning: subscriber entity ignoring inapplicable QoS "destination_order"
warning: publisher entity ignoring inapplicable QoS "destination_order"
[publication-matched: total=(1 change 1) current=(1 change 1) handle=2831578d:33f]
[subscription-matched: total=(1 change 1) current=(1 change 1) handle=2831578d:347]
[subscription-matched: total=(2 change 1) current=(2 change 1) handle=3a7049bc:658c]
ANN : { .systemDataIndex = 3, .format = 5, .contributingSensors = { { .sensorName = "aap", .dataItemIndex = 1 },{ .sensorName = "noot", .dataItemIndex = 2 } } }
ANO : { .systemDataIndex = 3, .format = 5, .contributingSensors = { { .sensorName = "aap", .dataItemIndex = 1 },{ .sensorName = "noot", .dataItemIndex = 2 } } }
[subscription-matched: total=(2 change 0) current=(1 change 1) handle=2831578d:347]

(Obviously the above type works for me; note the added #pragma keylist for triggering code generation by OpenSplice's IDL preprocessor.) Among those who know it, pubsub is notorious for its baroque set of command-line options, here:

-TS topic name = "S"
-Kx.md read topic definition metadata from x.md (because OpenSplice doesn't implement XTypes type discovery, it needs a bit of help)
-qa:o=r use by reception type stamp ordering: pubsub's defaults are different from DDS, it uses by source timestamp, reliable and no-auto-dispose by default (more sane settings in generally, but perhaps it was a bit stupid in a tool like this! but it grew out of something else ...)

This is also using some old version of OpenSplice. The above uses an undocumented interface for idlpp and things may have changed a bit, but if you can find the XML metadata, the rest is easy (type name on the first line, key list on the second line).

Aug 16 '24 10:08 eboasson

Thanks for the input, I will take a detailed look at that over the next couple of days.

The topic is a fair bit more complex than the IDL above, but not horrifically so. A couple of Unions, and a few structs within structs, I will see if I can sanitise it enough to publish it.

Earlier in the day I set up a little test run, matching the exact data sent from the production VM by OpenSplice with the data I can create in Cyclone on the test VM. In OSPL Test (which gets its type definition from the test machine, it cannot read the type definition from Cyclone) I am happily able to decode a sample from both machines, this leads me to believe that the IDL is identical (and I am assured by the team responsible for the Production VM they have given me the IDL for it!).

What is slightly odd is that when inspecting the packets from the Test and Prod VMs with Wireshark there are some notable differences. The serialized data from OpenSplice is 444 bytes in length and the data from Cyclone is 432 bytes in length and there are some definite differences in the serialised values (I will take some time and work through the data value by value to see if I can work out where the difference is). I still feel suspicious about these being around the two sequences in the datatype.

Aug 16 '24 12:08 FirebarSim

What is slightly odd is that when inspecting the packets from the Test and Prod VMs with Wireshark there are some notable differences. The serialized data from OpenSplice is 444 bytes in length and the data from Cyclone is 432 bytes in length and there are some definite differences in the serialised values (I will take some time and work through the data value by value to see if I can work out where the difference is). I still feel suspicious about these being around the two sequences in the datatype.

Given the same IDL and the same data, a different serialized length would be really weird. Cyclone supports multiple serialization versions (CDR/XCDR1/XCDR2) and alignment rules are different for 64-bit ints and floats in XCDR2 (8 bytes in CDR/XCDR1, 4 bytes in XCDR2) and makes one wonder. When communicating with OpenSplice it should stick to the old rules, but perhaps there's some combination of things that can cause it to get the alignment wrong.

On a side note, given that you're interoperating with OpenSplice: if you have support for it and can share the IDL (perhaps sanitized to turn all field and type names into meaningless words) under that support contract, you might want to try raising it through that channel as well. Having the content, the type and the serialised data really helps a lot.

Aug 19 '24 09:08 eboasson

Fore reference a pair of wireshark decodes that OSPL Test is really happy to decode to the same data type.

Aug 19 '24 11:08 FirebarSim

And here is some heavily abused IDL which does at least compile in cyclone's IDLC

module SplattedIdl
{
	typedef string<12> ConstrainedStringOne;
	
	union UnionOne switch (boolean)
	{
		case TRUE : unsigned long value;
	};
	
	struct StructOne
	{
		ConstrainedStringOne stringOne;
		UnionOne unionOne;
		unsigned long longOne;
		unsigned short shortOne;
	};
	
	typedef sequence<StructOne, 20> StructOneSequence;
	
	enum EnumFour
	{
		EnumFour_Zero,
		EnumFour_One
	};
	
	struct StructTen 
	{
		//Not Used
	};
	
	struct StructTwelve
	{
		unsigned long long ulonglongOne;
	};
	
	union UnionTen switch (boolean)
	{
		case TRUE : double value;
	};
	
	struct StructThirteen
	{
		UnionTen unionTen;
		double doubleOne;
		double doubleTwo;
	};
	
	union UnionEleven switch (boolean)
	{
		case TRUE : double value;
	};
	
	struct StructFourteen
	{
		double doubleOne;
		UnionEleven unionEleven;
		double doubleTwo;
	};
	
	union UnionNine switch (boolean)
	{
		case TRUE : StructFourteen value;
	};
	
	struct StructEleven 
	{
		StructTwelve structTwelve;
		StructThirteen structThirteen;
		boolean boolOne;
		UnionNine unionNine;
	};
	
	union UnionEight switch (EnumFour)
	{
		case EnumFour_Zero : StructTen structTenValue;
		case EnumFour_One : StructEleven structElevenValue;
	};
	
	union UnionTwo switch (boolean)
	{
		case TRUE : UnionEight value;
	};
	
	typedef string<6> ConstrainedStringTwo;
	
	typedef string<5> ConstrainedStringThree;
	
	typedef string<6> ConstrainedStringFour;
	
	struct StructTwo
	{
		ConstrainedStringThree constrainedStringThree;
		ConstrainedStringFour constrainedStringFour;
	};
	
	struct StructThree
	{
		ConstrainedStringTwo constrainedStringTwo;
	};
	
	enum EnumOne
	{
		structTwoKind,
		structThreeKind
	};
	
	union UnionThree switch (EnumOne)
	{
		case structTwoKind : StructTwo structTwoValue;
		case structThreeKind : StructThree structThreeValue;
	};
	
	struct StructFour
	{
		//Left blank as set to FALSE in union and its getting boring renaming all these things
	};
	
	union UnionFour switch (boolean)
	{
	case TRUE : StructFour value;
	};
	
	struct StructFive
	{
		//Left blank as set to FALSE in union and its getting boring renaming all these things
	};
	
	union UnionFive switch (boolean)
	{
	case TRUE : StructFive value;
	};
	
	enum EnumTwo
	{
		EnumTwo_Zero,
		EnumTwo_One,
		EnumTwo_Two,
		EnumTwo_Three,
		EnumTwo_Four,
		EnumTwo_Five,
		EnumTwo_Six
	};
	
	typedef sequence<unsigned long, 5> ULongSequenceOne;
	
	struct StructSix
	{
		boolean boolOne;
		boolean boolTwo;
		boolean boolThree;
		boolean boolFour;
		boolean boolFive;
	};
	
	enum EnumSix
	{
		EnumSix_Zero,
		EnumSix_One,
		EnumSix_Two,
		EnumSix_Three,
		EnumSix_Four
	};
	
	union UnionSix switch (boolean)
	{
		case TRUE : EnumSix value;
	};
	
	enum EnumFive
	{
		EnumFive_Zero,
		EnumFive_One,
		EnumFive_Two,
		EnumFive_Three,
		EnumFive_Four
	};
	
	union UnionSeven switch (boolean)
	{
		case TRUE : EnumFive value;
	};
	
	typedef string<2> ConstrainedStringFive;
	
	typedef string<8> ConstrainedStringSix;
	
	typedef string<8> ConstrainedStringSeven;
	
	typedef string<8> ConstrainedStringEight;
	
	enum EnumThree
	{
		// This enum has about 200 different values
		EnumThree_Zero
	};
	
	struct StructSeven
	{
		UnionSix unionSix;
		ConstrainedStringFive constrainedStringFive;
		UnionSeven unionSeven;
		EnumThree enumThree;
		ConstrainedStringEight constrainedStringEight;
		ConstrainedStringSix constrainedStringSix;
		ConstrainedStringSeven constrainedStringSeven;
	};
	
	struct StructEight
	{
		// Left blank as quite complex
	};
	
	struct StructFifteen
	{
		// Left blank as unused
	};
	
	union UnionTwelve switch (boolean)
	{
		case TRUE : StructFifteen value;
	};
	
	struct StructNine
	{
		boolean boolOne;
		unsigned short ushortOne;
		UnionThree unionThree;
		boolean boolTwo;
	#ifdef DDS_XTYPES
		@Key unsigned long ulongOne;
	#else
		unsigned long ulongOne;
	#endif
		UnionFour unionFour;
		UnionFive unionFive;
		EnumTwo enumTwo;
		ULongSequenceOne ulongSequenceOne;
		StructSix structSix;
		StructSeven structSeven;
		UnionTwelve unionTwelve;
		StructOne structOne;
		UnionTwo unionTwo;
		StructEight structEight;
		StructOneSequence structOneSequence;
	};
	#pragma keylist StructNine ulongOne
};

Aug 19 '24 13:08 FirebarSim

It appears that the difference in data first occurs around byte 92. I think that this is where the data for StructOne structOne; sits as I can identify the bytes for the text in StructOne .stringOne just shortly after.

Aug 19 '24 13:08 FirebarSim

@FirebarSim I assumed you are using the C binding of Cyclone but perhaps you are using the C++ binding. This detail does matter because the C++ one is different (it needs to be templated to fit the IDL-to-C++ mapping). I'd be grateful if you could tell me which language to focus on.

Aug 19 '24 20:08 eboasson

@FirebarSim I assumed you are using the C binding of Cyclone but perhaps you are using the C++ binding. This detail does matter because the C++ one is different (it needs to be templated to fit the IDL-to-C++ mapping). I'd be grateful if you could tell me which language to focus on.

Unfortunately I'm a complicated state, I'm using the python bindings as really I am a systems engineer rather than a software engineer, and I can't really write C++ or C!

Aug 19 '24 20:08 FirebarSim

You actually mentioned Python ... thanks for pointing it out to me again ☺️ It should work fine with all three languages, and Python is actually relatively easy for hacking a sample together.

So, I tried that, but my attempt works fine ... instead of risking going on a wild goose chase here, perhaps you can try my Python script to see if it works for you? There's always the possibility that you are using a different version where it triggers a bug, a big-endian machine — or even that I have the evil eye and the bugs decide to hide.

I've used my pubsub tool for OpenSplice that I mentioned above, in exactly this manner. OpenSplice gets the type from the XML, then dynamically builds a (de)serializer that converts between its normalised in-memory representation and CDR, all of which is completely language independent. This therefore ought to be good enough to reproduce an OpenSplice deserialization error. It outputs:

./pubsub -TS -KSplattedIdl_StructNine.md -qa:o=r -Ssm,pm ""
warning: subscriber entity ignoring inapplicable QoS "destination_order"
warning: publisher entity ignoring inapplicable QoS "destination_order"
[subscription-matched: total=(1 change 1) current=(1 change 1) handle=7441a132:33a]
[publication-matched: total=(1 change 1) current=(1 change 1) handle=7441a132:332]
[subscription-matched: total=(2 change 1) current=(2 change 1) handle=29c1939:1d25a7]
ANN : { .boolOne = true, .ushortOne = 2, .unionThree = structThreeKind:.structThreeValue = { .constrainedStringTwo = "banaan" }, .boolTwo = false, .ulongOne = 5, .unionFour = false:(invalid), .unionFive = false:(invalid), .enumTwo = EnumTwo_Six, .ulongSequenceOne = { 69236151,3284652,1423415 }, .structSix = { .boolOne = true, .boolTwo = false, .boolThree = false, .boolFour = true, .boolFive = true }, .structSeven = { .unionSix = true:.value = EnumSix_One, .constrainedStringFive = "xy", .unionSeven = true:.value = EnumFive_Three, .enumThree = EnumThree_Zero, .constrainedStringEight = "appel", .constrainedStringSix = "peer", .constrainedStringSeven = "kers" }, .unionTwelve = false:(invalid), .structOne = { .stringOne = "aap", .unionOne = true:.value = 94728463, .longOne = 529261, .shortOne = 9176 }, .unionTwo = true:.value = EnumFour_One:.structElevenValue = { .structTwelve = { .ulonglongOne = 314159265 }, .structThirteen = { .unionTen = true:.value = 7.120000, .doubleOne = 11.110000, .doubleTwo = 22.220000 }, .boolOne = false, .unionNine = true:.value = { .doubleOne = 3.314150, .unionEleven = false:(invalid), .doubleTwo = 2.200000 } }, .structEight = { .x = 10 }, .structOneSequence = { { .stringOne = "noot", .unionOne = true:.value = 89786756, .longOne = 12345679, .shortOne = 4321 },{ .stringOne = "zebra", .unionOne = false:(invalid), .longOne = 25791113, .shortOne = 4312 },{ .stringOne = "kameel", .unionOne = true:.value = 89786756, .longOne = 31415, .shortOne = 3421 } } }
ANO : { .boolOne = true, .ushortOne = 2, .unionThree = structThreeKind:.structThreeValue = { .constrainedStringTwo = "banaan" }, .boolTwo = false, .ulongOne = 5, .unionFour = false:(invalid), .unionFive = false:(invalid), .enumTwo = EnumTwo_Six, .ulongSequenceOne = { 69236151,3284652,1423415 }, .structSix = { .boolOne = true, .boolTwo = false, .boolThree = false, .boolFour = true, .boolFive = true }, .structSeven = { .unionSix = true:.value = EnumSix_One, .constrainedStringFive = "xy", .unionSeven = true:.value = EnumFive_Three, .enumThree = EnumThree_Zero, .constrainedStringEight = "appel", .constrainedStringSix = "peer", .constrainedStringSeven = "kers" }, .unionTwelve = false:(invalid), .structOne = { .stringOne = "aap", .unionOne = true:.value = 94728463, .longOne = 529261, .shortOne = 9176 }, .unionTwo = true:.value = EnumFour_One:.structElevenValue = { .structTwelve = { .ulonglongOne = 314159265 }, .structThirteen = { .unionTen = true:.value = 7.120000, .doubleOne = 11.110000, .doubleTwo = 22.220000 }, .boolOne = false, .unionNine = true:.value = { .doubleOne = 3.314150, .unionEleven = false:(invalid), .doubleTwo = 2.200000 } }, .structEight = { .x = 10 }, .structOneSequence = { { .stringOne = "noot", .unionOne = true:.value = 89786756, .longOne = 12345679, .shortOne = 4321 },{ .stringOne = "zebra", .unionOne = false:(invalid), .longOne = 25791113, .shortOne = 4312 },{ .stringOne = "kameel", .unionOne = true:.value = 89786756, .longOne = 31415, .shortOne = 3421 } } }
UNO : { .ulongOne = 5 }
[subscription-matched: total=(2 change 0) current=(1 change 1) handle=29c1939:1d25a7]
[subscription-matched: total=(2 change 0) current=(0 change 1) handle=7441a132:33a]

Don't be scared by the false:(invalid) it shouldn't have printed the (invalid) bit because there's nothing invalid about an unknown discriminator value in an union without a default case. It should have printed nothing instead.

I've included the output of the idl2md script for convenience. I had to change the IDL to have a field in the empty structs, I just put an octet in. It is included in the attached zip file.)

SplattedIdl.zip

Aug 20 '24 05:08 eboasson

This is going to be a bit of a challenge, the OSPL machines I have access to are all Windows based! I have had a look at the tools setup you have going and it looks like it is not workable for windows at present. Do the tools work with the community edition of OSPL? If they do I will just spin up another VM and have a test.

Aug 20 '24 10:08 FirebarSim

They should work just fine with the community version 🙂

If you can compile the examples for the Windows platform, then modifying the "hello world" example to use the SplattedIdl::StructNine instead would be easy (if deserialization fails, there's nothing to print anyway!)

Aug 20 '24 11:08 eboasson

I got the opensplice-tools stuff built on an ubuntu machine and started following your steps like the below, except with the real type names.

# ./idl2md "/path/to the/idl/x.idl" X::Data | tee x.md

I am getting the following output from idl2md:

./idl2md: line 4: [: too many arguments
X::Data
cpp: fatal error: too many input files
compilation terminated.
topic X::Data undefined
<MetaData version .... truncated because I am lazy

If I run idlpp manually using the same flags as in the source of the script and adding the directory of the IDL file to the includes I get what appears to be a successful run and am able to add the type definition to the file as it appears in your output.

Subsequently having set Domain ID 1 in the ospl.xml file (as used by the publisher) and the NIC on this machine that is used for DDS I am able to run the below, again with the real typenames

# ./pubsub -TS -Kx.md -qa:o=r -Ssm,pm ""

This gives the following output

./pubsub: error: DDS_DomainParticipantFactory_create_participant

The same occurs when running as superuser

Aug 20 '24 15:08 FirebarSim

Usually when it fails to create a domain participant, it is because something is not quite right in the OpenSplice configuration. There'll probably be a ospl-error.log file with cryptic messages ...

Two common mistakes are OSPL_URI being set to the XML file's pathname without a file:// prefix, another one is that a configuration that doesn't have Domain/SingleProcess set to true expects to attach to shared memory, which requires starting the daemons but is not supported in the community edition.

Otherwise, well, I have seen my fair share of OpenSplice XML configuration files and its ospl-error.log and ospl-info.log file, and so I am happy to see if I can make something of it.

Finally, this:

# ./idl2md "/path/to the/idl/x.idl" X::Data | tee x.md
./idl2md: line 4: [: too many arguments

I suspect to be a shell quoting issue: to the looks like it might have a space in there, and I wouldn't be surprised if that wreaked havoc in my little hacked-together scripts ... What you did as a workaround is perfect ☺️ it's really all just to run idlpp and to find the key fields. The input to pubsub is just three lines: (1) type name, (2) key fields, (3) the XML string generated by idlpp.

Aug 20 '24 18:08 eboasson

Ahh you got me there with the URI, you would think after months of going export CYCLONEDDS_URI=blah I would have remembered to set it.

Running now gives me the following output

warning: subscriber entity ignoring inapplicable QoS "destination_order"
warning: publisher entity ignoring inapplicable QoS "destination_order"
pubsub: tglib.c:404: parse_unioncase_cb: Assertion 'tu->dtype->kind == TG_INT || tu->dtype->kind == TG_UINT' failed

There isn't anything that seems particularly useful in the ospl-info.log and no other log files are generated by the run. But I guess the fact it is failing to get a union out of the data is telling in and of itself.

Aug 21 '24 08:08 FirebarSim

This is nothing to worry about:

warning: subscriber entity ignoring inapplicable QoS "destination_order"
warning: publisher entity ignoring inapplicable QoS "destination_order"

Those are simply because I tend to forget which QoS's are applicable where exactly, and originally I thought it would be best to warn if something you specified had no effect (perhaps you expected it to have an effect after all). I always use -qa:BLABLA and then completely ignore those warnings 🙂

This

pubsub: tglib.c:404: parse_unioncase_cb: Assertion 'tu->dtype->kind == TG_INT || tu->dtype->kind == TG_UINT' failed

means there's a bug in my pubsub tool in the interpreting the type definition, it is not even gotten around to interpreting data. It looks like it didn't handle boolean and char discriminators correctly (the boolean support was present in my local copy, I must've forgotten to push it a couple of years ago ...)

I pushed an update to https://github.com/eboasson/opensplice-tools ... could you try it again?

Aug 21 '24 08:08 eboasson

Unfortunately that fails to build for me with the message

tglib.c: In function 'typekindstr;:
tglib.c:214:10: error 'DDS_TYPE_ELEMENT_KIND_FORWARDDECLARATION' undeclares (first use in this function); did you mean 'DDS_TYPE_ELEMENT_KIND_FLOAT'?
tglib.c: In function 'parse_type_cb;:
tglib.c:572:10: error 'DDS_TYPE_ELEMENT_KIND_FORWARDDECLARATION' undeclares (first use in this function); did you mean 'DDS_TYPE_ELEMENT_KIND_FLOAT'?

Aug 21 '24 09:08 FirebarSim

No doubt I am building against some other version of OpenSplice ... I added something to the makefile to auto-detect whether DDS_TYPE_ELEMENT_KIND_FORWARDDECLARATION is defined and only reference it if it is. I hadn't expected to be committing to that repo again 😂

Aug 21 '24 09:08 eboasson

Oops, my bad on getting this deep into the debugging! I kinda need to get this working though, and this is a really good refresh of my mucking about with linux skills.

Now builds successfully, with the following command and response:

# ./pubsub -TTopicType_Context -Kx.md -qa:o=s,O=s,d=t,r=n,l=1 -Ssm,pm ""

warning: blah, a few of these, seem irrelevant
[subscription-matched: total=(1 change 1) current=(1 change 1) handle=4491cc2f:34d]
[publication-matched: total=(1 change 1) current=(1 change 1) handle=4491cc2f:336]
[subscription-matched: total=(2 change 1) current=(2 change 1) handle=5d4a75ef:6c0]
[subscription-matched: total=(3 change 1) current=(3 change 1) handle=28264e5e:e5d]
[publication-matched: total=(1 change 1) current=(1 change 1) handle=28264e5e:e5s]
ANN : { .boolOne = false etc. at 1Hz as sent by sending system. Data looks good }

I guess this is a good sign that the IDL is correct and there is something funny going on!

Aug 21 '24 09:08 FirebarSim

Oops, my bad on getting this deep into the debugging! I kinda need to get this working though, and this is a really good refresh of my mucking about with linux skills.

No worries. This kind of interoperability issue really annoys me and so I don't mind helping a bit to solve it. Fixing up some minor detail in that pubsub tool is also worth the bother because it still comes in handy often enough 🙂

Now builds successfully

Great!

I guess this is a good sign that the IDL is correct and there is something funny going on!

Yep, and one thing it could be is that it has something to do with the software on the Windows VMs. I see you made the topic transient, so it'd be interesting to check on the Windows VMs if there are error messages (assuming there is communication between your Linux VM and those Windows ones).

Aug 21 '24 11:08 eboasson

I ran up the Production VM and monitored its OSPL log files, the log there reported 3 errors upon starting pubsub on the other VM.

WARNING: Detected Unmatching QoS Policy 'Latency' for Topic <TopicType_Context>.
WARNING: Detected Unmatching QoS Policy 'Reliability' for Topic <TopicType_Context>.
WARNING: Detected Unmatching QoS Policy 'Resource' for Topic <TopicType_Context>.

It reported no additional information when the failing Cyclone Subscriber started. Cyclone's log reports the error message about failed for unknown reason only. pubsub's log reports the same QoS mismatches as above.

When I stop the production VM I see the missed Heartbeat notifications in pubsub's log. The recieve stops in cyclone with no further information.

Upon starting the cyclone publisher (without stopping the two subscribers) the cyclone receiver immediately starts reporting the data that is coming out. On the other hand pubsub shows nothing except the below:

[subscription-matched: total=(6 change 1) current=(2 change 1) handle=5ae5740f:5e166]

Perhaps this narrows it down to something in the IDL Preprocessing chain for Python in Cyclone (or OSPL I guess)?

Aug 21 '24 12:08 FirebarSim

Did some QoS checking and the QoS on pubsub and my Cyclone publisher appear compatible. but when I run cyclonedds subscribe there is one noticeable difference in that of the 3 QoS variants available one of them reports DataRepresentation(use_cdrv0_representation=True, use xcdrv2_representation=True) and the others all report xcdrv2 as False

Aug 21 '24 13:08 FirebarSim

I ran up the Production VM and monitored its OSPL log files, the log there reported 3 errors upon starting pubsub on the other VM.
WARNING: Detected Unmatching QoS Policy 'Latency' for Topic <TopicType_Context>.
WARNING: Detected Unmatching QoS Policy 'Reliability' for Topic <TopicType_Context>.
WARNING: Detected Unmatching QoS Policy 'Resource' for Topic <TopicType_Context>.
It reported no additional information when the failing Cyclone Subscriber started. Cyclone's log reports the error message about failed for unknown reason only. pubsub's log reports the same QoS mismatches as above.

Topics are meant to be defined consistently, and especially OpenSplice really cares. It warns if it detects differences. Most likely this means that the QoS setting you used in pubsub for the TopicType_Context are different from what is used elsewhere in the system. (Cyclone topic discovery and OpenSplice topic discovery are incompatible, so this warning can originate in OpenSplice applications.)

Cyclone's log reports the error message about failed for unknown reason only.

I take it you mean the "deserialization failed" message? With Cyclone-published data or is this real data from the production VMs? Is the real type or the SplattedIdl one? Because if you have a procedure for triggering the message with the SplattedIdl I can always try to match the versions and reproduce it.

When I stop the production VM I see the missed Heartbeat notifications in pubsub's log. The recieve stops in cyclone with no further information.

Yes, that fits; Cyclone doesn't print it when participants come or go, OpenSplice reports them leaving.

Upon starting the cyclone publisher (without stopping the two subscribers) the cyclone receiver immediately starts reporting the data that is coming out. On the other hand pubsub shows nothing except the below:
[subscription-matched: total=(6 change 1) current=(2 change 1) handle=5ae5740f:5e166]
Perhaps this narrows it down to something in the IDL Preprocessing chain for Python in Cyclone (or OSPL I guess)?

That suggests that pubsub matches the Cyclone publisher, but that it doesn't receive data, either because Cyclone did not match the subscriber (I doubt it, but it can be checked by looking at the publisher-matched/get_matched_subscriptions in the Cyclone publisher) or because OpenSplice did not deserialize the data.

Did some QoS checking and the QoS on pubsub and my Cyclone publisher appear compatible. but when I run cyclonedds subscribe there is one noticeable difference in that of the 3 QoS variants available one of them reports DataRepresentation(use_cdrv0_representation=True, use xcdrv2_representation=True) and the others all report xcdrv2 as False

That difference would fit with OpenSplice not supporting XTypes, hence not advertising any data representation information during discovery and hence defaulting to the old representation only.

Aug 22 '24 07:08 eboasson

All of this was with the real types, I can't realistically generate the Splatted Idl stuff on the production machine as the OSPL publisher is pretty tightly integrated into another process which runs it and uses it to bridge data to and from a proprietary inter system format. About all I can do on that machine is start and stop the system and check the system logs.

That suggests that pubsub matches the Cyclone publisher, but that it doesn't receive data, either because Cyclone did not match the subscriber (I doubt it, but it can be checked by looking at the publisher-matched/get_matched_subscriptions in the Cyclone publisher) or because OpenSplice did not deserialize the data.

I'll dig into the logs and see if I can get that information out, any thoughts on where would be the most telling places to look? Does this feel like it's a real error to you or a bit of a user error issue?

I tried running the system up in reverse, so pubsub first then the Production side. That produced an interesting error, on the production VM:

Error: Could not create 'TopicType_Context'.
Create Topic <TopicType_Context> failed: key "keyName" doesn't match already existing Topic key "".
Create Topic <TopicType_Context> failed: Unmatching Qos Policy: 'Resource'.
Create Topic <TopicType_Context> failed: Unmatching Qos Policy: 'Reliability'.
Create Topic <TopicType_Context> failed: Unmatching Qos Policy: 'Latency'.

And on pubsub no indication of a match as expected from a failure I suspect.

Tried the same with the cyclone publisher and there was no ospl error on the test VM.

If I start the cyclone publisher first, then pubsub, then the production VM I get the same OSPL error as above. I think I am reading the output from pubsub well enough when it reports that the following

[subscription-matched: total=(1 change 1) current=(1 change 1) handle=4491cc2f:34d]
[publication-matched: total=(1 change 1) current=(1 change 1) handle=4491cc2f:336]

These first two are pubsub's own Reader and Writer?

[subscription-matched: total=(2 change 1) current=(2 change 1) handle=5d4a75ef:6c0]

This is something external to pubsub. Should it say publication-matched if it is detecting a publisher though?

I am genuinely really lost on this and where to go next checking wise. I think a sensible step might be trying to extract the QoS used by the production VM to make sure I have an exact match on the cyclone side?

Aug 22 '24 11:08 FirebarSim

I tried running the system up in reverse, so pubsub first then the Production side. That produced an interesting error, on the production VM:
Error: Could not create 'TopicType_Context'.
Create Topic <TopicType_Context> failed: key "keyName" doesn't match already existing Topic key "".
Create Topic <TopicType_Context> failed: Unmatching Qos Policy: 'Resource'.
Create Topic <TopicType_Context> failed: Unmatching Qos Policy: 'Reliability'.
Create Topic <TopicType_Context> failed: Unmatching Qos Policy: 'Latency'.
And on pubsub no indication of a match as expected from a failure I suspect.

It looks to me like the topic definition used in pubsub has the key set differently, and that the QoS is set differently. For the purposes of the experiment, the data set in pubsub is not that important and so you might think it is not a big deal with pubsub uses a different key, but in OpenSplice it generally means that if you start pubsub first, the production system uses the wrong definition. That'll cause trouble.

In the investigation it can also trouble, because somewhere deep down in the bowls of the protocol spec, it mumbles something about endpoints "with key" and endpoints "without key" and the two never matching. If you read the other specs, the "with key" variant should be used always, but unfortunately RTI never figured that out, and so here we are. In short, that can cause tricky problems ...

So yes, I think it would be important to get the topic type/key/QoS the same in pubsub as in the production system. For Cyclone, only the presence of key fields matters.

I think I am reading the output from pubsub well enough when it reports that the following
[subscription-matched: total=(1 change 1) current=(1 change 1) handle=4491cc2f:34d]
[publication-matched: total=(1 change 1) current=(1 change 1) handle=4491cc2f:336]
These first two are pubsub's own Reader and Writer?

Yep 😀

[subscription-matched: total=(2 change 1) current=(2 change 1) handle=5d4a75ef:6c0]
This is something external to pubsub. Should it say publication-matched if it is detecting a publisher though?

Ah, naming. One of the hardest problems ... The DDS spec defines a "publication matched" event on the writer, which triggers when it matches a reader; and analogously a "subscription matched" event on the reader, triggering when it matches a writer. So "subscription-matched" means the reader matched a new writer. That is, it matched a publication 😵‍💫 I double-checked the code just to make sure ...

That suggests that pubsub matches the Cyclone publisher, but that it doesn't receive data, either because Cyclone did not match the subscriber (I doubt it, but it can be checked by looking at the publisher-matched/get_matched_subscriptions in the Cyclone publisher) or because OpenSplice did not deserialize the data. I'll dig into the logs and see if I can get that information out, any thoughts on where would be the most telling places to look? Does this feel like it's a real error to you or a bit of a user error issue?

If say, reader R matches a new writer W, but that writer W doesn't match the reader R, then that typically means a problem. It should always be symmetrical.

I am genuinely really lost on this and where to go next checking wise. I think a sensible step might be trying to extract the QoS used by the production VM to make sure I have an exact match on the cyclone side?

I am trying to think along. There may be an alternative: if you can extract the serialized blob that fails to decode[^1], then it is possible to write a program that tries to deserialize it given the type definition. Then you can usually trace the execution of the deserializer[^2] and/or start with only the first field, then progressively adding one fields until it fails.

For Cyclone, the program is fairly straightforward. You just run idlc as normal, then the topic descriptor contains the bytecode for the deserializer, and dds_cdrstream_desc_from_topic_desc will convert it to one that you can use directly with the deserializer[^3]. If you then call dds_stream_normalize with the blob in the debugger and set a breakpoint on normalize_error it stops right where it gets confused. In essence, this program already exists somewhere in the tests.

Or if you strip the type you can check the return value, and then the last one that succeeds will have the position where that blob ended stored in actual_size and presumably what comes next is the troublesome bit. Sometimes, you get unlucky it hobbles along for a while before it detects a problem ...

For OpenSplice ... something similar ... but I would have to dig around a bit, the basics are in https://github.com/eclipse-cyclonedds/cyclonedds/blob/b53c54a981a1193a430ffb4bb504efef157671ee/src/core/xtests/cdrtest/cdrtest.pl#L165-L183 where it first imports the metadata (like pubsub also does), then serializes a sample that came out of the Cyclone deserializer, but calling sd_cdrDeserializeObject is not that different.

I can imagine you think this is a bit too much. I wouldn't blame you ... but I would like to get to the bottom of this and I do think you really deserve the help. Would taking the communication out of the public eye allow you to provide some more details? I can help with that.

There is one other thing, one that I just remembered because writing the "hobbles along" reminded me of something: https://github.com/eclipse-cyclonedds/cyclonedds-python/pull/261. Funny things can happen with writing unions from Python when the data to be written doesn't quite match the type. That one can about because of a crash, but perhaps its worth checking the union definitions.

[^1]: (I've done it from the Wireshark output by exporting the data to a text file and converting the hex dump to raw bytes, but fortunately only for fairly small amounts; perl is great at this, especially pack and unpack) [^2]: in the debugger for Cyclone, in OpenSplice by compiling the community edition with some macros defined (https://github.com/ADLINK-IST/opensplice/blob/9d6a98262d6f4a418127650cca415e05946a042a/src/database/serialization/code/sd_cdr.c#L54) [^3]: you'd probably need to build it with -DEXPORT_ALL_SYMBOLS=1 defined in cmake so that just about everything in the library is accessible

Aug 22 '24 12:08 eboasson

I've been poking the QoS a little more with pubsub having found the type definition used by the Production VM. I've managed to get down to one error message to do with Reliability. The document tells me it is Best Effort and on adding r=n to the qos string I still get a startup error of mismatched Reliability. Bizarre. There is also nothing in the xml type definition about key values, is there meant to be an attribute in the type?

I can certainly extract the data with wireshark, I did try on the whole string previously but I guess using the topic descriptors it is possible to understand the order that the data is packed in and work back from there byte array slice by byte array slice?

Yeah I could be a little less cagey about types in a private conversation.

Aug 22 '24 14:08 FirebarSim

Following the logic train of turning off fields one at a time down the chain I have found that there is a particular field (a boolean in the equivalent of StructEight) that is failing to deserialise (as in I can comment that line and onwards out in the python code and deserialisation of a partial topic will occur. Otherwise deserialisation fails!

If I change the type of the field in the python class to types.int8 it deserialises to -1 and then the rest of the topic deserialises seamlessly!

The byte length I mentioned in a previous message was a red herring. Turns out the production implementation pads string values to max length with spaces. So messages from cyclone and messages from opsl are both 444 bytes in length.

Aug 22 '24 16:08 FirebarSim

I've been poking the QoS a little more with pubsub having found the type definition used by the Production VM. I've managed to get down to one error message to do with Reliability. The document tells me it is Best Effort and on adding r=n to the qos string I still get a startup error of mismatched Reliability. Bizarre.

That's odd indeed, unless the P system has an inconsistency itself or if there's an inconsistency between topic, reader and writer: that makes it really easy to get one of them wrong, and all can give rise to some warning about mismatched QoS. The topic ones get listed in the OpenSplice logs, the reader/writer ones I believe are reported by pubsub. I'm pretty sure pubsub sets the QoS correctly from the command-line.

If the P system is running before you start pubsub you don't actually need to specify the details of the topic definition with the -K and the -q options, because pubsub will use DDS_DomainParticipant_find_topic to find the existing definition. Sometimes it is easiest to make progress by side-stepping problems 🙂

There is also nothing in the xml type definition about key values, is there meant to be an attribute in the type?

In OpenSplice, the list of key fields is separate from the type definition. (They are part of the topic definition; my weird simpleidlpp perl script is there to find out what the key fields are given the IDL because it is not easily available from OpenSplice's idlpp output.) In "modern" DDS, that is, DDS with XTypes, the keys are part of the type definition.

I can certainly extract the data with wireshark, I did try on the whole string previously but I guess using the topic descriptors it is possible to understand the order that the data is packed in and work back from there byte array slice by byte array slice?

All you need to locate is the payload, because the type part and deserialization is best done with deserializers. I've deserialized by hand often enough to know it is best to avoid it! Locating the payload is easy, wireshark will highlight it in the hex dump for you if you click on the payload part of the dissected message.

(It gets trickier when the messages are really large, but 444 bytes fits in a single blob.)

Following the logic train of turning off fields one at a time down the chain I have found that there is a particular field (a boolean in the equivalent of StructEight) that is failing to deserialise (as in I can comment that line and onwards out in the python code and deserialisation of a partial topic will occur. Otherwise deserialisation fails!

If I change the type of the field in the python class to types.int8 it deserialises to -1 and then the rest of the topic deserialises seamlessly!

Ah! Bingo!

I thought this was Cyclone DDS Python → OpenSplice, so I'm not sure I understand why changing the python type to types.int8 would make the deserialization succeed, but in general, that looks very much like trying to deserialize something other than 0 or 1 as a boolean.

The CDR spec says that those are the only valid encodings for a boolean and therefore deserializers should reject other values as invalid input. It follows that serializers should never produce a blob with some other value in it.

Now, with a bit of experimentation and some things I vaguely remember, bools get handled as:

Cyclone C:

If the CDR and the in-memory layout are the same (not the case in your type):
- ser & deser: do not check
If they are not:
- ser: rejects anything other than 0 or 1
- deser: rejects anything other than 0 or 1

(I'm not sure I like the "do not check" path ...)

Cyclone Python:

ser: True, 1, -1, 2 as 1
ser: False, 0 and None as 0
deser: rejects anything other than 0 or 1

OpenSplice:

ser & deser: not checked, assumes ≠0 is treated as true

So it could be something like an OpenSplice application writing a sample with a boolean field set to 0xff instead of to 1, this getting sent out on the wire and then rejected by Cyclone.

If that's the case, then what is the best way of dealing with it? Transforming any non-0 boolean field in the CDR to 1 in the deserialized result? Only if an option is set? 🤔

Yeah I could be a little less cagey about types in a private conversation.

If it would help, my email address is easy to find in the commit log of Cyclone 🙂

Aug 26 '24 10:08 eboasson

The CDR spec says that those are the only valid encodings for a boolean and therefore deserializers should reject other values as invalid input. It follows that serializers should never produce a blob with some other value in it.

This could actually be an issue with the production VM in combination with OpenSplice perhaps being a little lax in checking if a value is legal before serialising. Because the backend of some of this stuff is in ADA95 if you access unitialised variables you can sometimes get whatever is in the memory space (this is why it pads the strings with spaces, I've seen previously data coming in where the OpenSplice system recieved "DATA" and the production system read 8 charachters from that and got "DATAahds").

Obviously my particular use case I would love it to deserialise with the same rules as either serialisation or OpenSplice. But perhaps this is a niche case, an error indicating a bool was outside of an allowed value and a parameter to control how strict deserialisation is might be better?

Aug 27 '24 09:08 FirebarSim

Yay! I'm always happy when I see people using Ada! (Part of the reason may very well be that I never actually wrote Ada code, but in my view it is still the only language worthy of the term software engineering. There's some happy family history involving Ada 🙂)

I would say it should at the very least be possible to let it transform any non-0 byte to a boolean value of true on deserialization. It may not be strictly to spec, but making it impossible leads to this type of trouble. Whether that should always be the behaviour, or whether it should be configurable (and how), I don't know yet.

I'm going to get around to actually doing something about this today, but if you're in a position to patch Cyclone, I think it might be worth a try to adjust the logic in these three places:

https://github.com/eclipse-cyclonedds/cyclonedds/blob/b53c54a981a1193a430ffb4bb504efef157671ee/src/core/cdr/src/dds_cdrstream_write.part.h#L15

https://github.com/eclipse-cyclonedds/cyclonedds/blob/b53c54a981a1193a430ffb4bb504efef157671ee/src/core/cdr/src/dds_cdrstream.c#L2330

https://github.com/eclipse-cyclonedds/cyclonedds/blob/b53c54a981a1193a430ffb4bb504efef157671ee/src/core/cdr/src/dds_cdrstream.c#L2342

Aug 30 '24 05:08 eboasson

I think this should now be fixed. I'll close it, but please do reopen if I shouldn't close it yet.

Oct 29 '24 13:10 eboasson

cyclonedds cyclonedds copied to clipboard

Deserialization of sample failed due to reasons unknown, possibly related to a sequence

cyclonedds
cyclonedds copied to clipboard