XbimEssentials recreate yacc and lex source code

Hello,

I have enhanced the string regular expresion used in StepP21Lex.lex to fix a problem that we have encountered a few times with certain IFC files.

So I ran the MAKEPARSER.BAT batch file to recreate the yacc and lex source files. But I had compile errors. So I undid my changes and ran the MAKEPARSER.BAT batch file without any changes. It seems to have run fine:

But the generated StepP21Lex.cs files has some changes in it that lead to compile errors:

Longs have been turned into ints:

Also an ifdef is lost:

Should all that be fixed manually or am I missing something or doing something wrong?

Oct 22 '24 20:10 santiagoIT

Yes, it's a hack from a while back. See https://github.com/xBimTeam/XbimEssentials/issues/561#issuecomment-2160556569

We should really look to replace this old PointsGarden parser

Oct 23 '24 09:10 andyward

I meant to add - you should be able to git cherrypick -n 6517bc1 to re-apply the #6517bc16042b3cfd820dd7eb45f72bbab92d13ad fix to your local branch

Oct 23 '24 14:10 andyward

@andyward It was precisely the single backslash issue that I am trying to address. Hope to be able to try this out soon and hopefully all unit tests will pass. If so, I will submit a pull request. We run into this problem frequently.

I hope there are tests with the short unicode encoding, if not I will try to add them. I need to make sure that the regex I have does not break anything with that. If not, I will add some.

Oct 23 '24 15:10 santiagoIT

unfortunately, the change I did to the regex broke some tests. I wanted the parser to be tolerant against non-correctly encoded strings. I ran into the EncodeBackslash() Test which is now disabled, and I can see that that is the way it used to work (fault tolerant) but it had to be changed.

I believe the correct approach would be to try to detect Invalid strings, by adding a new Token type (Tokens.STRING_INVALID) in the lex file. An exception could then be thrown specifying the line number and string, which would make it clear to the user why the file does not load. I know very little about the encoding of strings in IFC.

Are these the only valid encodings for IFC? https://technical.buildingsmart.org/resources/ifcimplementationguidance/string-encoding/

\S . No idea where this comes from
‘\PA Are other code pages supported?

Basically what I am trying to come up with is a regex which can be used to detect invalid strings. This regex would be run before the regular string regex,

Thank you!

Oct 28 '24 20:10 santiagoIT

This should be fixed now in develop. I've also addressed the issues of regeneration of the code from Lex/Yacc, and retaining the fix for very large (>2GB files)

I know very little about the encoding of strings in IFC.

Are these the only valid encodings for IFC? https://technical.buildingsmart.org/resources/ifcimplementationguidance/string-encoding/

\S . No idea where this comes from ‘\PA Are other code pages supported?

IFC STEP21 format is designed to only host ASCII characters in the range 32-126 (0x20-0x7E)- That means special characters (\n\t etc) and accented (and Unicode) characters need encoding..

\S\ and \P are the old IFC way of encoding ASCII characters outside those ranges, including those in different 'codepages'. aka ISO 8859.

\S allows encoding characters > 127 by letting you add 0x80/128 to the base characters.

\P enables changing the codepage between 9 different ASCII / ISO8859 code pages. PA=1 PE=5 (Cyrillic) etc. These give access to different Lattin, Greek, Cyrillic etc alphabets when using \S.

The preferred way to do this now is with the \X encoding which essentially opens up Unicode / ISO10646

There are 3 flavours essentially mapping to utf/utf16 and utf32 characters. \X = 1 byte \X2 = 2 byte chars \X4 = 4 byte chars

Some examples of how it works in the tests I added: https://github.com/xBimTeam/XbimEssentials/blob/develop/Tests/XbimP21StringDecoderTests.cs