regex-automata icon indicating copy to clipboard operation
regex-automata copied to clipboard

improve dfa deserialization docs around alignment

Open sam0x17 opened this issue 2 years ago • 9 comments

I have been following the DFA serialization / deserialization docs but I am getting an AlignmentMismatch error:

DeserializeError(AlignmentMismatch { alignment: 4, address: 4383310073 })

What might be causing this? I am storing my DFAs as follows:

in proc macro that generates DFAs:

    let reg = Regex::new(parser_regex.as_str()).unwrap();
    let (fwd_bytes_le, fwd_pad_le) = reg.forward().to_bytes_little_endian();
    let (rev_bytes_le, rev_pad_le) = reg.reverse().to_bytes_little_endian();
    let fwd_bytes_le = Literal::byte_string(fwd_bytes_le.as_slice());
    let rev_bytes_le = Literal::byte_string(rev_bytes_le.as_slice());
    let (fwd_bytes_be, fwd_pad_be) = reg.forward().to_bytes_big_endian();
    let (rev_bytes_be, rev_pad_be) = reg.reverse().to_bytes_big_endian();
    let fwd_bytes_be = Literal::byte_string(fwd_bytes_be.as_slice());
    let rev_bytes_be = Literal::byte_string(rev_bytes_be.as_slice());

The proc macro generates syntax looking like this:

        lazy_static! {
            static ref BOLTS_ROUTE_show_article: Route<4> = {
                Route::new(
                    [StaticRoutePiece(PieceType::Static, "/users/test/"), StaticRoutePiece(PieceType::Str, "name"), StaticRoutePiece(PieceType::Static, "/other_stuff/cool-thing/"), StaticRoutePiece(PieceType::U128, "id")],
                    b"rust-regex-automata-dfa-dense\x00\x00\x00\xff\xfe\x00\x00\x02\x00\x00..."
                    b"rust-regex-automata-dfa-dense\x00\x00\x00\x00\x00\xfe\xff\x00\x00\x00..."
                    0,
                    0,
                    b"rust-regex-automata-dfa-dense\x00\x00\x00\xff\xfe\x00\x00\x02\x00\x00..."
                    b"rust-regex-automata-dfa-dense\x00\x00\x00\x00\x00\xfe\xff\x00\x00\x00..."
                    0,
                    0,
                )
            };
        }

The Route::new function tries to instantiate the regex by loading the bytes for the DFAs like this:

pub struct Route<const N: usize> {
    pub pieces: [StaticRoutePiece; N],
    pub regex: Regex<DFA<&'static [u32]>>,
}

impl<const N: usize> Route<{ N }> {
    pub fn new(
        pieces: [StaticRoutePiece; N],
        fwd_bytes_le: &'static [u8],
        fwd_bytes_be: &'static [u8],
        fwd_pad_le: usize,
        fwd_pad_be: usize,
        rev_bytes_le: &'static [u8],
        rev_bytes_be: &'static [u8],
        rev_pad_le: usize,
        rev_pad_be: usize,
    ) -> Route<{ N }> {
        // use proper endian style
        let (fwd_bytes, rev_bytes, fwd_pad, rev_pad) = match cpu_endian::working() {
            cpu_endian::Endian::Little | cpu_endian::Endian::Minor => {
                (fwd_bytes_le, rev_bytes_le, fwd_pad_le, rev_pad_le)
            }
            cpu_endian::Endian::Big => (fwd_bytes_be, rev_bytes_be, fwd_pad_be, rev_pad_be),
        };

        // construct DFAs
        let fwd: DFA<&[u32]> = DFA::from_bytes(&fwd_bytes[fwd_pad..]).unwrap().0;
        let rev: DFA<&[u32]> = DFA::from_bytes(&rev_bytes[rev_pad..]).unwrap().0;

        // build regex
        let regex = Regex::builder().build_from_dfas(fwd, rev);
        Route {
            pieces: pieces,
            regex: regex,
        }
    }
}

Unfortunately I always get the following error on the //construct DFAs section:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DeserializeError(AlignmentMismatch { alignment: 4, address: 4412145913 })', /Users/sam/workspace/bolts-router/src/routes.rs:60:71

Any idea why this might happen? I can confirm that my machine is little endian, it is correctly selecting little endian, and both the padding variables are resolving to 0. I've tried representing the binary data several ways including as a byte string (above), an array of u8s, etc. I always get the same error. I can also confirm that the bytestring for both little endian DFAs is identical to one that has worked before in a simpler example, so I have no idea why this is happening.

sam0x17 avatar Jun 05 '22 19:06 sam0x17

Full macro output can be found here if you want to poke around with the generated DFA: https://gist.githubusercontent.com/sam0x17/d3231467d2dce929548bc77b1609860d/raw/71ab2d53267395023f1622e03b071891c5ea3fb5/dfa.rs

sam0x17 avatar Jun 05 '22 19:06 sam0x17

update: looks like one of my regexes exhibits this issue, and the other does not, so I don't think this has anything to do with how I'm encoding/decoding.

Here are the bytes for a DFA not exhibiting this issue: https://gist.githubusercontent.com/sam0x17/ebfe52f8e8ab146345980aafc42f880d/raw/956ab3d66bb59f0ef0ce3fb5688270b2e310072f/good_dfa.rs Here are the bytes for a DFA exhibiting this issue: https://gist.githubusercontent.com/sam0x17/5640a8ec101ea8446f5f38d97041cad4/raw/941e20c8dde20771c59efb65ee733144c8c0a61a/bad_dfa.rs

The regex for the good DFA:

\A/users/([^/?#]+)\z

The regex for the bad DFA:

\A/users/test/([^/?#]+)/other_stuff/cool-thing/([^/?#]+)\z

Maybe the problem is having two groups of parenthesis?

sam0x17 avatar Jun 05 '22 21:06 sam0x17

update: experiencing the same issue still after simplifying the regex:

\A/users/test/[^/?#]+/other_stuff/cool-thing/[^/?#]+\z

sam0x17 avatar Jun 05 '22 21:06 sam0x17

I figured it out!! The bytes that get loaded have to be aligned as u32's, so using a struct like the following to store the [u8; n] consts and specifying 32 as the alignment fixes the issue. This is why when I tweaked with the size of the regex, things would randomly work or not work. Would be cool to document this!

#[repr(align(32))]
pub struct AlignedBytes<const N: usize> {
    pub bytes: [u8; N],
}

sam0x17 avatar Jun 06 '22 00:06 sam0x17

@sam0x17 Can you say what should be added to the docs for, e.g., DFA::from_bytes to improve here? Note that the alignment requirement is stated in the Errors section (along with other requirements). There is also an example explicitly for dealing with alignment. That example includes a variant of your AlignedBytes type. Also, align(32) will work, but it is excessive. You don't want a 32-byte alignment. You want an alignment equivalent to do that of a u32 and the alignment of a u32 is 4. So that should be #[repr(align(4))].

Did the AlignmentMismatch error not give enough information to convey that you have to do something with alignment in order to make deserialization work? (I'm not blaming, I'm just trying to understand the conceptual failure mode here so that we can improve the docs.)

BurntSushi avatar Jun 06 '22 14:06 BurntSushi

I was unfamiliar with alignment issues so only figured it out when I noticed changing the size of the regex had an effect on whether things were working

On Mon, Jun 6, 2022 at 10:46 AM Andrew Gallant @.***> wrote:

@sam0x17 https://github.com/sam0x17 Can you say what should be added to the docs for, e.g., DFA::from_bytes https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#method.from_bytes to improve here? Note that the alignment requirement is stated in the Errors section (along with other requirements). There is also an example explicitly for dealing with alignment https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#example-dealing-with-alignment-and-padding. That example includes a variant of your AlignedBytes type. Also, align(32) will work, but it is excessive. You don't want a 32-byte alignment. You want an alignment equivalent to do that of a u32 and the alignment of a u32 is 4. So that should be #[repr(align(4))].

Did the AlignmentMismatch error not give enough information to convey that you have to do something with alignment in order to make deserialization work? (I'm not blaming, I'm just trying to understand the conceptual failure mode here so that we can improve the docs.)

— Reply to this email directly, view it on GitHub https://github.com/BurntSushi/regex-automata/issues/24#issuecomment-1147530903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE4LOPRIFMVZ2VHW4XQYLVNYFMRANCNFSM5X5ORNAA . You are receiving this because you were mentioned.Message ID: @.***>

sam0x17 avatar Jun 06 '22 18:06 sam0x17

Would be cool to document this!

Is there something that could have been added to the docs that would have helped you?

BurntSushi avatar Jun 06 '22 18:06 BurntSushi

What would be great is in the AlignmentMismatch error message, specifying that the alignment of the bytes that are to be loaded in via from_bytes must be 4 or a multiple of 4, maybe with a link to an example for how to do this

On Mon, Jun 6, 2022 at 10:46 AM Andrew Gallant @.***> wrote:

@sam0x17 https://github.com/sam0x17 Can you say what should be added to the docs for, e.g., DFA::from_bytes https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#method.from_bytes to improve here? Note that the alignment requirement is stated in the Errors section (along with other requirements). There is also an example explicitly for dealing with alignment https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#example-dealing-with-alignment-and-padding. That example includes a variant of your AlignedBytes type. Also, align(32) will work, but it is excessive. You don't want a 32-byte alignment. You want an alignment equivalent to do that of a u32 and the alignment of a u32 is 4. So that should be #[repr(align(4))].

Did the AlignmentMismatch error not give enough information to convey that you have to do something with alignment in order to make deserialization work? (I'm not blaming, I'm just trying to understand the conceptual failure mode here so that we can improve the docs.)

— Reply to this email directly, view it on GitHub https://github.com/BurntSushi/regex-automata/issues/24#issuecomment-1147530903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE4LOPRIFMVZ2VHW4XQYLVNYFMRANCNFSM5X5ORNAA . You are receiving this because you were mentioned.Message ID: @.***>

sam0x17 avatar Jun 06 '22 19:06 sam0x17

Would also make a lot of sense to include this comment in the docs for the from_bytes method directly

On Mon, Jun 6, 2022 at 3:11 PM Sam Johnson @.***> wrote:

What would be great is in the AlignmentMismatch error message, specifying that the alignment of the bytes that are to be loaded in via from_bytes must be 4 or a multiple of 4, maybe with a link to an example for how to do this

On Mon, Jun 6, 2022 at 10:46 AM Andrew Gallant @.***> wrote:

@sam0x17 https://github.com/sam0x17 Can you say what should be added to the docs for, e.g., DFA::from_bytes https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#method.from_bytes to improve here? Note that the alignment requirement is stated in the Errors section (along with other requirements). There is also an example explicitly for dealing with alignment https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#example-dealing-with-alignment-and-padding. That example includes a variant of your AlignedBytes type. Also, align(32) will work, but it is excessive. You don't want a 32-byte alignment. You want an alignment equivalent to do that of a u32 and the alignment of a u32 is 4. So that should be #[repr(align(4))].

Did the AlignmentMismatch error not give enough information to convey that you have to do something with alignment in order to make deserialization work? (I'm not blaming, I'm just trying to understand the conceptual failure mode here so that we can improve the docs.)

— Reply to this email directly, view it on GitHub https://github.com/BurntSushi/regex-automata/issues/24#issuecomment-1147530903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE4LOPRIFMVZ2VHW4XQYLVNYFMRANCNFSM5X5ORNAA . You are receiving this because you were mentioned.Message ID: @.***>

sam0x17 avatar Jun 06 '22 19:06 sam0x17

Docs should be improved now.

BurntSushi avatar Jul 07 '23 17:07 BurntSushi