regex-automata
regex-automata copied to clipboard
improve dfa deserialization docs around alignment
I have been following the DFA serialization / deserialization docs but I am getting an AlignmentMismatch
error:
DeserializeError(AlignmentMismatch { alignment: 4, address: 4383310073 })
What might be causing this? I am storing my DFAs as follows:
in proc macro that generates DFAs:
let reg = Regex::new(parser_regex.as_str()).unwrap();
let (fwd_bytes_le, fwd_pad_le) = reg.forward().to_bytes_little_endian();
let (rev_bytes_le, rev_pad_le) = reg.reverse().to_bytes_little_endian();
let fwd_bytes_le = Literal::byte_string(fwd_bytes_le.as_slice());
let rev_bytes_le = Literal::byte_string(rev_bytes_le.as_slice());
let (fwd_bytes_be, fwd_pad_be) = reg.forward().to_bytes_big_endian();
let (rev_bytes_be, rev_pad_be) = reg.reverse().to_bytes_big_endian();
let fwd_bytes_be = Literal::byte_string(fwd_bytes_be.as_slice());
let rev_bytes_be = Literal::byte_string(rev_bytes_be.as_slice());
The proc macro generates syntax looking like this:
lazy_static! {
static ref BOLTS_ROUTE_show_article: Route<4> = {
Route::new(
[StaticRoutePiece(PieceType::Static, "/users/test/"), StaticRoutePiece(PieceType::Str, "name"), StaticRoutePiece(PieceType::Static, "/other_stuff/cool-thing/"), StaticRoutePiece(PieceType::U128, "id")],
b"rust-regex-automata-dfa-dense\x00\x00\x00\xff\xfe\x00\x00\x02\x00\x00..."
b"rust-regex-automata-dfa-dense\x00\x00\x00\x00\x00\xfe\xff\x00\x00\x00..."
0,
0,
b"rust-regex-automata-dfa-dense\x00\x00\x00\xff\xfe\x00\x00\x02\x00\x00..."
b"rust-regex-automata-dfa-dense\x00\x00\x00\x00\x00\xfe\xff\x00\x00\x00..."
0,
0,
)
};
}
The Route::new
function tries to instantiate the regex by loading the bytes for the DFAs like this:
pub struct Route<const N: usize> {
pub pieces: [StaticRoutePiece; N],
pub regex: Regex<DFA<&'static [u32]>>,
}
impl<const N: usize> Route<{ N }> {
pub fn new(
pieces: [StaticRoutePiece; N],
fwd_bytes_le: &'static [u8],
fwd_bytes_be: &'static [u8],
fwd_pad_le: usize,
fwd_pad_be: usize,
rev_bytes_le: &'static [u8],
rev_bytes_be: &'static [u8],
rev_pad_le: usize,
rev_pad_be: usize,
) -> Route<{ N }> {
// use proper endian style
let (fwd_bytes, rev_bytes, fwd_pad, rev_pad) = match cpu_endian::working() {
cpu_endian::Endian::Little | cpu_endian::Endian::Minor => {
(fwd_bytes_le, rev_bytes_le, fwd_pad_le, rev_pad_le)
}
cpu_endian::Endian::Big => (fwd_bytes_be, rev_bytes_be, fwd_pad_be, rev_pad_be),
};
// construct DFAs
let fwd: DFA<&[u32]> = DFA::from_bytes(&fwd_bytes[fwd_pad..]).unwrap().0;
let rev: DFA<&[u32]> = DFA::from_bytes(&rev_bytes[rev_pad..]).unwrap().0;
// build regex
let regex = Regex::builder().build_from_dfas(fwd, rev);
Route {
pieces: pieces,
regex: regex,
}
}
}
Unfortunately I always get the following error on the //construct DFAs
section:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DeserializeError(AlignmentMismatch { alignment: 4, address: 4412145913 })', /Users/sam/workspace/bolts-router/src/routes.rs:60:71
Any idea why this might happen? I can confirm that my machine is little endian, it is correctly selecting little endian, and both the padding variables are resolving to 0. I've tried representing the binary data several ways including as a byte string (above), an array of u8s, etc. I always get the same error. I can also confirm that the bytestring for both little endian DFAs is identical to one that has worked before in a simpler example, so I have no idea why this is happening.
Full macro output can be found here if you want to poke around with the generated DFA: https://gist.githubusercontent.com/sam0x17/d3231467d2dce929548bc77b1609860d/raw/71ab2d53267395023f1622e03b071891c5ea3fb5/dfa.rs
update: looks like one of my regexes exhibits this issue, and the other does not, so I don't think this has anything to do with how I'm encoding/decoding.
Here are the bytes for a DFA not exhibiting this issue: https://gist.githubusercontent.com/sam0x17/ebfe52f8e8ab146345980aafc42f880d/raw/956ab3d66bb59f0ef0ce3fb5688270b2e310072f/good_dfa.rs Here are the bytes for a DFA exhibiting this issue: https://gist.githubusercontent.com/sam0x17/5640a8ec101ea8446f5f38d97041cad4/raw/941e20c8dde20771c59efb65ee733144c8c0a61a/bad_dfa.rs
The regex for the good DFA:
\A/users/([^/?#]+)\z
The regex for the bad DFA:
\A/users/test/([^/?#]+)/other_stuff/cool-thing/([^/?#]+)\z
Maybe the problem is having two groups of parenthesis?
update: experiencing the same issue still after simplifying the regex:
\A/users/test/[^/?#]+/other_stuff/cool-thing/[^/?#]+\z
I figured it out!! The bytes that get loaded have to be aligned as u32
's, so using a struct like the following to store the [u8; n]
consts and specifying 32 as the alignment fixes the issue. This is why when I tweaked with the size of the regex, things would randomly work or not work. Would be cool to document this!
#[repr(align(32))]
pub struct AlignedBytes<const N: usize> {
pub bytes: [u8; N],
}
@sam0x17 Can you say what should be added to the docs for, e.g., DFA::from_bytes
to improve here? Note that the alignment requirement is stated in the Errors
section (along with other requirements). There is also an example explicitly for dealing with alignment. That example includes a variant of your AlignedBytes
type. Also, align(32)
will work, but it is excessive. You don't want a 32-byte alignment. You want an alignment equivalent to do that of a u32
and the alignment of a u32
is 4
. So that should be #[repr(align(4))]
.
Did the AlignmentMismatch
error not give enough information to convey that you have to do something with alignment in order to make deserialization work? (I'm not blaming, I'm just trying to understand the conceptual failure mode here so that we can improve the docs.)
I was unfamiliar with alignment issues so only figured it out when I noticed changing the size of the regex had an effect on whether things were working
On Mon, Jun 6, 2022 at 10:46 AM Andrew Gallant @.***> wrote:
@sam0x17 https://github.com/sam0x17 Can you say what should be added to the docs for, e.g., DFA::from_bytes https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#method.from_bytes to improve here? Note that the alignment requirement is stated in the Errors section (along with other requirements). There is also an example explicitly for dealing with alignment https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#example-dealing-with-alignment-and-padding. That example includes a variant of your AlignedBytes type. Also, align(32) will work, but it is excessive. You don't want a 32-byte alignment. You want an alignment equivalent to do that of a u32 and the alignment of a u32 is 4. So that should be #[repr(align(4))].
Did the AlignmentMismatch error not give enough information to convey that you have to do something with alignment in order to make deserialization work? (I'm not blaming, I'm just trying to understand the conceptual failure mode here so that we can improve the docs.)
— Reply to this email directly, view it on GitHub https://github.com/BurntSushi/regex-automata/issues/24#issuecomment-1147530903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE4LOPRIFMVZ2VHW4XQYLVNYFMRANCNFSM5X5ORNAA . You are receiving this because you were mentioned.Message ID: @.***>
Would be cool to document this!
Is there something that could have been added to the docs that would have helped you?
What would be great is in the AlignmentMismatch
error message, specifying
that the alignment of the bytes that are to be loaded in via from_bytes
must be 4 or a multiple of 4, maybe with a link to an example for how to do
this
On Mon, Jun 6, 2022 at 10:46 AM Andrew Gallant @.***> wrote:
@sam0x17 https://github.com/sam0x17 Can you say what should be added to the docs for, e.g., DFA::from_bytes https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#method.from_bytes to improve here? Note that the alignment requirement is stated in the Errors section (along with other requirements). There is also an example explicitly for dealing with alignment https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#example-dealing-with-alignment-and-padding. That example includes a variant of your AlignedBytes type. Also, align(32) will work, but it is excessive. You don't want a 32-byte alignment. You want an alignment equivalent to do that of a u32 and the alignment of a u32 is 4. So that should be #[repr(align(4))].
Did the AlignmentMismatch error not give enough information to convey that you have to do something with alignment in order to make deserialization work? (I'm not blaming, I'm just trying to understand the conceptual failure mode here so that we can improve the docs.)
— Reply to this email directly, view it on GitHub https://github.com/BurntSushi/regex-automata/issues/24#issuecomment-1147530903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE4LOPRIFMVZ2VHW4XQYLVNYFMRANCNFSM5X5ORNAA . You are receiving this because you were mentioned.Message ID: @.***>
Would also make a lot of sense to include this comment in the docs for the
from_bytes
method directly
On Mon, Jun 6, 2022 at 3:11 PM Sam Johnson @.***> wrote:
What would be great is in the
AlignmentMismatch
error message, specifying that the alignment of the bytes that are to be loaded in viafrom_bytes
must be 4 or a multiple of 4, maybe with a link to an example for how to do thisOn Mon, Jun 6, 2022 at 10:46 AM Andrew Gallant @.***> wrote:
@sam0x17 https://github.com/sam0x17 Can you say what should be added to the docs for, e.g., DFA::from_bytes https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#method.from_bytes to improve here? Note that the alignment requirement is stated in the Errors section (along with other requirements). There is also an example explicitly for dealing with alignment https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html#example-dealing-with-alignment-and-padding. That example includes a variant of your AlignedBytes type. Also, align(32) will work, but it is excessive. You don't want a 32-byte alignment. You want an alignment equivalent to do that of a u32 and the alignment of a u32 is 4. So that should be #[repr(align(4))].
Did the AlignmentMismatch error not give enough information to convey that you have to do something with alignment in order to make deserialization work? (I'm not blaming, I'm just trying to understand the conceptual failure mode here so that we can improve the docs.)
— Reply to this email directly, view it on GitHub https://github.com/BurntSushi/regex-automata/issues/24#issuecomment-1147530903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE4LOPRIFMVZ2VHW4XQYLVNYFMRANCNFSM5X5ORNAA . You are receiving this because you were mentioned.Message ID: @.***>
Docs should be improved now.