wide icon indicating copy to clipboard operation
wide copied to clipboard

Potential feature: safe align_to

Open mcroomp opened this issue 2 months ago • 4 comments

What do you think of something like this? I realize that you can also do this with bytemuck, but this would let you write more generic code when operating on larger arrays handling the head/tail special case.

pub trait AlignedTo where Self : Sized
{
  type Element;

  fn align_to(array : &[Self::Element]) -> (&[Self::Element], &[Self], &[Self::Element]);
  fn align_to_mut(array : &mut [Self::Element]) -> (&mut [Self::Element], &mut [Self], &mut [Self::Element]);
}

mcroomp avatar Oct 12 '25 17:10 mcroomp

https://docs.rs/bytemuck/latest/bytemuck/fn.pod_align_to.html

This seems the same as the existing bytemuck version, including the head and tail. I'm unclear on the difference you're after.

Lokathor avatar Oct 12 '25 18:10 Lokathor

The background of this is that now that there's the AVX512 types, I'm trying to figure out a nice way to write code that is independent of lane size....

For example, if you had this:

pub trait AlignTo<Element> where  Element: Sized, Self: Sized,
{
  fn align_to(source: &[Element]) -> (&[Element], &[Self], &[Element]);
  fn align_to_mut(source: &mut [Element]) -> (&mut [Element], &mut [Self], &mut [Element]);
}

You could do this:

fn sum_vectors_generic<
  E: Add<Output = E> + Copy,
  WideType: AlignTo<E> + Add<Output = WideType> + Copy,
>(
  a: &[E],
  b: &[E],
  result: &mut [E],
) {
  assert_eq!(a.len(), b.len());
  assert_eq!(a.len(), result.len());

  let (a_head, a_mid, a_tail) = WideType::align_to(a);
  let (b_head, b_mid, b_tail) = WideType::align_to(b);
  let (c_head, c_mid, c_tail) = WideType::align_to_mut(result);

  for i in 0..a_head.len() {
    c_head[i] = a_head[i] + b_head[i];
  }

  for i in 0..a_mid.len() {
    let va = a_mid[i];
    let vb = b_mid[i];
    c_mid[i] = va + vb;
  }

  for i in 0..a_tail.len() {
    c_tail[i] = a_tail[i] + b_tail[i];
  }
}

then:


  // SSE2
  sum_vectors_generic::<i16, i16x8>(&a, &b, &mut c);

  /// AVX2
  sum_vectors_generic::<i16, i16x16>(&a, &b, &mut c);

  /// AVX-512
  sum_vectors_generic::<i16, i16x32>(&a, &b, &mut c);

mcroomp avatar Oct 13 '25 07:10 mcroomp

Well, two things jump out at me:

  1. if you want to have a trait for this behind the bytemuck feature and with impls that call out to bytemuck (to avoid duplicating unsafe code) I'll merge the PR for it.
  2. what you want in the example code doesn't necessarily work out like that. Imagine that the a slice aligns to 8, while the b slice aligns to 16. Now the head portions will be different lengths, and mid and tail too. Unfortunately, for something like this you have to check the alignment of all data runs and then use the least aligned one as the basis for how to transform the other two.

Lokathor avatar Oct 13 '25 07:10 Lokathor

Yeah you're right... it would have to assert that they start on the same alignment.

Things are moving more and more towards vector size agnostic code so I was thinking of some small steps that would make code less vector size dependent.

One other nice thing that would fit into this is that if there were trait version of the common non-arithmetic operations, like abs, saturating_add, max, min, etc.

mcroomp avatar Oct 13 '25 18:10 mcroomp