acl icon indicating copy to clipboard operation
acl copied to clipboard

Expand the variable bit rate options

Open nfrechette opened this issue 4 years ago • 15 comments

See this PR for original inspiration: https://github.com/nfrechette/acl/pull/348 by @ddeadguyy.

...I'm not sure if 20-24 would help all that much. Did you measure before/after? Did it help much with the memory footprint?

Take this with a huge grain of salt, considering I just disabled it until I figure out a regression test failure, but yes and yes. For example:

 complex character(229 joints) with several long joint chains

 short walk transition:
 bits 3-19 -> 29.08K
 bits 1-24 -> 27.33K

 long walk cycle:
 bits 3-19 -> 124.49K
 bits 1-24 -> 116.67K

When given the opportunity to pick 1 or 2 instead of 3, or 20-24 instead of 32, animations tend to shrink. There could be edge cases where picking 24 on a parent instead of 32 leads to picking higher num_bits for children, but it doesn't seem to be common for us.

 ...dropping the largest quaternion component instead of always W and possibly storing the component's sign to avoid the need to bias when interpolating during decompression...mix full width quaternions with variable bit rates...

Those sound like solid(and familiar) ideas, but I wouldn't mind trying to make those features co-exist with extended bit depths, if only in our local version of ACL. We'll see how that goes.

The hacks required to help 24 bit survive the unit tests are unfortunate. Tweaking how quantization works would remove all of that(see: ACL_PACKING). Perhaps I'll cap ACL_BIT_RATE at 23 for now.

The current variable bit rate options are between 3 and 19 bits per component (0 and 32 bits are also used for a special cases), inclusive. This is currently stored within a uint8_t and uses the lower 5 bits. However, the whole range of values isn't currently used. Considering that integers up to 24 bits can be losslessly represented by 32bit floating point values, we could expand the bit rates from 0 to 24 bits. This still wouldn't use the full range but it would allow for more options.

I was hoping to reduce the bit rate count to be able to fit on 4 bits to be able to later use that 5th bit for other things. However, that is very far down the road from now. Considering how it appears to lead to a ~10% memory reduction, it is worth exploring. I'll have to double check and run the numbers over CMU, Paragon, etc.

This branch will be used for the development work: feat/expand-bit-rates

nfrechette avatar Apr 30 '21 01:04 nfrechette

I'm moving on to this next. I'll start with feat/expand-bit-rates, but it may require the changes from https://github.com/nfrechette/acl/issues/351, so we'll see. I'll omit 24 bits since https://github.com/nfrechette/acl/issues/352 didn't pan out. Once this is done, I'll close https://github.com/nfrechette/acl/pull/348, because every change will have been ported to 2.0.

ddeadguyy avatar Jun 07 '21 14:06 ddeadguyy

Progress report: Our worst edge case is a large creature with long bone chains and subtle movement. It compresses in 30 seconds in stock 2.0. That's too slow, and I have a plan to improve that before I expand bit rates. I made an acl.sjson version for isolated testing. As it turns out, it fails regression due to a worst-case error of nearly 2 meters, which makes sense considering the artifacts we've noticed ingame. I have a plan for that, too.

ddeadguyy avatar Jun 08 '21 17:06 ddeadguyy

Interesting. Usually large errors like that are not visible visually and often end up as a result of mixing very large/small scale values and translations.

30sec sounds pretty slow indeed. Is this with the Highest compression level or with Medium? The compression level in ACL is similar in spirit to the one in zlib: it specifies how hard ACL should try to optimize things. Quality is unaffected, just compression time and compressed size. This was added specifically for this sort of scenario where long bone chains are present. Right now the game engine has to make this decision but I'm hoping to introduce an Auto compression level that picks a reasonable option depending on what the clip looks like. See #356 for details.

Note that Highest isn't always smaller in size, but it generally is.

nfrechette avatar Jun 09 '21 01:06 nfrechette

We always leave it at Medium. I can try Highest, but it probably won't help much, since higher-than-tolerance error exists while the bone chain is short enough that every parent channel is either constant or default. The main issues seem to be the following:

  • It's a greedy algorithm that stops as soon as error(bit_rate) > error(bit_rate-1), even though higher bit_rates could help. My plan is to consider the farthest-away descendants with the largest shells and smallest precisions when initializing bit rates, instead of basing it entirely on local settings. I'll try this first. I expect speed and precision improvements, but we'll see.
  • There isn't any error correction in ACL(LMK if I missed it). It isn't always possible for compression of the original local channel to get close enough to the target, especially in (but not limited to) long chains with static/default ancestors. I'm used to keeping animation poses in model space and compressing breadth-first. Convert to local -> decay -> plug result back in to the model-space pose. Error is corrected at the first opportunity(aka the next animated channels in the chain). I'll try this second, if necessary, but there are range limit complications here.

Usually large errors like that are not visible visually

Perhaps not within a segment, if the error within the segment is consistent. Sudden changes in that error between segments are a lot more noticeable. This happens to be a creature with long bone chains lying on the ground, which makes both more obvious.

ddeadguyy avatar Jun 09 '21 14:06 ddeadguyy

FWIW, Highest takes 272 seconds, uses more memory, and still has a similar worst-case error.

ddeadguyy avatar Jun 09 '21 18:06 ddeadguyy

That is something I haven't seen so far. I think of all the clips I have, the one that takes the longest to compress with Medium takes only a few seconds. Out of curiosity, how many joints to you have in your longest chain or how many long chains do you have?

It's a greedy algorithm that stops as soon as error(bit_rate) > error(bit_rate-1), even though higher bit_rates could help. My plan is to consider the farthest-away descendants with the largest shells and smallest precisions when initializing bit rates, instead of basing it entirely on local settings. I'll try this first. I expect speed and precision improvements, but we'll see.

That is one option I considered, see #373 which is along the lines of what you describe.

There isn't any error correction in ACL(LMK if I missed it). It isn't always possible for compression of the original local channel to get close enough to the target, especially in (but not limited to) long chains with static/default ancestors. I'm used to keeping animation poses in model space and compressing breadth-first. Convert to local -> decay -> plug result back in to the model-space pose. Error is corrected at the first opportunity(aka the next animated channels in the chain). I'll try this second, if necessary, but there are range limit complications here.

ACL doesn't currently do error compensation for a few reasons:

  • Time constraints and it being low priority for me for the time being
  • Lack of real world data that shows the technique works
  • The underwhelming performance of error compensation in UE4, see here

I'm not a huge fan of error compensation. I'm not so sure if it really performs well in practice. It is also highly likely to slow down compression significantly.

Perhaps not within a segment, if the error within the segment is consistent. Sudden changes in that error between segments are a lot more noticeable. This happens to be a creature with long bone chains lying on the ground, which makes both more obvious.

That is very true but if that is the case, it means we fail to assign the full precision bit rate when we should to meet our precision target. ACL generally succeeds in meeting the precision threshold provided but it can fail as a result of dropping the quaternion's W component. It adds more error than I'd like. Switching to dropping the largest component would improve things but the true fix to always meet the precision threshold is to support full quaternion with variable bit rates mixed with packed quaternions. I haven't gotten to that part yet, but it's on the roadmap somewhere. In the short term, I would suggest tuning the settings to be more conservative but if the bone chain is very long, it may not help much.

Quality/speed discussions would probably be best moved to separate issues of their own but that being said, another way to speed up compression could be to try something like #374. I have plans to make compression more greedy and aggressive for medium/high/highest (with the current medium behavior moving to low). Dropping bit rates could be something low/lowest do when speed is more important.

nfrechette avatar Jun 11 '21 01:06 nfrechette

how many joints do you have in your longest chain or how many long chains do you have?

246 joints, longest chain is 50 joints, average chain is ~19 joints. The compression time for the worst animation(lying on the ground + subtle movement) is 30 seconds. Average is ~10 seconds. Perhaps I'll make acl.sjsons for all of them, to see if any pass regression.

It would be difficult to justify expanding bit rates without speeding up decompression first, so I'll continue to work within feat/expand-bit-rates, unless there's a better place for it. Either way, I'll post my findings on https://github.com/nfrechette/acl/issues/373.

ddeadguyy avatar Jun 11 '21 15:06 ddeadguyy

It would be difficult to justify expanding bit rates without speeding up decompression first, so I'll continue to work within feat/expand-bit-rates, unless there's a better place for it. Either way, I'll post my findings on #373.

You mean speed up compression, right? I imagine the extra bit rates don't slow down decompression much, at worst, it might imply one more cache miss or two for the whole pose (for the constants).

How long is compression with ACL 1.3/2.0 stock and how long with the extra bit rates?

There aren't so many low hanging fruits left to optimize compression. By working per segment, everything remains in L1 and for sure in L2 so it is incredibly cache efficient. We don't do too much unnecessary work either. Multithreading could be easily introduced but it won't give gains if you do other things while compression occurs (e.g. compressing multiple clips in parallel or packing other data as part of some cook build). For single clip compression it would be much faster but not in wider usage.

I tried more fancy things like reworking the in memory representation to use a Structure of Array format for even more efficient SIMD but the performance gains were very minimal and the code complexity was much higher.

The easiest way to speed up compression at this point would be to use the compression level argument to do less work on the lower levels, possibly by trying fewer bit rates.

nfrechette avatar Jun 12 '21 14:06 nfrechette

You mean speed up compression, right?

Yeah, too late for me to fix that typo now, but definitely compression.

How long is compression with ACL 1.3/2.0 stock and how long with the extra bit rates?

IIRC, 1.3.5 without extra bit rates was 60 seconds. 1.3.5 with extra bitrates was 90 seconds. 2.0 without extra bit rates is 30 seconds. I haven't added extra bit rates to 2.0 yet.

There aren't so many low hanging fruits left to optimize compression...We don't do too much unnecessary work either.

Assuming the algorithm(especially bone chain permutation) doesn't change, that might be correct, but I'm optimistic about ideas like https://github.com/nfrechette/acl/issues/373.

ddeadguyy avatar Jun 12 '21 20:06 ddeadguyy

Those numbers are consistent with my expectations. ACL 2.0 is about 2x faster to compress than 1.3. It's still slower than I'd like for your clip though. Is there any chance you could share it with me (just the sjson/raw clip file) for regression testing/inspection? I'd be happy to sign an NDA if required.

From the feedback I got so far, my long term goal with compression performance is to add an 'auto' compression level that picks the best option based on the actual clip being compressed. The overwhelming majority of clips are simple and short and could benefit greatly from very aggressive compression since they take <50ms to compress. But more exotic clips can compress more slowly but because they are so rare, it is generally fine if they don't compress as well as long as we can maintain a decent compression time. I'll do my best to optimize the code but high level things is really what drives the cost. Stuff like how many bit rates, how many joints to permute, etc.

I should also mention that over the years, I've given a LOT of thought into trying to find an algorithm for finding the globally optimal bit rate permutation and everything I have come up with so far would be dramatically slower than what we have now. The only way to speed it up further would be to leverage the GPU and that's a whole other ballgame... I hope that someday I get the chance to implement all of this, even if only to validate how close the current approximation is (if it turns out to be too slow to be practical).

nfrechette avatar Jun 14 '21 01:06 nfrechette

Is there any chance you could share it with me (just the sjson/raw clip file) for regression testing/inspection? I'd be happy to sign an NDA if required.

I just checked, unfortunately that won’t be possible. In any case, stay tuned to https://github.com/nfrechette/acl/issues/373 for updates.

ddeadguyy avatar Jun 15 '21 15:06 ddeadguyy

It's fair to say that ACL_COMPRESSION_OPTIMIZED stole most of the thunder in https://github.com/nfrechette/acl/pull/376, but I don't mind. A 2-3% reduction in memory with minimal cost is still worth it, at least for our version of ACL. YMMV.

ddeadguyy avatar Jun 23 '21 20:06 ddeadguyy

@ddeadguyy I apologize for the massive delay regarding this. It turns out that the bind pose stripping took significantly more time to properly integrate into Unreal Engine. I had to update everything for UE5 and it required engine changes as well which have recently been merged in. It also took 7 months, way longer than anticipated, to integrate regression tests into CI.

That being said, that's all out of the way now and I'll be rebasing the branch for this on latest develop shortly. Once that's done and a few minor things are sorted out, I'll look at reviewing these changes, running the regression tests and compare before/after.

Stay tuned and thank you for your patience!

nfrechette avatar Sep 20 '22 02:09 nfrechette

TODO list before integration into develop branch:

  • [x] Rebase on latest develop
  • [x] Review code, add comments, ensure coding style is maintained
  • [x] Remove new defines like ACL_BIT_RATE_EXPANSION
  • [x] Ensure backwards compatibility with bit rates during decompression (old bit rates must be used for 2.0)
  • [x] Run and validate with -unit_test and -regression_test
  • [x] Run, validate, and measure against CMU
  • [x] Run, validate, and measure against Paragon
  • [x] Run, validate, and measure against Fight scene

nfrechette avatar Sep 20 '22 02:09 nfrechette

Compression stats

Baseline develop

Data Set Compressed Size Compression Speed Error 99th percentile
CMU 72.14 MB 13055.47 KB/sec 0.0089 cm
Paragon 208.72 MB 10243.11 KB/sec 0.0098 cm
Matinee Fight 8.18 MB 16419.63 KB/sec 0.0201 cm

Without ACL_COMPRESSION_OPTIMIZED

Data Set Compressed Size Compression Speed Error 99th percentile
CMU 71.43 MB (-1.0%) 11698.63 KB/sec (0.9x) 0.0089 cm
Paragon 204.14 MB (-2.2%) 8521.00 KB/sec (0.8x) 0.0097 cm
Matinee Fight 8.19 MB (+0.1%) 13101.82 KB/sec (0.8x) 0.0098 cm

With ACL_COMPRESSION_OPTIMIZED

Data Set Compressed Size Compression Speed Error 99th percentile
CMU 65.83 MB (-8.7%) 34682.23 KB/sec (2.7x) 0.0088 cm
Paragon 184.41 MB (-11.6%) 20858.25 KB/sec (2.0x) 0.0088 cm
Matinee Fight 8.11 MB (-0.9%) 17097.23 KB/sec (1.0x) 0.0092 cm

nfrechette avatar Oct 01 '22 02:10 nfrechette

I moved #373 into this milestone as your code indeed does most of what I suggested. I spotted a few edge cases to improve things and I'll also clean up the code. This feature should also make it possible to remove the constant sub-track thresholds in the compression settings since now we can approximate the error directly in object space using this trick and use the error metric for constant track detection and default track detection. I'll create a separate issue for this to be done later but I'm excited, I've been trying to remove those for many years now and just now realized we'll be able to do so by pushing this feature just a bit further.

nfrechette avatar Nov 27 '22 23:11 nfrechette

Progress is ongoing.

So far, I've cleaned up the dominant transform shell distance usage (approximating the object space error in local space by using the sum of children joints) and it yields a good gain over baseline. I also fixed it to account for precision properly.

I've cleaned up and removed the constant sub-track error thresholds and now use the error metric to detect them which is much cleaner and robust.

I've also cleaned up to modify the raw data following constant sub-track folding to make sure lossy data can reach the raw values when optimizing the bit rates.

Error compensation is a bit more tricky so far. I am getting mixed results. Sometimes it helps, sometimes it doesn't. I've tried to tweak it in various ways but can't consistently make it win. Overall, it can hurt compressed size quite a bit in certain circumstances (Paragon is 2% larger with error compensation). I continue to try various things but if I fail, I'll likely keep the code commented out by default along with comments as to why and revisit it later. The fact that error compensation doesn't work in aggregate is also consistent with what I observed with the implementation of Unreal Engine, see here. In the end, we may have to compress with/without and pick the best result (possibly only doing so when a high enough compression level is used to keep compression cost controllable).

nfrechette avatar Dec 17 '22 03:12 nfrechette

After trying many different options, I decided to disable the feature for now as mentioned previously.

Here is what I tried:

  • Your original code
  • Using the error metric to ensure we pick values that strictly improve accuracy
  • Compensating for the virtual vertices we skin as well using the error metric
  • Using 64 bit arithmetic to compute some of the transforms to ensure rounding and precision loss wasn't causing issues (it helps a bit, but not enough)

I will leave your original code commented out in case you wish to revisit it later.

nfrechette avatar Jan 03 '23 03:01 nfrechette