freac icon indicating copy to clipboard operation
freac copied to clipboard

CTDB (CUETools Database) error correction

Open Korkman opened this issue 1 year ago • 4 comments

CUETools has a great verification and error correction database which can be used to fix lightly damaged rips. Integrating this into fre:ac would be a major feature.

CTDB only works for whole discs, so integration would require fre:ac to implement a "whole disc mode". In its simplest form, I would imagine an indicator lit green after importing tracks from a CD until the joblist is modified.

fre:ac would then submit new samples to CTDB and offer repair attempts for damaged discs. The whole feature would be turned on or off just like AccurateRip in settings.

Korkman avatar Nov 10 '22 21:11 Korkman

CTDB verification on its own would be a nice enhancement. I've stumbled across several CDs that aren't in the AccurateRip database, but are in the CTDB.

The documentation is pretty much nonexistent, but I have implemented CTDB verification in another program so can hopefully save you a lot of headaches!

In short, here are the steps:

  1. Reformat the TOC for lookup;
  2. Download the XML lookup data for said TOC;
  3. Parse out all the possible track CRCs and their confidence levels from the XML;
  4. Compute CRCs for each track at various offsets and match them against all possible CRCs for the track;
  5. Add up the confidence levels for each matching CRC (there can be more than one per track) and print the results!

TOC

The lookup TOC can be constructed from the CDTOC metadata fre:ac is already computing. It is simply a colon-separated list of each track starting position (including data and leadout), relative to the offset of the first track. Data tracks get prefixed with a "-".

For example:

// Four audio tracks.
CDTOC: 4+B6+F271+15798+22362+2CC18
CTDB TOC: 0:61883:87778:139948:183138

// Ten audio tracks and one data track.
CDTOC: B+96+3757+696D+C64F+10A13+14DA2+19E88+1DBAA+213A4+2784E+2D7AF+36F11
CTDB TOC: 0:14017:26839:50617:67965:85260:105970:121620:135950:161720:-186137:224891

XML

The lookup URL (XML) is: http://db.cuetools.net/lookup2.php?version=3&ctdb=1&fuzzy=1&toc=THE_TOC_HERE

Each "pressing" exists as an <entry> child tag with attributes confidence for confidence, and trackcrcs for the individual audio track CRCs. The latter value is a space-separated list of checksums in numerical order (track1 track2 track3…).

There are separate whole-disc CRCs, but I would recommend ignoring them because they're A) all-or-nothing and B) will cause you to undercount confidence levels. (Some individual tracks may match in multiple pressings while others may not.)

CRCs

CTDB CRCs are computed against raw PCM data (44.1khz, 16-bit, stereo), excluding file headers. (For a regular .wav file, that means stripping off the first 44 bytes.)

  • The first track CRC ignores the leading 23,520 bytes.
  • The last track CRC ignores the trailing 23,520 - 47,040 bytes (the exact truncation depends on the total album length). You can simply brute-force it, though, in steps of 588 bytes, until you find a match or hit the end of the range to chop off.
  • The middle track CRCs cover the full range of the song.
  • If a track is the only track on the album, it loses both the leading and trailing portions the first and last tracks would.

The first and simplest offset to match is zero, in which case you just generate CRC32 values for each track following the above rules. This works (if it works) regardless of whether one is ripping the whole album or only part of it.

For optimal matching (whole-disc rips), the calculations should be repeated with the track boundaries shifted 4 - 23,520 bytes in both directions, in steps of 4 bytes. The CRCs cover the same total number of bytes as before, but the ranges start and end a little earlier or later, incorporating the end of the previous track or start of the next one.

The ignored regions for the first and last tracks still apply when shifting offsets, but will be smaller when the offset runs in the same direction as the truncation. For example, if shifting track one to the left four bytes, you'd only be ignoring the leading 23,516 bytes of the song, since the other four ignored bytes are part of the imaginary area before it begins.

All this offset-checking is computationally intensive, so I'd definitely recommend employing a CRC-combining algorithm to lighten the load.

For example, you could store the leading and trailing raw bytes for each track as bytes (to reference later), but pre-compute CRC32s for the unchanging middle bits. Then when you're shifting, you'd only have to resum the relatively small leading/trailing portions, and add those CRCs to the pre-computed middles.

For most tracks, you'd only need to save the first and last 23,520 bytes. The exception is the first and last tracks, which need to incorporate the ignorable regions too. (In other words, the first 47,040 bytes for the first track, and last 70,560 bytes for the last track.)

Confidence

As previously mentioned, a given track may match multiple CRCs at different offsets, so you'll need to add up all the matching confidence scores to arrive at the actual score for each track.

Other than that, the main thing to note is that you should only consider a rip "accurate" if the confidence is 2 or higher.

Unlike AccurateRip, CTDB submissions are published immediately and require no secondary confirmation. Someone may end up matching their own earlier rip made with a separate program if the value is only one. :wink:

End of Knowledge

Well, that's as far into the rabbit hole as I've gone! I don't write C/C++ so can't contribute directly, but hopefully this write-up will help whoever ends up undertaking the enhancement!

EDIT: Widening the offset checksum search from ±11,760 to ±23,520 yields additional matches in some rare cases. I updated the write-up accordingly.

joshstoik1 avatar Nov 27 '22 06:11 joshstoik1

Thank you very much for this writeup, @joshstoik1! This will be very helpful when adding support for CTDB.

So there are track CRCs in CTDB actually... I used to think there weren't and that you'd always have to rip the whole disc to make use of CTDB verification. I guess that only applies to error correction then? That's good to know and may even speed up implementation of basic verify-only CTDB support as a first step.

enzo1982 avatar Nov 27 '22 15:11 enzo1982

Yeah, I had thought CTDB was whole-disc-only too at first, but I guess that changed somewhere along the way.

You don't even really need to rip the whole disc for the offset-shifting sums. If someone only wants to rip track three, say, you could silently grab the 23,520 bytes immediately before and after, run the shifting tests, then throw them away. (Most of the time those bytes are all null, so you may not even need that much!)

joshstoik1 avatar Nov 27 '22 16:11 joshstoik1

It would be also nice to be able to verify files from hard disc to generate a report like EAC does. This is one feature I am missing on Linux. Command line support for this would be grate to use it within scripting.

fidoriel avatar Mar 24 '23 07:03 fidoriel