opus icon indicating copy to clipboard operation
opus copied to clipboard

Please provide an option to not depend on downloading model data

Open bbhtt opened this issue 10 months ago • 10 comments

We build opus from git source and we'd like to not switch to tarballs or download the model tarball. Given the recent xz situation downloading tarballs with no trace on git is complicated and seems like a security issue.

Having an option to stop depending on that seems like it'd be great.

With the current meson configuration:

-Dfloat-api=true 
-Dasm=disabled 
-Dhardening=true 
-Dcustom-modes=true 
-Denable-deep-plc=true 
-Denable-dred=true 
-Denable-osce=true 
-Ddocs=enabled 
-Dextra-programs=disabled 
-Dtests=disabled 
-Dintrinsics=enabled 
-Drtcd=enabled

meson seems to fail on meson.build:636:24: ERROR: File dnn/fargan_data.h does not exist. and is not even aware of the download model script living in autogen.

Whereas the download model script is passing an unknown commit https://github.com/xiph/opus/blob/ab4e83598e7fc8b2ce82dc633a0fc0c452b629aa/autogen.sh#L12C24-L12C31

What is the source of that commit? Can this model data be found on git somewhere? Can this be implemented as a git submodule?

Looking through the git history seems at some point there was a submodule, which was reverted in favour of downloading the model tarball.

Thanks!

bbhtt avatar Apr 09 '24 17:04 bbhtt

Hello, also why does the model tarball include ~130 MB of binary model files which the opus tarballs do not?

Also some of the headers are quite larger in model tarball than in opus tarballs

Model tarball:

Screenshot from 2024-04-09 23-46-38

Opus tarball:

Screenshot from 2024-04-09 23-47-55

bbhtt avatar Apr 09 '24 18:04 bbhtt

In the model tarball downloaded from git, the models include all float weights (e.g. for debugging), whereas for the Opus release tarball, the float data was removed to make the release as small as possible.

jmvalin avatar Apr 09 '24 18:04 jmvalin

What do you mean by downloading from git? The script is downloading from here https://media.xiph.org/opus/models/

Also it makes really hard to know which model tarball to update to, because the expected tarball commit is in the script only. If you are building in an environment without network access you need to know which filename to download beforehand.

the models include all float weights (e.g. for debugging), whereas for the Opus release tarball, the float data was removed to make the release as small as possible.

Is there some kind of docs on how to do this? Also knowing the source of the model data would be great too. How is the model tarball produced?

Thanks

bbhtt avatar Apr 09 '24 18:04 bbhtt

You can run scripts/shrink_model.sh to remove the debug float data from the extracted models. From there, you can just run "make dist" to build a tarball that has just what's needed.

jmvalin avatar Apr 09 '24 20:04 jmvalin

I also ran into this issue. Could you add the shrunken model data to the opus git repository, please, so that they're covered by your signed tags and opus can be built without any further downloads? The model tarballs are not an option as they're entirely unverified (and also very large).

heftig avatar Apr 18 '24 14:04 heftig

Maybe you can also provide an option to disable this new feature.

Erick555 avatar Apr 18 '24 17:04 Erick555

The newer models are just too big for git. That being said, the main branch has been updated to verify them so they cannot be compromised on the download server.

jmvalin avatar Apr 23 '24 00:04 jmvalin

Looks like the process is now more or less documented. Would it be possible to still have this checksum a separate file so it wouldn't need to be parsed from a shell script?

nanonyme avatar Apr 26 '24 20:04 nanonyme

I'm not against, but I don't have the skills to update the Windows version of the script. @mklingb any thoughts on that?

jmvalin avatar Apr 27 '24 02:04 jmvalin

The batch file would probably only turn simpler by having the checksum in a separate file rather than having it parse the shell script.

nanonyme avatar Apr 27 '24 08:04 nanonyme