api-inference-community icon indicating copy to clipboard operation
api-inference-community copied to clipboard

update fairseq version

Open sravyapopuri388 opened this issue 2 years ago • 3 comments

We created a hub model here https://huggingface.co/facebook/xm_transformer_s2ut_800m-en-hk-h1_2022 to support our English to Hokkein translation model and also pushed some changes to fairseq to get the model working. Updating the HF version of fairseq to test this model. Thanks so much in advance!

sravyapopuri388 avatar Jul 28 '22 17:07 sravyapopuri388

I changed this line to the new model and the test fails https://github.com/huggingface/api-inference-community/blob/main/docker_images/fairseq/tests/test_api.py#L12. The inference code does not work for this model. My understanding from https://huggingface.co/facebook/xm_transformer_s2ut_800m-es-en-st-asr-bt_h1_2022 is that the inference is more involved for this model, requiring to use fastspeech as well.

osanseviero avatar Jul 29 '22 08:07 osanseviero

Actually I misread the code. Now I realize this is a ASR Model, for which we don't have support in fairseq in the API. I'll add asr, but the model is almost 10Gb, which means it will be very slow to load.

osanseviero avatar Jul 29 '22 08:07 osanseviero

Hello, I'm sorry @sravyapopuri388 this may seem like a misplaced problem , but I asked because you are here on the same model, I am trying to convert from english to english using this form, When experimenting with https://huggingface.co/facebook/xm_transformer_s2ut_800m-es-en-st-asr-bt_h1_2022 It works, but when I use the example attached below it converts the English language into Spanish,

models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
    "facebook/xm_transformer_s2ut_800m-es-en-st-asr-bt_h1_2022",
    arg_overrides={"config_yaml": "config.yaml", "task": "speech_to_text"},
    cache_dir=cache_dir,
)



model = models[0].cpu()
cfg["task"].cpu = True

generator = task.build_generator([model], cfg)


# requires 16000Hz mono channel audio
audio, _ = torchaudio.load("../gnz_10005_m3.wav")

sample = S2THubInterface.get_model_input(task, audio)
unit = S2THubInterface.get_prediction(task, model, generator, sample)

# speech synthesis           
library_name = "fairseq"
cache_dir = (
    cache_dir or (Path.home() / ".cache" / library_name).as_posix()
)
cache_dir = snapshot_download(
    f"facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur", cache_dir=cache_dir, library_name=library_name
)

x = hub_utils.from_pretrained(
    cache_dir,
    "model.pt",
    ".",
    archive_map=CodeHiFiGANVocoder.hub_models(),
    config_yaml="config.json",
    fp16=False,
    is_vocoder=True,
)

with open(f"{x['args']['data']}/config.json") as f:
    vocoder_cfg = json.load(f)
assert (
    len(x["args"]["model_path"]) == 1
), "Too many vocoder models in the input"



vocoder = CodeHiFiGANVocoder(x["args"]["model_path"][0], vocoder_cfg)
tts_model = VocoderHubInterface(vocoder_cfg, vocoder)

tts_sample = tts_model.get_model_input(unit)
wav, sr = tts_model.get_prediction(tts_sample)

ipd.Audio(wav, rate=sr) 

Is there a parameter that needs to be modified, and how? The goal, in short, is to convert broken English into proper English using this model.

thanks

mohammed-Emad avatar Mar 15 '23 11:03 mohammed-Emad