AudioToken Problem with FP16

Problem with FP16

Open arielkantorovich opened this issue 9 months ago • 0 comments

Hi, your code work great and succussed to generate image from audio the problem is when I try the flag --mixed_precision fp16 I get always black image. Are you try run your code with this flag I not succussed to understand why it's happen?

`05/04/2024 15:33:47 - INFO - main - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda

Mixed precision type: fp16

/home/student/anaconda3/envs/TempToken/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/student/anaconda3/envs/TempToken/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( {'dropout', 'time_embedding_act_fn', 'resnet_out_scale_factor', 'cross_attention_norm', 'time_cond_proj_dim', 'resnet_skip_time_act', 'addition_embed_type_num_heads', 'conv_out_kernel', 'timestep_post_act', 'class_embeddings_concat', 'only_cross_attention', 'addition_time_embed_dim', 'upcast_attention', 'addition_embed_type', 'num_class_embeds', 'time_embedding_dim', 'encoder_hid_dim', 'encoder_hid_dim_type', 'num_attention_heads', 'time_embedding_type', 'attention_type', 'class_embed_type', 'mid_block_only_cross_attention', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'dual_cross_attention', 'transformer_layers_per_block', 'mid_block_type', 'resnet_time_scale_shift', 'use_linear_projection', 'projection_class_embeddings_input_dim'} was not found in config. Values will be initialized to default values. {'norm_num_groups', 'latents_mean', 'force_upcast', 'latents_std'} was not found in config. Values will be initialized to default values. {'dynamic_thresholding_ratio', 'timestep_spacing', 'clip_sample_range', 'sample_max_value', 'prediction_type', 'rescale_betas_zero_snr', 'thresholding'} was not found in config. Values will be initialized to default values. 05/04/2024 15:33:50 - INFO - modules.BEATs.BEATs - BEATs Config: {'input_patch_size': 16, 'embed_dim': 512, 'conv_bias': False, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': 'gelu', 'layer_wise_gradient_decay_ratio': 0.6, 'layer_norm_first': False, 'deep_norm': True, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.0, 'conv_pos': 128, 'conv_pos_groups': 16, 'relative_position_embedding': True, 'num_buckets': 320, 'max_distance': 800, 'gru_rel_pos': True, 'finetuned_model': True, 'predictor_dropout': 0.0, 'predictor_class': 527} /home/student/anaconda3/envs/TempToken/lib/python3.8/site-packages/torchaudio/compliance/kaldi.py:616: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at /opt/conda/conda-bld/pytorch_1682343998658/work/aten/src/ATen/EmptyTensor.cpp:31.) spectrum = torch.fft.rfft(strided_input).abs() scailing factor = 0.18215 /home/student/AudioToken/check_audioTOken.py:239: RuntimeWarning: invalid value encountered in cast images = (image * 255).round().astype("uint8")`

May 04 '24 11:05 arielkantorovich

AudioToken AudioToken copied to clipboard

Problem with FP16

AudioToken
AudioToken copied to clipboard