audio
audio copied to clipboard
torchaudio mobile?
🚀 Feature
torchaudio should work on Android and iOS platforms
Motivation
Mobile apps need a fast way to preprocess audio (i.e. generate spectrograms).
Pitch
I would love to see a lightweight interface that provides access to all torchaudio functions from Java and ObjectiveC. Heavy logic like FFT should be optimized for performance (i.e. pushed to GPU whenever available).
Alternatives
Tensorflow Lite is moving in this direction (mfcc and most other signal processing ops are already whitelisted).
Additional context
Transforms and functionals are now jitable. Have you tried exporting your model using jit and then importing on mobile? see mobile page Is that what you are trying to do?
Wow this sounds great! I have not noticed this in the docs and assumed that heavy stuff like FFT is not supported yet. I'll test it out over the next couple of days and share my results here.
@vincentqb I just traced the simplest possible module (with just a single torchaudio.transforms.MelSpectrogram) and attempted to run it in the demo app (github.com/pytorch/android-demo-app with org.pytorch:pytorch_android:1.4.0.1). Unfortunately, the forward method crashes on Android:
fft: ATen not compiled with MKL support
File "code/__torch__/torch/nn/modules/module/___torch_mangle_5.py", line 37
_16 = ops.prim.NumToTensor(torch.size(input1, 2))
input2 = torch.view(input1, [int(_15), int(_16)])
spec_f = torch.stft(input2, 1024, 143, 1024, window, False, True)
~~~~~~~~~~ <--- HERE
_17 = ops.prim.NumToTensor(torch.size(spec_f, 1))
_18 = ops.prim.NumToTensor(torch.size(spec_f, 2))
at org.pytorch.NativePeer.forward(Native Method)
at org.pytorch.Module.forward(Module.java:37)
at org.pytorch.helloworld.MainActivity.onCreate(MainActivity.java:74)
Can you please confirm whether pytorch mobile runtime is indeed supposed to support the STFT transform?
Indeed, fft is not currently supported on pytorch mobile, as mentioned here.
Ouch :( Do you think this will be resolved soon? If not, I would highly recommend improving the PyTorch Mobile documentation to specify which transforms are supported versus not. Also, I hope this feature request will stay open until all torchaudio features start working on mobile.
@dreiss @supriyar -- do you have some information to share about the functions that are supported within mobile, or the ones that would welcome contributions from the community?
I'am working on an iOS project and hitting this error:
fft: ATen not compiled with MKL support
Although Apple has the vDSP
framework could do this kind of processing, I would like to stick to the torch solution due to the simplicity and consistency.
I hope we could have fft
support in mobile since it's a very basic building block in audio processing and would be needed in many speech related models.
We enable every forward CPU op on mobile. I think the issue here is that this particular op doesn't have a portable implementation. We would be interested in a PR added one. Another option would be to do the fft before feeding the data to your model.
We enable every forward CPU op on mobile. I think the issue here is that this particular op doesn't have a portable implementation. We would be interested in a PR added one. Another option would be to do the fft before feeding the data to your model.
Is there a PR we could link to indicating which operations we would welcome contributions for?
I don't have a list because we haven't tested out every niche op, but if there is any op that doesn't have a non-Intel implementation, I'd be open to seeing a PR.
Don't have the time to personally PR, but you can do a naive implementation of rfft with matmuls, e.g.
(This code is translated from Tensorflow Magenta at https://github.com/tensorflow/magenta/blob/cf80d19fc0c2e935821f284ebb64a8885f793717/magenta/music/melspec_input.py#L64-L90, I removed some padding code I didn't need)
def _dft_matrix(dft_length):
# type: (int) -> Tuple[Tensor, Tensor]
real = 2 * math.pi / float(dft_length)
imag = 2 * math.pi / float(dft_length)
sum_components = torch.ger(torch.arange(dft_length), torch.arange(dft_length))
keep_values = dft_length // 2 + 1
real_part = torch.cos(real * sum_components)[:keep_values, :].transpose(0, 1)
imag_part = -torch.sin(imag * sum_components)[:keep_values, :].transpose(0, 1)
return real_part, imag_part
def _naive_rdft(signal_tensor, fft_length):
# type: (Tensor, int) -> Tensor
"""Implement real-input Fourier Transform by matmul."""
# We are right-multiplying by the DFT matrix, and we are keeping
# only the first half ("positive frequencies").
# So discard the second half of rows, but transpose the array for
# right-multiplication.
# The DFT matrix is symmetric, so we could have done it more
# directly, but this reflects our intention better.
real_dft_tensor, imag_dft_tensor = _dft_matrix(fft_length)
signal_frame_length = signal_tensor.shape[-1]
result_real_part = torch.bmm(signal_tensor, real_dft_tensor.unsqueeze(0))
result_imag_part = torch.bmm(signal_tensor, imag_dft_tensor.unsqueeze(0))
return torch.stack([result_real_part, result_imag_part], dim=3)
# windowed_frames [1, T, 512]
out_check = _naive_rdft(windowed_frames, 512)
out = torch.rfft(windowed_frames, 1)
assert torch.allclose(out_check, out, atol=1e-5)
When I try to import torchaudio into my Android project's app build.gradle file like so:
implementation 'org.pytorch:pytorch_android_torchaudio:1.5.0'
I get this error: Could not find org.pytorch:pytorch_android_torchaudio:1.5.0.
Is this the same issue as what's reported above?
When I try to import torchaudio into my Android project's app build.gradle file like so:
implementation 'org.pytorch:pytorch_android_torchaudio:1.5.0'
I get this error: Could not find org.pytorch:pytorch_android_torchaudio:1.5.0.
Is this the same issue as what's reported above?
@dstrube1
There is no android package dedicated for torchaudio
. You build your model or pipeline in Python, then dump it as a Torchscript file, then load it from your app and run it with Torchscript runtime. Please refer to the following.
https://pytorch.org/mobile/home/ https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html
Hi @jeffxtang, it seems that we can run fft directly with pytorch from version 1.10, due to the PocketFFT support from this commit: https://github.com/pytorch/pytorch/commit/4036820506693b71a96b9e20989bfe286acccd89
Can you confirm? thanks
Hi @jeffxtang, it seems that we can run fft directly with pytorch from version 1.10, due to the PocketFFT support from this commit: pytorch/pytorch@4036820
Can you confirm? thanks
Thanks for your question. Yes it's confirmed. We're working on an update of the demos to simplify the iOS and Android code.
Hi, just following up if torchaudio
is available on React Native now..? @mthrok @vincentqb @dkashkin Trying to do something like this:
audio, sample_rate = librosa.load(audio_filepath, sr=None)
# audio is a numpy array of floats 1xN, and sample_rate is a int
# Convert audio numpy array to spec image which is a numpy array or pytorch tensor
clip = torch.Tensor(audio)
# clip is a pytroch tensor datatype, which has the same dimensions as audio, 1xN tensor
spec = torchaudio.transforms.MelSpectrogram(sample_rate=sampling_rate, n_fft=n_fft, win_length=window_length, hop_length=hop_length, n_mels=n_mels)(clip)
# spec is a spectrogram, which is an image, data type is a pytorch tensor, and the dimensions is 3x128x15000
All on a React Native app :)
Hi, just following up if
torchaudio
is available on React Native now..? @mthrok @vincentqb @dkashkin Trying to do something like this:audio, sample_rate = librosa.load(audio_filepath, sr=None) # audio is a numpy array of floats 1xN, and sample_rate is a int # Convert audio numpy array to spec image which is a numpy array or pytorch tensor clip = torch.Tensor(audio) # clip is a pytroch tensor datatype, which has the same dimensions as audio, 1xN tensor spec = torchaudio.transforms.MelSpectrogram(sample_rate=sampling_rate, n_fft=n_fft, win_length=window_length, hop_length=hop_length, n_mels=n_mels)(clip) # spec is a spectrogram, which is an image, data type is a pytorch tensor, and the dimensions is 3x128x15000
All on a React Native app :)
@raedle may know if torchaudio is available on React Native.
Btw, both the iOS and Android streaming ASR demo apps using PyTorch 1.12 and torchaudio 0.12 were updated in July.
@jeffxtang it depends. Generally, if it is possible in native Android and iOS, then it is theoretically possible in React Native.
I did a quick check on the PyTorch Android/iOS demo apps but couldn't see any dependency to torchaudio in the app. How is the audio converted to tensors?
Yeah, i couldn't find the part of the code where we take a stream of audio from react native, and convert it tensors, or the part that converts the raw audio streams into Mel Spectrograms, anybody got a clue?
@henryleemr PlayTorch doesn't support audio streaming. Do you have an end-to-end example for either iOS or Android w/o React Native?
I might be open looking into this for PlayTorch, which would enable others using audio streaming in the future
@jeffxtang it depends. Generally, if it is possible in native Android and iOS, then it is theoretically possible in React Native.
I did a quick check on the PyTorch Android/iOS demo apps but couldn't see any dependency to torchaudio in the app. How is the audio converted to tensors?
@raedle This script uses torchaudio to converts the audio and generates the model used in the iOS & Android apps. So internally the model uses torchaudio's transforms.MelSpectrogram to do the audio conversion. @henryleemr
@jeffxtang it depends. Generally, if it is possible in native Android and iOS, then it is theoretically possible in React Native. I did a quick check on the PyTorch Android/iOS demo apps but couldn't see any dependency to torchaudio in the app. How is the audio converted to tensors?
@raedle This script uses torchaudio to converts the audio and generates the model used in the iOS & Android apps. So internally the model uses torchaudio's transforms.MelSpectrogram to do the audio conversion. @henryleemr
Ah cool! So if I were to say, use ResNet for the spectrogram (Which is just an image), I can do something like this?
import torch
import torch.nn as nn
import torchaudio
import torchvision.models as models
class ResNet(nn.Module):
def __init__(self, dataset, pretrained=True):
super(ResNet, self).__init__()
num_classes = 4
self.transform = torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_fft=400, n_mels=80, hop_length=160)
self.model = models.resnet50(pretrained=pretrained)
self.model.fc = nn.Linear(2048, num_classes)
def forward(self, raw_audio_tensor):
spectrogram = self.transform(raw_audio_tensor)
output = self.model(spectrogram)
return output
Correct me if I am wrong, once we've exported the model trained using this class into a .ptl
model file, we can just use the usual react native methods to load the model and pass raw audio streams into this model and the Mel Spectrogram transformation logic would be applied to the raw audio stream without needing any react native or javascript Mel Spectrogram functions?
@jeffxtang it depends. Generally, if it is possible in native Android and iOS, then it is theoretically possible in React Native. I did a quick check on the PyTorch Android/iOS demo apps but couldn't see any dependency to torchaudio in the app. How is the audio converted to tensors?
@raedle This script uses torchaudio to converts the audio and generates the model used in the iOS & Android apps. So internally the model uses torchaudio's transforms.MelSpectrogram to do the audio conversion. @henryleemr
Ah cool! So if I were to say, use ResNet for the spectrogram (Which is just an image), I can do something like this?
import torch import torch.nn as nn import torchaudio import torchvision.models as models class ResNet(nn.Module): def __init__(self, dataset, pretrained=True): super(ResNet, self).__init__() num_classes = 4 self.transform = torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_fft=400, n_mels=80, hop_length=160) self.model = models.resnet50(pretrained=pretrained) self.model.fc = nn.Linear(2048, num_classes) def forward(self, raw_audio_tensor): spectrogram = self.transform(raw_audio_tensor) output = self.model(spectrogram) return output
Correct me if I am wrong, once we've exported the model trained using this class into a
.ptl
model file, we can just use the usual react native methods to load the model and pass raw audio streams into this model and the Mel Spectrogram transformation logic would be applied to the raw audio stream without needing any react native or javascript Mel Spectrogram functions?
Based on how the model that implements the melspec transform is used in the iOS & Android apps, yes that's correct.
@hwangjeff @mthrok can you please double confirm?
yeah, the script looks good to me.
@jeffxtang Hello , any update in 2024 ;-), we do think consistency is critical as we trained on PC via torchaudio feature extractor (MFCC etc,.) , we do think we need the same library to do the feature extraction for other platforms like Android, iOS .. mostlikely due to the performance gap we may need try to convert pytorch models to onnx models durring deployment phase, so keep the feature extraction consistency is critical to make the model runs consistently across all platforms. Thanks alot !