transformers.js icon indicating copy to clipboard operation
transformers.js copied to clipboard

Proposal: Implementing a P2P Distributed ASR System in Web Environments using transformer.js

Open NMrJPine opened this issue 1 year ago • 1 comments

Feature request

Dear Xenova,

I'm impressed by your work Xenova, running Hugging Face Transformers directly in the browser is a great achievement and opens up new possibilities in web-based machine learning applications, Building upon this foundation, I propose an ambitious extension: a WebRTC-powered peer-to-peer (P2P) distributed inference system for real-time speech-to-text (STT) conversion.

The essence of the system lies in using WebRTC technology to create a distributed network of browser clients, each one contributing to the speech-to-text (STT) conversion process. The network will enable browsers to share the computational load of a STT task, processed locally using transformer.js.

This approach aims to decentralize the computational workload, thereby reducing model size and inference times per browser, while potentially enhancing overall transcription accuracy.

  1. Audio Splitting and Distribution:

    The audio input uploaded by a client is split into smaller segments. The segments are then distributed across the P2P network formed via WebRTC, ensuring an efficient distribution of processing tasks.

  2. Specialized Node Functionality: Each node in the network (a participating browser) hosts a custom-trained STT transformer model, specialized in a specific linguistic tasks. For example, one node might be adept at transcribing grammatical structures like articles or conjunctions, while another specializes in recognizing vocabulary related to specific objects or contexts. This modular approach allows for a fine-tuned and targeted transcription process, leading to low model sizes. Once the individual segments is transcribed, it is sent, along with timestamps, back to the client.

  3. Recomposition and Contextual Correction: The client undertakes the task of recomposing the fragments into the full transcription. To further enhance accuracy a GPT model runs a contextual analysis on the rebuild transcription identifying and adjusting any mis-transcribed words, using the context surrounding these discrepancies.

I am really eager to discuss this with you Xenova, in order to understand the possibilities and costs of bringing this to reality.

Motivation

Reducing model size and inference times per browser, while potentially enhancing overall transcription accuracy.

Your contribution

I can not contribute directly to the coding and development of the proposal, yet I am committed to supporting its development by providing a detailed architecture outline for the proposed system

NMrJPine avatar Dec 26 '23 06:12 NMrJPine

Hi there 👋 This is indeed an interesting idea! Although it is definitely out-of-scope for this library (as its main purpose is to provide a JS equivalent to the python library), perhaps you can work with someone in the community who is also interested in this.

xenova avatar Jan 08 '24 13:01 xenova