fairseq
fairseq copied to clipboard
Request for Guidance on Adding Punjabi Language Support to NLLB
Hello NLLB/fairseq team,
I’m reaching out to explore how to fine-tune the NLLB model to support better Punjabi, a vibrant language spoken by over 100 million people worldwide, including a historic Sikh community in California that has thrived since 1909.
As part of efforts to preserve and promote Punjabi in digital spaces, I’d like to understand:
Requirements for fine-tuning NLLB for Punjabi – Are there specific considerations for its Gurmukhi script or dialectal variations (e.g., Eastern vs. Western Punjabi)?
Existing tutorials – Is there a guide for adding new languages, particularly those with rich literary traditions, such as Punjabi?
Data needs – What type/amount of parallel data (e.g., Punjabi-English) would be optimal? Could community-translated datasets (e.g., religious texts, literature, or news) supplement existing resources?
Leveraging seed datasets – Are there templates (such as the NLLB-Seed dataset) that we could adapt for Punjabi?
Punjabi is a culturally significant language with deep roots in California’s Sikh diaspora, and I’d love to contribute to its inclusion in NLLB. Any advice or resources you could share would be invaluable!
Thank you for your time and for working on multilingual AI.
Best regards, Manav