Fabric
Fabric copied to clipboard
[Feature request]: Multilingual yt - formatting issues
What do you need?
I'm trying to use the yt --transcript function in languages other than English (French). Transcription contains formatting problems:
- writes "é" instead of "é", "ê" instead of "ê", "Ã" instead of "à"...
- no punctuation
I tried to fix it with 2 different approaches:
- using a custom pattern to fix those typos.
- editing the yt.py by including a function to replace the defective set of typos with the correct ones.
From my attempts, the best fix for now is using a pattern. I tried it in 2 ways, natural language, and asking it to mimick a given python function. The second solution worked best. But it's not perfect and works in a very random manner. It often fixes the formatting problem, but sometimes randomly changes a small amount of words (with other words). Also, the puntuation is usually simply lacking, or not as good as it is in English. Those problems happens even when Youtube has a correct subtitle file in French. It also sometimes doesn't work at all and gives me comments regarding the python function.
The results are encouraging, but very random. I'm willing to improve it but maybe it's not the right approach. Any suggestions are welcome.
Here is an example of a command I used:
yt --transcript --lang 'fr' https://www.youtube.com/watch?v=oiKj0Z_Xnjc | fabric --model llama3:latest -sp convert_fr
And the output:
`It seems you're asking me to run the fix_encoding_issues function on this text. I'll do that for you.
Please note that the original encoding of the text is not specified, but based on the presence of non-ASCII characters (e.g., â, è, ê, etc.), I assume it's encoded in UTF-8.
Here's the cleaned text:
Enfin je saurai où je vais Maman dit que lorsqu'on cherche bien On finit toujours par trouver Elle dit qu'il n'est jamais très loin Qu'il part très souvent travailler Maman dit "travailler c'est bien" Bien mieux qu'être mal accompagné Pas vrai ? Où est ton papa ? Dis-moi où est ton papa ? Sans même devoir lui parler Il sait ce qui ne va pas Ah sacré papa Dis-moi où es-tu caché ? ?a doit, faire au moins mille fois que j'ai Compté mes doigts Hey ! Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, où t'es où, papaoutai ? Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, où t'es où, papaoutai ? Quoi, qu'on y croit ou pas Y aura bien un jour où on n'y croira plus Un jour ou l'autre on sera tous papa Et d'un jour ? l'autre on aura disparu Serons-nous détestables ? Serons-nous admirables ? Des géniateurs ou des génies ? Dites-nous qui donne naissance aux irresponsables ? Ah dites-nous qui, tiens Tout le monde sait comment on fait des bébés Mais personne sait comment on fait des papas Monsieur Je-sais-tout en aurait hérité, c'est ça Faut l'sucer d'son pouce ou quoi ? Dites-nous où c'est caché, ça doit Faire au moins mille fois qu'on a Bouffé nos doigts Hey ! Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, où t'es où, papaoutai ? Où t'es, papaoutai ? Où t'es, papaoutai ? Où t'es, où t'es où, papaoutai ?`
Finally, here is the content of my custom pattern "convert_fr":
cat system.md
You will only execute the following python functions in a given text. You do not delete or add words and lines at all. Keep all the original text, even if there are any grammatical mistakes.
import re
def fix_encoding_issues(text): replacements = { "é": "é", "è": "è", "ê": "ê", "à ": "à", "â": "â", "ç": "ç", "ë": "ë", "î": "î", "ô": "ô", "ù": "ù", "û": "û", "ü": "ü", "ÿ": "ÿ", "À": "À", "Â": "Â", "Ã": "Ã", "Ä": "Ä", "Ã…": "Å", "Æ": "Æ", "Ç": "Ç", "È": "È", "É": "É", "Ê": "Ê", "Ë": "Ë", "ÃŒ": "Ì", "Ã": "Í", "ÃŽ": "Î", "Ñ": "Ñ", "Ã’": "Ò", "Ó": "Ó", "Ô": "Ô", "Õ": "Õ", "Ö": "Ö", "Ø": "Ø", "Ù": "Ù", "Ú": "Ú", "Û": "Û", "Ü": "Ü", "Ã": "Ý", "Þ": "Þ", "ß": "ß", "á": "á", "â": "â", "ã": "ã", "ä": "ä", "Ã¥": "å", "æ": "æ", "ç": "ç", "è": "è", "é": "é", "ê": "ê", "ë": "ë", "ì": "ì", "î": "î", "ï": "ï", "ð": "ð", "ñ": "ñ", "ò": "ò", "ó": "ó", "ô": "ô", "õ": "õ", "ö": "ö", "÷": "÷", "ø": "ø", "ù": "ù", "ú": "ú", "û": "û", "ü": "ü", "ý": "ý", "þ": "þ", "ÿ": "ÿ", }
for key, value in replacements.items(): text = text.replace(key, value) return textdef main(): # Load the text from a file with open('input.txt', 'r', encoding='utf-8') as file: text = file.read()
# Fix encoding issues text = fix_encoding_issues(text) # Save the cleaned text to a new file with open('output.txt', 'w', encoding='utf-8') as file: file.write(text)if name == "main": main()
UPDATE: The best solution I've found for now is to pipe the unaltered transcript into a python script in this way:
yt --transcript --lang 'fr' https://www.youtube.com/watch?v=HLD3BFdE0fU | python3 fix_french_typos.py
There must be a better way, for example specifying the correct code (UTF8 or something) somewhere, but I don't know how to do it and this works.
Here is the python script:
import re import sys
def fix_encoding_issues(text): replacements = { "é": "é", "è": "è", "ê": "ê", "à ": "à", "â": "â", "ç": "ç", "ë": "ë", "î": "î", "ô": "ô", "ù": "ù", "û": "û", "ü": "ü", "ÿ": "ÿ", "À": "À", "Â": "Â", "Ã": "Ã", "Ä": "Ä", "Ã…": "Å", "Æ": "Æ", "Ç": "Ç", "È": "È", "É": "É", "Ê": "Ê", "Ë": "Ë", "ÃŒ": "Ì", "Ã": "Í", "ÃŽ": "Î", "Ñ": "Ñ", "Ã’": "Ò", "Ó": "Ó", "Ô": "Ô", "Õ": "Õ", "Ö": "Ö", "Ø": "Ø", "Ù": "Ù", "Ú": "Ú", "Û": "Û", "Ü": "Ü", "Ã": "Ý", "Þ": "Þ", "ß": "ß", "á": "á", "â": "â", "ã": "ã", "ä": "ä", "Ã¥": "å", "æ": "æ", "ç": "ç", "è": "è", "é": "é", "ê": "ê", "ë": "ë", "ì": "ì", "î": "î", "ï": "ï", "ð": "ð", "ñ": "ñ", "ò": "ò", "ó": "ó", "ô": "ô", "õ": "õ", "ö": "ö", "÷": "÷", "ø": "ø", "ù": "ù", "ú": "ú", "û": "û", "ü": "ü", "ý": "ý", "þ": "þ", "ÿ": "ÿ", "Ý" : "à" }
for key, value in replacements.items(): text = text.replace(key, value) return textdef main(): # Read text from standard input input_text = sys.stdin.read()
# Fix encoding issues corrected_text = fix_encoding_issues(input_text) # Print the cleaned text to standard output print(corrected_text)if name == "main": main()`
Just change this line above and rebuild locally has solved the yt multilingual encoding issue for me,
I also mentioned this bug before, but seems no fix for that yet
Just change this line above and rebuild locally has solved the yt multilingual encoding issue for me, I also mentioned this bug before, but seems no fix for that yet
You're right! I tried exactly this before, but didn't think about rebuilding. I did 'pipx install . --force' and it was done! Thanks!
This relies on yt-dlp now and we don't do any scraping in the Go Fabric.
Just change this line above and rebuild locally has solved the yt multilingual encoding issue for me, I also mentioned this