VideoLingo 德转中的视频总是报“ValueError: All arrays must be of the same length”的错

2024-11-04 12:58:07.299 Uncaught app exception Traceback (most recent call last): File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling result = func() File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec exec(code, module.dict) File "D:\AI\VideoLingo\st.py", line 117, in main() File "D:\AI\VideoLingo\st.py", line 113, in main text_processing_section() File "D:\AI\VideoLingo\st.py", line 30, in text_processing_section process_text() File "D:\AI\VideoLingo\st.py", line 55, in process_text step5_splitforsub.split_for_sub_main() File "D:\AI\VideoLingo\core\step5_splitforsub.py", line 106, in split_for_sub_main pd.DataFrame({'Source': src_lines, 'Translation': tr_lines}).to_excel("output/log/translation_results_for_subtitles.xlsx", index=False) File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\pandas\core\frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\pandas\core\internals\construction.py", line 503, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\pandas\core\internals\construction.py", line 114, in arrays_to_mgr index = _extract_index(arrays) File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\pandas\core\internals\construction.py", line 677, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length

试了几段德语视频都是这个问题，英转中没问题

Nov 04 '24 05:11 piagodai

请询问时附带使用的 llm

Nov 04 '24 09:11 Huanshere

好的，按默认文档来的，用的claude-3-5-sonnet-20240620 微信截图_20241104191845

Nov 04 '24 11:11 piagodai

感谢反馈，我之后进一步测试德语的视频，之前只是简单测试了一下

Nov 08 '24 09:11 Huanshere

首先感谢开源，简直是生产力工具。较长的英文视频也会报这样的错误。请问是不是大语言模型的问题？ ValueError: All arrays must be of the same length Traceback: File "C:\Users\我叫肥坚\AppData\Roaming\Python\Python310\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling result = func() File "C:\Users\我叫肥坚\AppData\Roaming\Python\Python310\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec exec(code, module.dict) File "C:\VideoLingo2.0\st.py", line 123, in main() File "C:\VideoLingo2.0\st.py", line 119, in main text_processing_section() File "C:\VideoLingo2.0\st.py", line 33, in text_processing_section process_text() File "C:\VideoLingo2.0\st.py", line 57, in process_text step5_splitforsub.split_for_sub_main() File "C:\VideoLingo2.0\core\step5_splitforsub.py", line 129, in split_for_sub_main pd.DataFrame({'Source': src, 'Translation': remerged}).to_excel(OUTPUT_REMERGED_FILE, index=False) File "C:\VideoLingo2.0\python\lib\site-packages\pandas\core\frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "C:\VideoLingo2.0\python\lib\site-packages\pandas\core\internals\construction.py", line 503, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "C:\VideoLingo2.0\python\lib\site-packages\pandas\core\internals\construction.py", line 114, in arrays_to_mgr index = _extract_index(arrays) File "C:\VideoLingo2.0\python\lib\site-packages\pandas\core\internals\construction.py", line 677, in _extract_index raise ValueError("All arrays must be of the same length")

Nov 22 '24 06:11 Jimbager

刚刚尝试了下，同样的长视频，把大语言模型换成 openai/gpt-4o-2024-11-20 就可以了 @piagodai

Nov 22 '24 06:11 Jimbager

刚刚尝试了下，同样的长视频，把大语言模型换成 openai/gpt-4o-2024-11-20 就可以了 @piagodai

感谢，我一直用的claude 3.5 sonnet，你之前也是么？如果是的话看来是大模型的问题

Nov 23 '24 05:11 piagodai

我是英翻中，用的gpt-4o-2024-08-06，之前三个视频都成功了，一个两个半小时的视频，报类似错误：2024-11-23 22:48:17.860 Uncaught app exception Traceback (most recent call last): File "C:\VideoLingo\installer_files\env\lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling result = func() File "C:\VideoLingo\installer_files\env\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec exec(code, module.dict) File "C:\VideoLingo\st.py", line 123, in main() File "C:\VideoLingo\st.py", line 119, in main text_processing_section() File "C:\VideoLingo\st.py", line 33, in text_processing_section process_text() File "C:\VideoLingo\st.py", line 57, in process_text step5_splitforsub.split_for_sub_main() File "C:\VideoLingo\core\step5_splitforsub.py", line 128, in split_for_sub_main pd.DataFrame({'Source': split_src, 'Translation': split_trans}).to_excel(OUTPUT_SPLIT_FILE, index=False) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 503, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 114, in arrays_to_mgr index = _extract_index(arrays) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 677, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length

Nov 23 '24 14:11 jfishlet

我是英翻中，用的gpt-4o-2024-08-06，之前三个视频都成功了，一个两个半小时的视频，报类似错误：2024-11-23 22:48:17.860 Uncaught app exception Traceback (most recent call last): File "C:\VideoLingo\installer_files\env\lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling result = func() File "C:\VideoLingo\installer_files\env\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec exec(code, module.dict) File "C:\VideoLingo\st.py", line 123, in main() File "C:\VideoLingo\st.py", line 119, in main text_processing_section() File "C:\VideoLingo\st.py", line 33, in text_processing_section process_text() File "C:\VideoLingo\st.py", line 57, in process_text step5_splitforsub.split_for_sub_main() File "C:\VideoLingo\core\step5_splitforsub.py", line 128, in split_for_sub_main pd.DataFrame({'Source': split_src, 'Translation': split_trans}).to_excel(OUTPUT_SPLIT_FILE, index=False) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 503, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 114, in arrays_to_mgr index = _extract_index(arrays) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 677, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length

换了个model，重新跑了一遍又好了。这个项目是我们这种想看些英文视频内容，听力又不行的福音。

Nov 24 '24 04:11 jfishlet

今天大概查了一下原因，是在这句： pd.DataFrame({'Source': src_lines, 'Translation': tr_lines}).to_excel("output/log/translation_results_for_subtitles.xlsx", index=False) 的时候，输入的src_lines和tr_lines行数不一致跟德文没有关系，我在阿拉伯语视频翻中文的时候也遇到了应该就是llm断句的时候有误差或者对原文和译文的断句方式不一致的原因。因为每个llm返回的结果，以及同一个llm两次返回的结果都不一样，所以换一个llm就可能通过了，或者重新load一下视频把cache清掉重新调llm断句也能通过。

Nov 28 '24 14:11 piagodai

今天大概查了一下原因，是在这句： pd.DataFrame({'Source': src_lines, 'Translation': tr_lines}).to_excel("output/log/translation_results_for_subtitles.xlsx", index=False) 的时候，输入的src_lines和tr_lines行数不一致跟德文没有关系，我在阿拉伯语视频翻中文的时候也遇到了应该就是llm断句的时候有误差或者对原文和译文的断句方式不一致的原因。因为每个llm返回的结果，以及同一个llm两次返回的结果都不一样，所以换一个llm就可能通过了，或者重新load一下视频把cache清掉重新调llm断句也能通过。

嗯这个问题和 llm 有关，我会尝试加上更严格的校验。

Dec 01 '24 08:12 Huanshere

最近测试发现是因为没有返回正确的json格式，在对齐的时候没有切分而导致报错，在 ac5dc9d 中加入了更严格的json要求，应该能解决这个问题~ 感谢反馈！

Dec 01 '24 13:12 Huanshere