VideoLingo icon indicating copy to clipboard operation
VideoLingo copied to clipboard

德转中的视频总是报“ValueError: All arrays must be of the same length”的错

Open piagodai opened this issue 1 year ago • 3 comments

2024-11-04 12:58:07.299 Uncaught app exception Traceback (most recent call last): File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling result = func() File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec exec(code, module.dict) File "D:\AI\VideoLingo\st.py", line 117, in main() File "D:\AI\VideoLingo\st.py", line 113, in main text_processing_section() File "D:\AI\VideoLingo\st.py", line 30, in text_processing_section process_text() File "D:\AI\VideoLingo\st.py", line 55, in process_text step5_splitforsub.split_for_sub_main() File "D:\AI\VideoLingo\core\step5_splitforsub.py", line 106, in split_for_sub_main pd.DataFrame({'Source': src_lines, 'Translation': tr_lines}).to_excel("output/log/translation_results_for_subtitles.xlsx", index=False) File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\pandas\core\frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\pandas\core\internals\construction.py", line 503, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\pandas\core\internals\construction.py", line 114, in arrays_to_mgr index = _extract_index(arrays) File "C:\Users\WINDOWS\anaconda3\envs\videolingo\lib\site-packages\pandas\core\internals\construction.py", line 677, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length


试了几段德语视频都是这个问题,英转中没问题

piagodai avatar Nov 04 '24 05:11 piagodai

请询问时附带使用的 llm

Huanshere avatar Nov 04 '24 09:11 Huanshere

好的,按默认文档来的,用的claude-3-5-sonnet-20240620 微信截图_20241104191845

piagodai avatar Nov 04 '24 11:11 piagodai

感谢反馈,我之后进一步测试德语的视频,之前只是简单测试了一下

Huanshere avatar Nov 08 '24 09:11 Huanshere

首先感谢开源,简直是生产力工具。 较长的英文视频也会报这样的错误。请问是不是大语言模型的问题? ValueError: All arrays must be of the same length Traceback: File "C:\Users\我叫肥坚\AppData\Roaming\Python\Python310\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling result = func() File "C:\Users\我叫肥坚\AppData\Roaming\Python\Python310\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec exec(code, module.dict) File "C:\VideoLingo2.0\st.py", line 123, in main() File "C:\VideoLingo2.0\st.py", line 119, in main text_processing_section() File "C:\VideoLingo2.0\st.py", line 33, in text_processing_section process_text() File "C:\VideoLingo2.0\st.py", line 57, in process_text step5_splitforsub.split_for_sub_main() File "C:\VideoLingo2.0\core\step5_splitforsub.py", line 129, in split_for_sub_main pd.DataFrame({'Source': src, 'Translation': remerged}).to_excel(OUTPUT_REMERGED_FILE, index=False) File "C:\VideoLingo2.0\python\lib\site-packages\pandas\core\frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "C:\VideoLingo2.0\python\lib\site-packages\pandas\core\internals\construction.py", line 503, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "C:\VideoLingo2.0\python\lib\site-packages\pandas\core\internals\construction.py", line 114, in arrays_to_mgr index = _extract_index(arrays) File "C:\VideoLingo2.0\python\lib\site-packages\pandas\core\internals\construction.py", line 677, in _extract_index raise ValueError("All arrays must be of the same length")

Jimbager avatar Nov 22 '24 06:11 Jimbager

刚刚尝试了下,同样的长视频,把大语言模型换成 openai/gpt-4o-2024-11-20 就可以了 @piagodai

Jimbager avatar Nov 22 '24 06:11 Jimbager

刚刚尝试了下,同样的长视频,把大语言模型换成 openai/gpt-4o-2024-11-20 就可以了 @piagodai

感谢,我一直用的claude 3.5 sonnet,你之前也是么? 如果是的话看来是大模型的问题

piagodai avatar Nov 23 '24 05:11 piagodai

我是英翻中,用的gpt-4o-2024-08-06,之前三个视频都成功了,一个两个半小时的视频,报类似错误:2024-11-23 22:48:17.860 Uncaught app exception Traceback (most recent call last): File "C:\VideoLingo\installer_files\env\lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling result = func() File "C:\VideoLingo\installer_files\env\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec exec(code, module.dict) File "C:\VideoLingo\st.py", line 123, in main() File "C:\VideoLingo\st.py", line 119, in main text_processing_section() File "C:\VideoLingo\st.py", line 33, in text_processing_section process_text() File "C:\VideoLingo\st.py", line 57, in process_text step5_splitforsub.split_for_sub_main() File "C:\VideoLingo\core\step5_splitforsub.py", line 128, in split_for_sub_main pd.DataFrame({'Source': split_src, 'Translation': split_trans}).to_excel(OUTPUT_SPLIT_FILE, index=False) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 503, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 114, in arrays_to_mgr index = _extract_index(arrays) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 677, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length

jfishlet avatar Nov 23 '24 14:11 jfishlet

我是英翻中,用的gpt-4o-2024-08-06,之前三个视频都成功了,一个两个半小时的视频,报类似错误:2024-11-23 22:48:17.860 Uncaught app exception Traceback (most recent call last): File "C:\VideoLingo\installer_files\env\lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling result = func() File "C:\VideoLingo\installer_files\env\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec exec(code, module.dict) File "C:\VideoLingo\st.py", line 123, in main() File "C:\VideoLingo\st.py", line 119, in main text_processing_section() File "C:\VideoLingo\st.py", line 33, in text_processing_section process_text() File "C:\VideoLingo\st.py", line 57, in process_text step5_splitforsub.split_for_sub_main() File "C:\VideoLingo\core\step5_splitforsub.py", line 128, in split_for_sub_main pd.DataFrame({'Source': split_src, 'Translation': split_trans}).to_excel(OUTPUT_SPLIT_FILE, index=False) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 503, in dict_to_mgr return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 114, in arrays_to_mgr index = _extract_index(arrays) File "C:\VideoLingo\installer_files\env\lib\site-packages\pandas\core\internals\construction.py", line 677, in _extract_index raise ValueError("All arrays must be of the same length") ValueError: All arrays must be of the same length

换了个model,重新跑了一遍又好了。这个项目是我们这种想看些英文视频内容,听力又不行的福音。

jfishlet avatar Nov 24 '24 04:11 jfishlet

今天大概查了一下原因,是在这句: pd.DataFrame({'Source': src_lines, 'Translation': tr_lines}).to_excel("output/log/translation_results_for_subtitles.xlsx", index=False) 的时候, 输入的src_lines和tr_lines行数不一致 跟德文没有关系,我在阿拉伯语视频翻中文的时候也遇到了 应该就是llm断句的时候有误差或者对原文和译文的断句方式不一致的原因。 因为每个llm返回的结果,以及同一个llm两次返回的结果都不一样,所以换一个llm就可能通过了, 或者重新load一下视频把cache清掉重新调llm断句也能通过。

piagodai avatar Nov 28 '24 14:11 piagodai

今天大概查了一下原因,是在这句: pd.DataFrame({'Source': src_lines, 'Translation': tr_lines}).to_excel("output/log/translation_results_for_subtitles.xlsx", index=False) 的时候, 输入的src_lines和tr_lines行数不一致 跟德文没有关系,我在阿拉伯语视频翻中文的时候也遇到了 应该就是llm断句的时候有误差或者对原文和译文的断句方式不一致的原因。 因为每个llm返回的结果,以及同一个llm两次返回的结果都不一样,所以换一个llm就可能通过了, 或者重新load一下视频把cache清掉重新调llm断句也能通过。

嗯 这个问题和 llm 有关,我会尝试加上更严格的校验。

Huanshere avatar Dec 01 '24 08:12 Huanshere

最近测试发现是因为没有返回正确的json格式,在对齐的时候没有切分而导致报错,在 ac5dc9d 中加入了更严格的json要求,应该能解决这个问题~ 感谢反馈!

Huanshere avatar Dec 01 '24 13:12 Huanshere