PDFMathTranslate Enhanced Ollama Response Handling with Retries and Streaming

This PR improves the reliability and user experience of the OllamaTranslator by addressing several issues I've encountered.

Current Issues:

The current implementation sometimes gets stuck generating infinite responses, which can halt the translation process.
Additionally, I've observed that certain models perform better than others for specific types of text, but the system currently lacks the ability to leverage multiple models effectively.

Implemented Solutions:

To address these issues, I've implemented several key improvements:

I've added a length limitation that caps responses at either 2000 characters or three times the input length, whichever is greater. This threshold is based on typical translation length ratios and effectively prevents infinite responses while allowing for natural translation expansion.
I've also added support for multiple models, allowing users to specify several models separated by semicolons. For example:
```
OLLAMA_MODEL=gemma2:2b-instruct-q4_K_M;aya-expanse
```
Each model gets two retry attempts, as my experience shows that additional retries rarely improve results. However, I've found that different models often succeed where others fail, making this multi-model approach particularly effective.
I've implemented streaming responses to match the existing prompt display functionality. This provides immediate feedback during the translation process and helps identify where and when translation issues occur.
I've added a fallback mechanism that returns the original text if all translation attempts fail, ensuring the program continues to operate without hanging while preserving the content.

These changes significantly improve translation reliability while maintaining compatibility with existing debug output patterns.

Dec 20 '24 11:12 7shi

Thanks for your contribution! However, maybe we can adjust the temperature parameter (currently 0) to avoid generating infinite response, which is more maintainable and elegant.

Dec 20 '24 13:12 Byaidu

Thank you for your suggestion about adjusting the temperature parameter. I understand and respect your approach.

However, I'd like to clarify the current critical issue: When Ollama enters an infinite loop state, it completely blocks the translation process and prevents pdf2zh from even attempting to retry. If I forcibly terminate Ollama in this situation, pdf2zh enters an infinite reconnection loop, requiring me to forcibly terminate the Python process.

While adjusting the temperature may help reduce the frequency of infinite responses, I think we still need a safety mechanism to handle cases where the model gets stuck.

Would you consider accepting the timeout/length limitation part of the PR separately from the multi-model features? This would provide a essential safety net while we explore temperature-based solutions.

Dec 20 '24 14:12 7shi

Sure, the timeout/length limitation and multi-model are acceptable.

However, we might not accept a fallback mechanism that returns the original text, considering that program exceptions could be triggered not only by length limitations but also by issues within the ollama module itself (e.g., the user failing to correctly configure the ollama server, resulting in an HTTP error). This could complicate error diagnosis, especially considering the history of frequent ollama configuration errors reported in the issue tracker.

Dec 20 '24 14:12 Byaidu

I agree with removing the fallback mechanism. Clear error messages would be more helpful for troubleshooting.

Currently, when OllamaTranslator raises an error, the system keeps retrying indefinitely. I understand this retry behavior might be related to the overall error handling policy of the project, so I'll leave the specific implementation to your judgment.

Dec 20 '24 14:12 7shi

Thank you.

Dec 20 '24 15:12 7shi