ms-swift Inference and fine-tuning support for GOT-OCR2.

Inference:

CUDA_VISIBLE_DEVICES=0 swift infer --model_type got-ocr2 --model_id_or_path stepfun-ai/GOT-OCR2_0

<<< <image>OCR: 
Input an image path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png
简介 SWIFT支持250+LLM和35+MLLM（多模态大模型）的训练、推理、 评测和部署。开发者可以直接将我们的框架应用到自己的Research和 生产环境中，实现模型训练评测到应用的完整链路。我们除支持了 PEFT提供的轻量训练方案外，也提供了一个完整的Adapters库以支持 最新的训练技术，如NEFTune、LoRA+、LLaMA-PRO等，这个适配器 库可以脱离训练脚本直接使用在自己的自定流程中。 为方便不熟悉深度学习的用户使用，我们提供了一个Gradio的web-ui用 于控制训练和推理，并提供了配套的深度学习课程和最佳实践供新手入 门。 此外，我们也在拓展其他模态的能力，目前我们支持了AnimateDiff的 全参数训练和LoRA训练。 SWIFT具有丰富的文档体系，如有使用问题请请查看这里 可以在Huggingfacespace和ModelScope创空间中体验SWIFTweb ui功能了。
--------------------------------------------------
<<< clear
<<< <image>OCR: 
Input an image path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png
Introduction
SWIFT supports training, inference, evaluation and deployment of 250+ LLMs 
and 35+ MLLMs (multimodal large models). Developers can directly apply our 
framework to their own research and production environments to realize the 
complete workflow from model training and evaluation to application. In addition 
to supporting the lightweight training solutions provided by PEFT, we also 
provide a complete Adapters library to support the latest training techniques 
such as NEFTune, LoRA+, LLaMA-PRO, etc. This adapter library can be used 
directly in your own custom workflow without our training scripts.
To facilitate use by users unfamiliar with deep learning, we provide a Gradio 
web-ui for controlling training and inference, as well as accompanying deep 
learning courses and best practices for beginners.
Additionally, we are expanding capabilities for other modalities. Currently, we 
support full-parameter training and LoRA training for AnimateDiff.
SWIFT has rich documentations for users, please check here.
SWIFT web-ui is available both on Huggingface space and ModelScope studio, 
please feel free to try!

fine-tuning:

# fine-tuning LLM & projector, freeze vision encoder
CUDA_VISIBLE_DEVICES=0 swift sft \
    --model_type got-ocr2 --model_id_or_path stepfun-ai/GOT-OCR2_0 \
  --sft_type lora \
  --dataset latex-ocr-print#5000

# DDP & ZeRO2
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type got-ocr2 --model_id_or_path stepfun-ai/GOT-OCR2_0 \
  --sft_type lora \
  --dataset latex-ocr-print#5000 \
  --deepspeed default-zero2

train_loss (36)

inference after fine-tuning

CUDA_VISIBLE_DEVICES=0 swift infer \
    --ckpt_dir output/got-ocr2/vx-xxx/checkpoint-xxx \
    --load_dataset_config true \

Sep 25 '24 08:09 Jintao-Huang

你好，请问一下支持 swift vllm 部署吗？类似下面的指令 CUDA_VISIBLE_DEVICES=0 swift deploy --model_type llava1_6-vicuna-13b-instruct --infer_backend vllm

Sep 27 '24 09:09 cgq0816

请问何时能支持vllm推理？

Sep 27 '24 09:09 tbwang-clound

I am trying to finetune GOT in Hindi. The dataset I am using is from HuggingFace datasets( damerajee/hindi-ocr ). It contains only two columns, one is an image and the other is text present in the image. I have prepared a json file in the following format(taken from the official GOT OCR2.0 repo) {"query": "55555", "response": "66666", "images": ["image_path"]} i.e., few of the entries of the json file are given below-

{ "query": "\nOCR:", "response": "३४० श्री परमात्मने नमः अथ श्रीमद्भगवद्गीता ११ यथार्थ गीता १ अथ प्रथमोडध्यायः / धृतराष्ट्र उवाच धर्मक्षेत्रे समवेता युयुत्सवः मामकाः पाण्डवाश्चैव किमकुर्वत सञ्जया१११ | धृतराष्ट्र ने पूछा- "हे संजय! धर्मक्षेत्र, में एकत्र युद्ध को इच्छावाले मेरे और पाण्डु के पुत्रों ने क्या किया? ' अज्ञानरूपी धृतराष्ट्र और संयमरूपी संजय| अज्ञान मन के अन्तराल में रहता है| अज्ञान से आवृत्त मन धृतराष्ट्र जन्मान्ध है; संयमरूपी संजय के माध्यम से वह देखता है , सुनता है और समझता है कि परमात्मा ही सत्य है फिर भी जब तक उत्पन्न मोहरूपी दुर्योधन जीवित है इसकी दृष्टि सदैव कौरवों पर रहतीं है, विकारों पर हीं रहतीं है| शरीर एक क्षेत्र है| जब हृदय- देश में दैवी सम्पत्ति का बाहुल्य होता है तो यह शरीर धर्मक्षेत्र बन जाता है और जब इसमें आसुरी सम्पत्ति का बाहुल्य होता है तो यह शरीर कुरुक्षेत्र बन जाता है॰ कुरु अर्थात् करो- यह शब्द आदेशात्मक है| श्रीकृष्ण कहते हैं- प्रकृति से उत्पन्न तीनों गुणों द्वारा परवश होकर मनुष्य कर्म करता है| वह क्षणमात्र भी कर्म किये बिना नहों रह सकता| गुण उससे करा लेते हैं॰ सो जाने पर भी कर्म बन्द नहों होता, वह भी स्वस्थ कुरुक्षेत्रे कुरुक्षेत्र किन्तु इससे", "images": [ "/Users/srimanthdhondy/Programs/IITR/hindi_ocr_images/image_0.png" ] }, { "query": "\nOCR:", "response": "श्रीमद्भगवद्गीता यथार्थ गीता शरीर की आवश्यक खुराक मात्र है| तीनों गुण मनुष्य को से कीटपर्यन्त शरीरों में ही बाँधते हैं॰ जब तक प्रकृति और प्रकृति से उत्पन्न गुण जीवित हैं तब तक कुरु' लगा रहेगा| अतः जन्म-्मृत्युवाला क्षेत्र, विकारोंवाला क्षेत्र है और परमधर्म परमात्मा में प्रवेश दिलानेवाली पुण्यमयी प्रवृत्तियों ( पाण्डवों ) का क्षेत्र धर्मक्षेत्र है| पुरातत्त्वविद् पंजाब में , काशी - प्रयाग के मध्य तथा अनेकानेक स्थलों पर कुरुक्षेत्र की शोध में लगे हैं गीताकार ने स्वयं बताया है कि जिस क्षेत्र में यह युद्ध हुआ , वह कहाँ है| ' इदं शरीरं कौन्तेय क्षेत्रमित्यभिधीयते ' अ॰ १३/१ ) अर्जुन! यह शरीर ही क्षेत्र है और जो इसको जानता है इसका पार पा लेता है, वह क्षेत्रज्ञ है॰ ' आगे उन्होंने क्षेत्र का विस्तार बताया जिसमें दस इन्द्रियाँ , मन, बुद्धि, अहंकार, पाँचों विकार और तीनों गुणों का वर्णन है| शरीर ही क्षेत्र है , एक अखाड़ा है| इसमें लड़नेवाली प्रवृत्तियाँ दो हैं- दैवी सम्पद् और आसुरी सम्पद् , को सन्तान' और धृतराष्ट्र को सन्तान सजातीय और विजातीय प्रवृत्तियाँ | महापुरुष की शरण जाने पर इन दोनों प्रवृत्तियों में संघर्ष का सूत्रपात होता है| यह क्षेत्र क्षेत्रज्ञ का संघर्ष है और यहीं वास्तविक युद्ध है| विश्वयुद्धों से इतिहास भरा पड़ा है; उनमें जीतनेवालों को भी शाश्वत विजय नहीं मिलती| ये तो आपसी बदले हैं| प्रकृति का सर्वथा शमन करके प्रकृति से परे को सत्ता का दिग्दर्शन करना और उसमें प्रवेश पाना ही वास्तविक विजय है| यही एक ऐसी विजय है, जिसके पीछे हार नहीं है| यही मुक्ति है जिसके पीछे जन्म-्मृत्यु का बन्धन नहीं है| इस प्रकार अज्ञान से आच्छादित प्रत्येक मन संयम के द्वारा जानता है कि क्षेत्र क्षेत्रज्ञ के युद्ध में क्या हुआ? अब जैसा जिसके संयम का उत्थान है वैसे-वैसे उसे दृष्टि आती जायेगी| सञ्जय उवाच दृष्ट्वा तु पाण्डवानीकं व्यूढं दुर्योधनस्तदा| आचार्यमुपसङ्गम्य राजा वचनमबवीत् १२१| देवता = कुरुक्षेत्र किन्तु पाण्डु अनुभवी किन्तु", "images": [ "/Users/srimanthdhondy/Programs/IITR/hindi_ocr_images/image_1.png" ] },

Is the above .json file right? Or should I be placing the image object(PIL image object) instead of the image path? In response, I have given the text(ground truth) that I am expecting from the model, am I right?

Now the issue is how do I use this fine-tuned model? I went through the documentation, unlike in your Official GOT online demo which directly accepts image, in this fine tuned version one must enter a prompt or "OCR:" and then the image or image path or link(I managed to run this by using the CLI command as per the documentation.How to run it within a program/script that will be integrated into an Streamlit application? By "how do I use?", I mean that, one can use your official model by running few lines of code of HuggingFace, but for this fine-tuned version how do things work?

I am doing all this as a part of a project to build a basic application using Streamlit. The GitHub repository of the same is given below- https://github.com/AISpaceXDragon/GOT-OCR2.0.git

Thank you for giving your time in reading my queries and I hope that I will receive your response as soon as possible.

Sep 30 '24 18:09 AISpaceXDragon

@AISpaceXDragon I see you have successfully fine tuned the model in another language, Hindi. Can you provide me with a way to build a training dataset on the new language you made? I'm very grateful for that

Oct 01 '24 02:10 minhduc01168

@AISpaceXDragon I see you have successfully fine tuned the model in another language, Hindi. Can you provide me with a way to build a training dataset on the new language you made? I'm very grateful for that

As I mentioned I am using one of the dataset from HuggingFace Datasets link and I didn't build it. But I think you meant building the ".json file" for a given dataset,is it? Please let me know, so that I could assist you.

Oct 01 '24 03:10 AISpaceXDragon

@AISpaceXDragon That's right, I mean how to build ".json file" from a standard data set

Oct 01 '24 03:10 minhduc01168

@AISpaceXDragon Can you tell me at what stage do you do it when fine tuning? And are the results after fine tuning similar to the original results published by the author? I mean is it approximately?

Oct 01 '24 03:10 minhduc01168

@AISpaceXDragon That's right, I mean how to build ".json file" from a standard data set

I wrote a python script to prepare the .json file for a dataset. The format of the entries in the json file is same as mentioned in the comment before. The script that I have written takes the image and stores them in a folder. Whereas the "response" part in the json entry, contains the ground truth(text present in the image, in my case) <-- This is what we want our model to give as a reply, when given with the image path specificed in the "image" part of the json entry.

This is what I have done, but I was not able to evaluate the model with the same format.

This is why I posted comment in this issues space.

Format - {"query": "55555", "response": "66666", "images": ["image_path"]}

Oct 01 '24 04:10 AISpaceXDragon

@AISpaceXDragon Can you tell me at what stage do you do it when fine tuning? And are the results after fine tuning similar to the original results published by the author? I mean is it approximately?

What do you mean by "Can you tell me at what stage do you do it when fine tuning?"? I didn't get you. Please try to be clear.

Answer for "And are the results after fine tuning similar to the original results published by the author? I mean is it approximately?" The thing is that, I fine-tuned the model on Google Colab, which means limited compute resources. As per my observation, if fine-tuned for more number of epochs and on more data ,the results would be excellent(as mentioned in the research paper).

Oct 01 '24 04:10 AISpaceXDragon

@AISpaceXDragon Reply to "Can you tell me at what stage do you do it when fine tuning?". I see the author mentioned the following in the README.md section: 0.Train sample can be found here. Note that the '' in the 'conversations'-'human'-'value' is necessary! 1.This codebase only supports post-training (stage-2/stage-3) upon our GOT weights. 2. If you want to train from stage-1 described in our paper, you need this repo.

Oct 01 '24 04:10 minhduc01168

@minhduc01168 Reply to "I see the author mentioned the following in the README.md section: 0.Train sample can be found here. Note that the '' in the 'conversations'-'human'-'value' is necessary! 1.This codebase only supports post-training (stage-2/stage-3) upon our GOT weights. 2. If you want to train from stage-1 described in our paper, you need this repo."

I see that you are referring to training of the model, but I am referring to fine-tuning of the model. This means I am working only at Stage 2 or 3.

Note that training is different from fine tuning. Trianing means taking the defined model architecture with random weights and passing all the inputs until it the model gives corresponding correct outputs. Fine-tuning means taking these pretrained weights(learnings of the model) and use it for a specific variation of the same task.(In this case I want to perform OCR which is the main aim of the model, but as the training data used while training the model was mostly English and Chinese, the model is efficient only at these languages. But I want the model to extend these capabilities to other language, in my case Hindi, so I took the pretrained weights(ability of the model to extract text from images) and trained it on different language. This means, I want the ability of the model to extract text from images but only for different language along side the languages which it was already trained on.

I hope you understand what I am trying to convey. Let me know, if you didn't understand any part of the explanation.

Oct 01 '24 05:10 AISpaceXDragon

@Jintao-Huang Could you answer my question?

Oct 01 '24 10:10 AISpaceXDragon

@AISpaceXDragon Have you had anyone explain the data format below? Can you explain it to me? I'm very grateful for that. {"query": "55555", "response": "66666", "images": ["image_path"]} {"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]} {"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}

Oct 03 '24 03:10 minhduc01168

hey , can someone please help and tell me how can i train this model on MNIST dataset?

Oct 03 '24 15:10 Antoniodevy

@AISpaceXDragon HELP please

Oct 03 '24 15:10 Antoniodevy

@minhduc01168 Reply to "Have you had anyone explain the data format below? Can you explain it to me? I'm very grateful for that. {"query": "55555", "response": "66666", "images": ["image_path"]} {"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]} {"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}"

Answer for first part of the question, I understood them myself. No one explained them to me.

Answer to the second part of the question, There are three formats of there are three formats of data as mentioned. The first one is query. This contains the prompt and the image tag i.e., OCR:. The second one is response which contains the data/text that you want the model to predict and the third images in which you put the path of the images, you want to fine-tune on. This is the explanation of the first data format.

The explanation for second data format is similar to the first one, except it contains a new entry that is history which records all the previous responses of the model for the given images.

The explanation for the third data format is similar to that of the above. Here, the history contains list of all the query and responses pairs that you have given separately in the data format one.

I hope you are understood by explanation else let me know. Thank you.

Oct 06 '24 10:10 AISpaceXDragon

hey , can someone please help and tell me how can i train this model on MNIST dataset?

Follow the instructions as given by modelscope's ms swift documentation.

Let me know if you didn't get it, thank you.

Oct 06 '24 10:10 AISpaceXDragon

I tried it on google colab and i got the error above as i send

On Sun, Oct 6, 2024, 1:30 PM Srimanth @.***> wrote:

hey , can someone please help and tell me how can i train this model on MNIST dataset?

Follow the instructions as given by modelscope's ms swift documentation.

Let me know if you didn't get it, thank you.

— Reply to this email directly, view it on GitHub https://github.com/modelscope/ms-swift/issues/2122#issuecomment-2395384530, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIFJWBPRBAC5GCZOQVNIRRDZ2EGLRAVCNFSM6AAAAABOZ62CEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJVGM4DINJTGA . You are receiving this because you commented.Message ID: @.***>

Oct 06 '24 11:10 Antoniodevy

@AISpaceXDragon Did you train successfully and is everything working well? Thank you very much for your answer.

Oct 06 '24 15:10 minhduc01168

Yes training fine but testing no at alll

On Sun, Oct 6, 2024, 6:17 PM minhduc01168 @.***> wrote:

@AISpaceXDragon https://github.com/AISpaceXDragon Did you train successfully and is everything working well? Thank you very much for your answer.

— Reply to this email directly, view it on GitHub https://github.com/modelscope/ms-swift/issues/2122#issuecomment-2395475873, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIFJWBP32DORYJ5YASVCRX3Z2FIBBAVCNFSM6AAAAABOZ62CEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJVGQ3TKOBXGM . You are receiving this because you commented.Message ID: @.***>

Oct 06 '24 16:10 Antoniodevy

infer vllm？

Oct 07 '24 15:10 tbwang-clound

@Jintao-Huang Could you answer my question?

Hello, the holiday just ended, and I didn’t reply in time. What was the issue? 😊

Oct 09 '24 05:10 Jintao-Huang

@Jintao-Huang Can you explain it to me? I'm very grateful for that. {"query": "55555", "response": "66666", "images": ["image_path"]} {"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]} {"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}

Oct 09 '24 06:10 minhduc01168

This format might be clearer.

{"query": "<image>55555", "response": "66666", "images": ["image_path"]}
{"query": "<image><image>eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}

Oct 12 '24 17:10 Jintao-Huang

@Jintao-Huang Thank you for explaining it to me. Because my GPU resources are limited. Can you tell me how I can load the weight model to continue training? Thank you

Oct 13 '24 01:10 minhduc01168

@AISpaceXDragon sorry, What is OCR u use to have response in Data Format? Pytesseract or GOT-OCR or something?? Thank u

I didn't get you. Please try to be clear.

Nov 04 '24 06:11 AISpaceXDragon

@Jintao-Huang I wanna fine-tune to OCR table image other language. I don't get what is content of response ? Have structure table line by line or Latex tabular? Can u explain help me? Thank u {"query": "55555", "response": "66666", "images": ["image_path"]} {"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]} {"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}

Nov 04 '24 09:11 thanhtran134

微调后，调用微调后的模型报错： 401730889265_ pic 如何解决？模型内容如下： 411730889276_ pic

Nov 06 '24 11:11 Charimanhua

你需要merge lora. 才会有config.json文件

Nov 06 '24 11:11 Jintao-Huang

请问merge lora应该在哪一步操作呀？我不是很懂，谢谢！

Nov 06 '24 11:11 Charimanhua