【Model structure and principles】MiniCPM-Llama3-V 2.5

飞书用户309

2024年8月26日修改

👇

Bilibili accompanying video: https://www.bilibili.com/video/BV1qr421M7GU/?spm_id_from=333.337.search-card.all.click&vd_source=1534be4f756204643265d5f6aaa38c7b

HF Address: https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5

GitHub: https://github.com/OpenBMB/MiniCPM-V/tree/main

Model introduction

MiniCPM-Llama3-V 2.5 is the latest and highest performing model of the MiniCPM-V series. The total number of parameters is 8B, and its multi-modal comprehensive performance surpasses commercial closed-source models such as GPT-4V-1106, Gemini Pro, Claude 3, Qwen-VL-Max, etc. Its OCR capability and command following capability are further improved, and it supports multiple languages in more than 30 languages. Modal interaction. MiniCPM-Llama3-V 2.5 can achieve efficient terminal device deployment through the system's use of efficient inference technologies such as model quantization, CPU, NPU, and compilation optimization.​

Algorithm principle and process (simplified diagram)

1. First divide the image pixel value (w*h) by 448*448 to estimate the approximate number of segmentations.​
2. According to the aspect ratio of the original image, determine the patch_size (patch_size is the length and width of the segmented sub-image, try to be similar to the pre-training aspect ratio) and the number of segmentation num_slice. The original image needs to be scaled by linear interpolation. It is guaranteed that complete segmentation can be performed according to the current patch_size (new pits will be opened later).​
3. Split the image, mark each patch with its position, sort it, and add special characters:​
<im_start> + <unk_token> * num + <im_end>​
  Among them, num represents how many tokens a subgraph is represented in the final language model.​
  <unk_token> represents a placeholder symbol, which will be replaced by image features later:​
  <im_start><im_end> are separate tokens at the beginning and end of the image boundary respectively.​

common.docs_name - LarkCCM_Docs_Menu_Image

After the above image is divided six times, it will become 6 consecutive <im_start> + <unk_token> * query_nums + <im_end>, and the six sub-images will be saved as patches.​

4.
Splice the text into text according to the general llm method, as shown in the following sample:​
Q: What does this picture depict?​
Answer: This picture is the logo of OpenBMB. OpenBMB is jointly supported and initiated by Wallface Intelligence and Tsinghua Natural Language Processing Laboratory (THUNLP).​

Processed into:​
<用户>What does this picture depict？<AI>This picture is the logo of OpenBMB. OpenBMB is jointly supported and initiated by Wallface Intelligence and Tsinghua Natural Language Processing Laboratory (THUNLP)。​

5.
Convert the above text to ids and splice it with the previous image placeholder to obtain the final input:​
im_text_hold_id=bos_id+（<im_start> + <unk_token> * num + <im_end>）*num_slice+tokenizer.encode('<用户>What does this picture depict？<AI>This picture is the logo of OpenBMB........Laboratory (THUNLP)')+eos_id​
The above im_text_hold_id is the final token_id of the final input language model, where bos_id and eos_id represent the text start character and end character of the language model respectively.​
num indicates how many tokens a subgraph (patch) occupies​

6.
Enter im_text_hold_id into the embedding layer of the language model LLM to obtain llm_embedding:​

7.
Put all sub-image patches into the siglip and resample modules in order to extract the feature img_hidden_state:​

8.
Replace img_hidden_state with the vector corresponding to the position of unk_token in LLm_embeding, fuse the image features, and obtain vlm_state​

9.
Perform forward propagation and back propagation. The final architecture diagram is as shown in the figure:​

【Model structure and principles】MiniCPM-Llama3-V 2.5 ​

【Model structure and principles】MiniCPM-Llama3-V 2.5