- MiniCPM-V 2.6 Deployment Tutorial
- vllm
- vllm python code inference
- 1.1 First go to huggingface to download the model weights:
- Or : You can also download the quantized awq model, which is twice as fast and requires only 7G of video memory.
- Simple comparison between fp16 and awq using vllm (4090 single gpu)
- 1.2 pip install vllm
- 1.3 Create python code to inference by vllm
- 1.4 describe the video
- vllm api sever
- llama.cpp inference
- 2. Get the llama.cpp branch of openbmb:
- 3. Install some package
- 4. Get the gguf weight of MiniCPM-V 2.6.
- 5. Get the gguf weight of MiniCPM-V 2.6.
- 6. start inference:
- 6.1 Picture reasoning instructions
- 6.2 Video reasoning instructions
- 4.3 Inference parameter description
- Ollama
- 1. Follow the above tutorial of llama.cpp to obtain the gguf model. The language model is best quantified.
- 2. Install some package
- 3. Get the official ollma branch of openbmb:
- 4. environmental needs:
- 5. Install large model dependencies:
- 6. Compile ollama
- 7. The compilation is successful, start ollama in the ollama main path:
- 8. Create a Modelfile:
- 9. Create ollama model Instance:
- 10. Run ollama model instance:
- 11. Input the question and image URL separated by a space.
- FAQ
- 1. Q: I get OOM when initializing the model (vllm)
- 2. Q: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.(vllm)
- 3. [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU 0 has a total capacity of 23.64 GiB of which 769.69 MiB is free. Including non-PyTorch memory, this process has 22.88 GiB memory in use.
MiniCPM-V 2.6 Deployment Tutorial
MiniCPM-V 2.6 Deployment Tutorial
2024年8月20日修改
本文讨论了MiniCPM-V 2.6的部署教程,涵盖多种推理方式及遇到问题的解决办法。关键要点包括:
1.
模型及参考链接:给出模型地址、GitHub仓库地址和Bilibili教程视频链接,适合能修改Python脚本参数和使用基本Bash的个人。
2.
vllm推理:介绍下载模型权重方法,对比了fp16和awq的速度、时间、内存使用等数据,还给出代码推理、视频描述及vllm api server相关内容。
3.
llama.cpp推理:设备需一定内存,介绍获取分支、安装包、获取权重方法及推理指令,说明了各推理参数含义。
4.
Ollama相关:设备需一定内存,介绍获取分支、安装依赖、编译、创建模型实例等步骤。
5.
常见问题解答:针对模型初始化内存不足、输出异常标签、安装编译报错等问题给出解决办法 。
🙌
Bilibili Accompanying Video: https://www.bilibili.com/video/BV1sM4m1172r/?vd_source=cd29f4e20ef69babd26f4f34cc7c8b3f
Suitable for Individuals: Those who are capable of modifying simple parameters in Python scripts and can use the most basic Bash .
vllm
vllm python code inference
1.1
First go to huggingface to download the model weights:
代码块
git clone https://huggingface.co/openbmb/MiniCPM-V-2_6
Or : You can also download the quantized awq model, which is twice as fast and requires only 7G of video memory.
代码块
git clone https://www.modelscope.cn/models/linglingdan/MiniCPM-V_2_6_awq_int4
git clone https://github.com/LDLINGLINGLING/AutoAWQ.git
cd AutoAWQ
pip install e .