MiniCPM-V 2.6: A GPT-4V Stage Multimodal LLMs for Single Picture, Multi-Picture, and Video on Your Telephone

MiniCPM-V 2.6 represents the most recent and most superior iteration within the MiniCPM-V sequence, constructed on the SigLip-400M and Qwen2-7B frameworks, boasting a complete of 8 billion parameters. This mannequin introduces important enhancements in efficiency and new options tailor-made for multi-image and video understanding, reaching substantial developments over its predecessor, MiniCPM-Llama3-V 2.5.

Key Options of MiniCPM-V 2.6:

Main Efficiency: MiniCPM-V 2.6 attains a mean rating of 65.2 on OpenCompass, a complete analysis throughout eight widespread benchmarks. With its 8 billion parameters, this mannequin surpasses outstanding proprietary fashions reminiscent of GPT-4o mini, GPT-4V, Gemini 1.5 Professional, and Claude 3.5 Sonnet in single picture understanding.

Multi-Picture Understanding and In-context Studying: Able to dialog and reasoning over a number of pictures, MiniCPM-V 2.6 achieves state-of-the-art outcomes on multi-image benchmarks together with Mantis-Eval, BLINK, Mathverse mv, and Sciverse mv. It additionally displays promising in-context studying skills.

Video Understanding: Accepting video inputs, MiniCPM-V 2.6 supplies dialog and dense captions for spatial-temporal info. It outperforms fashions like GPT-4V, Claude 3.5 Sonnet, and LLaVA-NeXT-Video-34B on Video-MME, each with and with out subtitles.

Robust OCR Functionality: Processing pictures with varied facet ratios and as much as 1.8 million pixels, MiniCPM-V 2.6 units a brand new normal on OCRBench, outperforming proprietary fashions reminiscent of GPT-4o, GPT-4V, and Gemini 1.5 Professional. Leveraging the most recent RLAIF-V and VisCPM strategies, it ensures reliable behaviors with considerably decrease hallucination charges on Object HalBench, supporting multilingual capabilities throughout English, Chinese language, German, French, Italian, and Korean.

Superior Effectivity: Regardless of its compact dimension, MiniCPM-V 2.6 displays state-of-the-art token density, encoding a 1.8 million pixel picture into simply 640 tokens, 75% fewer than most fashions. This enhances inference pace, first-token latency, reminiscence utilization, and energy consumption, enabling environment friendly real-time video understanding on gadgets reminiscent of iPads.

Ease of Use: MiniCPM-V 2.6 is flexible in its software, supporting environment friendly CPU inference on native gadgets via llama.cpp and ollama, providing quantized fashions in int4 and GGUF codecs in 16 sizes, vLLM assist for high-throughput and memory-efficient inference, domain-specific fine-tuning, fast native WebUI demo setup with Gradio, and on-line net demos.

MiniCPM-V 2.6 represents a big leap in machine studying for visible understanding, providing unmatched efficiency, effectivity, and value throughout single picture, multi-image, and video processing duties

Try the HF Mannequin and GitHub. All credit score for this analysis goes to the researchers of this venture. Additionally, don’t neglect to observe us on Twitter and be a part of our Telegram Channel and LinkedIn Group. For those who like our work, you’ll love our e-newsletter..

Don’t Overlook to hitch our 47k+ ML SubReddit

Discover Upcoming AI Webinars right here

You Might Also Like

Contextual Retrieval: An Superior AI Approach that Reduces Incorrect Chunk Retrieval Charges by as much as 67%

Torrential rain in Japan floods quake-stricken Noto area By Reuters

LASR: A Novel Machine Studying Strategy to Symbolic Regression Utilizing Giant Language Fashions

Russian assault on Ukraine’s Kryvyi Rih kills three

Sketch: An Progressive AI Toolkit Designed to Streamline LLM Operations Throughout Various Fields