Pioneering Multimodal Reasoning with CoT, an open-source model for advanced visual and text reasoning.
A comprehensive survey on benchmarks for Multimodal Large Language Models (MLLMs).
Open-source evaluation toolkit for large multi-modality models, supporting 220+ models and 80+ benchmarks.
DeTikZify synthesizes graphics programs for scientific figures from sketches using TikZ.
Qwen2.5-Omni is an end-to-end multimodal model by Alibaba Cloud, capable of understanding text, audio, vision, and video.
VideoMind is a Chain-of-LoRA Agent designed for long video reasoning using human-like processes.
AnimeGamer is an infinite anime life simulation tool that predicts game states using multimodal models.
A collection of high-quality pretrained models and resources for Chinese natural language processing.
Train a 26M-parameter visual multimodal VLM from scratch in just 1 hour, suitable for deep learning enthusiasts.
OmniSVG is an end-to-end multimodal SVG generator leveraging Vision-Language Models for detailed SVG generation.
LLMFarm is an iOS and MacOS app for offline use of large language models using the GGML library.
Jina AI offers advanced search solutions for multilingual and multimodal data using AI technology.