Newsletter
Join the Community
Subscribe to our newsletter for the latest news and updates
MoshiVis is a Vision Speech Model (VSM) integrating speech and image processing for interactive conversations.
MoshiVis is a cutting-edge Vision Speech Model (VSM) designed to facilitate engaging discussions about images while maintaining a natural conversational style. Leveraging the foundational speech model Moshi, it introduces significant improvements with an additional 206M adapter parameters on top of the base model.