The NemoStation team has open-sourced Marlin 2B, a video vision-language model fine-tuned from Qwen3.5-2B. It is specifically designed for two practical development scenarios: ‘what happens in the video’ and ‘when it happens’, offering two distinct invocation modes. In caption mode, users input a video and receive a structured dictionary containing a general scene description along with a list of events marked by precise timestamps in seconds (for instance, «<14.3 – 18.2> A person pushes a door open»). The find mode accepts natural language queries and returns the start and end timestamps of relevant segments within the video. Both modes can be accessed via the standard HF transformers API without any extra wrappers.
In benchmark tests, Marlin 2B ranks first among all 2B-parameter models on CaReBench, a fine-grained captioning benchmark. On TimeLens-Bench, a temporal localization evaluation suite, it outperforms Qwen2.5-VL-7B by +6.4 mIoU and matches Gemini-2.0-Flash in performance. On DREAM-1K, it performs between Tarsier-34B and Gemini-1.5-Pro, making it the strongest open-source video model under the 2B parameter threshold when balancing dense description capabilities with accurate time tracking.
Training followed a two-phase approach: first, supervised fine-tuning (SFT) was performed using roughly 400,000 high-quality video clips re-labeled via Gemini-3-Flash’s reasoning mode and subsequently verified by human reviewers; later, alignment was achieved through SimPO preference optimization. The entire training process was completed on a single H100 GPU. To run Marlin 2B locally, users need transformers version 5.7.0 or newer, torch 2.11.0+, plus torchcodec. It runs smoothly even on consumer-grade GPUs or Mac systems equipped with a 16GB M1 chip, and also integrates seamlessly with inference frameworks like vLLM and Swift. Model weights are distributed free of charge upon request, though commercial usage requires prior authorization from the team per the BSL-1.1 license terms.