ByteDance debuts Vidi2, a 12B-parameter multimodal LLM designed to generate TikTok videos from simple prompts.
On December 1, 2025, ByteDance announced Vidi2, its latest multimodal large language model tailored specifically for video understanding and editing. With 12 billion parameters and powered by the Gemma-3 architecture, Vidi2 processes hours-long raw footage, localizing objects and people at one-second intervals, facilitating fine-grained editing such as object tracking in dynamic scenes. Its innovative adaptive token compression maintains key visual details while handling long videos efficiently. This breakthrough enables users to generate complete TikTok short videos or movie clips simply from text prompts, signaling a disruptive shift in video content creation workflows. Leveraging TikTok’s vast daily active user base and massive video training data, ByteDance aims to establish a potent AI flywheel that challenges traditional video editing and AI companies alike. Vidi2 remains in research phase with a demonstration expected soon, highlighting the expanding role of domain-specialized multimodal LLMs in media production.[4]
Source: AIbase