LTX 2.3 Generates 4K Video with Synchronized Audio in One Pass

LTX 2.3 Generates 4K Video with Synchronized Audio in One Pass

Lightricks released LTX 2.3, a 22-billion-parameter open-source model that generates 4K video with synchronized audio in a single forward pass. Previous video generation models produced silent clips that required separate audio synthesis. LTX 2.3 combines both into one pipeline, generating up to 20 seconds of video at 50 frames per second with matched sound effects and ambient audio.

The model runs on a single consumer-grade GPU, which makes it accessible to independent creators and small studios. Open-source availability means anyone can download, fine-tune, and deploy it without licensing fees.

Key Features of LTX 2.3 Video Generation

  • 4K resolution output at up to 50 FPS in a single generation pass
  • Synchronized audio generated alongside video, not stitched on afterward
  • Native portrait mode support for vertical video content
  • Up to 20 seconds of continuous video per generation
  • Open-source release with full model weights available for download

How LTX 2.3 Handles Audio-Video Synchronization

Earlier video models treated audio as an afterthought. You would generate a video clip, then run a separate model to create matching sound. The results rarely aligned well. Footsteps would not match foot movements. Door slams would arrive a half-second late.

LTX 2.3 uses a Diffusion Transformer architecture that processes visual and audio tokens together during generation. The model learns temporal relationships between what happens on screen and what sounds should accompany it. In practice, this means a generated video of ocean waves includes wave crash sounds that match the visual timing.

LTX 2.3 is the first open-source model to generate synchronized 4K video and audio in a single pass, removing the need for separate audio synthesis pipelines.

Performance on Consumer Hardware

Lightricks optimized LTX 2.3 to run on GPUs with 24GB or more of VRAM. A 10-second clip at 1080p resolution generates in approximately 90 seconds on an RTX 4090. 4K output takes longer but remains practical for batch workflows. The model also supports lower-resolution preview generation for rapid iteration before committing to full-quality renders.

Portrait mode works without any special configuration. You specify the aspect ratio in the prompt, and the model generates vertical video natively rather than cropping a landscape output. This matters for creators producing content for TikTok, Instagram Reels, and YouTube Shorts.

What LTX 2.3 Means for Video Creators

Open-source 4K video generation with audio changes the economics of short-form content production. Stock footage libraries, sound effect subscriptions, and motion graphics packages all become less necessary when a single model can generate both visual and audio content to specification. Fine-tuning on custom datasets means studios can train the model on their brand aesthetic and produce consistent output.

The model is available now on Hugging Face and through Lightricks’ ComfyUI integration. Community-built fine-tunes are already appearing for specific styles including product visualization, architectural fly-throughs, and nature documentaries.