Abstract:

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

  • kakes@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    4
    ·
    1 year ago

    Show-1 is an AI platform that is able to write, produce, direct, animate, and even voice entirely new episodes of TV shows. Show-1 uses different diffusion models, such as Stable Diffusion and Tortoise TTS, to generate high-quality images and speech features based on reference clips of existing shows. It also uses a multi-agent simulation to provide contextualization, story progression, and behavioral control for the characters and events in the episodes.

    Maybe I'm blind, but where does it say literally any of what you've written here? As far as I can tell, this is just a text-to-video model, not any of this other stuff.