AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance


1The Chinese University of Hong Kong  2Tsinghua University  3The University of Hong Kong

*Equal contribution. Corresponding author.



Overview of Comparison and Generated Videos

Abstract

Character video generation is a significant real-world application focused on producing high-quality videos featuring specific characters. Recent advancements have introduced various control signals to animate static characters, successfully enhancing control over the generation process. However, these methods often lack flexibility, limiting their applicability and making it challenging for users to synthesize a source character into a desired target scene. To address this issue, we propose a novel framework, AnyCharV, that flexibly generates character videos using arbitrary source characters and target scenes, guided by pose information. Our approach involves a two-stage training process. In the first stage, we develop a base model capable of integrating the source character with the target scene using pose guidance. The second stage further bootstraps controllable generation through a self-boosting mechanism, where we use the generated video in the first stage and replace the fine mask with the coarse one, enabling training outcomes with better preservation of character details. Experimental results demonstrate the effectiveness and robustness of our proposed method.

Method

The overview of our proposed AnyCharV. We design a two-stage training strategy with fine-to-coarse guidance for controllable character video generation. In the first stage, we utilize a self-supervised manner to train a base model for integrating a reference character with the target scene, guided by fine segmentation mask and 2D pose sequence. In the second stage, we propose a self-boosting training strategy by interacting between the reference and target character using coarse bounding box mask guidance. Such a mechanism can better preserve the identity of the reference character and eliminate the bad influence caused by the mask shape. The CLIP encoder and VAE are always frozen. We train denoising UNet, ReferenceNet, and pose guider during the first stage, while only finetuning denoising UNet in the second stage.

Comparison with State-of-the-arts

Qualitative results of our method compared with previous state-pf-the-art methods.

Application: Combining with T2I and T2V Models

Reference Character

Target Video

Generated Video

Qualitative results of combining AnyCharV with FLUX[4] and HunyuanVideo[5].

More Generated Videos

Reference Character

Target Video

Generated Video

Visualization results of the generated videos using our proposed AnyCharV model in multiple visual scenarios.

References

[1] Zhao, Yuyang, Enze Xie, Lanqing Hong, Zhenguo Li, and Gim Hee Lee. Make-a-protagonist: Generic video editing with an ensemble of experts. arXiv, 2023.

[2] Viggle. Viggle AI mix application. https://viggleai.io, 2024.

[3] Men, Yifang, Yuan Yao, Miaomiao Cui, and Liefeng Bo. "Mimo: Controllable character video synthesis with spatial decomposed modeling." arXiv, 2024.

[4] Black Forest, L. Flux. https://github.com/black-forest-labs/flux, 2023.

[5] Kong, Weijie, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong et al. "Hunyuanvideo: A systematic framework for large video generative models." arXiv, 2024.

BibTeX


@article{wang2025anycharv,
  title={AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance},
  author={Wang, Zhao and Wen, Hao and Zhu, Lingting and Shang, Chengming and Yang, Yujiu and Dou, Qi},
  journal={arXiv preprint arXiv:2502.08189},
  year={2025}
}