MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Shubo Lin, Xuanyang Zhang, Wei Cheng, Weiming Hu, Gang Yu, Jin Gao

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Shubo Lin¹, Xuanyang Zhang²^*, Wei Cheng², Weiming Hu¹, Gang Yu²^†, Jin Gao²^†,

¹CASIA ²StepFun ^*Project Lead ^†Corresponding authors

Under Review

Paper Code (Coming Soon)⏳ Dataset (Coming Soon)⏳

Abstract

Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher’s physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

Method Overview

Qualitative Comparison Results

Baseline: CogVideoX-2B

CogVideoX-2B

VideoREPA

MMPhysVideo (Ours)

A car gliding over a road slick with rainwater.

Hand holds the phone.

A hose sprays water on plants in a garden.

Honey diffusing into warm milk.

A hose sprays water onto a burning pile of tires, extinguishing the flames and creating steam.

Baseline: CogVideoX-5B

CogVideoX-5B

VideoREPA

MMPhysVideo (Ours)

Mustard squirting out of a plastic bottle onto a hotdog.

A honey dipper drizzles honey onto Greek yogurt.

A tyre rolls through a deep puddle, splashing water.

A large log floats downstream in a rushing river.

Hands rub luscious lotion on dry skin.

Baseline: Wan2.1-1.3B

Wan2.1-1.3B

MMPhysVideo (Ours)

A wine bottle pours a red blend into a glass.

The sharp knife severs the fresh loaf of bread.

Leather glove catching a hard baseball.

A bottle pours olive oil over a pan of vegetables.

An apple falls into a vat of cider, sending up a spray.

Visualization of Joint Multimodal Modeling

RGB

Unified

RGB

XYZ

BibTeX

@article{lin2026mmphysvideoscalingphysicalplausibility,
  title={MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling},
  author={Shubo Lin and Xuanyang Zhang and Wei Cheng and Weiming Hu and Gang Yu and Jin Gao},
  eprint={2604.02817},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  year={2026},
  url={https://arxiv.org/abs/2604.02817}
}