TelePhysics: Physics-Grounded Multi-Object Scene Generation from a Single Image in Real Time

Xin Zhang1,2*     Yabo Chen2*     Yijie Fang2     Wanying Qu1    
Haibin Huang2     Chi Zhang2     Feng Xu1     Xuelong Li2
1Fudan University    2Institute of Artificial Intelligence, China Telecom (TeleAI)
*Equal contribution
TelePhysics facilitates holistic 3D scene generation and physically grounded video synthesis from a single input image.

Abstract

Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a training-free framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. This formulation enables accurate multi-object interactions and mechanics-based control, such as applying forces and simulating rigid-body dynamics, while preserving the visual fidelity of the input. Unlike diffusion-based or autoregressive video priors that incur high inference latency, our approach allows instantaneous physics simulation and real-time rendering after initialization. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability.

Motivation

Prevalent issues in existing methods: video generation suffers from physically uncontrollable motion; single-image-to-3D reconstruction exhibits multi-object inter-penetration; physics-based scene-level 3D generation shows spatial misalignment and inconsistency with the input image; and physics-based 3D generation and control often results in cartoonish appearance.

Method Overview

Overview of the TelePhysics framework. (a) Given a single input image and user controls, the pipeline applies Scene Perception to reconstruct 3D object meshes and synthesize a background environment. (b) These components are grounded in a unified global coordinate system through Scene Alignment to ensure geometric consistency, while a VLM-driven parameter estimation module concurrently deduces the physical properties of each entity. (c) Guided by these semantic priors, the Physics Simulation stage computes physically compliant trajectories and collision responses. (d) WonderTrace then bridges the visual domain gap by refining the coarse simulation renders into photorealistic video sequences.
TelePhysics Pipeline Overview

Qualitative Comparisons

Comparison with state-of-the-art video generation models. The input includes an image, a motion prompt, and a text prompt.
Prompt "Initially, the dominoes are stationary. A row of white dominoes falls from left to right, each domino tilting forward and striking the next, as if a propagating force is transmitted along the chain of dominoes, causing each domino to rotate around its base until it falls completely flat. The viewing perspective is fixed."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, the three balls in the scene were stationary. Then, they were simultaneously subjected to a horizontal force directed towards the bucket, causing them to roll across the table, strike the bucket, and then fall to the ground."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, the two ceramic bowls were stationary. Then, the smaller bowl was smoothly lifted upwards, moved to the left, and then, due to gravity, dropped onto the larger bowl, eventually stacking inside it. The camera angle shifted slightly downwards."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, the teddy bear was at rest with zero initial velocity. Subsequently, due to a force applied from the right, the teddy bear began to move and eventually fell off the edge of the chair. The viewing perspective switches from the front to the right side."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, all the whales were stationary; the blue whale was first subjected to a force pushing it towards the upper right, and the gray shark was subjected to an upward force. Eventually, the gray shark plush toy was knocked out of view. The camera perspective remained unchanged throughout the entire process."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, all three oranges were stationary on a flat surface. Their texture was somewhat like jelly. Subsequently, each of the three oranges was subjected to an upward force, causing them to move, and eventually fall. The camera perspective remained unchanged throughout the entire process."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B