TelePhysics is a unified, training-free framework designed to facilitate holistic 3D scene generation and physically grounded video synthesis from a single input image. The video showcases interactions among multiple objects across diverse scenes.

Abstract

Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a training-free framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scene-level multi-object interactions and introduces richer, complex control types for advanced mechanics-based manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability.

TelePhysics teaser

Motivation

Primary issues of physics-grounded scene generation. Given a single reference image and a text prompt describing a physical event (e.g., a ball falling under gravity), existing paradigms exhibit five distinct limitations: (a) Video Generation is physically uncontrollable; (b) Single-Image to 3D suffers from multi-object inter-penetration; (c) Scene-Level 3D Generation shows spatial misalignment; (d) 3D Generation and Control yields cartoonish appearances; and (e) Scene-Level 3D Generation introduces inconsistency with the input image.

Framework Overview

TelePhysics Framework Pipeline

Overview of the TelePhysics framework. (a) Given a single input image and user controls, the pipeline applies Scene Perception to reconstruct 3D object meshes and synthesize a background environment. (b) These components are grounded in a unified global coordinate system through Scene Alignment to ensure geometric consistency, while a VLM-driven parameter estimation module concurrently deduces the physical properties of each entity. (c) Guided by these semantic priors, the Physics Simulation stage—built on Genesis with rigid-body, MPM, and PBD solvers chosen per material type—computes physically compliant trajectories and collision responses. (d) Finally, WonderTrace bridges the visual domain gap by refining the coarse simulation renders into photorealistic video sequences, driven by Wan2.1 VACE (default fast) or Wan2.2 VACE 14B (high-quality offline). After one-time perception & alignment, the simulation–rendering loop runs at interactive rates (~15 FPS on a single H100).

Qualitative Comparisons

Comparison with state-of-the-art video generation models. The input includes an image, a motion prompt, and a text prompt.

Prompt "Initially, the dominoes are stationary. A row of white dominoes falls from left to right, each domino tilting forward and striking the next, as if a propagating force is transmitted along the chain of dominoes, causing each domino to rotate around its base until it falls completely flat. The viewing perspective is fixed."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, the three balls in the scene were stationary. Then, they were simultaneously subjected to a horizontal force directed towards the bucket, causing them to roll across the table, strike the bucket, and then fall to the ground."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, the two ceramic bowls were stationary. Then, the smaller bowl was smoothly lifted upwards, moved to the left, and then, due to gravity, dropped onto the larger bowl, eventually stacking inside it. The camera angle shifted slightly downwards."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, the teddy bear was at rest with zero initial velocity. Subsequently, due to a force applied from the right, the teddy bear began to move and eventually fell off the edge of the chair. The viewing perspective switches from the front to the right side."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, all the whales were stationary; the blue whale was first subjected to a force pushing it towards the upper right, and the gray shark was subjected to an upward force. Eventually, the gray shark plush toy was knocked out of view. The camera perspective remained unchanged throughout the entire process."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B
Prompt "Initially, all three oranges were stationary on a flat surface. Their texture was somewhat like jelly. Subsequently, each of the three oranges was subjected to an upward force, causing them to move, and eventually fall. The camera perspective remained unchanged throughout the entire process."
Input ImageInput
Ours
CogVideoX1.5
PhysCtrl
Sora2-pro
Veo3.1
Wan2.2-5B
Wan2.2-A14B

Scene Alignment

To resolve the spatial misalignment introduced by egocentric reconstruction, we map all reconstructed meshes into a unified world coordinate system anchored to a canonical ground plane. For multi-object scenes, the vertices of all meshes are aggregated into a global point cloud and a dominant ground plane is fit; a rigid transformation derived via Rodrigues’ formula then aligns its normal to z-up and anchors the lowest point of the scene to the origin, yielding a gravity-aligned world frame W where the ground plane coincides with z=0. To avoid the failure modes of standard RANSAC in cluttered scenes, we further introduce Anchor-Guided Manifold Fitting (AGMF): only points near each object’s local z-minimum are sampled as ground anchors, and a robust M-estimator (Huber/Tukey) is minimised over this anchor set—decoupling the ground manifold from vertical distractors such as torsos or walls.

Ground Plane Alignment Pipeline
Ground Plane Alignment Pipeline. Scene-level alignment procedure: starting from the aggregated point cloud (a), RANSAC fits the dominant plane (b) and its surface normal is analysed against the world vertical axis (c); a rotation aligns the normal to z-up (d), the lowest point is translated to the origin (e), and the final calibrated frame coincides with z=0 (f).
AGMF vs RANSAC plane estimation
Comparison of plane-estimation strategies. (a) Standard RANSAC is biased toward high-density vertical distractors (e.g., human torsos or walls), leading to erroneous orientation estimates. (b) Anchor-Guided Manifold Fitting (Ours) samples only local minima Pbase from object extrema, decoupling the ground manifold from vertical structures and recovering a physically plausible surface normal n.

Perspective Alignment

Differentiable rendering pipeline for explicit camera pose estimation

Differentiable rendering pipeline for explicit camera pose estimation. The framework optimises the camera spatial parameters by minimising the discrepancy between the rendered output and the target image: a renderer synthesises an image and a binary silhouette mask under the current pose, and these are compared with the target through a region-aware appearance loss (foreground + background L1) and a Dice-based mask loss. Because the rendering loss landscape is non-convex, a coarse-to-fine strategy is adopted—a global stochastic search over a bounded space B first identifies a reliable initialisation, then the derivative-free Powell method under box constraints performs local refinement, converging to the final pose v.

Qualitative visualization of the perspective alignment optimization process

Qualitative visualization of the perspective alignment optimisation process. From top to bottom, the rendering loss gradually decreases as the camera parameters are refined. Each row shows the original image, the rendered perspective under the current camera parameters, and the corresponding binary mask. Suboptimal alignment at this stage would otherwise lead to significant geometric distortion and rendering artifacts in downstream synthesis; our coarse-to-fine optimisation enables robust convergence and progressively improved alignment.

WonderTrace

Illustration of camera motion and viewpoint variation

Illustration of camera motion and viewpoint variation. WonderTrace bridges the gap between visually coarse simulator output and photorealistic video by treating the latent space of a state-of-the-art video model as a rendering prior: simulated states and trajectories are projected into conditioning signals (depth, optical flow, object masks), and a partial denoising strategy applies only the final diffusion steps—preserving the simulator’s spatiotemporal structure while enriching the frames with high-frequency visual detail. To support novel-view synthesis and aggressive camera trajectories without out-of-bounds artifacts, the panoramic background generated during Scene Perception is reused here: the rerendered, expanded background is composited with the dynamic foreground in a dual-stream design, ensuring continuous visual fidelity and temporal consistency under entirely novel camera trajectories.

BibTeX

@misc{zhang2026telephysicsphysicsgroundedmultiobjectscene,
      title={TelePhysics: Physics-Grounded Multi-Object Scene Generation from a Single Image with Real-Time Interaction},
      author={Xin Zhang and Yabo Chen and Yijie Fang and Wanying Qu and Haibin Huang and Chi Zhang and Feng Xu and Xuelong Li},
      year={2026},
      eprint={2605.20290},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2605.20290},
}