LooseControlVideo Project Page

TL;DR We present LooseControlVideo (LCV), a framework for controlling video generation and editing using sparse, oriented 3D boxes. Unlike dense depth maps, optical flow or 3D point tracks, 3D boxes are easy to draw by hand, giving users a simple yet expressive interface to creatively direct object trajectories, rotations, occlusions, and camera motion.

Input condition

Loose Control Video

Original video

Input condition

Loose Control Video

LooseControlVideo lets users direct trajectories, rotations, occlusion, camera motion, and localized edits using sparse oriented 3D boxes.

Abstract

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo (LCV), a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a blocking proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LCV significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2-3x improvement in Trajectory Error, 2x improvement in Rigid Motion Consistency, and a 1.5-2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.

Control Video Objects in 3D Space

LCV gives precise control over object motion in 3D, including trajectories and rotations.

Original video

Input boxes

Loose Control Video

Input boxes

Loose Control Video

Input boxes

Loose Control Video

Original video

Input boxes

Loose Control Video

Original video

Input boxes

Loose Control Video

and interactions...

Input boxes

Loose Control Video

Add Objects and Make Them Interact with Video Objects

Original video

Input boxes

Loose Control Video

Handles Occlusion

Input boxes

Loose Control Video

Input boxes

Loose Control Video

Rotations

Original video

Input boxes

Loose Control Video

Input boxes

Loose Control Video

Input boxes

Loose Control Video

Control Camera

Input boxes

Loose Control Video

Original video

Input boxes

Loose Control Video

Original video

Input boxes

Loose Control Video

Boxes as Parts

Original video

Input boxes

Loose Control Video

Input boxes

Loose Control Video

Comparisons with Baselines

Generation

Input boxes

Baseline

Loose Control Video

Input boxes

Baseline

Loose Control Video

Input boxes

Baseline

Loose Control Video

Input boxes

Baseline

Loose Control Video

Editing

Original video

Input boxes

Baseline

Loose Control Video

Original video

Input boxes

Baseline

Loose Control Video

Original video

Input boxes

Baseline

Loose Control Video

Original video

Input boxes

Baseline

Loose Control Video

User Study Preference Matrix

Each entry shows the share of preferences for the row method over the column method across completed sessions.

Overall

	LCV (Ours)	Depth Only	Optical Flow	2D Bounding Boxes
LCV (Ours)	-	78.1%	87.5%	92.2%
Depth Only	21.9%	-	65.6%	75.0%
Optical Flow	12.5%	34.4%	-	50.0%
2D Bounding Boxes	7.8%	25.0%	50.0%	-

Editing

	LCV (Ours)	Depth Only	Optical Flow	2D Bounding Boxes
LCV (Ours)	-	84.4%	90.6%	84.4%
Depth Only	15.6%	-	90.6%	59.4%
Optical Flow	9.4%	9.4%	-	12.5%
2D Bounding Boxes	15.6%	40.6%	87.5%	-

Generation

	LCV (Ours)	Depth Only	Optical Flow	2D Bounding Boxes
LCV (Ours)	-	71.9%	84.4%	100.0%
Depth Only	28.1%	-	40.6%	90.6%
Optical Flow	15.6%	59.4%	-	87.5%
2D Bounding Boxes	0.0%	9.4%	12.5%	-

Citation

@misc{bhat2026loosecontrolvideodirectorialvideocontrol,
      title={LooseControlVideo: Directorial Video Control using Spatial Blocking},
      author={Shariq Farooq Bhat and Niloy J. Mitra and Kalyan Sunkavalli},
      year={2026},
      eprint={2606.19495},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.19495},
}