LooseControlVideo Directorial Video Control using Spatial Blocking
Adobe Research
ECCV 2026
TL;DR We present LooseControlVideo (LCV), a framework for controlling video generation and editing using sparse, oriented 3D boxes. Unlike dense depth maps, optical flow or 3D point tracks, 3D boxes are easy to draw by hand, giving users a simple yet expressive interface to creatively direct object trajectories, rotations, occlusions, and camera motion.
Abstract
Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo (LCV), a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a blocking proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LCV significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2-3x improvement in Trajectory Error, 2x improvement in Rigid Motion Consistency, and a 1.5-2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.
Control Video Objects in 3D Space
LCV gives precise control over object motion in 3D, including trajectories and rotations.
Original video
Input boxes
Loose Control Video
Input boxes
Loose Control Video
Input boxes
Loose Control Video
Original video
Input boxes
Loose Control Video
Original video
Input boxes
Loose Control Video
and interactions...
Input boxes
Loose Control Video
Add Objects and Make Them Interact with Video Objects
Original video
Input boxes
Loose Control Video
Handles Occlusion
Input boxes
Loose Control Video
Input boxes
Loose Control Video
Rotations
Original video
Input boxes
Loose Control Video
Input boxes
Loose Control Video
Input boxes
Loose Control Video
Control Camera
Input boxes
Loose Control Video
Original video
Input boxes
Loose Control Video
Original video
Input boxes
Loose Control Video
Boxes as Parts
Original video
Input boxes
Loose Control Video
Input boxes
Loose Control Video
Comparisons with Baselines
Generation
Editing
User Study Preference Matrix
Each entry shows the share of preferences for the row method over the column method across completed sessions.
Overall
| LCV (Ours) | Depth Only | Optical Flow | 2D Bounding Boxes | |
|---|---|---|---|---|
| LCV (Ours) | - | 78.1% | 87.5% | 92.2% |
| Depth Only | 21.9% | - | 65.6% | 75.0% |
| Optical Flow | 12.5% | 34.4% | - | 50.0% |
| 2D Bounding Boxes | 7.8% | 25.0% | 50.0% | - |
Editing
| LCV (Ours) | Depth Only | Optical Flow | 2D Bounding Boxes | |
|---|---|---|---|---|
| LCV (Ours) | - | 84.4% | 90.6% | 84.4% |
| Depth Only | 15.6% | - | 90.6% | 59.4% |
| Optical Flow | 9.4% | 9.4% | - | 12.5% |
| 2D Bounding Boxes | 15.6% | 40.6% | 87.5% | - |
Generation
| LCV (Ours) | Depth Only | Optical Flow | 2D Bounding Boxes | |
|---|---|---|---|---|
| LCV (Ours) | - | 71.9% | 84.4% | 100.0% |
| Depth Only | 28.1% | - | 40.6% | 90.6% |
| Optical Flow | 15.6% | 59.4% | - | 87.5% |
| 2D Bounding Boxes | 0.0% | 9.4% | 12.5% | - |
Citation
@misc{bhat2026loosecontrolvideodirectorialvideocontrol,
title={LooseControlVideo: Directorial Video Control using Spatial Blocking},
author={Shariq Farooq Bhat and Niloy J. Mitra and Kalyan Sunkavalli},
year={2026},
eprint={2606.19495},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.19495},
}