ActCam

Zero-Shot Joint Camera and 3D Motion Control for Video Generation

Omar El Khalifi1, Thomas Rossi1, Oscar Fossey1, Thibault Fouque1, Ulysse Mizrahi1, Philip Torr3, Ivan Laptev2, Fabio Pizzati2, Baptiste Bellot-Gurlet1
1Kinetix  ·  2MBZUAI  ·  3University of Oxford
ActCam inputs: acting video, first frame, and camera shots

Abstract

For artistic applications, video generation requires fine-grained control over both performance and cinematography—i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly (i) transfers character motion from a driving video into a new scene and (ii) enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on a pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. Compared to pose-only control and other pose+camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations—especially under large viewpoint changes.


Key Contributions

1

Zero-Shot Joint Control

Joint acting-motion and camera-trajectory control in image-conditioned video generation—enabling controllable cinematography without additional training, outperforming finetuned approaches.

2

Camera-Aligned Conditions

A geometry-grounded conditioning pipeline aligning motion and scene geometry to the target camera, with reference-character removal and depth alignment preventing static/dynamic interference.

3

Two-Phase Schedule

An inference pipeline adapting conditioning to the denoising step—depth enforces structure early, then pose-only guidance refines details—preserving dynamic scene elements.


Method

ActCam is a pure inference-time method. No finetuning required—just carefully constructed conditioning signals fed to a pretrained backbone.

ActCam pipeline diagram

Background Scene Extraction

The reference character is inpainted out of the reference image, and a monocular depth estimator (MoGe) creates a background-only 3D mesh. This prevents static character geometry from conflicting with dynamic motion conditioning.

4D Motion Recovery

A monocular 3D human motion estimator (GVHMR) recovers an articulated motion sequence from the acting video, providing a stable 3D signal that avoids 2D keypoint ambiguities under viewpoint changes.

Scene Transfer & 3D Fitting

The recovered motion is aligned to the background 3D mesh via a weighted affine depth transformation, ensuring geometrically consistent placement respecting character-environment contacts.

Camera-Aligned Rasterization

Both pose and depth+pose control signals are rasterized under the target camera trajectory with depth-aware back-to-front ordering, handling self-occlusions correctly.

Two-Phase Conditioning

Early denoising steps use depth+pose to lock scene structure and camera motion. Later steps switch to pose-only, allowing high-frequency detail refinement without propagating depth artifacts.

Phase 1 · t ≤ tstop

Depth + Pose

Full geometric conditioning establishes 3D scene structure, camera viewpoint, and motion layout during high-noise early steps.

Phase 2 · t > tstop

Pose Only

Depth is dropped. Pose-only guidance refines textures, lighting, and high-frequency details without over-constraining generation.


Results

ActCam is evaluated on both static and moving camera benchmarks using VBench metrics, motion fidelity (MPJPE), and geometric consistency (Sampson Error), alongside a human preference study.

Joint Camera & Character Control

Model Average ↑ SC ↑ BC ↑ AF ↑ IQ ↑ TC ↑ MS ↑ MPJPE ↓ SE ↓
Uni3C 0.83700.90840.93800.56880.66400.96070.98210.21210.5665
RealisDance DiT 0.83510.92090.93420.54170.64480.98030.98880.21230.4528
ActCam (Ours) 0.84970.92120.93500.57670.72120.95710.98720.20870.4546

Motion Control (Static Camera)

ModelAverage ↑SC ↑BC ↑AF ↑IQ ↑TC ↑MS ↑
Moore-AnimateAnyone83.7894.6594.9051.5666.3497.1698.07
HumanVid84.6893.6994.9455.5867.4597.8798.52
MimicMotion82.2792.2193.6052.0959.6797.4698.61
Animate-X82.9393.3995.1151.7260.9197.7998.68
Hyper-Motion84.0493.5894.9752.9765.5298.1999.01
UniAnimate-DiT84.2994.5695.4452.1865.5298.7899.24
VACE85.3393.5695.0357.8170.6196.7498.25
Wan-Animate84.3893.0694.5254.4766.8798.4298.96
SteadyDancer85.1593.4895.1856.8068.4597.9999.02
ActCam (Ours)86.4795.2895.8358.6670.8398.8899.34

Human Preference Study

2AFC study with 17 users comparing ActCam vs. Uni3C on identical inputs. ActCam is considerably preferred across all criteria.

Camera
53.1%
27.8%
19.1%
Motion
66.9%
24.1%
9.0%
Visual Quality
53.2%
36.7%
10.1%
ActCam (Ours)
Uni3C
Tie

Conditioning Signal Visualization

Each video visualizes the full ActCam conditioning pipeline. The top row shows the four input signals; the bottom shows the generated output.

Reference Image Depth + Pose Map Pose-Only Map Acting Video

Generated Videos

Each pair shows the acting input (left) and ActCam's generated output (right), synchronized frame-by-frame. The camera motion preset is highlighted for each result.


Citation

If you find this work useful, please cite:

@inproceedings{elkhalifi2026actcam,
  title     = {ActCam: Zero-Shot Joint Camera and 3D Motion
               Control for Video Generation},
  author    = {Omar El Khalifi and Thomas Rossi and
               Oscar Fossey and Thibault Fouque and
               Ulysse Mizrahi and Philip Torr and
               Ivan Laptev and Fabio Pizzati and
               Baptiste Bellot-Gurlet},
  year      = {2026}
}