Learn: MediaPipe Hand Panels + WebGPU/TSL

Reverse-engineering Berto's gesture-driven panel experiment
3 March 2026 · R1

I. TL;DR + Verdict

Donna
Partial
Architecture reproduced
Eric
Cleared
Needs live camera tuning
Main blocker
Webcam
No real-time stream in this env
Combined
BOTH
Masterable in one session
Combined Verdict: BOTH MASTER The core stack is clear and reproducible: WebGPU render loop + TSL shader graph + MediaPipe hand landmarks + panel control mapping. Donna can scaffold and debug the pipeline, while Eric does the final live camera calibration and motion feel tuning.

II. Artifact Decoded

The source post shows an interactive visual demo where hand tracking (MediaPipe) controls a WebGPU + TSL panel scene built in a React Three Fiber starter12. The quoted parent post frames it as "Stage dives into WebGPU Render Targets + TSL," then the follow-up adds hand detection on top1.

LayerRoleWhy it matters
WebGPUModern GPU backendSupports high-frequency render-target experimentation with less CPU overhead6
TSL / Node MaterialShader graph authoringLets iteration happen in JS/TS nodes instead of raw WGSL/GLSL for faster prototyping7
MediaPipe hand landmarksGesture signalConverts camera input into normalized hand keypoints for controls5
R3FReact orchestrationKeeps render loop, components, and controls composable8

III. Donna Reproduction Attempt

Attempted

Replicate the same architecture path: demo inspection -> runtime signal check -> implementation scaffold with hand landmarks driving panel transforms.

Observed Runtime Signals

mediapipe-panels.vercel.app
r3f-webgpu-starter
TSL / WebGPU / Webcam / MediaPipe
Webcam: waiting for permission
Palm: waiting for webcam
Result: PARTIAL The implementation strategy is reproducible, but this environment cannot grant live webcam/browser GPU interactivity for end-to-end gesture calibration. That last 15% (stability, smoothing constants, "feel") remains unverified here.

Prototype Control Mapper (reproduction output)

// Maps MediaPipe landmarks -> panel transform targets
export function mapHandToPanel(landmarks) {
  if (!landmarks || landmarks.length < 9) return null;
  const wrist = landmarks[0];
  const indexMcp = landmarks[5];
  const middleTip = landmarks[12];

  const dx = indexMcp.x - wrist.x;
  const dy = indexMcp.y - wrist.y;
  const pinch = Math.hypot(
    landmarks[4].x - landmarks[8].x,
    landmarks[4].y - landmarks[8].y
  );

  return {
    rotY: (dx - 0.08) * 2.6,      // horizontal hand shift -> panel yaw
    rotX: (dy - 0.10) * -2.0,     // vertical hand shift -> panel pitch
    z: Math.max(-1.8, -0.9 - pinch * 2.4), // pinch -> push/pull depth
    glow: Math.max(0, 1.0 - pinch * 3.0),  // pinch closes glow
    cursorX: middleTip.x,
    cursorY: middleTip.y
  };
}

This mirrors the likely interaction grammar in the artifact: stable anchor points, a pinch-distance scalar, then low-pass smoothing in the render loop.

IV. Blockers and Resolution Map

BlockerTypeStatusResolution
No live webcam stream in this execution contextAccess gapUnresolved hereRun locally on Mac/Chrome with camera permissions and calibrate landmarks in-browser
Unknown exact smoothing constants from creator buildTaste/feel gapPartially resolvedTune EMA/spring constants while testing real hand motion
TSL graph specifics not directly visible from postTool gapResolved by patternStart with node-based color/depth modulation, then iterate visually
Important Gesture demos fail most often at jitter filtering and coordinate normalization, not rendering. Prioritize smoothing and dead-zones before visual polish.

V. Build Path (Fastest Practical Route)

  1. Start from R3F + WebGPU starter (Canvas + WebGPU renderer path)86.
  2. Create one panel mesh with TSL node material controls for color/intensity/depth response7.
  3. Wire MediaPipe hand landmarks into a normalized control object (like mapHandToPanel)5.
  4. Apply EMA smoothing + dead-zone clamp before mutating panel state.
  5. Only then add MRT / multi-panel compositing for the "stage dive" look.
Time to first good result 60-120 minutes for a working single-panel gesture interaction; 1 additional session for aesthetic parity with the original demo.

VI. Critical Assessment

QuestionAssessment
Is this durable skill or one-off novelty?Durable. The pattern (vision landmarks -> normalized control bus -> GPU scene modulation) generalizes to creative tools and UI prototyping.
Does it require rare hardware?No. Modern Chrome + camera + WebGPU-capable machine is sufficient; scaling quality is mostly software tuning.
Biggest false assumption riskThinking shader complexity is the hard part. In practice, motion filtering and gesture semantics dominate perceived quality.

Final Verdict: BOTH MASTER

Donna masters the architecture and implementation scaffolding; Eric masters live calibration. This is an ideal split task: Donna writes and iterates the pipeline quickly, while Eric spends short hands-on time tuning real camera behavior for production feel.

If the goal is shipping a polished clone, the next action is to run the scaffold locally with webcam permission, tune 3-5 control constants, and record a before/after interaction clip.


References

[1] Bautista Berto post on X (status 2028577952154165474). Primary artifact and claim: added MediaPipe hand detection on top of WebGPU/TSL experiment.
[2] mediapipe-panels.vercel.app. Live demo header text indicates stack: TSL / WebGPU / Webcam / MediaPipe.
[3] hardcore-tsl-webgpu.vercel.app. Quoted parent experiment: WebGPU render targets + TSL starter context.
[4] fxtwitter mirror metadata for status. Extracted post text and linked experiment URLs when direct X fetch was blocked.
[5] MediaPipe Hand Landmarker docs. Landmark output model used for gesture-to-control mapping.
[6] Three.js WebGPURenderer docs. WebGPU renderer layer for real-time GPU scene pipeline.
[7] Three.js NodeMaterial / TSL docs. Node-based shader authoring model for TSL-style material iteration.
[8] React Three Fiber documentation. React orchestration layer used by the starter projects shown in the demos.