MediaPipe - Technical Overview
MediaPipe is an open-source framework developed by Google for building cross-platform, customizable machine learning solutions for live and streaming media. It provides pre-built pipelines for computer vision tasks like face detection, hand tracking, pose estimation, and now includes support for on-device LLM inference. Part of Google AI Edge, MediaPipe is designed for real-time performance on mobile, edge, web, and desktop platforms.
High-Level Architecture
How It Works - Graph-Based Processing
Detection and Tracking Pipeline
BlazePose/BlazeFace/BlazePalm Model Architecture
MediaPipe Solutions Suite
Landmark Topologies
Model Maker - Transfer Learning Flow
LLM Inference Pipeline
Ecosystem - Platforms and Integrations
Key Concepts
Graph-Based Architecture
MediaPipe uses a Directed Acyclic Graph (DAG) for processing:
| Component | Description |
|---|---|
| Packet | Basic data unit with timestamp and immutable payload |
| Stream | Sequence of packets between nodes |
| Calculator | Processing node that transforms packets |
| Graph | DAG defining data flow between calculators |
Two-Stage Detection Pattern
Most MediaPipe vision solutions use a detector-tracker pattern:
- Detector (Slow): Runs on selected frames to find regions of interest
- Tracker (Fast): Tracks ROI across frames, re-detecting when confidence drops
This enables real-time performance by running heavy detection only when needed.
Blaze Model Family
| Model | Task | Landmarks | Performance |
|---|---|---|---|
| BlazeFace | Face Detection | 6 keypoints | Sub-millisecond on mobile GPU |
| BlazePalm | Palm Detection | 7 keypoints | Real-time palm localization |
| BlazePose Lite | Pose Estimation | 33 keypoints | 2.7 MFlop, 1.3M params |
| BlazePose Full | Pose Estimation | 33 keypoints | 6.9 MFlop, 3.5M params |
Model Maker Customization
- Uses transfer learning to retrain models with custom data
- Requires ~100 samples per class
- Training typically completes in minutes
- Supports: Object Detection, Image Classification, Gesture Recognition, Text Classification
Key Facts (2025)
- Framework Version: 0.10.31 (December 2025)
- Repository: github.com/google-ai-edge/mediapipe
- License: Apache 2.0 (Open Source)
- Primary Languages: C++, Python, JavaScript, Java, Swift
- Face Landmarks: 478 3D points with 52 blendshape expressions
- Hand Landmarks: 21 keypoints per hand
- Pose Landmarks: 33 full-body keypoints
- Holistic Total: 543 combined landmarks (pose + face + hands)
- LLM Support: Gemma 2B/7B, Phi-2, Falcon, StableLM
- LoRA Support: Fine-tuning for Gemma and Phi-2 models
- WebGPU: Recently open-sourced WebGPU helpers
Performance Benchmarks
| Solution | Platform | Backend | Latency |
|---|---|---|---|
| Face Detection | Mobile | GPU | Sub-millisecond |
| Pose Estimation | Mobile CPU | CPU | ~7 FPS (Jetson Nano) |
| Pose Estimation | Mobile GPU | GPU | ~20+ FPS |
| Selfie Segmentation | Web | GPU | <3ms inference |
| Selfie Segmentation | Web | CPU | 120+ ms |
| LLM (Gemma 2B) | Mobile | GPU | 50-200ms response |
GPU Support
- Android/Linux: OpenGL ES up to 3.2
- iOS: OpenGL ES 3.0, Metal
- Web: WebGL, WebGPU (experimental)
- Requirement: OpenGL ES 3.1+ for ML inference calculators
Common Use Cases
- Augmented Reality: Face filters, virtual try-on, makeup effects
- Fitness & Sports: Form analysis, rep counting, motion tracking
- Healthcare: Fall prevention, physical therapy monitoring
- Sign Language Recognition: Accessibility applications
- Gaming: Gesture controls, full-body game input
- Video Conferencing: Background blur, face effects
- Robotics & Drones: Object tracking, navigation
- Creative Tools: Animation, motion capture
Technical Specifications
| Component | Technology |
|---|---|
| Core Language | C++ |
| Build System | Bazel |
| Model Format | TensorFlow Lite (.tflite) |
| Video Processing | OpenCV |
| Audio Processing | FFMPEG |
| GPU Compute | OpenGL ES, Metal, WebGPU |
| CPU Inference | XNNPACK |
| Quantization | INT8, FP16 (QAT and PTQ) |
Security Considerations
- On-Device Processing: Data never leaves the device for inference
- Privacy by Design: No cloud dependency for core ML tasks
- Model Protection: TFLite models can be encrypted
- Input Validation: Sanitize input dimensions to prevent buffer overflows