Skip to content

MediaPipe - Technical Overview

MediaPipe is an open-source framework developed by Google for building cross-platform, customizable machine learning solutions for live and streaming media. It provides pre-built pipelines for computer vision tasks like face detection, hand tracking, pose estimation, and now includes support for on-device LLM inference. Part of Google AI Edge, MediaPipe is designed for real-time performance on mobile, edge, web, and desktop platforms.

High-Level Architecture

How It Works - Graph-Based Processing

Detection and Tracking Pipeline

BlazePose/BlazeFace/BlazePalm Model Architecture

MediaPipe Solutions Suite

Landmark Topologies

Model Maker - Transfer Learning Flow

LLM Inference Pipeline

Ecosystem - Platforms and Integrations

Key Concepts

Graph-Based Architecture

MediaPipe uses a Directed Acyclic Graph (DAG) for processing:

ComponentDescription
PacketBasic data unit with timestamp and immutable payload
StreamSequence of packets between nodes
CalculatorProcessing node that transforms packets
GraphDAG defining data flow between calculators

Two-Stage Detection Pattern

Most MediaPipe vision solutions use a detector-tracker pattern:

  1. Detector (Slow): Runs on selected frames to find regions of interest
  2. Tracker (Fast): Tracks ROI across frames, re-detecting when confidence drops

This enables real-time performance by running heavy detection only when needed.

Blaze Model Family

ModelTaskLandmarksPerformance
BlazeFaceFace Detection6 keypointsSub-millisecond on mobile GPU
BlazePalmPalm Detection7 keypointsReal-time palm localization
BlazePose LitePose Estimation33 keypoints2.7 MFlop, 1.3M params
BlazePose FullPose Estimation33 keypoints6.9 MFlop, 3.5M params

Model Maker Customization

  • Uses transfer learning to retrain models with custom data
  • Requires ~100 samples per class
  • Training typically completes in minutes
  • Supports: Object Detection, Image Classification, Gesture Recognition, Text Classification

Key Facts (2025)

  • Framework Version: 0.10.31 (December 2025)
  • Repository: github.com/google-ai-edge/mediapipe
  • License: Apache 2.0 (Open Source)
  • Primary Languages: C++, Python, JavaScript, Java, Swift
  • Face Landmarks: 478 3D points with 52 blendshape expressions
  • Hand Landmarks: 21 keypoints per hand
  • Pose Landmarks: 33 full-body keypoints
  • Holistic Total: 543 combined landmarks (pose + face + hands)
  • LLM Support: Gemma 2B/7B, Phi-2, Falcon, StableLM
  • LoRA Support: Fine-tuning for Gemma and Phi-2 models
  • WebGPU: Recently open-sourced WebGPU helpers

Performance Benchmarks

SolutionPlatformBackendLatency
Face DetectionMobileGPUSub-millisecond
Pose EstimationMobile CPUCPU~7 FPS (Jetson Nano)
Pose EstimationMobile GPUGPU~20+ FPS
Selfie SegmentationWebGPU<3ms inference
Selfie SegmentationWebCPU120+ ms
LLM (Gemma 2B)MobileGPU50-200ms response

GPU Support

  • Android/Linux: OpenGL ES up to 3.2
  • iOS: OpenGL ES 3.0, Metal
  • Web: WebGL, WebGPU (experimental)
  • Requirement: OpenGL ES 3.1+ for ML inference calculators

Common Use Cases

  1. Augmented Reality: Face filters, virtual try-on, makeup effects
  2. Fitness & Sports: Form analysis, rep counting, motion tracking
  3. Healthcare: Fall prevention, physical therapy monitoring
  4. Sign Language Recognition: Accessibility applications
  5. Gaming: Gesture controls, full-body game input
  6. Video Conferencing: Background blur, face effects
  7. Robotics & Drones: Object tracking, navigation
  8. Creative Tools: Animation, motion capture

Technical Specifications

ComponentTechnology
Core LanguageC++
Build SystemBazel
Model FormatTensorFlow Lite (.tflite)
Video ProcessingOpenCV
Audio ProcessingFFMPEG
GPU ComputeOpenGL ES, Metal, WebGPU
CPU InferenceXNNPACK
QuantizationINT8, FP16 (QAT and PTQ)

Security Considerations

  • On-Device Processing: Data never leaves the device for inference
  • Privacy by Design: No cloud dependency for core ML tasks
  • Model Protection: TFLite models can be encrypted
  • Input Validation: Sanitize input dimensions to prevent buffer overflows

Sources

Technical research and documentation