From Raw 360-Degree Footage to AI Insights: A Complete Processing Pipeline

360-degree video captures everything at once but turning raw dual-fisheye footage into something analytically useful and shareable requires a pipeline of tools working in concert. At adagger, we have built and deployed such a pipeline across several domains, from action sports coaching to industrial inspection. This article walks through each stage: clean stitching using the camera manufacturer’s SDK, applying machine learning models to extract structured data from the footage, previewing and stabilising the result in a browser-based tool, and exporting to the formats that major platforms expect.

The approach is not tied to any single use case. Whether the camera is mounted at the masthead of a sailing dinghy, on a helmet in a cycling race, on a pole above a construction site, or in a retail space, the processing stages are the same. What changes is the trained model and the metrics you want to derive.

Stage 1: Clean Stitching with the Manufacturer SDK

The first and most foundational step is stitching the two fisheye lenses into a single equirectangular frame. Getting this wrong undermines everything downstream: object detection models produce split detections at the stitch seam, keypoint models lose body parts that straddle the boundary, and the video looks wrong to a human reviewer.

A common shortcut is to pass raw camera files through FFmpeg with manually guessed projection parameters. This produces a technically valid equirectangular image but the stitch quality depends entirely on how well the parameters match the physical lens geometry of the specific camera model. The result is often a visible seam, colour mismatch between the two halves, or geometric distortion near the poles. These artefacts are difficult to fix in post-processing and degrade model accuracy in exactly the region at the edges of each lens where objects of interest frequently appear.

The correct approach is to use the camera manufacturer’s SDK, which applies the factory calibration data baked into each camera body. The SDK knows the precise lens parameters, the inter-lens baseline, and the sensor characteristics for each model. Manufacturers such as Insta360, GoPro (MAX), Ricoh (Theta), and Kandao all provide SDKs or processing libraries for this purpose. The output is a geometrically accurate equirectangular or fisheye-corrected frame with a clean blend at the stitch seam and consistent colour across both lenses.

Our pipeline calls the relevant SDK programmatically so that stitching is fully automated as part of the ingestion step. Raw files are passed in, and the SDK returns calibrated equirectangular video without any manual parameter tuning. This makes the pipeline portable across different camera models from the same manufacturer. The camera-specific calibration is handled by the SDK rather than requiring a set of per-model FFmpeg parameters to be maintained.

Stage 2: Perspective Reprojection

Equirectangular projection is ideal for storage and transmission but introduces significant distortion, particularly near the poles and at the edges of the frame. Running machine learning inference directly on equirectangular frames degrades accuracy for objects that appear in high-distortion regions, and most models are trained on conventional rectilinear images that do not exhibit this distortion.

After stitching, the pipeline reprojects regions of interest into rectilinear perspective crops. The pointing direction and field of view of each crop are configurable. For a masthead sailing application, for example, we extract a forward-facing crop centred on the sailor, a wider crop covering the fleet ahead, and a 360-degree equatorial strip for fleet tracking. For a helmet-mounted camera, a forward-facing crop covers the road ahead and a rearward crop covers following traffic. The reprojection is computed analytically from the equirectangular coordinates and adds negligible processing time.

Stage 3: Machine Learning Inference

With clean, well-projected frames available, machine learning models can be applied reliably. Our pipeline supports two primary model types, which are often run together in a shared inference loop.

Object Detection

Object detection models identify and localise instances of defined object classes within each frame. In a sailing context this means detecting dinghy hulls; in a construction context it might mean detecting workers, vehicles, or equipment; in retail it might mean detecting customers or product placements. We use a YOLOv8-based architecture that runs efficiently on both CPU and GPU.

Detections are linked across frames using a multi-object tracker to produce continuous trajectories. From these trajectories, derived metrics such as distance estimates, approach rates, dwell times, and positional relationships between tracked objects can be computed automatically.

Keypoint Detection

Keypoint detection models identify the positions of semantically meaningful landmarks within an object: body joints for a person, anatomical reference points for an animal, or structural features for a vehicle. From these landmark positions, a wide range of derived metrics can be computed: joint angles, limb extensions, posture scores, and temporal metrics such as reaction times and movement consistency.

Standard pose estimation models trained on conventional photography do not transfer well to unconventional camera angles: a masthead view looking downward at a sailor, for example, produces an unfamiliar perspective. We use a custom keypoint training pipeline with transfer learning from a base pose model, fine-tuned on a labelled dataset built from frames captured at the target camera angle. Even a few hundred labelled frames are sufficient to produce a reliable model for a specific viewpoint, because the background variation is limited and the camera position is fixed.

Stage 4: Browser-Based Preview, Stabilisation, and Export

Raw 360-degree footage shot from a moving platform such as a vehicle or a person running is almost always shaky. Without stabilisation, the footage is disorienting to watch and the motion makes machine learning inference harder. We have developed a browser-based tool that handles preview, stabilisation, and export in a single interface.

Interactive Preview

The preview tool loads equirectangular video and renders it as an interactive spherical view in the browser using WebGL. The viewer can pan in any direction, adjust the field of view, and jump to any point in the footage. This makes it practical to review long recordings without downloading large files or installing desktop software. The tool also displays any annotations generated by the machine learning pipeline such as bounding boxes, keypoint overlays, and derived metric readouts overlaid on the spherical view in real time.

Stabilisation

The stabilisation step estimates the rotational motion of the camera across frames and applies a compensating rotation to each frame, effectively removing the high-frequency shake while preserving intentional slow pans. The algorithm works in spherical coordinates, which avoids the border artefacts that appear when stabilising flat video. A smoothing window is configurable so that the trade-off between shake reduction and following intentional camera movement can be tuned to the use case. The result is footage that is comfortable to watch and that presents a stable horizon which is important both for viewer experience and for model accuracy.

Export to Social Media Formats

Different platforms have different requirements for 360-degree video. YouTube expects equirectangular MP4 with specific XMP spatial metadata injected into the container so that the platform activates its 360 viewer. Instagram does not support 360 playback natively and instead expects a conventional flat rectilinear crop, typically a 1:1 or 4:5 frame extracted from the equirectangular source. Other platforms have their own aspect ratios and codec requirements.

The export tool handles all of this automatically. The user selects a target platform such as YouTube, Instagram, TikTok, or a generic download and the tool applies the correct projection, aspect ratio, bitrate, codec settings, and metadata. For YouTube, the full equirectangular output is produced with injected 360 metadata. For Instagram and TikTok, a rectilinear crop is computed from a configurable viewing direction, with aspect ratio and resolution matched to platform specifications. The stabilised footage is used as the source for all exports, so shaky source material is not passed through to the published output.

Putting It Together: A Complete Pipeline

End to end, the pipeline moves from raw camera files to published, annotated output through the following stages:

  1. Ingestion and stitching raw camera files are passed to the manufacturer’s SDK (e.g. the Insta360 SDK for Insta360 cameras), which returns a calibrated equirectangular MP4 with a clean stitch seam.
  2. Reprojection rectilinear perspective crops are extracted from the equirectangular frame at configurable directions and field-of-view angles, reducing distortion in regions of interest.
  3. Inference object detection and keypoint detection models run on the reprojected crops; detections are mapped back into spherical coordinates and tracked across frames.
  4. Stabilisation rotational motion is estimated and compensated in spherical coordinates, producing smooth footage suitable for review and publication.
  5. Preview and review the browser-based tool renders the stabilised footage as an interactive spherical view with annotation overlays, enabling efficient review of long recordings.
  6. Export the tool exports to target platform formats: equirectangular with 360 metadata for YouTube, rectilinear crops for Instagram and TikTok, or custom crops for other downstream uses.

Practical Challenges

Moving Platform and Rolling Shutter

Cameras mounted on fast-moving platforms such as helmets or vehicles introduce significant motion blur and rolling shutter distortion during rapid movements. Frame selection logic that prefers lower-motion moments reduces the impact on inference quality. Data augmentation during training that simulates motion blur and tilted horizons improves model robustness to these conditions.

Variable Lighting

Outdoor 360-degree recording spans a wide range of lighting conditions within a single take, e.g. bright sky, shadows, water reflections, lens flare. Strong data augmentation during training (brightness, contrast, hue, random shadows) is the most effective mitigation. Where the pipeline runs as a service, normalisation applied at the frame level before inference also helps.

Stitch Seam and Model Accuracy

Even with SDK-quality stitching, the equatorial region where the two lenses meet can produce subtle colour gradients or micro-blending artefacts under challenging lighting. For applications where detections near the seam are critical, overlapping perspective crops from each lens independently with detections merged using non-maximum suppression produces better results than working in the stitched equirectangular frame.

Scale Calibration

Distance estimation from visual detections requires a known scale reference. Where GPS data is available from the camera, it can be fused with visual tracking to improve distance accuracy. Where it is not, a known reference object in the scene a hull of known length, a vehicle of known dimensions, etc. provides a pixel-per-metre calibration that is computed from a small set of annotated calibration frames.

Use Cases

The pipeline described here is domain-agnostic. We have applied it, or elements of it, across the following areas:

  • Sports coaching masthead 360-degree video on dinghies for body position analysis, fleet tracking, and tactical review; helmet-mounted cameras in cycling for road position and competitor tracking
  • Industrial inspection pole-mounted 360-degree cameras on production lines and construction sites for worker safety monitoring, equipment tracking, and progress documentation
  • Event coverage 360-degree cameras at sporting events and conferences, with automated highlights clipping and multi-platform export
  • Retail and space analytics ceiling-mounted 360-degree cameras for customer flow analysis, dwell time measurement, and planogram compliance checking

In every case the core pipeline from manufacturer SDK stitching, reprojection, inference, stabilisation, browser preview, platform export is reused. The domain-specific work is the labelled training data and the metric definitions.

Get Involved

If you have 360-degree video data and a performance or analytics question you want answered, we would be glad to discuss what a custom pipeline could look like for your specific context. We have experience building sports analytics MVPs, custom keypoint detection models, and automated visual inspection systems from initial concept through to production deployment.

Get in touch with the adagger team to discuss your project.

Leave a Comment

Scroll to Top