How Vyom Integrates with SAM2 and ONNX Runtime for Cutting-Edge Image Processing

What is SAM2?

Segment Anything’s SAM2 is a powerful model from Meta AI designed to identify and segment any object in an image. Here’s how it fits into Vyom:

By integrating SAM2 model, Vyom automates the task of object detection, simplifying downstream tasks such as classification, tagging, or augmented reality overlays.

Segment Anything Model 2 (SAM2) extends the foundational Segment Anything Model to handle images and video in real-time. Originally, Segment Anything was designed for promptable visual segmentation—users provide minimal “prompts” (like clicks or boxes), and the model automatically identifies the corresponding objects.

Key Features of SAM2

  • Video Extension: Views images as single-frame videos to unify image and video segmentation.
  • Transformer Architecture: Uses streaming memory for real-time video processing.
  • Model-in-the-Loop Data Engine: Collects and refines data via user interaction, building what is currently the largest video segmentation dataset.
  • High Performance: Achieves strong results across a wide range of tasks and visual domains.

What is ONNX Runtime?

ONNX Runtime is a cross-platform, high-performance runtime for machine learning models in the ONNX (Open Neural Network Exchange) format.

Embedding foundation AI models on your mobile app

While VyomOS-powered robots and UAVs run on powerful companion computers running hi-tech deep learning models, sometimes such power is also required in the hands of our users on their mobile phones.

Today’s phones do have powerful CPUs and GPUs, sometimes rivalling even our companion computers. Vyom’s mobile GCS app integrates proprietary and open-source foundation models to use this power for our users.

Our integration of SAM2 by Meta and satellite images helps our users plan Drone missions better, reducing the planning stage from hours to mere seconds.

Understanding How SAM2 Works: Architecture & Workflow Steps

SAM2 comprises two core components:

  1. Encoder
    • Processes the input image (scaled to 1024×1024) to create intermediate embeddings.
  2. Decoder
    • Accepts user “prompt points” (also scaled to 1024×1024) plus the encoder outputs to generate segmentation masks.

The decoder produces multiple masks, each with a confidence score (0–1). Each point in the resulting masks also carries its own confidence value, allowing you to filter low-confidence points or masks.

Important Considerations

  • Image Dimensions: SAM2 is trained for 1024×1024. Ensure you scale images and any user input coordinates accordingly.
  • Confidence Scores: Filter out masks or points below your chosen confidence threshold.

Key conversion details:

  • The encoder produces three feature maps: high_res_feats_0, high_res_feats_1, and image_embed.
  • The decoder supports dynamic input shapes for point coordinates and masks.
  • The multimask output option allows the model to generate multiple segmentation masks.

Below is a simplified illustration of our segmentation pipeline:

scss

CopyEdit

   ┌─────────────┐         ┌──────────────┐

   │Input Image   │        │User Prompts   │ (x,y) coords + labels

   └─────────────┘         └──────────────┘

         │                     │

         ▼                     ▼

 ┌─────────────────┐   ┌─────────────────┐

 │ SAM2 Encoder    │   │ SAM2 Decoder    │

 │ (Encoder.onnx)  │   │ (Decoder.onnx)  │

 └─────────────────┘   └─────────────────┘

         ▼                     ▼

            ┌──────────────┐

            │ Segmentation │

            │   Masks      │

            └──────────────┘

Implementation Details in React Native (Android)

In our Vyom application, we’ve implemented SAM2 with Kotlin coroutines to ensure faster inference and image processing without blocking the main UI thread.

Kotlin Coroutines & Native Module

  • Asynchronous Loading: Model files (Encoder.onnx and Decoder.onnx) are loaded in the background.
  • React Native Bridge: We created a native module that exposes the segmentation functionality to our JavaScript components.

Multiple Points & Labels

  • Selecting Multiple Segments: Add points to the “points array” with label = 1 to keep segmenting different areas.
  • Removing Segments: Pass a point from an existing segment with label = 0 to remove it.

These labels and points are fed into the decoder, which recalculates masks based on the new prompts.

Getting Started: Running the App

Ready to try it out yourself? Follow these steps to run our sample React Native app using SAM2 and ONNX Runtime:

1. Clone the Repo

git clone <https://github.com/vyom-os/SAM2ImplementationReactNative.git>

2. Create an Assets Folder

Converting SAM2 to ONNX Format

To achieve better precision and reduce model size, you can convert the SAM2 model to ONNX format. Follow these steps to convert your model:

1. Download the Model

Start by downloading the SAM 2 PyTorch checkpoint. For this example, we’ll use the smallest variant:

Download SAM 2 Checkpoint

2. Set Up the Environment

First, clone the SAM 2 repository:

git clone <https://github.com/facebookresearch/sam2.git>

cd segment-anything-2

Then, install the necessary dependencies:

pip3 install -e .  # Ensure Python version <= 12.5

pip3 install onnx onnxscript onnxsim onnxruntime

The export process will produce two ONNX files:

  • sam2_hiera_tiny_encoder.onnx (~109 MB)
  • sam2_hiera_tiny_decoder.onnx (~16 MB)

Here’s the code to export the encoder:

torch.onnx.export(

    sam2_encoder,

    img,

    f"{model_type}_encoder.onnx",

    export_params=True,

    opset_version=17,

    input_names=['image'],

    output_names=['high_res_feats_0', 'high_res_feats_1', 'image_embed']

)

For the decoder, use the following export command with dynamic input shapes for interactive usage:

torch.onnx.export(

    sam2_decoder,

    # ... input parameters ...

    dynamic_axes={

        "point_coords": {0: "num_labels", 1: "num_points"},

        "point_labels": {0: "num_labels", 1: "num_points"},

        "mask_input": {0: "num_labels"},

        "has_mask_input": {0: "num_labels"}

    }

)

Or download SAM2 ONNX model from HuggingFace: https://huggingface.co/models?p=1&sort=trending&search=segment+anything

3. Install Dependencies

cd SAM2ImplementationReactNative

npm install

4. Run the App

npx react-native run-android

5. Explore

  • Launch on an Android device or emulator.
  • Tap the screen to add points (label=1) or remove existing segments (label=0).

Note:

  • Don’t forget to add the ONNX Runtime Android library (e.g., com.microsoft.onnxruntime:onnxruntime-android:<latest.release>) in your app/build.gradle.

Check Out Our GitHub Repository

We’ve open-sourced a reference implementation for you to explore:

Vyom SAM2 Implementation on React Native

Feel free to fork the repo, experiment, and contribute back. It includes a detailed README (labeled “Readme 2 file”) explaining each integration step in greater detail.

GitHub Repo: Check out our code in the SAM2ImplementationReactNative repository for more details on the structure and implementation.

Conclusion

With SAM2, ONNX Runtime, and a robust React Native setup, Vyom demonstrates that real-time, on-device segmentation is not just possible but highly efficient. By maintaining control over user prompts, scaling, and model inference, our approach ensures a flexible, interactive user experience - whether handling single images or streaming video.

We hope you enjoy exploring our approach and harnessing the power of SAM2 in your own React Native apps.

For further details or troubleshooting tips, check out our GitHub repo or consult the official documentation links referenced above.