cactus

1.4.3-betaindexed

Fast, lightweight inference framework for energy-efficient on-device AI: numerical computation graph API, OpenAI-compatible inference engine, INT8 optimizations and model/tooling for compact, low-power deployments.

AndroidNative·cactus-compute/cactus

5.4k

Stars

Used by

dependents

Health

/ 100

Cactus

A hybrid edge-cloud AI engine for mobile devices & wearables.

Fast & accurate: fastest inference on ARM CPU, Cactus quants at 4-bit matches f16
Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
Multimodal: one engine for speech, vision, and language models
Cloud fallback: automatically route requests to cloud models if needed
Model-Agnostic: Custom PyTorch models can be exported to the Cactus runtime.

┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for text, speech, and vision.
└─────────────────┘     
         │
┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph ensures 10x lower RAM 
└─────────────────┘     
         │
┌─────────────────┐
│ Cactus Kernels  │ ←── Fastest ARM SIMD kernels (Apple, Samsung, Pixel, etc)
└─────────────────┘     
         │
┌─────────────────┐
│ Cactus Quants   │ ←── Cactus Quants at 4-bit uniform matches f16.
└─────────────────┘  
         │
┌─────────────────┐
│Cactus Transpiler│ ←── Transpiles custom PyTorch model to Cactus.
└─────────────────┘

Quick Demo (Mac)

Step 1: brew install cactus-compute/cactus/cactus
Step 2: cactus run

Cactus Engine

Example response from Gemma3-270m

Cactus Graph

Benchmarks

LLM: Gemma-4-E2B-CQ4 (CPU, no speculative decode), 1k-prefill tps / 100-decode tps
VLM: Gemma-4-E2B-CQ4 (NPU prefill, CPU decode), 256px input, latency / decode tps
Transcribe: Parakeet-TDT-0.6B-CQ4 (NPU prefill, CPU decode), 20s audio, latency / decode tps
Missing latency == no NPU support for device

Supported Models

Any HuggigFace model can be converted using cactus convert [HF-Name], though experimental.
Liquid, Gemma. whisper. parakeet and Qwen model families are especially tested.
Some models have been pre-uploaded here, just run cactus download [HF-Name].
cactus run [HF-Name] albeit first downloads or convert the model if not found.

Learn More

Bindings

Using this repo

Maintaining Organisations

Citation

If you use Cactus in your research, please cite it as follows:

@software{cactus,
  title        = {Cactus: AI Inference Engine for Phones & Wearables},
  author       = {Ndubuaku, Henry and Cactus Team},
  url          = {https://github.com/cactus-compute/cactus},
  year         = {2025}
}

N/B: Scroll all the way up and click the shields link for resources!

Related libraries

Surfaced from shared tags and platforms — no rankings paid for.

koog★ 4.4k

JetBrainsFramework designed for building AI agents with tool interaction, complex workflows, semantic search, and persistent memory. Offers modular architecture, real-time processing, and comprehensive tracing.Shared: llm, ai

Llamatik★ 161

ferranponsOn-device and remote LLM inference via native llama.cpp bindings, offering embeddings, context-aware text generation (streaming & non-streaming), lightweight HTTP client/server and GGUF model support.Shared: llm, ai

deepseek-kotlin★ 8

OremifProvides seamless access to a REST API for integrating chat functionalities, enabling configurable client setup, request handling, and streaming of response chunks for efficient data processing.Shared: llm, ai

llm-typewriter★ 6

NadeemIqbalStreaming-text typewriter for LLM apps: renders streaming tokens with live progressive Markdown, per-language syntax-highlighted code blocks, human-like speed curves, configurable cursor, tap-to-skip, and accessibility support.Shared: llm, ai

prompt-bar★ 6

NadeemIqbalAI chat composer UI with multi-line auto-growing input, slash-command autocomplete, @mention dropdown, attachment chips/previews, unified Send/Sending/Stop state, voice support, templates and live token counter.Shared: llm, ai

deviceai★ 5

deviceai-labsOn-device AI runtime enabling speech recognition, TTS, and local LLM inference with offline RAG, auto model downloads, streaming generation, and GPU acceleration for low-latency, privacy-preserving apps.Shared: llm, ai

Cactus

A hybrid edge-cloud AI engine for mobile devices & wearables.

Fast & accurate: fastest inference on ARM CPU, Cactus quants at 4-bit matches f16

Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines

Multimodal: one engine for speech, vision, and language models

Cloud fallback: automatically route requests to cloud models if needed

Model-Agnostic: Custom PyTorch models can be exported to the Cactus runtime.

┌─────────────────┐ │ Cactus Engine │ ←── OpenAI-compatible APIs for text, speech, and vision. └─────────────────┘ │ ┌─────────────────┐ │ Cactus Graph │ ←── Zero-copy computation graph ensures 10x lower RAM └─────────────────┘ │ ┌─────────────────┐ │ Cactus Kernels │ ←── Fastest ARM SIMD kernels (Apple, Samsung, Pixel, etc) └─────────────────┘ │ ┌─────────────────┐ │ Cactus Quants │ ←── Cactus Quants at 4-bit uniform matches f16. └─────────────────┘ │ ┌─────────────────┐ │Cactus Transpiler│ ←── Transpiles custom PyTorch model to Cactus. └─────────────────┘

Benchmarks

LLM: Gemma-4-E2B-CQ4 (CPU, no speculative decode), 1k-prefill tps / 100-decode tps

VLM: Gemma-4-E2B-CQ4 (NPU prefill, CPU decode), 256px input, latency / decode tps

Transcribe: Parakeet-TDT-0.6B-CQ4 (NPU prefill, CPU decode), 20s audio, latency / decode tps

Missing latency == no NPU support for device

Supported Models

Any HuggigFace model can be converted using cactus convert [HF-Name], though experimental.

Liquid, Gemma. whisper. parakeet and Qwen model families are especially tested.

Some models have been pre-uploaded here, just run cactus download [HF-Name].

cactus run [HF-Name] albeit first downloads or convert the model if not found.

Citation

If you use Cactus in your research, please cite it as follows:

@software{cactus, title = {Cactus: AI Inference Engine for Phones & Wearables}, author = {Ndubuaku, Henry and Cactus Team}, url = {https://github.com/cactus-compute/cactus}, year = {2025} }

N/B: Scroll all the way up and click the shields link for resources!

Mac M4 Pro	324 / 39	1.2s / 48	0.2s / 10.6M	1385 MB
Mac M3 Pro	390 / 26	2.76s / 28.06	0.32s / 2.29M	1376 MB
iPhone 17 Pro	-	-	-	-
iPhone 13 Mini	-	-	-	-
Galaxy S26	248 / 21	- / 16	- / 5.7M	-
Galaxy A17 5G	-	-	-	-
Pixel 10 Pro	-	-	-	-
Pixel 6a	-	-	-	-
Raspberry Pi 5	-	-	-	-

Cactus Engine	C	Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, vector index, cloud handoff
Cactus Graph	C++	Tensor operations, matrix multiplication, attention, normalization, activation functions
Cactus Kernels	C++	ARM NEON SIMD kernels for matmul, attention, convolution, quantization, DSP, image processing
Cactus Quants	C++	Rotation-and-codebook quantization from 4-bit to 1-bit for all weight tensors
Cactus Hybrid	C/Python	Route hard queries to the cloud automatically based on local model confidence
Cactus Transpiler	Python	Convert any PyTorch model to a Cactus runtime graph for on-device inference
Python Package	Python	Python package and CLI