Model Inference

Engines and platforms for running and serving AI models (self-hosted or managed)

18 tools

BentoML

Open-source platform for building, shipping, and scaling AI inference services. Supports any ML framework, GPU autoscali...

Model Inference

Cerebras Inference

Freemium

Inference API built on Cerebras Wafer-Scale Engine hardware delivering extremely high token throughput. Achieves 2000+ t...

Model Inference

Cloudflare Workers AI

Freemium

Serverless GPU inference at the network edge. Run LLMs, image generation, speech, and embedding models with no infrastru...

Model Inference

Fireworks AI

Freemium

High-performance inference platform for open-source models with compound AI system support. Offers serverless and dedica...

Model Inference

Groq

Freemium

Cloud inference API providing access to open-source LLMs, speech, and TTS models via an OpenAI-compatible endpoint. Uses...

Model Inference

Lepton AI

Freemium

Cloud platform for running AI inference with a Pythonic SDK and OpenAI-compatible endpoints. Built by ex-Meta AI researc...

Model Inference

llama.cpp

Open Source

C/C++ LLM inference engine. Run quantized models on CPU/GPU with minimal dependencies. GGUF format standard.

Model Inference

llamafile

Open Source

Mozilla project that bundles an LLM and runtime into a single cross-platform executable (llama.cpp + Cosmopolitan Libc)....

Model Inference

LocalAI

Open Source

Open-source, self-hosted OpenAI API-compatible inference server for running LLMs, image generation, and audio models on ...

Model Inference

MLC LLM

Open Source

Open-source framework for compiling and running LLMs natively on any hardware including mobile, edge, and browser. Uses ...

Model Inference

Predibase

Paid

Serverless fine-tuning and inference platform for open LLMs using LoRA adapters. Train and serve hundreds of custom LoRA...

Model Inference

Ray Serve

Open Source

Scalable model-serving library built on Ray. Composes multiple ML models and Python functions into a single production e...

Model Inference

Replicate

Freemium

Cloud platform for running, fine-tuning, and deploying machine learning models via API. Offers thousands of community an...

Model Inference

TensorRT-LLM

Open Source

Open-source library for optimizing large language model inference on NVIDIA GPUs. Leverages deep hardware-software co-de...

Model Inference

Text Generation Inference

Open Source

Hugging Face's production-ready inference server. Optimized for Transformer models with tensor parallelism support.

Model Inference

Together AI

Freemium

Cloud inference platform for running open-source and proprietary AI models via serverless or dedicated endpoints. Differ...

Model Inference

vLLM

Open Source

Open-source high-throughput LLM inference and serving engine. Uses PagedAttention for efficient memory management, conti...

Model Inference

Xinference

Open Source

Open-source platform for running inference with LLMs, embedding models, multimodal models, and more on cloud or on-premi...

Model Inference