Skip to main content
Glossary

AI Inference Terms & Definitions

Key terms and concepts in AI inference, GPU orchestration, and data center platforms.

AI Inference

AI inference is the process of running a trained machine learning model to generate predictions or outputs from new input data. Unlike training, which requires massive compute to build the model, inference uses the trained model to produce results — such as generating text, classifying images, or producing embeddings — in real-time or near-real-time.

AI Inference Platform

An AI inference platform is a complete software system for deploying, managing, and monetizing AI inference services. It combines model serving, API management, multi-tenant billing, usage metering, GPU orchestration, and administrative tools into a unified platform that enables operators to offer AI inference as a service to their customers.

GPU Orchestration

GPU orchestration is the automated management of GPU resources across a computing cluster, including workload scheduling, resource allocation, scaling, and health monitoring. A GPU orchestration system decides which workloads run on which GPUs, manages model loading and unloading, and ensures efficient utilization of expensive GPU hardware.

GPU Scheduling

GPU scheduling is the process of assigning compute workloads to specific GPUs in a cluster based on resource requirements, availability, priority, and optimization goals. A GPU scheduler decides which model runs on which GPU, when to preempt lower-priority work, and how to balance load across heterogeneous GPU hardware.

Model Serving

Model serving is the process of deploying trained machine learning models and making them available for inference requests via an API or other interface. A model serving system handles model loading into GPU memory, request batching, response generation, and lifecycle management including versioning, updates, and retirement.

Multi-Tenant GPU

Multi-tenant GPU refers to the practice of sharing GPU resources across multiple independent customers (tenants) while maintaining strict isolation between them. Each tenant operates as if they have dedicated resources, with separate billing, API access, usage tracking, and security boundaries — all running on shared GPU infrastructure.

OpenAI-Compatible API

An OpenAI-compatible API is an inference endpoint that follows the same request and response format as OpenAI's API specification. Applications built for the OpenAI API can switch to an OpenAI-compatible endpoint by changing only the base URL and API key, with no code changes required. This compatibility enables portability between AI providers.

Serverless Inference

Serverless inference is a model serving architecture where the infrastructure provider automatically manages compute resources for AI inference requests. Users send requests to an API endpoint and receive responses without provisioning, managing, or scaling GPU servers. The platform handles model loading, scaling, and resource allocation transparently.

Token Metering

Token metering is the practice of measuring AI inference usage at the token level — counting individual input tokens (the prompt) and output tokens (the generated response) separately. This granular metering enables usage-based pricing models where customers are billed based on their actual compute consumption rather than flat subscription fees.

White-Label AI Platform

A white-label AI platform is software that enables companies to offer AI services under their own brand without building the underlying technology. The platform provider handles model serving, billing, API management, and infrastructure orchestration, while the operator customizes branding, pricing, and customer experience.