Skip to main content

AI / LLM Gateway

The Otoroshi LLM Extension transforms Otoroshi into a full-featured AI Gateway. It provides a unified, OpenAI-compatible API layer in front of 50+ LLM providers, with built-in cost tracking, guardrails, caching, load balancing, and observability. The extension is distributed as a jar that plugs into Otoroshi's admin extension system.

Overview

The extension adds the following capabilities to Otoroshi:

  • Unified API: expose a single OpenAI-compatible endpoint that routes to any LLM provider
  • 50+ providers: OpenAI, Anthropic, Azure OpenAI, Mistral, Groq, Cohere, Google Gemini, Ollama, DeepSeek, X.ai, Hugging Face, Cloudflare, Scaleway, OVH, and many more
  • Multi-modal: chat completions, embeddings, image generation, audio (TTS/STT), video generation, and content moderation
  • Cost management: automatic cost tracking per request with budget enforcement per API key, user, or service
  • Guardrails: regex, LLM-based, moderation API, webhook, WASM, prompt injection detection, PII filtering, toxicity detection, and more
  • Caching: simple TTL-based cache and semantic cache using embeddings
  • Load balancing: round-robin, random, and best-response-time strategies across multiple providers
  • Provider fallback: automatic retry on a different provider when the primary fails
  • Persistent memory: session-based conversation history across requests
  • Rate limiting: token-based rate limiting per consumer
  • Observability: full audit trail of every LLM call with tokens, costs, cache status, and guardrail results

Installation

Download the extension jar and place it in Otoroshi's classpath, then declare it as an admin extension:

# Download the extension jar
curl -L -o otoroshi-llm-extension.jar \
https://github.com/cloud-apim/otoroshi-llm-extension/releases/download/xxx/otoroshi-llm-extension_2.12-xxx.jar

Start Otoroshi with the extension:

java -cp "otoroshi.jar:otoroshi-llm-extension.jar" play.core.server.ProdServerStart

Or using environment variables:

export OTOROSHI_ADMIN_EXTENSIONS_CONFIGURATIONS='[{"enabled":true}]'
java -cp "otoroshi.jar:otoroshi-llm-extension.jar" play.core.server.ProdServerStart

Once loaded, the extension registers all its entities and plugins in the Otoroshi admin API and UI.

Entities

The extension introduces several new entity types, all manageable through the admin API and UI.

AI Provider

The core entity. An AI Provider defines the connection to an LLM service, including authentication, model selection, caching strategy, guardrails, and fallback behavior.

{
"id": "openai-gpt4o",
"name": "OpenAI GPT-4o",
"description": "OpenAI GPT-4o provider",
"provider": "openai",
"connection": {
"base_url": "https://api.openai.com/v1",
"token": "vault://env/OPENAI_TOKEN",
"timeout": 180000
},
"options": {
"model": "gpt-4o",
"temperature": 0.7,
"max_tokens": 4096,
"allow_config_override": true
},
"provider_fallback": "anthropic-claude",
"cache": {
"strategy": "semantic",
"ttl": 86400000,
"score": 0.85
},
"guardrails": [
{
"type": "moderation",
"config": {
"moderation_model": "openai-moderation"
}
}
],
"guardrails_fail_on_deny": true,
"models": {
"include": ["gpt-4o.*"],
"exclude": []
},
"tags": ["production"],
"metadata": {}
}

Supported providers

Providerprovider valueNotes
OpenAIopenaiGPT-4o, GPT-4, GPT-3.5, o1, o3, etc.
AnthropicanthropicClaude 4, Claude 3.5, Claude 3
Azure OpenAIazure-openaiRequires resource_name, deployment_id, api_version
MistralmistralMistral Large, Medium, Small, Codestral
GroqgroqLlama, Mixtral on Groq hardware
CoherecohereCommand R, Command R+
Google GeminigeminiGemini Pro, Gemini Ultra
OllamaollamaLocal models via Ollama
DeepSeekdeepseekDeepSeek V3, Coder
X.aix-aiGrok models
Hugging FacehuggingfaceInference API models
CloudflarecloudflareWorkers AI models
ScalewayscalewayGenerative APIs
OVHovh-ai-endpointsAI Endpoints
Cloud Templecloud-templeLLMaaS
OpenAI-compatibleopenai-compatibleAny OpenAI-compatible API (LM Studio, vLLM, etc.)
Load BalancerloadbalancerMeta-provider distributing across others

And 30+ additional providers via the OpenAI-compatible adapter.

Connection configuration

The connection object varies by provider:

OpenAI / Most providers:

{
"base_url": "https://api.openai.com/v1",
"token": "sk-...",
"timeout": 180000
}

Azure OpenAI:

{
"resource_name": "my-azure-resource",
"deployment_id": "my-deployment",
"api_version": "2024-02-01",
"api_key": "...",
"timeout": 180000
}

Cloudflare Workers AI:

{
"account_id": "...",
"model_name": "@cf/meta/llama-3-8b-instruct",
"token": "..."
}

OpenAI-compatible (custom endpoint):

{
"base_url": "http://localhost:8000/v1",
"token": "not-needed",
"supports_tools": true,
"supports_streaming": true,
"supports_completion": true,
"headers": {},
"additional_body_params": {}
}
note

All token fields support Otoroshi's secret vaults syntax: vault://env/MY_VAR, vault://hashicorp/path/to/secret, etc.

Model options

{
"model": "gpt-4o",
"temperature": 0.7,
"top_p": 1.0,
"max_tokens": 4096,
"allow_config_override": true
}

When allow_config_override is true, clients can override model, temperature, max_tokens, etc. in their request body. When false, the provider's options are always used regardless of what the client sends.

Model constraints

You can restrict which models a provider exposes using regex patterns:

{
"models": {
"include": ["gpt-4o.*", "gpt-4-turbo.*"],
"exclude": [".*preview.*"]
}
}

Budget

Budgets enforce spending or token limits per consumer (API key, user, or service) over a time window.

{
"id": "standard-budget",
"name": "Standard tier budget",
"budget_type": "money",
"consumer_type": "api_key",
"consumer_filter": "ai_tier=standard",
"limit_amount": 50.0,
"time_window": "30days",
"alert_threshold": 80,
"metadata": {}
}
{
"id": "budget_b261a762-a0f4-4c11-9edb-3cbefae5874a",
"name": "Standard budget",
"description": "Standard Budget",
"metadata": {
"updated_at": "2025-11-06T15:54:15.157+01:00"
},
"tags": [],
"enabled": true,
"start_at": "2025-11-04T22:07:41.404+01:00",
"end_at": "2026-11-04T22:07:41.404+01:00",
"duration": {
"value": 30,
"unit": "day"
},
"limits": {
"total_tokens": 10000000,
"total_usd": 200,
"inference_tokens": null,
"inference_usd": null,
"image_tokens": null,
"image_usd": null,
"audio_tokens": null,
"audio_usd": null,
"video_tokens": null,
"video_usd": null,
"embedding_tokens": null,
"embedding_usd": null,
"moderation_tokens": null,
"moderation_usd": null
},
"scope": {
"extract_from_apikey_meta": true,
"extract_from_apikey_group_meta": true,
"extract_from_user_meta": true,
"extract_from_user_auth_module_meta": true,
"extract_from_provider_meta": true,
"apikeys": [],
"users": [],
"groups": [],
"providers": [],
"models": [],
"always_apply_rules": false,
"rules": [
{
"kind": "json-path-validator",
"path": "$.foo",
"value": "bar",
"error": null
}
],
"rules_match_mode": "all"
},
"action_on_exceed": {
"mode": "block",
"alert_on_exceed": true,
"alert_on_almost_exceed": true,
"alert_on_almost_exceed_percentage": 80
},
"kind": "ai-gateway.extensions.cloud-apim.com/AiBudget"
}
FieldDescription
budget_typemoney (USD) or tokens
consumer_typeapi_key, user, or service
consumer_filterMetadata key=value pattern to match consumers to this budget
limit_amountMaximum amount (dollars or tokens) per time window
time_windowDuration string: 1hour, 1day, 30days, 1year
alert_thresholdPercentage (0-100) at which to trigger an alert

The budget system tracks costs separately for inference, images, audio, video, embeddings, and moderation. When a consumer exceeds their budget, subsequent requests are rejected until the time window resets.

Prompt

Reusable prompt texts that can be referenced by guardrails or other configurations:

{
"id": "safety-validator",
"name": "Safety validation prompt",
"description": "Used by LLM guardrail to validate user messages",
"prompt": "Analyze the following message and determine if it is safe and appropriate. Answer only 'true' or 'false'.",
"tags": ["guardrail"],
"metadata": {}
}

Prompt Context

Injects additional context (system messages) into every request to a provider:

{
"id": "company-context",
"name": "Company context",
"context": "You are a helpful assistant for ACME Corp. Always be professional and accurate. Never share internal company data.",
"tags": [],
"metadata": {}
}

Contexts are linked to providers via:

{
"context": {
"default": "company-context",
"contexts": ["company-context", "safety-context"]
}
}

Prompt Template

Templates with variable substitution for dynamic prompt generation:

{
"id": "customer-support-template",
"name": "Customer support template",
"template": "[{\"role\": \"system\", \"content\": \"You are a support agent for @{body.company}. The customer plan is @{body.plan}.\"}, {\"role\": \"user\", \"content\": \"@{body.question}\"}]",
"tags": [],
"metadata": {}
}

Template expressions:

  • @{body.field} - fields from the request body
  • @{request.path} - request properties
  • @{properties.field} - custom properties

Persistent Memory

Enables conversation history across multiple requests, using session IDs to group messages:

{
"id": "default-memory",
"provider": "local",
"config": {
"connection": {"name": "local"},
"options": {
"kind": "MessagesSlidingWindow",
"max_messages": 50,
"session_id": "${apikey.id}"
}
}
}

The session_id field supports Otoroshi expression language. Common patterns:

  • ${apikey.id} - one conversation per API key
  • ${user.email} - one conversation per user
  • ${req.header.X-Session-Id} - client-controlled sessions
  • ${apikey.metadata.team_id} - one conversation per team

Memory kinds:

  • MessagesSlidingWindow - keeps the last N messages
  • FullHistory - keeps all messages (no limit)

Embedding Model

Configuration for embedding models, used by semantic cache and vector search:

{
"id": "openai-embeddings",
"name": "OpenAI Embeddings",
"provider": "openai",
"connection": {
"base_url": "https://api.openai.com/v1",
"token": "vault://env/OPENAI_TOKEN"
},
"options": {
"model": "text-embedding-3-small"
}
}

Supported embedding providers: OpenAI, Mistral, Cohere, Ollama, Hugging Face, Gemini, DeepSeek, X.ai, Scaleway, OVH, Azure, and more. A local AllMiniLmL6V2 model is also available with no external API calls required.

Image Model

Configuration for image generation:

{
"id": "dalle3",
"name": "DALL-E 3",
"provider": "openai",
"connection": {
"base_url": "https://api.openai.com/v1",
"token": "vault://env/OPENAI_TOKEN"
},
"options": {
"model": "dall-e-3"
}
}

Supported providers: OpenAI (DALL-E), Azure, X.ai (Grok), Luma, Leonardo.AI.

Audio Model

Configuration for text-to-speech and speech-to-text:

{
"id": "openai-tts",
"name": "OpenAI TTS",
"provider": "openai",
"connection": {
"base_url": "https://api.openai.com/v1",
"token": "vault://env/OPENAI_TOKEN"
},
"options": {
"model": "tts-1"
}
}

Supported: OpenAI (TTS/STT), Groq, ElevenLabs, Mistral (translation).

Video Model

Configuration for video generation:

{
"id": "luma-video",
"name": "Luma Video",
"provider": "luma",
"connection": {
"token": "vault://env/LUMA_TOKEN"
},
"options": {
"model": "ray-flash-2"
}
}

Moderation Model

Configuration for content moderation:

{
"id": "openai-moderation",
"name": "OpenAI Moderation",
"provider": "openai",
"connection": {
"base_url": "https://api.openai.com/v1",
"token": "vault://env/OPENAI_TOKEN"
},
"options": {
"model": "omni-moderation-latest"
}
}

Plugins

The extension provides several Otoroshi plugins that you attach to routes.

OpenAI Compatible Proxy (OpenAiCompatProxy)

The main plugin. Exposes an OpenAI-compatible chat completions API (/v1/chat/completions) backed by any configured provider. Supports streaming and non-streaming modes.

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.OpenAiCompatProxy

Configuration:

{
"refs": ["provider-id-1", "provider-id-2"],
"extract_provider_from_body": true,
"selector_expr": "$.provider"
}
  • refs: list of provider IDs this route can use
  • extract_provider_from_body: when true, the client can specify "provider": "id" in the request body to select which provider to use
  • selector_expr: JsonPath expression to extract the provider ID from the request body

Request format (OpenAI-compatible):

{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Otoroshi?"}
],
"temperature": 0.7,
"max_tokens": 2000,
"stream": false,
"provider": "openai-gpt4o"
}

Response format (OpenAI-compatible):

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1704067200,
"model": "gpt-4o",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Otoroshi is a modern API gateway..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 50,
"total_tokens": 75
}
}

LLM Proxy (AiLlmProxy)

A simpler proxy plugin that passes the raw request body to the provider without OpenAI format wrapping. Useful for custom integrations.

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.AiLlmProxy

Anthropic Compatible Proxy (AnthropicCompatProxy)

Exposes an Anthropic Messages API compatible endpoint.

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.AnthropicCompatProxy

Embeddings (OpenAICompatEmbedding)

OpenAI-compatible embeddings endpoint.

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.OpenAICompatEmbedding

Request:

{
"input": "The quick brown fox",
"model": "provider-id###text-embedding-3-small",
"encoding_format": "float"
}

Image Generation (OpenAICompatImagesGen)

OpenAI-compatible image generation endpoint.

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.OpenAICompatImagesGen

Request:

{
"prompt": "A cat wearing a top hat",
"model": "provider-id###dall-e-3",
"n": 1,
"size": "1024x1024",
"quality": "hd"
}

Audio - Text to Speech (OpenAICompatTextToSpeech)

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.OpenAICompatTextToSpeech

Audio - Speech to Text (OpenAICompatSpeechToText)

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.OpenAICompatSpeechToText

Video Generation (OpenAICompatVideosGen)

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.OpenAICompatVideosGen

Content Moderation (OpenAICompatModeration)

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.OpenAICompatModeration

Token Rate Limiting (LlmTokensRateLimitingValidator)

Rate limits requests based on token consumption rather than request count.

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.LlmTokensRateLimitingValidator

Configuration:

{
"window_millis": 60000,
"throttling_quota": 10000,
"group_expr": "${apikey.id}"
}

Adds response headers:

  • X-Llm-Ratelimit-Max-Tokens: token limit for the consumer
  • X-Llm-Ratelimit-Consumed-Tokens: tokens used in the current window
  • X-Llm-Ratelimit-Remaining-Tokens: tokens remaining

Prompt Validator (AiPromptValidator)

Validates incoming prompts against regex patterns before they reach the LLM.

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.AiPromptValidator

Configuration:

{
"allow": [".*"],
"deny": [".*password.*", ".*secret.*"]
}

Prompt Template (AiPromptTemplate)

Applies a prompt template to transform incoming requests.

Plugin reference: cp:otoroshi_plugins.com.cloud.apim.otoroshi.extensions.aigateway.plugins.AiPromptTemplate

Guardrails

Guardrails validate prompts before they are sent to the LLM (input guardrails) and responses after they are received (output guardrails). They are configured directly on the provider entity.

Available guardrail types

Regex guardrail

Matches message content against allow/deny regex patterns:

{
"type": "regex",
"config": {
"allow": ["^[a-zA-Z0-9\\s.,!?]+$"],
"deny": ["(?i)ignore previous instructions", "(?i)system prompt"]
}
}

LLM guardrail

Uses another LLM provider to validate content. The validation prompt should return true or false:

{
"type": "llm",
"config": {
"provider": "guardrail-llm-provider-id",
"prompt": "safety-validator-prompt-id"
}
}

Moderation guardrail

Uses a moderation model (e.g., OpenAI's moderation API) to check content safety:

{
"type": "moderation",
"config": {
"moderation_model": "openai-moderation-id"
}
}

Detects categories like violence, sexual content, hate speech, self-harm, etc.

Webhook guardrail

Sends content to an external HTTP service for validation:

{
"type": "webhook",
"config": {
"url": "https://my-validator.example.com/validate",
"headers": {"Authorization": "Bearer my-token"},
"ttl": 300000
}
}

The webhook receives the messages array as POST body and must respond with {"result": true} or {"result": false}.

Built-in detection guardrails

Several pre-configured guardrail types are available for common safety concerns:

TypeDetection
prompt_injectionPrompt injection attempts
pifPersonal information (PII)
secrets_leakageAPI keys, passwords, credentials
toxic_languageToxic or harmful language
racial_biasRacially biased content
gender_biasGender-biased content
personal_health_informationHealth-related PII
gibberishNonsensical content
faithfulnessResponse accuracy vs. source context

Text pattern guardrails

Content/length-based validation:

TypePurpose
wordsWord count constraints
sentencesSentence count constraints
charactersCharacter count constraints
containsMust contain specific text
semantic_containsMust semantically match content

WASM / QuickJS guardrails

Custom validation logic via WebAssembly or JavaScript:

{
"type": "wasm",
"config": {
"wasm_ref": "my-wasm-plugin-id"
}
}

Guardrail behavior

When guardrails_fail_on_deny is true on the provider, a denied request returns an error immediately. When false, the request is flagged but still processed. Guardrail results are included in the response metadata and audit events.

Caching

Simple cache

TTL-based caching. Identical requests (by SHA-512 hash of the body) return cached responses:

{
"cache": {
"strategy": "simple",
"ttl": 3600000
}
}

Semantic cache

Uses embeddings to find semantically similar past queries. Even if the wording is different, similar questions return cached answers:

{
"cache": {
"strategy": "semantic",
"ttl": 86400000,
"score": 0.85
}
}
  • score: similarity threshold (0 to 1). Higher values require closer matches.
  • Uses an embedded AllMiniLmL6V2 model by default (no external API call).
  • In-memory store with up to 5000 entries per provider.

The response includes cache metadata:

{
"cache": {
"status": "hit",
"key": "abc123...",
"ttl": 86400000,
"latency": 3
}
}

Load balancing

A loadbalancer meta-provider distributes requests across multiple backend providers:

{
"id": "multi-llm",
"name": "Multi-LLM load balancer",
"provider": "loadbalancer",
"options": {
"loadbalancing": "best_response_time",
"refs": [
{"ref": "openai-provider", "weight": 2},
{"ref": "anthropic-provider", "weight": 1},
{"ref": "mistral-provider", "weight": 1}
]
}
}

Strategies

StrategyBehavior
round_robinDistributes evenly in sequence
randomRandom provider selection
best_response_timePicks the provider with the lowest average latency

The weight field controls distribution: a provider with weight 3 receives 3x more traffic than one with weight 1.

Provider fallback

Each provider can specify a fallback provider. If the primary fails, the request is automatically retried on the fallback:

{
"id": "openai-primary",
"provider": "openai",
"provider_fallback": "anthropic-backup",
"connection": { "..." : "..." }
}

Cost tracking

The extension automatically tracks costs using a pricing database covering 1000+ models. Costs are calculated per request and reported in the response metadata:

{
"costs": {
"input_cost": 0.00025,
"output_cost": 0.001,
"reasoning_cost": 0.0,
"total_cost": 0.00125,
"currency": "dollar"
}
}

Costs are also tracked separately for:

  • Inference (chat completions)
  • Images (generation/editing)
  • Audio (TTS/STT)
  • Video (generation)
  • Embeddings
  • Moderation

All costs flow into the budget system and into Otoroshi's event exporters for analytics.

Observability

Every LLM request generates an audit event containing:

  • Consumer identity (API key, user, or service)
  • Provider and model used
  • Token usage (input, output, reasoning)
  • Cost in USD
  • Cache status (hit/miss)
  • Guardrail results (pass/deny with details)
  • Latency
  • Error details (if any)

These events are exported through Otoroshi's standard data exporters (Elasticsearch, Kafka, webhooks, etc.) for dashboarding and alerting.

MCP Integration

The extension supports the Model Context Protocol (MCP) for tool/function calling:

{
"id": "mcp-tools",
"name": "MCP Tools Connector",
"transport": {
"type": "sse",
"url": "http://mcp-server:3000/sse"
}
}

Transport types: stdio, sse, websocket, http.

API reference

All entities are accessible through Otoroshi's admin API:

EntityEndpoint
Providers/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/providers
Budgets/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/budgets
Prompts/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/prompts
Prompt Templates/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/prompt-templates
Prompt Contexts/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/prompt-contexts
Embedding Models/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/embedding-models
Image Models/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/image-models
Audio Models/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/audio-models
Video Models/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/video-models
Moderation Models/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/moderation-models
Persistent Memories/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/persistent-memories
MCP Connectors/bo/api/proxy/apis/ai-gateway.extensions.cloud-apim.com/v1/mcp-connectors

Further reading