π‘οΈ Circuit Breaker in Otoroshi
In Otoroshi, the circuit breaker is built into the gateway to protect both your users and your infrastructure. When a backend starts failing, timing out, or returning too many errors, Otoroshi doesnβt endlessly retry: it tracks errors over time, and if thresholds are exceeded, it βopens the circuit.β While the circuit is open, requests are blocked before reaching the failing backend, preventing overload. After a cooldown, Otoroshi will automatically test the backend again to decide whether to close the circuit and resume traffic
Full Request Lifecycle with Otoroshi Circuit Breaker
[ Start Request ]
|
|-- Try to connect ----------------------------X if > 10s [β± connectionTimeout]
|
|-- Send request ------------------------------|
| |
| |-- Wait response ------------------X if > 30s [β± callTimeout]
| |
| |-- If streaming --------------------------X if >120s inactivity [β± callAndStreamTimeout]
|
|-- If failure (β) β retry with delays:
| Retry1 after 50ms
| Retry2 after 100ms
| Retry3 after 200ms ...
|
|-- All retries + calls cannot exceed --------X [β± globalTimeout = 30s]
|
|-- If connection stays open but idle --------X after 60s [β± idleTimeout]
|
|-- Errors tracked in 2s windows
| If >20 consecutive β errors β Circuit opens π
|
[ End Request or Circuit Opened ]
Parameters Overview
| Parameter | Example Value | Meaning |
| retries | 2 | Number of extra attempts after a failure |
| retryInitialDelay | 50 ms | Delay before the first retry |
| backoffFactor | 2 | Exponential growth of retry delays (50ms β 100ms β 200ms β¦) |
| maxErrors | 20 | Consecutive errors tolerated before opening the circuit |
| callTimeout | 30 000 ms | Max time to receive a full response (non-streaming) |
| callAndStreamTimeout | 120 000 ms | Max time to handle a streaming response (SSE, gRPC, WebSocket) |
| connectionTimeout | 10 000 ms | Max time to establish TCP/SSL connection |
| idleTimeout | 60 000 ms | Max idle time before closing a keep-alive connection |
| globalTimeout | 30 000 ms | Max total request duration including retries |
| sampleInterval | 2000 ms | Observation window for errors/successes |
Retries
- Definition: Number of additional attempts after a failure.
- Example: If
retries = 1and your call fails once (timeout, 5xx, etc.), Otoroshi retries one more time. - Use case: Helps absorb temporary errors (like a short backend overload).
Parameters:
retries = 2
retryInitialDelay = 50ms
backoffFactor = 2
globalTimeout = 30 000ms
Timeline:
t=0ms 50ms 100ms 200ms 30 000ms
|--- Req1 β --| |--- Req2 β --| |--- Req3 β --| [β± Global Timeout]
Explanation:
Req1 fails β wait 50ms β Req2 fails β wait 100ms β Req3 fails
Even with retries, the total time cannot exceed 30s.
β Max Errors
- Definition: Maximum number of consecutive errors before Otoroshi opens the circuit.
- Example: If your backend keeps returning
500, after 20 such errors, Otoroshi will stop calling it for a while β the circuit is opened to protect your users and infrastructure.
Retry Initial Delay
- Definition: Time to wait before the first retry attempt.
- Example: With
50 ms, if a request fails, the retry is triggered after 50 ms, not immediately.
Backoff Factor
- Definition: Exponential growth factor for retry delays.
- Example with initial delay = 50 ms:
- Retry 1 β after 50 ms
- Retry 2 β after 100 ms
- Retry 3 β after 200 ms β¦
- Purpose: Prevents hammering a backend thatβs already in trouble.
Call Timeout
- Definition: Maximum time to receive a full response from the backend (non-streaming).
- Example: If your API takes more than 30s for a
GET, Otoroshi will terminate the request.
Parameters:
callTimeout = 30 000ms
Timeline:
t=0ms 30 000ms
|---- Waiting for full response ----------X [β± Call Timeout]
Explanation:
If the backend does not send the complete response within 30s β Otoroshi closes the request.
Call and Stream Timeout
- Definition: Maximum time for handling a streaming response (SSE, WebSocket, gRPC stream).
- Example: If a log streaming API runs continuously, the connection is cut after 2 minutes of inactivity.
Parameters:
callAndStreamTimeout = 120 000ms
Timeline:
0ms 20s 45s 70s 90s 120s
|Msg π’| |Msg π’| |Msg π’| |Msg π’| |Msg π’| ---- inactivity ----X [β± Stream Timeout]
Explanation:
If no data is received for 120s β Otoroshi stops the stream.
Connection Timeout
- Definition: Maximum time to establish a TCP/SSL connection.
- Example: If the backend takes longer than 10s to accept, Otoroshi gives up.
Parameters:
connectionTimeout = 10 000ms
Timeline:
t=0ms 10 000ms
|---- Trying to connect ---------------X [β± Connection Timeout]
Explanation:
If the backend does not accept the TCP/SSL connection within 10s β Otoroshi aborts.
Idle Timeout (Akka http client only)
- Definition: Maximum time a connection can stay open without activity.
- Example: If a keep-alive connection has no traffic for 60s, Otoroshi closes it.
Parameters:
idleTimeout = 60 000ms
Timeline:
0ms 12s 60s
|--- Activity π’ ---| ....... no activity .........X [β± Idle Timeout]
Explanation:
A keep-alive connection with no data for 60s is closed β prevents zombie sockets.
Global Timeout
- Definition: Maximum total time a request (including retries) can take.
Distinction between callTimeout and globalTimeout
Parameters:
retries = 2
retryInitialDelay = 50ms
backoffFactor = 2
callTimeout = 10 000ms
globalTimeout = 25 000ms
Timeline:
t=0ms 10 000ms 20 000ms 25 000ms
|-- Attempt1 β |-- Attempt2 β |-- Attempt3 β | [β± Global Timeout reached]
Explanation:
- Each attempt cannot exceed callTimeout (10s) β X marks when a single call fails.
- All retries combined cannot exceed globalTimeout (25s).
- After 25s, no further attempts are made, even if retries remain.
- callTimeout stops a single request attempt if it takes too long.
- globalTimeout stops all retries combined, enforcing a hard upper limit on the total request duration.
Sample Interval
- Definition: Specify the sliding window time for the circuit breaker in milliseconds, after this time, error count will be reseted
Parameters:
sampleInterval = 2000ms
maxErrors = 20
Timeline:
Errors: β β β β β β β β β β β β β β β β β β β β
Time β |-- 2s window --|-- 2s window --|-- 2s window --|
Explanation:
After 20 consecutive errors in observed intervals β Circuit opens
Otoroshi stops calling the backend temporarily to protect users & infra.
Healthcheck on routes
Otoroshi continuously checks the health of backend services to decide whether they are safe to receive traffic. The circuit breaker uses these results to move between healthy (green), degraded (yellow), and unhealthy (red) states.
Parameters
- Enabled: Enable or disable health checks.
- URL: The endpoint on the backend that Otoroshi will call to verify health.
- Timeout: Maximum time allowed for the health check request (**default: 5 seconds**).
- Healthy statuses: By default, HTTP response codes in the range 200β499 (excluding 500) are considered healthy.
- Unhealthy statuses: HTTP status codes <199 or >500 are considered unhealthy.
- Per-route overrides: Both healthy and unhealthy status code lists can be overridden for each route, allowing fine-grained control depending on the service behavior.
Custom Header Validation
For each health check, Otoroshi sends a header:
Otoroshi-Health-Check-Logic-Test: <numeric_value>
where <numeric_value> is a number dynamically generated by Otoroshi for that specific request.
Your backend must respond with:
Otoroshi-Health-Check-Logic-Test-Result: <numeric_value> + 42
Example:
- Request header β
Otoroshi-Health-Check-Logic-Test: 735 - Expected response header β
Otoroshi-Health-Check-Logic-Test-Result: 777
Outcomes
Green (healthy)
- Backend responds within the timeout.
- Status code is in the healthy range (200β499).
- Header
Otoroshi-Health-Check-Logic-Test-Resultis present and correct.
Yellow (degraded)
Backend responds, but the response header is missing, malformed, or contains an incorrect value.
Red (unhealthy)
Backend times out or returns a status code <199 or >500.
When a backend is marked red and the flag BlockOnRed is enabled, the circuit breaker will open, cutting traffic to that host until it recovers.
The Block on red flag can be enabled in the healthcheck section of the backend configuration or by using the corresponding environment variable.