分佈式健康檢查：自定義 Spring Boot Actuator

商業價值：健康檢查讓系統「自動發現問題、自動恢復」，直接支撐導讀篇提到的 99% 庫存準確率——系統不穩定就不可能有準確的庫存。

前言：為什麼需要健康檢查？

在微服務架構中，一個服務可能依賴多個外部元件：

元件	用途	掛掉的影響
PostgreSQL	主資料庫	無法讀寫訂單
Redis	快取	效能下降
Kafka	訊息佇列	無法非同步處理
Solr	搜尋引擎	無法搜尋訂單

問題：Kubernetes 預設只檢查 HTTP 回應，無法知道資料庫是否正常。

Spring Boot Actuator 健康檢查

基本設定

# application.yml
management:
endpoints:
web:
base-path: /
exposure:
include: health, info, metrics

endpoint:
health:
show-details: always
show-components: always

health:
# 啟用各元件的健康檢查
db:
enabled: true
redis:
enabled: true

健康檢查端點

端點	用途	使用場景
/health	完整健康狀態	監控系統
/health/liveness	存活檢查	K8s liveness probe
/health/readiness	就緒檢查	K8s readiness probe

自定義健康檢查指標

Kafka 健康檢查

@Component
public class KafkaHealthIndicator implements HealthIndicator {

@Value(“${kafka.bootstrap-servers}”)
private String bootstrapServers;

private AtomicReference<Health> cachedHealth =
new AtomicReference<>(Health.unknown().build());

@Override
public Health health() {
return cachedHealth.get();
}

/**
* 背景執行緒定期檢查，避免阻塞健康檢查端點
*/
@Scheduled(fixedRate = 30000) // 每 30 秒檢查一次
public void checkHealth() {
try {
Properties props = new Properties();
props.put(“bootstrap.servers”, bootstrapServers);
props.put(“request.timeout.ms”, “5000”);

try (AdminClient admin = AdminClient.create(props)) {
admin.listTopics().names().get(5, TimeUnit.SECONDS);
}

cachedHealth.set(Health.up()
.withDetail(“servers”, bootstrapServers)
.build());

} catch (Exception e) {
cachedHealth.set(Health.down()
.withDetail(“error”, e.getMessage())
.build());
}
}
}

Solr 健康檢查

@Component
public class SolrHealthIndicator implements HealthIndicator {

@Autowired
private SolrClient solrClient;

private AtomicReference<Health> cachedHealth =
new AtomicReference<>(Health.unknown().build());

@Override
public Health health() {
return cachedHealth.get();
}

@Scheduled(fixedRate = 30000)
public void checkHealth() {
try {
SolrPingResponse response = solrClient.ping();
int status = response.getStatus();

if (status == 0) {
cachedHealth.set(Health.up()
.withDetail(“responseTime”, response.getQTime())
.build());
} else {
cachedHealth.set(Health.down()
.withDetail(“status”, status)
.build());
}

} catch (Exception e) {
cachedHealth.set(Health.down()
.withDetail(“error”, e.getMessage())
.build());
}
}
}

健康檢查回應範例

{

    “status”: “UP”,

    “components”: {

        “db”: {

            “status”: “UP”,

            “details”: {

                “database”: “PostgreSQL”,

                “validationQuery”: “isValid()”

            }

        },

        “kafka”: {

            “status”: “UP”,

            “details”: {

                “servers”: “kafka:9092”

            }

        },

        “redis”: {

            “status”: “UP”,

            “details”: {

                “version”: “7.0.0”

            }

        },

        “solr”: {

            “status”: “UP”,

            “details”: {

                “responseTime”: 5

            }

        }

    }

}

Kubernetes 整合

# deployment.yaml
spec:
containers:
– name: oms-service
# 存活檢查：程式是否還活著
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3

# 就緒檢查：是否可以接受流量
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
initialDelaySeconds: 20
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3

Probe 類型	失敗後行為	使用場景
liveness	重啟 Pod	程式死當、無回應
readiness	從 Service 移除	暫時無法服務（如 DB 斷線）

設計考量

為什麼用背景執行緒 + 快取？

健康檢查端點需要快速回應（< 1秒）
外部元件檢查可能很慢（網路延遲）
Kubernetes 頻繁呼叫（每 5-10 秒）

設計	說明
背景檢查	每 30 秒執行一次，不阻塞端點
結果快取	AtomicReference 儲存最新狀態
逾時設定	檢查逾時 5 秒，避免卡住
狀態詳情	包含時間、錯誤訊息等資訊

監控整合

將健康狀態匯出到 Prometheus：

# 健康狀態指標
health_check_status{component=”kafka”} 1
health_check_status{component=”solr”} 1
health_check_status{component=”redis”} 1
health_check_status{component=”db”} 1

# 檢查執行時間
health_check_duration_seconds{component=”kafka”} 0.023
health_check_duration_seconds{component=”solr”} 0.005

總結

設計	效果
自定義 HealthIndicator	檢查所有依賴元件
背景執行 + 快取	端點回應快速
K8s Probe 整合	自動重啟/移除故障 Pod
Prometheus 匯出	歷史趨勢監控

為什麼不用其他方案？

方案	優點	缺點	結論
只靠 K8s 預設檢查	零設定	只檢查 HTTP 回應，不知道 DB 狀態	不夠
外部監控工具打 API	不侵入程式碼	只知道 API 回應，不知道內部狀態	補充用
自己寫健康檢查 API	完全控制	要自己處理快取、超時	重複造輪子
Actuator + 自訂	整合好、可擴展	要學 Spring 生態	Spring 專案首選

實戰踩坑

坑 1：健康檢查太慢導致 Pod 被殺

最初健康檢查直接連 Kafka，網路慢時要 10 秒才回應。K8s 以為 Pod 死了，不斷重啟。解法：改成背景執行緒定期檢查，健康端點只回傳快取結果。

坑 2：Liveness 和 Readiness 混用

最初兩個 Probe 用同一個端點。結果 Kafka 斷線時，所有 Pod 都被重啟（Liveness 失敗）。正確做法：Liveness 只檢查「程式還活著」，Readiness 檢查「能不能接流量」。Kafka 斷線應該是 Readiness 失敗（從 Service 移除），不是 Liveness 失敗（重啟）。

坑 3：忘記設定 initialDelaySeconds

應用程式啟動要 30 秒，但健康檢查 10 秒就開始。結果 Pod 永遠起不來，一直被重啟。

系列導航

◀ 上一篇
多租戶認證

📚 返回目錄

下一篇 ▶
DTO 設計

分佈式健康檢查：自定義 Spring Boot Actuator

前言：為什麼需要健康檢查？

Spring Boot Actuator 健康檢查

基本設定

健康檢查端點

自定義健康檢查指標

Kafka 健康檢查

Solr 健康檢查

健康檢查回應範例

Kubernetes 整合

設計考量

監控整合

總結

為什麼不用其他方案？

實戰踩坑

系列導航

留言

在〈分佈式健康檢查：自定義 Spring Boot Actuator〉中有 1 則留言

發佈留言 取消回覆

更多文章

Claude 分流實戰:訂閱 / API / Foundry / 本地怎麼選 + 省 Token

把 HR 表單系統接上 A2A:讓 AI 幫新人填表與上傳證件

Hacker News 每日精選 – 2026-06-24

Hacker News 每日精選 – 2026-06-23

發佈留言取消回覆