Skip to main content
Version: 25.3

Seqera Platform Monitoring

Enabling observability metrics

Seqera Platform has built-in observability metrics which can be enabled by adding prometheus to the MICRONAUT_ENVIRONMENTS environment variable. This exposes a Prometheus endpoint at /prometheus on the default listen port (e.g., http://localhost:8080/prometheus).

Combined with infrastructure monitoring tools such as Node Exporter, you can monitor relevant metrics across your deployment.


Key metrics to monitor

Seqera Platform-specific metrics

Data Studio metrics

MetricDescription
data_studio_startup_time_failure_seconds_sumTime for failed Data Studio startups
data_studio_startup_time_failure_seconds_countFailed Data Studio startup count

Track Data Studio startup performance to identify environment provisioning issues. Slow or failing startups impact user productivity.

Average startup time by tool:

sum by (tool) (increase(data_studio_startup_time_success_seconds_sum{app="backend", namespace="$namespace"}[$__rate_interval]))
/
sum by (tool) (increase(data_studio_startup_time_success_seconds_count{app="backend", namespace="$namespace"}[$__rate_interval]))

Failed startup rate:

rate(data_studio_startup_time_failure_seconds_count{namespace="$namespace"}[$__rate_interval])

Error tracking

MetricDescription
tower_logs_errors_10secCountErrors in last 10 seconds
tower_logs_errors_1minCountErrors in last minute
tower_logs_errors_5minCountErrors in last 5 minutes

Monitor application errors across different time windows. Rolling error counts help identify transient issues versus sustained problems.

Recent error counts:

tower_logs_errors_10secCount{namespace="$namespace"}
tower_logs_errors_1minCount{namespace="$namespace"}
tower_logs_errors_5minCount{namespace="$namespace"}

Log events by severity level:

rate(logback_events_total{namespace="$namespace"}[$__rate_interval])

Infrastructure resources

CPU usage

Monitor container CPU consumption against requested resources to identify capacity issues or inefficient resource allocation.

Backend CPU usage:

rate(container_cpu_usage_seconds_total{container="backend", namespace="$namespace"}[$__rate_interval])

Compare against requested resources to determine if the container is over or under-provisioned:

max(kube_pod_container_resource_requests{container="backend", namespace="$namespace", resource="cpu"})

Memory usage

Track working set memory, committed memory, and limits to prevent OOM conditions.

Backend memory working set shows actual memory in use:

container_memory_working_set_bytes{container="backend", namespace="$namespace"}

Memory requests and limits define the bounds for container memory allocation:

max(kube_pod_container_resource_requests{container="backend", namespace="$namespace", resource="memory"})
max(kube_pod_container_resource_limits{container="backend", namespace="$namespace", resource="memory"})

HTTP server requests

MetricDescription
http_server_requests_seconds_countTotal request count by method, status, and URI
http_server_requests_seconds_sumTotal request duration by method, status, and URI
http_server_requests_seconds_maxMaximum request duration
http_server_requests_seconds (quantiles)Request latency percentiles (p50, p95, p99, p999)

HTTP metrics reveal application throughput, error rates, and latency patterns. These are essential for understanding user-facing performance.

Total request throughput shows overall API activity:

sum(rate(http_server_requests_seconds_count{app="backend", namespace="$namespace"}[$__rate_interval]))

Error rate (4xx and 5xx responses) indicates client errors and server failures:

sum(rate(http_server_requests_seconds_count{app="backend", namespace="$namespace", status=~"[45].."}[$__rate_interval]))

Average latency per endpoint helps identify slow API paths:

sum by (method, uri) (rate(http_server_requests_seconds_sum{app="backend", namespace="$namespace"}[$__rate_interval]))
/
sum by (method, uri) (rate(http_server_requests_seconds_count{app="backend", namespace="$namespace"}[$__rate_interval]))

Top 10 endpoints by time spent highlights where server time is consumed for optimization efforts:

topk(10, sum by(method, uri) (rate(http_server_requests_seconds_sum{namespace="$namespace", app="backend"}[$__rate_interval])))

HTTP client requests

MetricDescription
http_client_requests_seconds_countOutbound request count
http_client_requests_seconds_sumTotal outbound request duration
http_client_requests_seconds_maxMaximum outbound request duration

Monitor external API calls and integrations. Slow or failing outbound requests can cascade into application performance issues.

Outbound request rate:

rate(http_client_requests_seconds_count{namespace="$namespace"}[$__rate_interval])

Average outbound request duration:

rate(http_client_requests_seconds_sum{namespace="$namespace"}[$__rate_interval])
/
rate(http_client_requests_seconds_count{namespace="$namespace"}[$__rate_interval])

Maximum outbound request duration identifies slow external dependencies:

http_client_requests_seconds_max{namespace="$namespace"}

JVM memory metrics

MetricDescription
jvm_buffer_memory_used_bytesMemory used by JVM buffer pools (direct, mapped)
jvm_memory_used_bytesAmount of used memory by area (heap/non-heap) and region
jvm_memory_committed_bytesMemory committed for JVM use
jvm_memory_max_bytesMaximum memory available for memory management
jvm_gc_live_data_size_bytesSize of long-lived heap memory pool after reclamation
jvm_gc_max_data_size_bytesMax size of long-lived heap memory pool

JVM memory metrics are critical for preventing OutOfMemoryErrors and identifying memory leaks. Monitor both heap (Java objects) and non-heap (metaspace, code cache) regions.

Heap memory usage shows memory used for Java objects:

jvm_memory_used_bytes{app="backend", namespace="$namespace", area="heap"}
jvm_memory_committed_bytes{app="backend", namespace="$namespace", area="heap"}
jvm_memory_max_bytes{app="backend", namespace="$namespace", area="heap"}

Non-heap memory includes metaspace and code cache:

jvm_memory_used_bytes{app="backend", namespace="$namespace", area="nonheap"}
jvm_memory_committed_bytes{app="backend", namespace="$namespace", area="nonheap"}
jvm_memory_max_bytes{app="backend", namespace="$namespace", area="nonheap"}

Heap usage percentage provides a quick health indicator. Alert when this exceeds 85%:

sum(jvm_memory_used_bytes{area="heap"}) / sum(jvm_memory_max_bytes{area="heap"}) * 100

Direct buffer usage is important for Netty-based applications. High usage can cause native memory issues:

jvm_buffer_memory_used_bytes{namespace="$namespace", app="backend", id="direct"}
jvm_buffer_total_capacity_bytes{namespace="$namespace", app="backend", id="direct"}

JVM garbage collection

MetricDescription
jvm_gc_pause_seconds_sumTotal time spent in GC pauses
jvm_gc_pause_seconds_countNumber of GC pause events
jvm_gc_pause_seconds_maxMaximum GC pause duration
jvm_gc_memory_allocated_bytes_totalTotal bytes allocated in young generation
jvm_gc_memory_promoted_bytes_totalBytes promoted to old generation

Garbage collection metrics reveal memory pressure and its impact on application responsiveness. Long GC pauses cause request latency spikes.

Average GC pause duration should remain low (under 100ms for most applications):

rate(jvm_gc_pause_seconds_sum{app="backend", namespace="$namespace"}[$__rate_interval])
/
rate(jvm_gc_pause_seconds_count{app="backend", namespace="$namespace"}[$__rate_interval])

Maximum GC pause identifies worst-case latency impact. Alert if this exceeds 1 second:

jvm_gc_pause_seconds_max{app="backend", namespace="$namespace"}

Live data size after GC shows long-lived objects. If this grows over time, you may have a memory leak:

jvm_gc_live_data_size_bytes{app="backend", namespace="$namespace"}

Memory allocation and promotion rates indicate object creation patterns. High promotion rates suggest objects are living longer than expected:

rate(jvm_gc_memory_allocated_bytes_total{app="backend", namespace="$namespace"}[$__rate_interval])
rate(jvm_gc_memory_promoted_bytes_total{app="backend", namespace="$namespace"}[$__rate_interval])

JVM threads

MetricDescription
jvm_threads_live_threadsCurrent number of live threads (daemon + non-daemon)
jvm_threads_daemon_threadsCurrent number of daemon threads
jvm_threads_peak_threadsPeak thread count since JVM start
jvm_threads_states_threadsThread count by state (runnable, blocked, waiting, timed-waiting)

Thread metrics help identify deadlocks, thread pool exhaustion, and concurrency issues.

Thread counts show overall thread activity:

jvm_threads_live_threads{app="backend", namespace="$namespace"}
jvm_threads_daemon_threads{app="backend", namespace="$namespace"}
jvm_threads_peak_threads{app="backend", namespace="$namespace"}

Thread states reveal blocking issues. High blocked thread counts indicate lock contention:

jvm_threads_states_threads{app="backend", namespace="$namespace"}

JVM classes

MetricDescription
jvm_classes_loaded_classesCurrently loaded classes
jvm_classes_unloaded_classes_totalTotal classes unloaded since JVM start

Class loading metrics help identify class loader leaks or excessive dynamic class generation.

Loaded classes should stabilize after startup. Continuous growth may indicate a class loader leak:

jvm_classes_loaded_classes{namespace="$namespace", app="backend"}

Class unload rate:

rate(jvm_classes_unloaded_classes_total{namespace="$namespace", app="backend"}[$__rate_interval])

Process metrics

MetricDescription
process_cpu_usageRecent CPU usage for the JVM process
process_cpu_time_ns_totalTotal CPU time used by the JVM
process_files_open_filesOpen file descriptor count
process_files_max_filesMaximum file descriptor limit
process_uptime_secondsJVM uptime
process_start_time_secondsProcess start time (unix epoch)

Process-level metrics provide visibility into resource consumption and system limits.

JVM process CPU usage:

process_cpu_usage{namespace="$namespace"}

Open file descriptors should be monitored against limits. Exhaustion causes connection failures:

process_files_open_files{namespace="$namespace"}

File descriptor utilization percentage - alert when this exceeds 90%:

(process_files_open_files{namespace="$namespace"} / process_files_max_files{namespace="$namespace"}) * 100

Process uptime helps identify restart events. Low uptime may indicate stability issues:

process_uptime_seconds{namespace="$namespace"}

System metrics

MetricDescription
system_cpu_usageSystem-wide CPU usage
system_cpu_countNumber of processors available to JVM
system_load_average_1m1-minute load average

System metrics provide host-level context for application performance.

System-wide CPU usage:

system_cpu_usage{namespace="$namespace"}

System load average should remain below the CPU count for healthy systems:

system_load_average_1m{namespace="$namespace"}

Available CPU count:

system_cpu_count{namespace="$namespace"}

Executor thread pools

MetricDescription
executor_active_threadsCurrently active threads by pool (io, blocking, scheduled)
executor_pool_size_threadsCurrent thread pool size
executor_pool_max_threadsMaximum allowed threads in pool
executor_queued_tasksTasks queued for execution
executor_completed_tasks_totalTotal completed tasks
executor_seconds_sumTotal execution time

Thread pool metrics reveal concurrency bottlenecks. Saturated pools cause request queuing and increased latency.

Thread pool utilization percentage - high utilization indicates the pool is near capacity:

executor_active_threads{service="backend", namespace="$namespace", name!="scheduled"}
/
executor_pool_size_threads{service="backend", namespace="$namespace", name!="scheduled"}

Cron scheduled executor utilization:

executor_active_threads{service="cron", namespace="$namespace", name="scheduled"}
/
executor_pool_size_threads{service="cron", namespace="$namespace", name="scheduled"}

Queued tasks indicate backlog. Growing queues suggest the pool cannot keep up with demand:

executor_queued_tasks{app="backend", namespace="$namespace"}

Task completion rate:

rate(executor_completed_tasks_total{namespace="$namespace"}[$__rate_interval])

Cache metrics

MetricDescription
cache_sizeNumber of entries in cache
cache_gets_totalCache hits and misses by cache name
cache_puts_totalCache entries added
cache_evictions_totalCache eviction count

Cache effectiveness directly impacts database load and response times. Low hit rates indicate caching issues.

Redis cache hit rate - should be above 70% for effective caching:

avg(irate(redis_keyspace_hits_total{app="platform-redis-exporter"}[$__rate_interval])
/
(irate(redis_keyspace_misses_total{app="platform-redis-exporter"}[$__rate_interval]) + irate(redis_keyspace_hits_total{app="platform-redis-exporter"}[$__rate_interval])))

Cache size by name:

cache_size{namespace="$namespace"}

Cache operation rates:

rate(cache_gets_total{namespace="$namespace"}[$__rate_interval])
rate(cache_puts_total{namespace="$namespace"}[$__rate_interval])
rate(cache_evictions_total{namespace="$namespace"}[$__rate_interval])

Hibernate/Database metrics

MetricDescription
hibernate_sessions_open_totalTotal sessions opened
hibernate_sessions_closed_totalTotal sessions closed
hibernate_connections_obtained_totalDatabase connections obtained
hibernate_query_executions_totalTotal queries executed
hibernate_query_executions_max_secondsSlowest query time
hibernate_entities_inserts_totalEntity insert operations
hibernate_entities_updates_totalEntity update operations
hibernate_entities_deletes_totalEntity delete operations
hibernate_entities_loads_totalEntity load operations
hibernate_transactions_totalTransaction count
hibernate_flushes_totalSession flush count
hibernate_optimistic_failures_totalOptimistic lock failures (StaleObjectStateException)

Database metrics reveal query performance, connection management, and transaction health.

Session operations - open and closed counts should be roughly equal. A growing gap indicates session leaks:

rate(hibernate_sessions_open_total{app="backend", namespace="$namespace"}[$__rate_interval])
rate(hibernate_sessions_closed_total{app="backend", namespace="$namespace"}[$__rate_interval])

Connection acquisition rate:

rate(hibernate_connections_obtained_total{app="backend", namespace="$namespace"}[$__rate_interval])

Query execution rate:

rate(hibernate_query_executions_total{app="backend", namespace="$namespace"}[$__rate_interval])

Query latency by type helps identify slow queries for optimization:

sum by (query) (rate(hibernate_query_execution_total_seconds_sum{app="backend", namespace="$namespace"}[$__rate_interval]))
/
sum by (query) (rate(hibernate_query_execution_total_seconds_count{app="backend", namespace="$namespace"}[$__rate_interval]))

Slowest query time - alert if this exceeds 5 seconds:

hibernate_query_executions_max_seconds{app="backend", namespace="$namespace"}

Entity operation rates show database write patterns:

rate(hibernate_entities_inserts_total{app="backend", namespace="$namespace"}[$__rate_interval])
rate(hibernate_entities_updates_total{app="backend", namespace="$namespace"}[$__rate_interval])
rate(hibernate_entities_deletes_total{app="backend", namespace="$namespace"}[$__rate_interval])
rate(hibernate_entities_loads_total{app="backend", namespace="$namespace"}[$__rate_interval])

Transaction success/failure rate:

sum by (result) (rate(hibernate_transactions_total{app="backend", namespace="$namespace"}[$__rate_interval]))

Optimistic lock failures indicate concurrent modification conflicts. High rates suggest contention issues:

rate(hibernate_optimistic_failures_total{app="backend", namespace="$namespace"}[$__rate_interval])

Connection pool metrics

MetricDescription
jdbc_connections_activeActive database connections
jdbc_connections_maxMaximum connection pool size
jdbc_connections_minMinimum connection pool size
jdbc_connections_usageConnection pool usage

Connection pool metrics prevent connection exhaustion during traffic bursts.

Active connections vs pool limits - alert when active connections approach the maximum:

sum(jdbc_connections_active{app="backend", namespace="$namespace"})
sum(jdbc_connections_max{app="backend", namespace="$namespace"})
sum(jdbc_connections_min{app="backend", namespace="$namespace"})
sum(jdbc_connections_usage{app="backend", namespace="$namespace"})

Hibernate cache metrics

Hibernate caching reduces database load. Monitor hit rates to ensure caches are effective.

Query cache hit rate - should exceed 60%:

sum(increase(hibernate_cache_query_requests_total{app="backend", namespace="$namespace", result="hit"}[$__rate_interval]))
/
sum(increase(hibernate_cache_query_requests_total{app="backend", namespace="$namespace"}[$__rate_interval]))

Query plan cache hit rate:

sum(increase(hibernate_cache_query_plan_total{app="backend", namespace="$namespace", result="hit"}[$__rate_interval]))
/
sum(increase(hibernate_cache_query_plan_total{app="backend", namespace="$namespace"}[$__rate_interval]))

Second level cache hit rate by region:

sum by (region) (increase(hibernate_second_level_cache_requests_total{app="backend", namespace="$namespace", result="hit"}[$__rate_interval]))
/
sum by (region) (increase(hibernate_second_level_cache_requests_total{app="backend", namespace="$namespace"}[$__rate_interval]))

Logging metrics

MetricDescription
logback_events_totalLog events by level (debug, info, warn, error, trace)

Log event metrics provide early warning of application issues.

Error rate - track error log frequency for anomaly detection:

rate(logback_events_total{level="error"}[5m])

Kubernetes health

Monitor pod health to catch deployment or infrastructure issues early.

Pods in unhealthy states:

sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed", namespace!="wave-build"}) > 0

Alerting recommendations

Critical alerts

  • jvm_memory_used_bytes{area="heap"} > 90% of jvm_memory_max_bytes
  • process_files_open_files > 90% of process_files_max_files
  • logback_events_total{level="error"} rate > threshold
  • tower_logs_errors_1minCount > 0
  • HTTP 5xx errors > 5% of total requests
  • jdbc_connections_active > 90% of jdbc_connections_max
  • Any pods in Failed/Unknown state for > 5 minutes

Warning alerts

  • jvm_gc_pause_seconds_max > 1 second
  • jvm_gc_live_data_size_bytes approaching jvm_gc_max_data_size_bytes
  • Heap usage > 85% of max heap
  • executor_queued_tasks > threshold
  • Executor utilization > 90%
  • hibernate_optimistic_failures_total rate increasing
  • hibernate_query_executions_max_seconds > 5 seconds
  • http_server_requests_seconds p99 > acceptable latency
  • Redis cache hit rate < 70%
  • Hibernate query cache hit rate < 60%
  • Growing gap between credits_estimation_workflow_added_total and credits_estimation_workflow_ended_total
  • hibernate_sessions_open_total >> hibernate_sessions_closed_total over time

Quick reference: Metrics by troubleshooting scenario

IssueKey Metrics to Check
Slow application responsehttp_server_requests_seconds (latency), jvm_gc_pause_seconds_max, hibernate_query_executions_max_seconds, executor_active_threads
Out of memory errorsjvm_memory_used_bytes, jvm_gc_pause_seconds, jvm_gc_live_data_size_bytes, jvm_buffer_memory_used_bytes
Database performancehibernate_query_executions_max_seconds, jdbc_connections_active, hibernate_transactions_total, cache hit rates
High CPU usageprocess_cpu_usage, system_cpu_usage, jvm_threads_live_threads, executor_active_threads
Connection exhaustionjdbc_connections_active, jdbc_connections_max, hibernate_sessions_open_total vs hibernate_sessions_closed_total
Cache issuesRedis hit rate, hibernate_cache_query_requests_total, cache_gets_total, cache_evictions_total
Workflow processing delayscredits_estimation_workflow_*, credits_estimation_task_*, executor_queued_tasks, tower_logs_errors_*
Thread starvationexecutor_active_threads, executor_queued_tasks, jvm_threads_states_threads{state="blocked"}
Memory leaksjvm_memory_used_bytes trending up, jvm_gc_live_data_size_bytes growing, jvm_classes_loaded_classes growing
GC pressurejvm_gc_pause_seconds_max, jvm_gc_memory_promoted_bytes_total, time in GC vs application time