Metrics and Monitoring
This document covers the comprehensive Prometheus metrics and monitoring capabilities provided by the DataMover Operator.
Overview
The DataMover Operator provides detailed Prometheus metrics for monitoring backup operations, performance tracking, and operational observability. These metrics enable effective monitoring, alerting, and troubleshooting of data movement operations.
Available Metrics
Operation Metrics
datamover_operations_total
Counter tracking total operations by phase and status.
Labels:
- phase
: Operation phase (CreatingClonedPVC, CreatingPod, CleaningUp)
- status
: Operation status (started, success, failure)
- namespace
: Kubernetes namespace
Examples:
datamover_operations_total{phase="CreatingClonedPVC", status="started", namespace="default"} 10
datamover_operations_total{phase="CreatingClonedPVC", status="success", namespace="default"} 9
datamover_operations_total{phase="CreatingClonedPVC", status="failure", namespace="default"} 1
datamover_phase_duration_seconds
Histogram tracking phase execution duration.
Labels:
- phase
: Operation phase
- namespace
: Kubernetes namespace
Buckets: 0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300, 600, 1200, +Inf seconds
Examples:
datamover_phase_duration_seconds_bucket{phase="CreatingPod", namespace="default", le="300"} 45
datamover_phase_duration_seconds_sum{phase="CreatingPod", namespace="default"} 12847.3
datamover_phase_duration_seconds_count{phase="CreatingPod", namespace="default"} 50
Cleanup Metrics
datamover_cleanup_operations_total
Counter tracking PVC cleanup operations.
Labels:
- status
: Cleanup status (success, failure)
- namespace
: Kubernetes namespace
Examples:
datamover_cleanup_operations_total{status="success", namespace="default"} 25
datamover_cleanup_operations_total{status="failure", namespace="default"} 2
Job Metrics
datamover_pod_creation_operations_total
Counter tracking job/pod creation operations.
Labels:
- status
: Operation status (started, success, failure)
- namespace
: Kubernetes namespace
Examples:
datamover_pod_creation_operations_total{status="started", namespace="default"} 30
datamover_pod_creation_operations_total{status="success", namespace="default"} 28
datamover_pod_creation_operations_total{status="failure", namespace="default"} 2
datamover_data_sync_operations_total
Counter tracking data synchronization operations.
Labels:
- status
: Sync status (success, failure)
- namespace
: Kubernetes namespace
Examples:
datamover_data_sync_operations_total{status="success", namespace="default"} 25
datamover_data_sync_operations_total{status="failure", namespace="default"} 3
Metric Collection
Prometheus Configuration
Configure Prometheus to scrape DataMover metrics:
# prometheus.yml
scrape_configs:
- job_name: 'datamover-operator'
static_configs:
- targets: ['datamover-operator-controller-manager-metrics-service:8080']
scrape_interval: 30s
metrics_path: /metrics
ServiceMonitor (Prometheus Operator)
For Prometheus Operator deployments:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: datamover-operator
namespace: datamover-operator-system
spec:
selector:
matchLabels:
control-plane: controller-manager
endpoints:
- port: https
scheme: https
path: /metrics
tlsConfig:
insecureSkipVerify: true
Manual Metric Access
Access metrics directly:
# Port-forward to metrics endpoint
kubectl port-forward -n datamover-operator-system service/datamover-operator-controller-manager-metrics-service 8080:8080
# Query metrics
curl http://localhost:8080/metrics | grep datamover
Monitoring Dashboards
Grafana Dashboard Example
Create comprehensive Grafana dashboard:
{
"dashboard": {
"title": "DataMover Operator",
"panels": [
{
"title": "Operation Success Rate",
"type": "stat",
"targets": [
{
"expr": "rate(datamover_operations_total{status=\"success\"}[5m]) / rate(datamover_operations_total[5m]) * 100"
}
]
},
{
"title": "Phase Duration",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(datamover_phase_duration_seconds_bucket[5m]))"
}
]
},
{
"title": "Operations by Phase",
"type": "graph",
"targets": [
{
"expr": "rate(datamover_operations_total[5m])"
}
]
}
]
}
}
Key Performance Indicators (KPIs)
Monitor these critical metrics:
- Success Rate: Overall operation success percentage
- Phase Duration: Time taken for each operation phase
- Failure Rate: Operations failing per time period
- Cleanup Success: PVC cleanup operation success rate
Dashboard Panels
Success Rate Panel
# Overall success rate
rate(datamover_operations_total{status="success"}[5m]) /
rate(datamover_operations_total[5m]) * 100
Average Phase Duration Panel
# Average phase duration by phase
rate(datamover_phase_duration_seconds_sum[5m]) /
rate(datamover_phase_duration_seconds_count[5m])
Operations Volume Panel
# Operations per minute by phase
rate(datamover_operations_total[1m]) * 60
Cleanup Success Rate Panel
# Cleanup success rate
rate(datamover_cleanup_operations_total{status="success"}[5m]) /
rate(datamover_cleanup_operations_total[5m]) * 100
Alerting
Prometheus Alerting Rules
Configure alerts for operational issues:
groups:
- name: datamover.rules
rules:
# High failure rate alert
- alert: DataMoverHighFailureRate
expr: |
(
rate(datamover_operations_total{status="failure"}[5m]) /
rate(datamover_operations_total[5m])
) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "DataMover high failure rate"
description: "DataMover failure rate is above 10% in namespace {{ $labels.namespace }}"
# Long-running operations alert
- alert: DataMoverLongRunningOperation
expr: |
histogram_quantile(0.95, rate(datamover_phase_duration_seconds_bucket[5m])) > 1800
for: 5m
labels:
severity: warning
annotations:
summary: "DataMover operations taking too long"
description: "95th percentile of DataMover operations exceeds 30 minutes"
# Cleanup failures alert
- alert: DataMoverCleanupFailures
expr: |
rate(datamover_cleanup_operations_total{status="failure"}[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "DataMover cleanup operations failing"
description: "PVC cleanup operations are failing in namespace {{ $labels.namespace }}"
# No operations alert (for scheduled backups)
- alert: DataMoverNoOperations
expr: |
rate(datamover_operations_total[1h]) == 0
for: 2h
labels:
severity: info
annotations:
summary: "No DataMover operations detected"
description: "No DataMover operations in the last hour"
# Job retry exhaustion alert
- alert: DataMoverJobRetriesExhausted
expr: |
rate(datamover_pod_creation_operations_total{status="failure"}[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "DataMover jobs exhausting retries"
description: "DataMover jobs are failing after all retry attempts"
Alert Severity Levels
Critical: Immediate attention required - Job retries exhausted - Complete operation failures - Security violations
Warning: Investigation needed - High failure rates - Performance degradation - Cleanup failures
Info: Awareness notifications - No operations detected - Configuration changes - Routine maintenance
Performance Monitoring
Baseline Metrics
Establish baseline performance metrics:
Metric | Typical Range | Alert Threshold |
---|---|---|
PVC Clone Creation | 30s - 5min | > 10min |
Job Execution | 1min - 30min | > 60min |
Cleanup Operation | 5s - 30s | > 2min |
Overall Success Rate | > 95% | < 90% |
Performance Queries
Average Operation Duration
# Average duration by phase over last hour
avg_over_time(
rate(datamover_phase_duration_seconds_sum[5m])[1h:5m]
) /
avg_over_time(
rate(datamover_phase_duration_seconds_count[5m])[1h:5m]
)
Success Rate Trend
# Success rate trend over 24 hours
rate(datamover_operations_total{status="success"}[1h]) /
rate(datamover_operations_total[1h])
Resource Utilization Correlation
# Correlate phase duration with resource usage
rate(datamover_phase_duration_seconds_sum[5m]) and on()
rate(container_cpu_usage_seconds_total{pod=~"datamover-operator.*"}[5m])