Healthcheck & monitoring
Healthcheck
Use Datalore's in-built HTTP endpoint (accessible at /health
) to verify whether the instance has become online and responsive.
This endpoint and returns OK
when no issues are detected.
Use the same endpoint as Kubernetes liveness probe if default Helm charts are used for the deployment.
Monitoring
Datalore has a built-in metrics exporter, which is disabled by default and accessible at the /metrics
path when enabled explicitly.
There are two mutually exclusive environment variables of the Datalore server that can be used to enable metrics:
Monitoring environment variables
Name | Type | Default value | Description |
---|---|---|---|
| String | Not defined | Enables the exporter and defines the authentication token required to collect metrics. Mutually exclusive with |
| String | Not defined | Enables the exporter. No authentication will be required to read metrics. Mutually exclusive with |
Metrics
agent_pool_size
: shows how many agents the pool currently has.Prometheus query:
sum by (instance_name)(agent_pool_size)
agent_waiting_time_bucket
: represents the timespan in which the user waited for an instance startup.Prometheus query:
sum(increase(agent_waiting_time_bucket[10m])) by (le)
agent_in_pool_time_bucket
: represents the timespan in which the agent was online and idle before being assigned to a specific notebook.Prometheus query:
sum(increase(agent_in_pool_time_bucket[10m])) by (le)
agents_started_total
: shows how many agents were started per minute.Prometheus query:
sum by (instance_name)(rate(agents_started_total[5m])) * 60