System Monitoring

ScienceOps has several endpoints that can be integrated into system monitoring services such as Nagios, Graphite, StatsD, Datadog and New Relic. Master and worker nodes are all monitored, as well as models.

Configuring Monitoring Systems

To configure a Monitoring System, navigate to Admin > Systems > Monitoring.

Graphite and StatsD

  1. In the first field, add your Graphite / StatsD hostname, (the URL where your monitoring service is located). If necessary, include the port.

  2. Add a prefix for your metrics. This text will be prepended to all metrics collected by the monitoring service. This makes it easier to identify which metrics are coming from ScienceOps. Note that for some hosted versions of Graphite / StatsD, this prefix must be preceded by the API Key provided by the hosting service.

Datadog

  1. In the first field, add your Datadog API key.

  2. In the second field, add your Datadog App key.

  3. Finally, a prefix for your metrics. This text will be prepended to all metrics collected by the monitoring service. This makes it easier to identify which metrics are coming from ScienceOps.

New Relic

Because New Relic monitors detailed application-level performance, it must be configured in the /etc/scienceops.yaml file.

To add New Relic monitoring, open the /etc/scienceops.yaml and add the following lines:

newrelic:
    license_key: <license_key>
    app_name: <metrics-prefix>

Once this is added, save the file and restart the scienceops process:

sudo stop scienceops
sudo start scienceops

Metrics Tracked

As of release 2.4.1, we currently support the following metrics:

System Metrics Network Metrics Prediction Metrics
CPU usage Bytes Received Success
Memory Free Bytes Sent Errors
Memory Total Network Drop In Total
Percent of Memory in Use Network Dropout ErrorRate
Disc Space Free Error Input Rate
Error Output Rate
Number of Packets Sent
Number of Packets Received

Model Status:

  • Model Status, similar to a Nagios status endpoint

Graphite Prefixes:

All Graphite prefixes are preceded by { prefix }.{ node }.

For example: scienceops.master_node.Mem.Active

System Metrics:

Metric Prefix Description
CPU.TotalPct total % of CPU usage across all cores
CPU.Total not a %
Mem.UsedPct % of total memory
Mem.Active, Avail, Free, Inactive, Total, and Used Memory usage, in KB
Disk.{ disk partition }.UsedPercent % of disk used on this partition

Network Metrics:

Metric Prefix Description
Net.Interface.{Interface Name}.BytesRecv bytes received
Net.Interface.{Interface Name}.BytesSent bytes sent
Net.Interface.{Interface Name}.Dropin network drop in
Net.Interface.{Interface Name}.Dropout network dropout
Net.Interface.{Interface Name}.ErrIn Error input rate
Net.Interface.{Interface Name}.ErrOut Error output rate
Net.Interface.{Interface Name}.PacketsSent number of packets sent
Net.Interface.{Interface Name}.PacketsRecv number of packets received

Prediction Metrics:

Metric Prefix Description
Predictions.Success Counts of successful predictions over last hour
Predictions.Errors Counts of prediction errors over last hour
Predictions.Total Counts of all predictions over last hour
Predictions.ErrorRate Error rate of predictions over last hour

Note: ErrorRate is computed as: ErrorRate = Errors / 60s in units of 1 / t

Model Status:

Metric Prefix Description
ModelStatus.{Ops Username}.{Model Name}.Status Status of a model

Note: a value of 1 indicates the model is online

Nagios

Status endpoints for models are Nagios compliant return status codes indicating whether the model is accessible or not.

Return a list of models:

Note: this does not return the models' status

$ curl --user USER:APIKEY https://IP_ADDRESS/api/USER/models/
{
 "status": "ok",
 "models": [
  "HelloWorld",
  "test1"
 ]
}

Return a specific model’s status:

$ curl --user USER:APIKEY https://IP_ADDRESS/USER/models/MODEL/status/
{"status":"OK","date":"2016-01-08T16:16:44Z"}

results matching ""

    No results matching ""