Pingdom

Pingdom

Legacy

Area infrastructure-services
Tech Lead
Description Monitoring tool.
Quality No Sentry, no SONAR

Pingdom

Introduction

We use Pingdom as one of our blackbox monitoring tools. We also use Alertra, but we will be deprecating Alerta and merging its functionality into Pingdom ASAP. In the mean time, if you have questions about Alertra, it is best to contact @tpickard.

Crossref uses Pingdom to perform simple HTTP uptime and response time checks against Crossref systems and systems that Crossref depends on (e.g. the Handle system). The checks are performed from multiple geographic locations and a check only fails if it fails and is confirmed to fail from separate location. This means that it is highly unlikely that a failure reported by Pingdom is a false positive.

Each check returns uptime and response time information. Pingdom keeps a historical log of this information. Some Crossref services have Pingdom data going back to 2007.

For new services, we have tried to standardise on using a well-known route, /heartbeat, that provides information on the health of the respective service. The /heartbeat route returns a json response and the top level of the json response should have a key called ‘status’. Pingdom will check to see that the value of the ‘status’ key, is “ok.” The “ok” value should only be returned if the respective service is functioning. Note that we say “functioning” and not just “up”. The two terms are often conflated and this causes problems. More details on the /hearbeat route can be found here.

Our legacy services and some of the external services we depend on typically do not have a /heartbeat endpoint. In those cases, we will point pingdom at a suitable endpoint that indicates that the service is at least up and responding.

Some services may not have an HTTP endpoint at all (e.g. the VPN server). For these we use Pingdom’s ICMP or SSH check.

Pingdom checks can send alerts when services are down. Note that a limitation of Pingdom is that it is limited to “binary” up/down judgements. It has no way of determining that a service is “degraded” (e.g. running slower than a defined baseline).

We have divided Pingdom checks into a few categories:

  1. interest: These do not send alerts to anybody. For example, we monitor ORCID’s search uptime as a matter if interest and for comparison.
  2. status-test: These are for used when we are experimenting with something and don’t want to bother anybody. They send no alerts.
  3. status-beta: These are used by services that are in beta. They only send alerts to the status-beta channel on Slack.
  4. status-info: These are generally pointing at free services that have no SLAs. They only send alerts to the status-info channel on Slack.
  5. status-critical: These point at critical services and SLA-backed services such as our PLus services. They send alerts to the status-critical channel on Slack and they send SMS/email alerts to relevant tech team members.

Interpreting Pingdom data.

You can look at Pingdom data either on the Pingdom site itself (contact infrastructure team for credentials) or, for a subset of services, at the bottom of the Crossref status page.

This will focus on using the Pingdom service itself.

Crossref service checks are all located in Pingdom’s “Uptime” dashboard. Services that are currently “down” will be listed at the top of the dashboard.

Below that will be services that are currently “up”, but which may be experiencing intermitent outages or slowdowns. You can search and filter this list for the service you are interested in.

Note that the default time scale for the service data that you look at is 24 hours. This might be good for seeing any current issues, but it doesn’t provide context. For example, you may look at the chart and it might look like there is a lot of activity and that the service is slow- but when you zoom out (to 30 days, for example), you might see that the 24 hour period in question is, in fact, seeing low traffic and fast response times compared to the previous 39 days.

TIP: Always make sure you zoom in and out of Pingdom data in order to understand the context.

When you are looking at the data for an individual service, the top of the service page will show a chart depicting uptime and response time. And below that you will see a table that also lists the uptimes/downtimes along with specific timestamps.

TIP: Pingdom shows all times in UTC.

Along the right hand side of the “Uptime changes” table, you will see two buttons:

  • Test Log (next to all entries)
  • Root Cause (only next to outages)

Clicking on the “Test Log” button will show you a dialog that shows details of each of the individual tests that was performed in that cycle, including the location of each of the test servers.

Sometimes, when there is a failure, you might see that all of the failures are coming from a particular region. This may indicate a local internet problem rather than a problem with our service. Also note that this is very rare and that you should only conclude that there might be a problem with the Pingdom test itself after consulting the Pingdom status page to see if they have detected anomalies as well. Also- try not to think about how “meta” this is. It will make your head hurt.

In short, the information provided by the “Test Log” button is interesting, but it is not typically useful in diagnosing a problem.

Clicking on the “Root cause” next to a failure will give you details of the response that was received from any failed tests.

If the request did not time-out, It will show you the HTTP error that was returned, the IP the requests resolved to, the time the requests took to complete took, the HTTP headers that were received in the response and the contents of the HTTP response itself. For example:

Received header

503 Service Unavailable
Cache-Control: no-cache
Connection: close
Content-Type: text/html

Received content

<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

If, on the other hand, the request timed-out (i.e. took longer than 30 seconds to complete), the “Root cause” will show you a traceroute to the domain of the requested resource. Unfortunately, this is of limited value because the Crossref firewall does not respond to traceroute requests.

Also- sometimes, confusingly you may see what looks like a successful and complete response from the service. This simply means that the response came after the timeout period of 30 seconds. This should not be interpreted to mean the service is, in fact, working. An HTTP services responding in over 30 seconds is not working.

Pingdom updates to the Crossref public status page.

Crossref uses statuspage.io to host the Crossref public statuspage.

Some of the components on the status page are updated automatically updated via Pingdom.

To get Pingdom to automatically update a component on statuspage.io, you must:

  1. In statuspage, click on the “automation” button next to the component that should be updated by Pingdom. It will list the “specific email address for this component” which will look something like this: component+73ec22b4-dfd4-492b-xy59-91c7ab112f3a@notifications.statuspage.io.
  2. Copy that email address and go to the Pingdom admin page.
  3. Go to the “Alert contacts” section of the “Users” page on Pingdom.
  4. Check to see if there is already an “Alert Contact” for the statuspage component in question. If there is not, create one. If there is, edit it.
  5. Make sure the email address of the “Alert contact” is the one you copied from statuspage.io.
  6. Click on “Test alert settings”.
  7. Go back to the component on the statuspage.io site. Click on automation again and you should see the test mail from Pingdom appear in the dialog.

TIP: Note that, if you edit or change a component on statuspage.io- it will generate a new email address and break the linkage between Pingdom and statuspage.io unless you update the pingdom alert contact’s email address again with the new statuspage component email.