Monitoring health status from fastly health checks
Hi fastly team,
I am building our service failure detection.
Is there a way to get information on the status of fastlys health checks?
We expected a log upon an unhealthy event, but my first testing has not shown one.
Thank you,
Malte
-
@ drwilco Any word on Fastly monitoring for health checks?
Being unable to view health check status makes debugging, especially real-time debugging during a 3rd party outage, a real pain.
-
Hi rmharrison Chris Usick Malte Lauenroth and all,
Regarding this topic, yes, we're aware of your needs as this is one of the frequently asked questions. We have no ETA to share at the moment, but have an internal feature request ticket and the product team is working on it.
In the meanwhile, I'd like to share some of my tips for how you can monitor backend health status using VCL.
https://developer.fastly.com/reference/vcl/variables/backend-connection/req-backend-healthy/
req.backend.healthy
- Read Only
Whether or not the backend has been marked unhealth by healthchecks.https://developer.fastly.com/reference/vcl/variables/backend-connection/backend-name-healthy/
backend.{NAME}.healthy
- Read Only
Whether a particular backend is healthy.We can use these variables. Some of you may think about (or have tried) putting it into your custom logging format and monitor it through your logging pipeline. The problem with this approach is since you only receive logs after clients sent requests to us, and not really achieving "proactive" monitoring. Well, if the service continuously gets a large amount of traffic it may be fine, but if not that would be problematic as you don't have enough data (logs) to identify the backends health status.
So instead of relying on the streaming logs, we can also make a tiny API endpoint in your VCL that returns a value of those backends health-related variables.
In your VCL, you see backend top-level objects are defined like below:
backend ${backend_name_1} { ... } backend ${backend_name_2} { ... }
${backend_name_n}
is important here. Next, you create the below VCL Snippet invcl_recv
subroutine:and then create the below VCL Snippet in
vcl_error
subroutine:You can test a live demo here: https://fiddle.fastlydemo.net/fiddle/ac69251f
What we're doing here is, you created an API endpoint that returns a JSON body that includes your backend health status. Now, you can use any third-party monitoring tools and send requests to this endpoint so that you can monitor your backend health status seen from a Fastly POP.
But this solution is still not perfect. Because we have multiple POPs around the world, you could only check one single path ( POP A <-> backend) that is the closest POP from your monitoring server, and you can't know the health status for all POPs (POP B > backend, POP C > backend ... etc) which you likely want to monitor as well.
Luckily, there's still a trick to achieve that. Most of you probably haven't used it before, but we have a cool API called "edge check":
https://developer.fastly.com/reference/api/utils/content//content/edge_check
- GET
Retrieve headers and MD5 hash of the content for a particular URL from each Fastly edge serverThis is the exact same functionality as the "Check cache" button on the Fastly service configuration UI. It will send a request with a specified URL to all Fastly POPs and return headers and MD5 hash of the content.
"headers"... yes, this is why we're also adding the health status into response headers of that synthetic response in the above snippet. So if you call this hand-crafted API endpoint using the edge check API, you can simply get the backend health status from each of our POP at once. Then, you should be able to parse the JSON response and retrieve each status.
$ curl https://api.fastly.com/content/edge_check?url=${hostname}/fastly/api/hc-status -H 'Fastly-Key: ${your_fastly_token}'
-
Hi Yongjik,
Thanks for reaching out. We've just migrated our forum from Discourse to Zendesk community forum. There are a couple of posts we are reviewing to clean up and make easier to view.
I've gone ahead and made some changes to this post to replicate the original post you referenced.
Thanks
Adam
-
We recently had an outage on the OpenStreetMap Standard Tile Layer, which uses reasonably complex backend selection, and the outage was harder to diagnose because we didn't have visibility into the healthchecks as seen by Fastly. In response, we've implemented the above API and added it to our monitoring.
Since no one has described the monitoring part of it yet, we're using a custom prometheus exporter at https://github.com/openstreetmap/prometheus-exporters/blob/main/exporters/fastly_healthcheck/fastly_healthcheck_exporter that queries the edge_check API, parses the JSON, and returns metrics for each backend as viewed by each POP.
In grafana, we then have a dashboard which shows, among other stats, avg(fastly_healthcheck_status{host="tile.openstreetmap.org"}) by (backend). Graphing this as a percentage shows availability from different POPs.
I've also written an openstreetmap.org diary post at https://www.openstreetmap.org/user/pnorman/diary/399627 covering this
Please sign in to leave a comment.
Comments
7 comments