Monitoring health status from fastly health checks

Comments

7 comments

  • Malte Lauenroth

    Hi Doc, thanks, any method would be fine for me. For now we run a local health check routine and some extra load monitoring.

    Cheers

    0
    Comment actions Permalink
  • Chris Usick

    Hi, has there been any development on this feature @drwilco? I'm working on the exact same thing.

    Thanks,

    Chris

    0
    Comment actions Permalink
  • rmharrison

    @ drwilco Any word on Fastly monitoring for health checks?

    Being unable to view health check status makes debugging, especially real-time debugging during a 3rd party outage, a real pain.

    0
    Comment actions Permalink
  • Shohei smaeda

    Hi rmharrison Chris Usick Malte Lauenroth and all,

    Regarding this topic, yes, we're aware of your needs as this is one of the frequently asked questions. We have no ETA to share at the moment, but have an internal feature request ticket and the product team is working on it.

    In the meanwhile, I'd like to share some of my tips for how you can monitor backend health status using VCL.

    https://developer.fastly.com/reference/vcl/variables/backend-connection/req-backend-healthy/

    req.backend.healthy - Read Only
    Whether or not the backend has been marked unhealth by healthchecks.

    https://developer.fastly.com/reference/vcl/variables/backend-connection/backend-name-healthy/

    backend.{NAME}.healthy - Read Only
    Whether a particular backend is healthy.

    We can use these variables. Some of you may think about (or have tried) putting it into your custom logging format and monitor it through your logging pipeline. The problem with this approach is since you only receive logs after clients sent requests to us, and not really achieving "proactive" monitoring. Well, if the service continuously gets a large amount of traffic it may be fine, but if not that would be problematic as you don't have enough data (logs) to identify the backends health status.

    So instead of relying on the streaming logs, we can also make a tiny API endpoint in your VCL that returns a value of those backends health-related variables.

    In your VCL, you see backend top-level objects are defined like below:

    backend ${backend_name_1} { ... }
    backend ${backend_name_2} { ... }
    

    ${backend_name_n} is important here. Next, you create the below VCL Snippet in vcl_recv subroutine:

    and then create the below VCL Snippet in vcl_error subroutine:

    You can test a live demo here: https://fiddle.fastlydemo.net/fiddle/ac69251f

    What we're doing here is, you created an API endpoint that returns a JSON body that includes your backend health status. Now, you can use any third-party monitoring tools and send requests to this endpoint so that you can monitor your backend health status seen from a Fastly POP.

    But this solution is still not perfect. Because we have multiple POPs around the world, you could only check one single path ( POP A <-> backend) that is the closest POP from your monitoring server, and you can't know the health status for all POPs (POP B > backend, POP C > backend ... etc) which you likely want to monitor as well.

    Luckily, there's still a trick to achieve that. Most of you probably haven't used it before, but we have a cool API called "edge check":
    https://developer.fastly.com/reference/api/utils/content/

    /content/edge_check - GET 
    Retrieve headers and MD5 hash of the content for a particular URL from each Fastly edge server

    This is the exact same functionality as the "Check cache" button on the Fastly service configuration UI. It will send a request with a specified URL to all Fastly POPs and return headers and MD5 hash of the content.

    "headers"... yes, this is why we're also adding the health status into response headers of that synthetic response in the above snippet. So if you call this hand-crafted API endpoint using the edge check API, you can simply get the backend health status from each of our POP at once. Then, you should be able to parse the JSON response and retrieve each status.

    $ curl https://api.fastly.com/content/edge_check?url=${hostname}/fastly/api/hc-status -H 'Fastly-Key: ${your_fastly_token}'
    3
    Comment actions Permalink
  • Yongjik Kim

    Hi there,

    could somebody fix the formatting of Shohei's (incredibly helpful) response above? It used to look great, but I think there was some rendering backend change. Now I can hardly read the code...

    - Yongjik

    0
    Comment actions Permalink
  • Adam

    Hi Yongjik,

    Thanks for reaching out. We've just migrated our forum from Discourse to Zendesk community forum. There are a couple of posts we are reviewing to clean up and make easier to view.

    I've gone ahead and made some changes to this post to replicate the original post you referenced.

    Thanks

    Adam

    1
    Comment actions Permalink
  • Paul Norman

    We recently had an outage on the OpenStreetMap Standard Tile Layer, which uses reasonably complex backend selection, and the outage was harder to diagnose because we didn't have visibility into the healthchecks as seen by Fastly. In response, we've implemented the above API and added it to our monitoring.

     

    Since no one has described the monitoring part of it yet, we're using a custom prometheus exporter at https://github.com/openstreetmap/prometheus-exporters/blob/main/exporters/fastly_healthcheck/fastly_healthcheck_exporter that queries the edge_check API, parses the JSON, and returns metrics for each backend as viewed by each POP.

    In grafana, we then have a dashboard which shows, among other stats, avg(fastly_healthcheck_status{host="tile.openstreetmap.org"}) by (backend). Graphing this as a percentage shows availability from different POPs.

    I've also written an openstreetmap.org diary post at https://www.openstreetmap.org/user/pnorman/diary/399627 covering this

    0
    Comment actions Permalink

Please sign in to leave a comment.