Partial outage
Identified
At PanSift we try to embrace and use IPv6 as the primary transport. Today we failed to spot an issue with our IPv6 system configuration on a new load balancer which caused intermittent reachability issues for the web application. Any IPv6-enabled agents buffered their data during the brief outages and then uploaded it once full IPv6 connectivity was restored. IPv4 agent and web app connectivity were unaffected.
Time of Issue: 12 UTC - 14 UTC (intermittent reachability)
Impact (Detail): Web dashboards, agent graphing/reporting, and IPv6-based agent ingestion were intermittently unavailable for a total of 15 mins between 12 UTC and 14 UTC due to a scheduled change. This change only applied to paid accounts using the `ingest5` datastore and no data was adversely affected or lost. Agents continued to buffer data whenever they could not write via IPv6 and graphs on the web app timed out.
Investigation (Detail): The investigation highlighted the trigger of updating our DNS to cut traffic across to a new load balancer. DNS TTLs had been set low (60s) before the migration, so once discovered, the DNS entries were rapidly rolled back (and then subsequently rolled forward again to test once the additional system-level config was added). This highlighted an IP resolution and reachability issue for IPv6 traffic from the web app to the new load balancer and thus onwards to the cloud-based time series database. IPv4 traffic was unaffected at all times between agents, time series databases, and the web app. All affected agent data was buffered during the brief outages and was directed to the previous load balancer using DNS and the low TTLs (or buffered for later upload).
2022-10-21 11:00:00 UTC
-
a year