Intermittent IPv6 Reachability Issues For Web App and IPv6 enabled agents.

Resolved in 2 hours

Incident timeline

Partial outage
Resolved
RCA(Root Cause Analysis): While migrating to a new load balancer for commercial cloud accounts, a portion of IPv6 configuration was overlooked which prevented IPv6 responses from the new load balancer (albeit it was given IPv6 resolving records it lacked the default IPv6 route in its "netplan"). This prevented IPv6 responses from the load balancer being returned or forwarded and thus affected access to the commercial Influx cloud for the web app and IPv6-enabled agents. This was quickly identified and the missing configuration was added and then tested. Going forward there will be more automated testing for reachability when commissioning new infrastructure to ensure full dual-stack reachability.
2022-10-21 13:00:00 UTC - a year
Partial outage
Identified
At PanSift we try to embrace and use IPv6 as the primary transport. Today we failed to spot an issue with our IPv6 system configuration on a new load balancer which caused intermittent reachability issues for the web application. Any IPv6-enabled agents buffered their data during the brief outages and then uploaded it once full IPv6 connectivity was restored. IPv4 agent and web app connectivity were unaffected. Time of Issue: 12 UTC - 14 UTC (intermittent reachability) Impact (Detail): Web dashboards, agent graphing/reporting, and IPv6-based agent ingestion were intermittently unavailable for a total of 15 mins between 12 UTC and 14 UTC due to a scheduled change. This change only applied to paid accounts using the `ingest5` datastore and no data was adversely affected or lost. Agents continued to buffer data whenever they could not write via IPv6 and graphs on the web app timed out. Investigation (Detail): The investigation highlighted the trigger of updating our DNS to cut traffic across to a new load balancer. DNS TTLs had been set low (60s) before the migration, so once discovered, the DNS entries were rapidly rolled back (and then subsequently rolled forward again to test once the additional system-level config was added). This highlighted an IP resolution and reachability issue for IPv6 traffic from the web app to the new load balancer and thus onwards to the cloud-based time series database. IPv4 traffic was unaffected at all times between agents, time series databases, and the web app. All affected agent data was buffered during the brief outages and was directed to the previous load balancer using DNS and the low TTLs (or buffered for later upload).
2022-10-21 11:00:00 UTC - a year