Kubernetes Fix - One-Line Change Saves 600 Hours Annually
Basically, a small change in Kubernetes made a tool restart much faster, saving a lot of time.
A one-line fix in Kubernetes has transformed restart times for Atlantis from 30 minutes to just 30 seconds. This change saved the team 600 hours a year, enhancing productivity significantly. Teams managing large persistent volumes should consider similar adjustments to avoid bottlenecks.
What Happened
A significant bottleneck was discovered in the Kubernetes setup for Atlantis, a tool used for managing Terraform projects. Each restart of Atlantis took an astonishing 30 minutes, blocking engineering efforts and causing frustration. This issue arose from how Kubernetes handles volume permissions, particularly as the persistent volume grew in size. With around 100 restarts per month, the cumulative downtime added up to over 50 hours of lost productivity every month.
The problem was traced back to the fsGroup setting in Kubernetes, which was causing delays during the mounting of the persistent volume. When the volume was mounted, Kubernetes attempted to change the group ownership for every file on the volume, leading to significant delays, especially as the number of files grew into the millions.
Who's Affected
The engineering teams using Atlantis were the primary victims of this inefficiency. With frequent credential rotations and onboarding processes, the delays in restarting the tool hampered their ability to make timely infrastructure changes. The on-call engineers were paged every time Atlantis was restarted, further exacerbating the issue and leading to unnecessary alarm.
This situation highlights a common challenge faced by teams managing large-scale Kubernetes workloads. As persistent volumes grow, the default settings in Kubernetes can become bottlenecks that slow down operations, affecting overall productivity and response times.
What Data Was Exposed
While no sensitive data was exposed due to this issue, the inefficiency in restart times meant that teams were unable to apply critical infrastructure changes promptly. The long wait times could potentially lead to situations where updates to security credentials or configurations were delayed, increasing the risk of operational disruptions.
The fix implemented involved adjusting the fsGroupChangePolicy within the Kubernetes security context. By changing this setting to OnRootMismatch, the system only modifies permissions when necessary, significantly speeding up the restart process.
What You Should Do
For teams experiencing similar issues, it’s essential to audit Kubernetes configurations, especially for workloads with large persistent volumes. Consider reviewing the securityContext settings, particularly fsGroup and fsGroupChangePolicy. This simple change can dramatically improve performance and reduce downtime.
If your team is frequently restarting Kubernetes pods and facing delays, investigate whether permission changes are causing slowdowns. The adjustment made in this case has reclaimed nearly 600 hours of productive work annually, underscoring the importance of optimizing Kubernetes settings as workloads scale. Remember, not every fix needs to be complex; often, it’s about asking the right questions and understanding system behavior.
Cloudflare Blog