Mine the Gap

The net-zero carbon revolution will mean adapting to a world with a lot more mining. (Note: This was originally published on the Geolsoc website and is reproduced here with supporting figure and…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




K8s Start Up Race Conditions

QUICK FIX #0: Kubernetes Health and Readiness Probes Race Conditions

In QUICK FIX posts I narrowly focus on issues that I’ve came across while working or hobby-ing in software development in the wild. The issues could be bugs; nuances of the program or service in question; or simply my own misunderstanding. Or something in-between. The goal is to quickly help people out who come across the same problem.

When Kubernetes pods start up their health state is assessed by probes. This allows K8s to perform management operations. It ensures that pods which aren’t up or ready do not receive traffic and it also ensures that pods can be restarted to restore a healthy state if they fall ill.

The Health probe asks the pod if is healthy. Usually meaning that that the server is up and the endpoints can be hit.

The Readiness probe asks the pod if it is ready. This is because the application may be healthy but busy with some other process such as data loading or dependency checking and as such wouldn’t be fully ready to to be operational.

Recently, I was having an error where pods could not start. When pods are failing to start you can query the reason:

From this we found a ‘getsockopt’ error. A Reason:‘Error’ and Exit Code:’137'.

After googling around I found that this reason code appears when the pod runs out of memory. However I didn’t think it was that because any instance where my pods ran out of memory I did get this code but also the message ‘OOMKilled’.

I then found that there are some issue tickets raised for ‘getsockopt’ and amidst the various root causes, one was around probes failing.

The probes create a race condition for the pod on start-up.

A ‘race condition’ is where a system has dependent events and certain events must occur before other events. However, there is no execution control. The race is where event A must complete before event B is executed.

Our race condition is where a pod is booting up. It must complete and be able to indicate a healthy state before the probe event occurs.

As such you have two choices.

Allocate Higher Quantities of CPU to your Deployments so that the boot process is faster and the pod will be up in time for Liveness and Readiness checks.

Set a longer Initial wait time on the Live and Readiness probe and extend the failure deadline and the test interval . This should give you plenty of time for your Service to boot up and become Ready.

If you raise CPU this can lead to an overallocation of CPU after boot. This could be handled with vertical CPU scaling to allow the pod to request more CPU when needed. But, most of us will already be using Horizontal Pod Scaling which is easier to manage with a static CPU configuration. (see my post on k8s sizing: https://medium.com/pareture/k8s-reusable-cross-env-microservice-sizing-template-d03dd8bfebf2 )

Below is a deployment.yaml setting which waits 1 minute before the first check. Then will do 5 additional checks 30 seconds apart. Giving the pod a total of 3.5 minutes to be ready before a re-deployment. It will succeed on the first indication and each probe times out after 3 seconds. Both the readiness and liveness probes are the same in this case.

Even if the CPU allocation does need to be higher it is important to know these settings and what they are doing for you.

Obviously there could be other reasons for the probe failing and ending up with this type of Error but this is what solved it for me. I hope it can be useful for others as well.

Add a comment

Related posts:

The New Facebook?

2018 is set to be a transformative year for the world’s largest social media network, that’s according to Facebook VP, David Marcus. With over 2 billion monthly active users and a global usage…