High Availablity is one of TraefikEE's core features. Since TraefikEE is running on top of diverse infrastructure and/or orchestration engines, some of them have special limitations in regards to which SLA TraefikEE can achieve.
In Kubernetes, healtchecks are performed through the kubelet. While the pods behind a service are healthchecked every couple seconds by the kubelet, it can take up to 45 seconds until Kubernetes realises that a kubelet itself is down. As Kubenetes will not start to react on that situation before it knows that a kubelet is down, that might have an impact on your SLA.
However, there are some possible workarounds, depending on the current setup:
- Modifying the Kubernetes configuration and/or adding tooling to improve Kubernetes internal health monitoring, such as for example node-problem-detector
- Customizing the TraefikEE installation, by exposing the data node directly without using a LoadBalancer service. This will allow for a direct connection to TraefikEE which availability could potentially be checked faster from an external LB, as it bypasses the Kubernetes internal route update mechanism to the TraefikEE nodes.
However, customizing the Installation has a couple of limitations:
- It is not possible to deploy multiple TraefikEE clusters on the same Kubernetes if you need it.
- TraefikEE requires a manual installation instead of using the
- In that manual installation, you have to fix the ports to be used by the data node pods, which have to be available on each node.
- It will make scaling your TraefikEE data nodes more difficult.
In Swarm, healthchecks for the swarm nodes themselves are performed by the manager. Typically, Swarm recognizes a worker being down within 15 seconds. However, Docker Swarm recognizes a missing replicate within ~5 seconds, which is much faster.
In the default Swarm setup, high availability is achieved differently: the TraefikEE data plane is published via the Ingress Routing Mesh, which handles the incoming requests. This routing mesh is composed of IPTables and IPVS, which provide healthcheck endpoints and TCP connection retries. If a swarm node holding a TraefikEE data plane container dies, all in flight requests to that node will fail as well. Luckily, the internal IPVS will pick up failing requests and attempt to redeliver them to a different node. This means that for the given time of a missing or failing replica, some requests might be retried internally. However, as long as there is at least one available node in the data plane, a node will answer the retried requests and no requests will be dropped from the users perspective.
Alternatively, it is also possible to publish the data nodes ports with a host bind directly. This allows your external load balancer to communicate directly with the nodes and provides better performance. However, you will loose benefits such as the port being available on every node.