Custom Pod Controller ===================== .. raw:: html Kubernetes uses the `Controller pattern `_ to align the current cluster state with the desired one. Stateful applications are usually managed with the `StatefulSet `_ controller, which creates and reconciles a set of Pods built from the same specification, and assigns them a sticky identity. CloudNativePG implements its own custom controller to manage PostgreSQL instances, instead of relying on the ``StatefulSet`` controller. While bringing more complexity to the implementation, this design choice provides the operator with more flexibility on how we manage the cluster, while being transparent on the topology of PostgreSQL clusters. Like many choices in the design realm, different ones lead to other compromises. The following sections discuss a few points where we believe this design choice has made the implementation of CloudNativePG more reliable, and easier to understand. PVC resizing ------------ This is a well known limitation of ``StatefulSet`` : it does not support resizing PVCs. This is inconvenient for a database. Resizing volumes requires convoluted workarounds. In contrast, CloudNativePG leverages the configured storage class to manage the underlying PVCs directly, and can handle PVC resizing if the storage class supports it. Primary Instances versus Replicas --------------------------------- The ``StatefulSet`` controller is designed to create a set of Pods from just one template. Given that we use one ``Pod`` per PostgreSQL instance, we have two kinds of Pods: 1. primary instance (only one) 2. replicas (multiple, optional) This difference is relevant when deciding the correct deployment strategy to execute for a given operation. Some operations should be performed on the replicas first, and then on the primary, but only after an updated replica is promoted as the new primary. For example, when you want to apply a different PostgreSQL image version, or when you increase configuration parameters like ``max_connections`` (which are `treated specially by PostgreSQL because CloudNativePG uses hot standbyreplicas `_ ). While doing that, CloudNativePG considers the PostgreSQL instance’s role - and not just its serial number. Sometimes the operator needs to follow the opposite process: work on the primary first and then on the replicas. For example, when you lower ``max_connections`` . In that case, CloudNativePG will: - apply the new setting to the primary instance - restart it - apply the new setting on the replicas The ``StatefulSet`` controller, being application-independent, can’t incorporate this behavior, which is specific to PostgreSQL’s native replication technology. Coherence of PVCs ----------------- PostgreSQL instances can be configured to work with multiple PVCs: this is how WAL storage can be separated from ``PGDATA`` . The two data stores need to be coherent from the PostgreSQL point of view, as they’re used simultaneously. If you delete the PVC corresponding to the WAL storage of an instance, the PVC where ``PGDATA`` is stored will not be usable anymore. This behavior is specific to PostgreSQL and is not implemented in the ``StatefulSet`` controller - the latter not being application specific. After the user dropped a PVC, a ``StatefulSet`` would just recreate it, leading to a corrupted PostgreSQL instance. CloudNativePG would instead classify the remaining PVC as unusable, and start creating a new pair of PVCs for another instance to join the cluster correctly. Local storage, remote storage, and database size ------------------------------------------------ Sometimes you need to take down a Kubernetes node to do an upgrade. After the upgrade, depending on your upgrade strategy, the updated node could go up again, or a new node could replace it. Supposing the unavailable node was hosting a PostgreSQL instance, depending on your database size and your cloud infrastructure, you may prefer to choose one of the following actions: 1. drop the PVC and the Pod residing on the downed node; create a new PVC cloning the data from another PVC; after that, schedule a Pod for it 2. drop the Pod, schedule the Pod in a different node, and mount the PVC from there 3. leave the Pod and the PVC as they are, and wait for the node to be back up. The first solution is practical when your database size permits, allowing you to immediately bring back the desired number of replicas. The second solution is only feasible when you’re not using the storage of the local node, and re-mounting the PVC in another host is possible in a reasonable amount of time (which only you and your organization know). The third solution is appropriate when the database is big and uses local node storage for maximum performance and data durability. The CloudNativePG controller implements all these strategies so that the user can select the preferred behavior at the cluster level (read the :ref:`Kubernetes upgrade and maintenance ` section for details). Being generic, the ``StatefulSet`` doesn’t allow this level of customization.