Automated Maintenance¶
VMs perform automated maintenance activities in announced maintenance windows. Typical activities are system updates and reboots necessary to activate VM property changes like memory size and the number of CPUs.
When new activities are scheduled by our central VM directory, a mail is sent out to technical contacts with information about what’s happening and when.
Activities can be merged into existing activities, updating them. For significant changes, like additional service restarts, a mail is sent again and the activity might get rescheduled to a later time. For small changes and cancelled activities, no mail is being sent currently.
The fc-agent
service executes due activities which happens every 10 minutes.
Some activities require a reboot which is done at the end of an agent run
after executing all activities.
Activities that are overdue (more than 30 minutes after planned time) are postponed for at least 8 hours and scheduled again.
Before executing activities, the machine is put into maintenance mode (it’s not in service) to prevent triggering false alarms for expected service interruptions during maintenance.
Maintenance is scheduled in a way so activities on different VMs shouldn’t run at the same time but this is not enforced by default. The execution of activities can be delayed for various reasons so activities on different VMs may overlap.
Additional Maintenance Constraints¶
To make sure that VMs don’t execute activities at the same time, possibly affecting
availability of a redundant system, the NixOS option
flyingcircus.agent.maintenanceConstraints.machinesInService
can be used.
This means that the specified machines from the same resource group have to be in service (not in maintenance mode) when the machine tries to enter maintenance mode. The constraint is checked shortly after entering maintenance mode, before executing activities. If it’s not met, due activities are postponed to a later time and the machines leaves maintenance mode immediately.
For the following example, assume that the VMs example10
, example11
and
example12
are running redundant instances of an application and we want at
least two of the instances in service at any time.
This is enforced by this config, which has to be placed on each machine:
# /etc/local/nixos/maintenance_settings.nix
{ config, ... }:
{
flyingcircus.agent.maintenanceConstraints.machinesInService = [
"example10"
"example11"
"example12"
];
}
The name of the current machine is ignored, so the config can be the same on all machines.