Flatcar Self-Paced Learning Series: Immutable OS, Boot Process, In‐Place Updates, and Automating Rollback
In this session, we’ll do a deep dive into Flatcar’s immutability and partition layout, and dissect the operating system’s start-up process. Building on this, we’ll do a deep dive into the update process, run an in-place upgrade, and configure an automated roll-back. Lastly, we’ll discuss Flatcar release channels and the release stabilisation process.
The session will cover immutability, boot, provisioning, and A/B partition layout first. While a bit dry, these are necessary to understand the innerworks of in-place updates.
Prerequisites
The session builds on the session “Basic Operation and Local Testing”. It assumes you
- have created a local test environment.
- are able to start ephemeral Flatcar VMs.
- know how to transpile Butane YAML to Ignition JSON.
- pass Ignition JSON configuration to a VM at launch.
Download a previous OS release
In the Basics session, we downloaded the latest Alpha. Since we want to perform live in-place updates in this session, we need to use a less-than-latest release.
Go to https://www.flatcar.org/releases/#alpha-release , scroll down a bit until you find the second-to-latest release, and click the link corresponding to your host architecture (amd64 or arm64).
As before, download
flatcar_production_qemu_uefi.shwhich we also need to make executableflatcar_production_qemu_uefi_efi_code.qcow2flatcar_production_qemu_uefi_efi_vars.qcow2flatcar_production_qemu_uefi_image.img
Or use the bash automation below.
Adjust release and arch to your needs.
# replace with second-latest Alpha release number (or just use as-is, as this release should be "old enough")
release='4230.0.0'
# amd64 or arm64
arch='amd64'
Then run
wget https://alpha.release.flatcar-linux.net/"${arch}"-usr/"${release}"/{flatcar_production_qemu_uefi.sh,flatcar_production_qemu_uefi_efi_code.qcow2,flatcar_production_qemu_uefi_efi_vars.qcow2,flatcar_production_qemu_uefi_image.img}
chmod 755 flatcar_production_qemu_uefi.sh
Flatcar Partition Layout
For our first boot, we don’t actually want the update client to interfere with us. It will check for updates regularly, and stage and reboot by default. We can avoid that by simply not starting the update client.
...
systemd:
units:
- name: update-engine.service
mask: true
- name: locksmithd.service
mask: true
...
Let’s add this to the web service from our “Basics” session! It’s always a good thing to have an actual service running.
Complete Butane YAML for convenience
variant: flatcar
version: 1.0.0
systemd:
units:
- name: update-engine.service
mask: true
- name: locksmithd.service
mask: true
- name: nginx.service
enabled: true
contents: |
[Unit]
Description=NGINX example
After=docker.service
Requires=docker.service
[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker rm --force nginx1
ExecStart=/usr/bin/docker run --name nginx1 --pull always --log-driver=journald --net host docker.io/nginx:1
ExecStop=/usr/bin/docker stop nginx1
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
Then transpile and start.
cat nginx.yaml | docker run --rm -i quay.io/coreos/butane:latest > nginx.json
./flatcar_production_qemu_uefi.sh -i nginx.json -f 12345:80 -- -nographic -snapshot
NOTE We’ll require root access for most of what we do in this session, as we’re introspecting sensitive areas of the system.
Once the VM finished booting, use
sudo -i
to switch to the root account.
Leave the VM running for interactively exploring the Flatcar OS.
Immutable operating system
All of Flatcar’s binaries reside in /usr.
/usr is on a separate partition, and that partition is strictly read-only.
Everything else is either sym-linked into /usr- like /bin, /sbin, /lib, and /lib64.
Or it is generated at first boot (see the tmpfiles step below).
Check it out!
ls -la /
Try creating a file in /usr:
echo 'test' > /usr/testfile
Let’s check out how the OS disk is used. Which partitions of the OS disk are mounted?
mount | grep vda
Wait, / and /oem are there, but not /usr?
Well, this needs a bit of detective work.
First, we can verify /usr is, in fact, based on a partition on /dev/vda:
rootdev -s /usr
returns vda3. But why doesn’t it show up in our mounts?
Let’s check what is actually mounted on /usr:
mount | grep -w /usr
Let’s ignore the systemd-sysext line for now; we’ll elaborate on this in a later session.
So /usr is handled by devicemapper, more specifically
ls -la /dev/mapper/usr
it’s dm-0. Let’s ask the device mapper about it, then:
dmsetup status /dev/dm-0
OOOoohh, it’s a dm-verity device!
- DM-Verity is a special Device Mapper storage that is guaranteed to be read-only - in fact, verity of the storage bits is guarded by cryptographic checksums.
- DM-Verity was added to the Linux kernel in 2011 by Netflix and Google , and is used in Chromebooks - which share ancestry with Flatcar.
So let’s see which partition dm-0 is actually using:
veritysetup status usr
Right, it’s /dev/vda3.
So dm-verity inserts itself by means of a device mapper layer between the physical vda3 and what’s mounted on /usr.
For now we have:
/backed byvda9- the root partition. This is populated at first boot; we’ll discuss in a second how exactly that happens. There’s also a reason why it is the last partition in the table. Find out more below./oembacked byvda6contains vendor specific tools (thinkwa-agentin the Azure image, oramazon-ssm-agentfor AWS)./usris a device mapper storage- backed by
dm-0, the verity layer, which is - backed by
vda3, the currently active OS partition.
- backed by
There are other partitions, some of which are reserved and are currently not in use. EFI-SYSTEM, ROOT, USR-A / USR-B, and OEM are the most interesting ones.
Let’s look at the boot process to better understand how these partitions interoperate.
Flatcar Boot Process
---
title: Flatcar Boot Process
---
flowchart LR
EFI1[EFI-SYSTEM
UEFI start-up]
EFI2[EFI-SYSTEM
Grub bootloader: Active partition?]
EFI2@{shape: decision}
EFI3a[EFI-SYSTEM
Kernel + Initrd release #A]
EFI3b[EFI-SYSTEM
Kernel + Initrd release #B]
USR4a[USR-A
user space release #A]
USR4b[USR-B
user space release #B]
ROOT[Pivot to root w/ USR-A or USR-B mounted to /usr]
EFI1 --> EFI2
EFI2 --Partition A is active--> EFI3a --> USR4a --> ROOT
EFI2 --Partition B is active--> EFI3b --> USR4b --> ROOT
The boot process is quite similar to regular Linux start-up, with minor Flatcar specific changes.
EFI-SYSTEMorBIOS-BOOT==>EFI-SYSTEM(on legacy BIOS machines) UEFI (or grub BIOS stub on legacy systems) starts, the system performs basic hardware initialisation, then loads..EFI-SYSTEMGrub, the bootloader. Grub reads its configuration and determines which kernel+initrd to load and which OS (USR) partition to use, based on GPT attributes of bothUSR-AandUSR-Bpartitions. It loads kernel and initrd into RAM, then starts the kernel (passing the correct USR partition via kernel command line)EFI-SYSTEMKernel and init-ramdisk run in memory. This is when Ignition fetches its configuration and executes on it.USR-AorUSR-BRoot FS is prepared and set up./usris mounted.- Ignition finishes, root is switched from the initrd to the root filesystem, and systemd reloads all services.
ROOTandUSR-*Regular system services start.
Flatcar’s OS disk (see
partition table
in our public docs) contains 2 separate partitions for OS user spaces. The respective two kernel+initrd blobs are stored together in the EFI-SYSTEM partition.
Let’s explore ourselves!
Since we’re using qemu (which uses virtio devices), the OS disk is vda.
Let’s list the partitions first.
gdisk -l /dev/vda
USR-A and USR-B both are OS partitions.
One is considered the “active” partition, the other is “spare” and will be used to stage updates.
These partitions contain the whole of the user space.
The corresponding kernel and initrd are stored in the EFI partition mounted on /boot.
Let’s take a look.
ls -la /boot/flatcar/
Currently there’s only one kernel+initrd - vmlinuz-a since we just provisioned a fresh system that never updated.
Which partition is active?
Let’s pretend we’re Grub, the bootloader.
We need to decide which kernel to boot!
For this, we can check which USR partition is the currently active one.
From Flatcar user space we can use the cgpt tool:
cgpt show /dev/vda
and looking for the Attr lines in for both partitions the output. The active one should show
Attr: priority=1 tries=0 successful=1
We can see that USR-A has priority, and has booted successfully.
Therefore, the kernel+initrd from vmlinuz-a and user space from USR-A currently make up the OS version we’re running.
Flatcar Provisioning Process
Flatcar’s first boot is special. The system is initialised and user configuration is applied at first boot.
---
title: Flatcar Provisioning
---
flowchart LR
GRUB[Grub
Detects first boot, sets kernel command line]
Ignition1[InitRD: Ignition
fetches vendor + user config]
Ignition2[InitRD: Ignition
Prepares disks and root partition
Resizes root partition]
tmpfiles[InitRD: Systemd-Tmpfiles
Populates root filesystem]
Ignition3[InitRD: Ignition
Applies user configuration
Downloads user files]
root[Systemd reloads
Regular System start-up]
GRUB --> Ignition1 --> Ignition2 --> tmpfiles --> Ignition3 -- pivot-root to OS disk root partition --> root
Populate root: Flatcar’s first boot
First boot is determined by Grub.
It checks for the presence of a
file
/flatcar/first_boot in the EFI-SYSTEM partition and sets a kernel command line option respectively.
This file is removed later, after provisioning finished.
System Provisioning runs from the initrd
If first boot is detected in the initrd, the ignition provisioning agent is started.
Ingition fetches vendor specific configuration - think username / ssh key, network configuration etc. that you can set up e.g. via the Azure Portal when launching a VM - and “user data”.
User data is expected to be in Ignition JSON format - exactly what we’ve been transpiling to for our web service and “don’t update” configurations.
Ignition initialises storage devices and file systems - which can be customised and modified from user data configuration, as we’ll learn in a later session.
It also resizes the root partition to fill all of the OS disk.
This is the reason why the root partition is at the very end of Flatcar’s OS disk partition list (vda9).
System defaults - tmpfiles that are not temporary
In a second stage, and also from the initrd, a service called systemd-tmpfiles creates all files and directories required in the root filesystem outside of /usr. systemd-tmpfiles is a great tool that suffers from less-than-optimal naming, in that it doesn’t actually handle temporary files. systemd-system-files-manager would be a better, though slightly too verbose. name. The misnomer even led to
adventurous users inadvertently deleting their home directory
, a documentation issue later
addressed by systemd maintainers
.
If you like to check out for yourself how Flatcar uses systemd-tmpfiles, just list the tempfiles configuration we ship with each release:
ls /usr/lib/tmpfiles.d/
and check them out individually.
For instance, if you’d like to see who’s creating the symlinks from /bin and /sbin into /usr, consult baselayout-usr.conf.
cat /usr/lib/tmpfiles.d/baselayout-usr.conf
Applying user customisation
Lastly, after the “distro” files and directories were created, all file-based user customisations are applied. This includes creating users, groups, and files, and downloading user content specified in Ignition configuration. Systemd units specified in user data will be created and existing units will be modified in accordance with the user’s configuration.
In our configuration above, this includes disabling (masking) the update-engine and locksmithd services, creating a new service unit based on the inline configuration for our NGINX service, and marking that service active.
Pivot Root
After preparing the root partition and rendering all files not shipped in the Flatcar OS image in /usr, the system changes its filesystem root from the in-memory initrd to the actual root filesystem.
At that point, Systemd reloads all service files.
Services and modifications to services (drop-ins, masks, enablement) shipped with Ignition configuration are now considered and become active as the system boots normally.
In-Place Updates
Flatcar OS updates need-to-know
- Automated / unattended. Updates are staged in the background, while the system is running. Since updates need a reboot to activate, various mechanisms for controlling node reboots are provided.
- Atomic. There is no intermediate state (think: half of the new packages were installed, then suddenly there’s a power shortage). OS version 1 before reboot, OS version 2 afterwards.
- 100% reversible. You can roll back to the previous version in case of issues, to boot into a known-good environment. Roll-backs are automatable / customisable to your needs, and atomic too.
- Update from any version to any (newer) version. Flatcar can be updated from any previous release to the latest release.
After all that theory we’ll now FINALLY get back to some more hands-on stuff. This is the reason we downloaded a previous OS release. So let’s go and update!
First, open a browser and point it to http://localhost:12345 . Oh yeah, our NGINX demo. It’s still alive!
Now, on Flatcar, unmask and enable the update client update_engine.
Note that while binaries and commands use underscores _, the systemd unit uses a dash -.
Use systemctl to start the client:
systemctl unmask update-engine
systemctl enable --now update-engine
The service now runs in the background and will regularly (default: hourly) check for updates. We can query its status via
update_engine_client -status
It’s most likely idle right now. We can ask it to check for an update:
update_engine_client -check-for-update
It is expected to find an update since we downloaded an old version. We can run
update_engine_client -status
to follow the download process: CURRENT_OP will be UPDATE_STATUS_DOWNLOADING, and PROGRESS will display the download progress in fractures of 1 (e.g. 0.5 equals 50%, 1 equals 100%).
Eventually, CURRENT_OP switch to UPDATE_STATUS_UPDATED_NEED_REBOOT.
This means the update has been verified and stored in the spare partition.
We can even see the new kernel+initrd stored in the EFI-SYSTEM partition:
ls -la /boot/flatcar/
Let’s check partition attributes while we’re at it:
cgpt show /dev/vda
and we see that now, USR-B has a priority higher than USR-A.
tries=1 is used by the bootloader to check how many tries to boot into that partition are left.
It will be decremented by the bootloader before starting the kernel.
Before we reboot, let’s note down the OS version and the kernel version we’re on:
cat /etc/os-release
uname -a
Now let’s activate the update:
reboot
Make sure you’re root, then run
cat /etc/os-release
uname -a
and compare with your notes.
And check if our service is running on http://localhost:12345 !
Lastly, let’s consult partition table attributes:
cgpt show /dev/vda
We see that USR-B now is active (higher priority than USR-A) and “successful”.
This is because update_engine makes sure the successful attribute is set when it starts.
Critical Services and Updates: Automating Roll-Backs
The above discusses OS mechanism to boot into new OS versions and declare the new OS release stable - solely based on the successful start-up of update_engine.
It’s quite easy to build on this and to devise a set-up that ensures critical services come up before a new release is declared stable.
---
title: Flatcar First-boot after update
---
flowchart LR
GRUB[Grub
Selects new version for boot
Decrements tries counter]
OS[OS
Boots from new partition]
SRV[Services start]
UE[update_engine
Starts and marks partition as successful]
GRUB --> OS --> SRV
OS --> UE
We want update_engine to depend on a successful start of our critical services, and when our services fail to start after a timeout, we want a reboot.
Then Grub will fall back to the previous OS version.
The tricky bit is to only apply this process right after an update happened, when we boot into the updated OS for the first time.
Otherwise we risk ending up in a reboot loop when our “critical services” don’t start under regular (non-update) circumstances, which will impede debugging.
A respective dependency chain can be built with systemd units and seamlessly integrated into the generic Flatcar start-up. For this, we want:
- A check for determining if this is the first boot after an upgrade. It should declare the system “healthy” straight away only if this is not a first boot after upgrade. We can build this in a short shell script from what we’ve learned about Flatcars partition labels above.
- A health check meta-service that only runs when the “first boot” check succeeds. Users can depend that service on their critical services, so it can only start after these services started. After all dependencies were satisfied, the update is healthy.
- A trigger for
update_engineto only start when either 1. or 2. marked the boot as healthy. - A timer that triggers a reboot if neither 1. nor 2. concluded successfully.
---
title: Flatcar First-boot after update
---
flowchart LR
GRUB[Grub
Selects new version for boot
Decrements tries counter]
OS[OS
Boots from new partition]
SRV[Critical Service
required to start before update_engine]
UE[update_engine
Starts and marks partition as successful]
UP[First boot after update?]
UP@{shape: decision}
H[Wait for critical service]
H@{shape: delay}
HEALTHY{{Boot Declared Healthy}}
T[Timer Unit
Waits for e.g. 10 minutes]
T@{shape: delay}
RD[Health check successful?]
RD@{shape: decision}
R[Trigger reboot]
RN[Do nothing]
GRUB --> OS --> SRV
OS --> UP -- No --> HEALTHY
HEALTHY --> UE
HEALTHY --> RD
UP -- Yes, check health --> H --> HEALTHY
SRV -- critical service started successfully --> H
OS --> T --> RD -- no --> R
RD -- yes --> RN
We will use a flag file, /run/first-boot-healthy, to signify that the boot is healthy (i.e. either 1. or 2. above returned successful).
This allows us to flexibly use systemd’s ConditionPathExists unit conditions to wire up our logic as well as a
path
unit to ultimately trigger the start of update_engine.
Let’s lay this out!
1. Detecting a first boot after an OS upgrade.
We can use a script around cgpt to check if we:
- booted from the partition with the highest priority, and
- the
successfulbit hasn’t been set yet.
Helper script for detecting first boot after upgrade
#!/bin/bash
healthy_flag_file="${1:-/run/first-boot-healthy}"
function get_part_attr() {
local partition="$1"
local attribute="$2"
cgpt show "${partition}" \
| sed -nE "s/.*Attr:.*${attribute}=([0-9]+)([[:space:]]|\$).*/\1/p"
}
function is_first_boot_after_upgrade() {
active_part="$(rootdev -s /usr)"
active_prio="$(get_part_attr "${active_part}" priority)"
spare_part="$(cgpt find -t flatcar-usr 2>/dev/null | grep -v "${active_part}")"
spare_prio="$(get_part_attr "${spare_part}" priority)"
# Is current /usr partition the highest priority?
# (A previous manual roll-back can cause it not to be)
if [[ ${active_prio} -le ${spare_prio} ]] ; then
echo "Active partition '${active_part}' has lower or equal priority ('${active_prio}') than spare ('${spare_part}': '${spare_prio}')."
return 1
fi
echo "Active partition '${active_part}' has highest priority '${active_prio}' (spare '${spare_part}': '${spare_prio}')."
# Is active partition marked successful already by previous boot?
if [[ "$(get_part_attr "${active_part}" "successful")" -eq 1 ]] ; then
echo "Current USR partition '${active_part}' has been marked as successful boot in a previous boot."
return 1
fi
return 0
}
if ! is_first_boot_after_upgrade; then
echo "No first boot after upgrade detected, quitting."
touch "${healthy_flag_file}"
exit 0
fi
echo "First boot after upgrade detected"
The script will generate a file /run/first-boot-healthy only if this is NOT the first boot after an update.
We also need a corresponding service definition to run it.
- name: is-first-boot-after-upgrade.service
enabled: true
contents: |
[Unit]
Description=Detect if this is a first boot after an OS upgrade.
[Service]
ExecStart=/opt/detect-first-boot-after-upgrade.sh
[Install]
WantedBy=multi-user.target
2. Force a health check that ensures our critical service is running
If step 1. did detect a first boot after upgrade, the system is not marked healthy yet.
We can define a simple service unit that creates /run/first-boot-healthy.
Users can then make their critical services depend on this unit, so all these need to start before our unit runs.
Consider this service definition:
- name: first-boot-healtcheck.service
enabled: true
contents: |
[Unit]
Description=Meta service to mark the first boot after an OS upgrade as healthy.
After=is-first-boot-after-upgrade.service
Requires=is-first-boot-after-upgrade.service
ConditionPathExists=!/run/first-boot-healthy
[Service]
ExecStartPre=/usr/bin/echo "All critical services are up, start-up is healthy."
ExecStart=/usr/bin/touch /run/first-boot-healthy
[Install]
WantedBy=multi-user.target
It runs after is-first-boot-after-upgrade.service, and it will only run when /run/first-boot-healthy hasn’t been created yet.
Users could now use
systemd:
units:
- name: first-boot-healtcheck.service
dropins:
- name: nginx-essential-service.conf
contents: |
[Unit]
Requires=nginx.service
After=nginx.service
to make sure the health check can only start after NGINX did.
3. Start update_engine only after 1. or 2. succeed
Unit dependencies in systemd itself unfortunately are not flexible enough to map either/or, branches, and branch merge flows. Fortunately, path units can be used to work around this, and to start arbitrary units based on the presence (or creation) of a file.
Let’s add a path unit that starts update_engine for us when our flag file is created
- name: first-boot-healthy.path
enabled: true
contents: |
[Unit]
Description=Triggers either after the first boot after an OS upgrade was healthy or if there was no OS upgrade.
[Path]
PathExists=/run/first-boot-healthy
Unit=update-engine.service
[Install]
WantedBy=multi-user.target
And ensure it does not start when the flag file does not exists - this effectively covers all wants: and requires: dependencies of other units on update_engine spread across Flatcar.
- name: update-engine.service
dropins:
- name: first-boot-healthy-must-exist.conf
contents: |
[Unit]
ConditionPathExists=/run/first-boot-healthy
4. Reboot after timeout if healthy flag was not set
Lastly, we define a timer unit that waits a set amount of time after systemd started, before starting a service which, if /run/first-boot-healthy does not exist, triggers a reboot.
- name: reboot-after-unhealthy-upgrade.timer
enabled: true
contents: |
[Unit]
Description=Triggers a reboot (causing a rollback) when the OS is unhealthy after an upgrade
[Timer]
OnStartupSec=60
[Install]
WantedBy=timers.target
- name: reboot-after-unhealthy-upgrade.service
contents: |
[Unit]
Description=Triggers a reboot (causing a rollback) when the OS is unhealthy after an upgrade
ConditionPathExists=!/run/first-boot-healthy
[Service]
ExecStartPre=/usr/bin/echo "WARNING: unclean boot detected after OS upgrade."
ExecStartPre=/usr/bin/echo "WARNING: Rebooting to trigger a roll-back."
ExecStart=/usr/bin/reboot
Note that the timeout is very tight - 60 seconds - in this example. This is for Demo purposes; in production environments this should align to the expected critical service start-up time, likely 10 minutes or more.
Finishing touches and test run
Before we test the above, we actually need a service that fails!
We can amend our NGINX unit to fail start-up if a file /nginx-fail exists:
...
ExecStartPre=/usr/bin/test ! -f /nginx-fail
...
Now we’re all set for a test run.
For convenience, find the whole config here:
variant: flatcar
version: 1.0.0
storage:
files:
- path: /opt/detect-first-boot-after-upgrade.sh
mode: 0500
contents:
inline: |
#!/bin/bash
healthy_flag_file="${1:-/run/first-boot-healthy}"
function get_part_attr() {
local partition="$1"
local attribute="$2"
cgpt show "${partition}" \
| sed -nE "s/.*Attr:.*${attribute}=([0-9]+)([[:space:]]|\$).*/\1/p"
}
function is_first_boot_after_upgrade() {
active_part="$(rootdev -s /usr)"
active_prio="$(get_part_attr "${active_part}" priority)"
spare_part="$(cgpt find -t flatcar-usr 2>/dev/null | grep -v "${active_part}")"
spare_prio="$(get_part_attr "${spare_part}" priority)"
# Is current /usr partition the highest priority?
# (A previous manual roll-back can cause it not to be)
if [[ ${active_prio} -le ${spare_prio} ]] ; then
echo "Active partition '${active_part}' has lower or equal priority ('${active_prio}') than spare ('${spare_part}': '${spare_prio}')."
return 1
fi
echo "Active partition '${active_part}' has highest priority '${active_prio}' (spare '${spare_part}': '${spare_prio}')."
# Is active partition marked successful already by previous boot?
if [[ "$(get_part_attr "${active_part}" "successful")" -eq 1 ]] ; then
echo "Current USR partition '${active_part}' has been marked as successful boot in a previous boot."
return 1
fi
return 0
}
if ! is_first_boot_after_upgrade; then
echo "No first boot after upgrade detected, quitting."
touch "${healthy_flag_file}"
exit 0
fi
echo "First boot after upgrade detected"
systemd:
units:
- name: locksmithd.service
mask: true
- name: nginx.service
enabled: true
contents: |
[Unit]
Description=NGINX example
After=docker.service
Requires=docker.service
[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker rm --force nginx1
ExecStartPre=/usr/bin/test ! -f /nginx-fail
ExecStart=/usr/bin/docker run --name nginx1 --pull always --log-driver=journald --net host docker.io/nginx:1
ExecStop=/usr/bin/docker stop nginx1
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
- name: is-first-boot-after-upgrade.service
enabled: true
contents: |
[Unit]
Description=Detect if this is a first boot after an OS upgrade.
[Service]
ExecStart=/opt/detect-first-boot-after-upgrade.sh
[Install]
WantedBy=multi-user.target
- name: first-boot-healtcheck.service
enabled: true
contents: |
[Unit]
Description=Meta service to mark the first boot after an OS upgrade as healthy.
After=is-first-boot-after-upgrade.service
Requires=is-first-boot-after-upgrade.service
ConditionPathExists=!/run/first-boot-healthy
[Service]
ExecStartPre=/usr/bin/echo "All critical services are up, start-up is healthy."
ExecStart=/usr/bin/touch /run/first-boot-healthy
[Install]
WantedBy=multi-user.target
dropins:
- name: nginx-essential-service.conf
contents: |
[Unit]
Requires=nginx.service
After=nginx.service
- name: first-boot-healthy.path
enabled: true
contents: |
[Unit]
Description=Triggers either after the first boot after an OS upgrade was healthy or if there was no OS upgrade.
[Path]
PathExists=/run/first-boot-healthy
Unit=update-engine.service
[Install]
WantedBy=multi-user.target
- name: update-engine.service
dropins:
- name: first-boot-healthy-must-exist.conf
contents: |
[Unit]
ConditionPathExists=/run/first-boot-healthy
- name: reboot-after-unhealthy-upgrade.timer
enabled: true
contents: |
[Unit]
Description=Triggers a reboot (causing a rollback) when the OS is unhealthy after an upgrade
[Timer]
OnStartupSec=60
[Install]
WantedBy=timers.target
- name: reboot-after-unhealthy-upgrade.service
contents: |
[Unit]
Description=Triggers a reboot (causing a rollback) when the OS is unhealthy after an upgrade
ConditionPathExists=!/run/first-boot-healthy
[Service]
ExecStartPre=/usr/bin/echo "WARNING: unclean boot detected after OS upgrade."
ExecStartPre=/usr/bin/echo "WARNING: Rebooting to trigger a roll-back."
ExecStart=/usr/bin/reboot
And don’t forget to transpile 😉
Start a fresh Flatcar VM from our second-to-last Alpha release image.
./flatcar_production_qemu_uefi.sh -i nginx.json -f 12345:80 -- -nographic -snapshot
After boot, become root (sudo -i).
Check the NGINX web server from your local browser, and check the status of the various services we defined:
systemctl status nginx.service update-engine.service is-first-boot-after-upgrade.service first-boot-healtcheck.service reboot-after-unhealthy-upgrade.service -l --no-pager
Among other things, we can see that reboot-after-unhealthy-upgrade.service tried to start 60 seconds after boot, but fortunately did not trigger a reboot as its precondition was not met (the non-existence of /run/first-boot-healthy).
Let’s see if we can make NGINX fail:
touch /nginx-fail
systemctl restart nginx
With our NGINX failure staged, we can once again upgrade the node:
update_engine_client -check_for_update
update_engine_client -status
and, after the update was staged, reboot.
You can check the Flatcar OS release on the login prompt:
Flatcar Container Linux by Kinvolk alpha XXXX for QEMU
XXXX should be the latest Alpha release.
Then we just wait - the VM will auto-reboot within 60 seconds after it started. After a short while you’ll see
Flatcar Container Linux by Kinvolk alpha YYY for QEMU
XXXX should be the Alpha release we downloaded at the beginning of this session.
Rollback successful!
Since we’re now back to the previous version, step #1 above should mark the boot as healthy (so the instance does not continue to reboot). Great job - we just built an automated roll-back into a known good environment when a critical service does not come up after an OS upgrade.
Done!
In this session, you learned
- about Flatcar’s immutable and verity-protected OS partition
- the Flatcar boot process, and initial provisioning
- the A/B update scheme and how the bootloader determines what to boot
- the upgrade process
- how to customise Flatcar to roll back OS upgrades when critical services fail after an update