DevSecOps

systemd Hardening: A Production Sandboxing Cookbook

Most sysadmins reach for AppArmor or SELinux when they want to confine a service. Both work, both have learning curves, and both expect either pre-shipped policies or substantial custom-rule writing. Meanwhile, systemd has shipped — for years — a sandboxing toolkit that gets you most of the protection with maybe 25 lines of INI.

This is the cookbook I use to harden services on production servers. Worked examples for nginx, MariaDB, and a custom Python app, plus the debugging workflow when something inevitably breaks.

Why systemd sandboxing first

systemd's hardening directives use the same kernel primitives AppArmor/SELinux do — namespaces, seccomp, capabilities, cgroups, mount restrictions, BPF — but configured per-service in plain INI. No policy compiler. No reload-the-whole-LSM workflow. The kernel rejects forbidden operations before they reach the service.

What this stops:

Most filesystem-based privilege escalation (read-only root, no /home, private /tmp)
Kernel module loading from a compromised service
Capability abuse (CAP_SYS_ADMIN, CAP_NET_RAW for sniffing, etc.)
Process namespace pivots (the service can't see other PIDs)
syscall-based exploits filtered by seccomp
Network egress to unintended destinations (per-cgroup BPF firewalling)

What it doesn't replace:

Patch management. Sandboxing slows exploitation, it doesn't prevent vulnerabilities.
Authentication and authorization at the application layer.
Defense against in-process logic bugs — the service still runs as itself.

The win is layering. An unpatched RCE inside a sandboxed nginx can't drop a webshell into /etc/, can't load a kernel module, can't shell out to /bin/sh (filtered by SystemCallFilter=), and can't exfiltrate to an arbitrary IP if you've locked egress with IPAddressDeny=. That's a meaningful blast-radius reduction for free.

Read this first: drop-in overrides, not unit edits

Never edit /usr/lib/systemd/system/nginx.service directly. Package upgrades will overwrite it. Use drop-in overrides:

systemctl edit nginx

This opens an editor for /etc/systemd/system/nginx.service.d/override.conf. Add only the [Service] section with your additions. Save, then:

systemctl daemon-reload
systemctl restart nginx

To see the merged unit:

systemctl cat nginx

Cookbook 1: nginx

systemctl edit nginx

Drop in:

[Service]
# Filesystem
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/log/nginx /var/lib/nginx /var/cache/nginx /run

# Kernel
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
ProtectClock=true
ProtectHostname=true
ProtectProc=invisible
ProcSubset=pid

# Process
NoNewPrivileges=true
RestrictSUIDSGID=true
RestrictRealtime=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RemoveIPC=true
UMask=0027

# Capabilities (only what nginx actually needs)
CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_CHOWN CAP_DAC_OVERRIDE CAP_SETGID CAP_SETUID
AmbientCapabilities=CAP_NET_BIND_SERVICE

# Syscalls
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources

# Network
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6

Verify the impact:

systemd-analyze security nginx

A stock nginx unit on most distros scores in the 6.5–9.5 range (rated UNSAFE). After this drop-in, expect roughly 2.0–2.5 (OK), depending on systemd version.

⚠️ MemoryDenyWriteExecute=true breaks any process that JITs. Fine for nginx, will break Node, PHP-FPM with JIT enabled, LuaJIT, and tracing JITs. Skip it for those.

⚠️ ReadWritePaths= is the escape hatch under ProtectSystem=strict. If nginx writes somewhere not in that list (custom log path, cache, upload buffer), it will fail with permission errors that look like config bugs. Audit your nginx.conf for any *_path directive and make sure the parent dir is in ReadWritePaths=.

Cookbook 2: MariaDB

systemctl edit mariadb

[Service]
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/lib/mysql /var/log/mysql /var/run/mysqld /run/mysqld

ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
ProtectClock=true
ProtectHostname=true
ProtectProc=invisible
ProcSubset=pid

NoNewPrivileges=true
RestrictSUIDSGID=true
RestrictRealtime=true
RestrictNamespaces=true
LockPersonality=true
RemoveIPC=true
UMask=0027

CapabilityBoundingSet=CAP_IPC_LOCK
AmbientCapabilities=

SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources @mount

RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6

⚠️ Don't enable MemoryDenyWriteExecute=true for MariaDB — it can interfere with stored procedure execution paths in some versions. Test thoroughly in staging if you want it.

⚠️ MariaDB's InnoDB uses io_setup/io_submit for native async I/O. These are in @system-service, so the filter above is fine. If you write a stricter custom filter, make sure async I/O syscalls are allowed or InnoDB will refuse to start with cryptic errors about read threads.

Cookbook 3: A custom Python service from scratch

A Flask app running gunicorn on port 8000, reading config from /etc/myapp/, writing logs to /var/log/myapp/, and connecting to PostgreSQL at 10.0.0.5:5432.

/etc/systemd/system/myapp.service:

[Unit]
Description=My Flask App
After=network-online.target postgresql.service
Requires=network-online.target

[Service]
Type=notify
User=myapp
Group=myapp
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/venv/bin/gunicorn --bind 127.0.0.1:8000 wsgi:app
Restart=on-failure
RestartSec=5

# Filesystem
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadOnlyPaths=/etc/myapp
ReadWritePaths=/var/log/myapp
InaccessiblePaths=/boot /opt/backups

# Kernel
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectControlGroups=true
ProtectClock=true
ProtectHostname=true
ProtectProc=invisible
ProcSubset=pid

# Process
NoNewPrivileges=true
RestrictSUIDSGID=true
RestrictRealtime=true
RestrictNamespaces=true
LockPersonality=true
PrivateDevices=true
RemoveIPC=true
UMask=0077

# No caps needed — runs as 'myapp' on a high port
CapabilityBoundingSet=
AmbientCapabilities=

# Syscalls
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources @mount @reboot @swap @cpu-emulation @debug @module @raw-io

# Network — only egress to PostgreSQL and localhost
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
IPAddressDeny=any
IPAddressAllow=10.0.0.5
IPAddressAllow=127.0.0.0/8

# Resource limits
MemoryMax=512M
TasksMax=64
LimitNOFILE=4096

[Install]
WantedBy=multi-user.target

IPAddressDeny=any plus a small IPAddressAllow= list is the underrated gem here. A compromised app can't curl evil.com to pull stage-2. It can only reach destinations you've explicitly allowed. This is enforced via per-cgroup BPF, so the service can't escape it by switching libraries or using raw sockets.

⚠️ If your app makes outbound calls (Stripe, SendGrid, an internal microservice), each destination IP needs an IPAddressAllow= entry. DNS resolution itself needs the resolver allowed — typically 127.0.0.53/32 for systemd-resolved, or whatever your local DNS is.

⚠️ Don't put the empty CapabilityBoundingSet= line in services that need to bind low ports as non-root. Use CapabilityBoundingSet=CAP_NET_BIND_SERVICE + AmbientCapabilities=CAP_NET_BIND_SERVICE instead, as in the nginx example.

Tooling: systemd-analyze security

systemd-analyze security myapp.service

You get a per-directive table with explanations of what each setting buys you, plus an overall score:

→ Overall exposure level for myapp.service: 1.6 OK

Run it across every service on a host. Anything scoring above 5.0 is a candidate. Distro-shipped units rarely include the full hardening set — they aim for compatibility, not minimum exposure.

For a fleet-wide audit:

systemd-analyze security --no-pager 2>/dev/null | sort -k2 -n -r | head -20

This gives you the top 20 most-exposed services in one shot. Good way to prioritize what to harden first.

Debugging when hardening breaks things

Standard failure modes:

Suspect a path restriction. Run the binary outside systemd under strace as the service user:

sudo -u myapp strace -f -e trace=openat,stat,access \
  /opt/myapp/venv/bin/gunicorn wsgi:app 2>&1 \
  | grep -v ENOENT | grep -v '= 0'

Cross-check the paths against your ReadWritePaths= / ReadOnlyPaths= lists.

Suspect a syscall filter is the cause. Use SystemCallLog= instead of SystemCallFilter= to log without blocking:

SystemCallFilter=
SystemCallLog=@privileged @resources @mount

Run the service, exercise it, then read the journal to see what it actually called. Tighten the real filter based on that.

Service runs but functionality is broken. Add temporarily:

[Service]
Environment=SYSTEMD_LOG_LEVEL=debug

Then systemctl daemon-reload && systemctl restart myapp and check the journal.

Service exits immediately, non-zero status.

journalctl -u myapp -n 50 --no-pager

Look for Permission denied, Operation not permitted, or EACCES. Usually a missing ReadWritePaths= or a filtered syscall.

Capabilities: drop them all by default

The capability-bounding directives are the most underused tool in the kit. Most services don't need any capabilities — they run as a non-root user.

CapabilityBoundingSet=
AmbientCapabilities=

That empty CapabilityBoundingSet= removes all capabilities from the bounding set, meaning even if the service is somehow tricked into a setuid path or a privilege transition, there are no privileged capabilities to use. Combined with NoNewPrivileges=true, this is a hard ceiling on capability acquisition.

For services that need exactly one (binding port 80 as a non-root user, for example):

User=nginx
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
NoNewPrivileges=true

This is how you bind low ports as a non-root service without ever running as root, and without setcap on the binary.

Threat model: what this stops, what it doesn't

Stops or significantly slows:

Filesystem-based persistence — read-only root, no /home, private /tmp
Kernel-level tampering — Protect* family blocks /proc, /sys, kernel logs
Lateral movement — IPAddressDeny= + dropped caps + namespace restrictions
Most published RCE-to-shell payloads — they expect a writable filesystem and /bin/sh

Doesn't stop:

App-layer logic bugs — auth bypass, IDOR, SQL injection all run inside the sandbox
Resource exhaustion — use MemoryMax=, CPUQuota=, TasksMax= for that
Compromise of the service user itself — defense-in-depth: per-service users, rootless containers
Kernel 0-days — the sandbox uses kernel features; a kernel compromise is game over regardless

Layered with patch hygiene, per-service unprivileged users, and either rootless Podman or a proper container runtime, you've made a serious dent in your attack surface for the cost of a 25-line drop-in per service.

Roll-out playbook

For an existing fleet:

Pick the highest-risk service first — public-facing, complex codebase. Usually nginx, your application server, or your mail stack.
Deploy a baseline drop-in to one staging host. Restart, verify functionality with smoke tests.
Confirm score improvement with systemd-analyze security.
Watch the journal for 24–48 hours under real traffic. Permission errors usually surface within a day.
Tighten incrementally — SystemCallFilter, IPAddressDeny, CapabilityBoundingSet — one directive group at a time. Don't change five things at once.
Roll to production via Ansible:

- name: Deploy nginx hardening drop-in
  copy:
    src: hardening/nginx.override.conf
    dest: /etc/systemd/system/nginx.service.d/override.conf
    owner: root
    group: root
    mode: "0644"
  notify:
    - daemon-reload
    - restart nginx

handlers:
  - name: daemon-reload
    systemd:
      daemon_reload: yes
  - name: restart nginx
    service:
      name: nginx
      state: restarted

Move to the next service. After three or four iterations, you have a library of per-service hardening drop-ins and an Ansible role that deploys them across the fleet.

The exercise pays for itself the first time a CVE drops on a service you've sandboxed and the public proof-of-concept exploit doesn't work because half the syscalls it needs are filtered. That's not theoretical — it's happened to me twice.