Availability
Availability does not happen by chance, but through planned redundancy, automated processes, and continuous monitoring. I develop and implement infrastructures that reliably ensure fail-safe continuous operation – from individual services to distributed HA clusters. The goal is stable operation even in the event of hardware defects, software errors, or maintenance work, supported by structured monitoring and alerting as well as clearly defined backup and recovery concepts.
Architectural decisions are always made with an eye to operational reality, maintainability, and clear responsibilities in order to handle failures in a controlled manner. Availability thus becomes a plannable feature of the infrastructure rather than a reactive emergency measure.

Architecture & Redundancy

The goal is to consistently avoid single points of failure. I design infrastructures in which central components are duplicated or distributed—from the network and storage level to the application.
- Redundant network paths (LACP bonding, VLAN separation)
- Cluster designs for databases, virtualization, and containers
- Distributed storage replication with Ceph or ZFS mirror
- Failover mechanisms with Keepalived, Corosync, Pacemaker
Backup & Recovery

Operational stability does not end with failover—it also includes recoverability after serious failures. I rely on open-source backup systems and documented recovery processes.
- Snapshots, deduplication, and incremental backups (Borg, Restic, Bareos, PBS)
- Automated restore tests and disaster recovery playbooks
- Versioned documentation and recovery guides
- Integration into monitoring and notification
Monitoring & Alerting

Transparency is a prerequisite for stability. I implement holistic monitoring chains that detect problems early and report them automatically.
- Prometheus, Grafana, Alertmanager
- Journal and log aggregation with Loki and systemd-journald
- SNMP-based hardware monitoring and capacity planning
- Integration of email and ChatOps notifications
High-availability clusters

I plan and operate cluster environments that provide services without interruption – whether local or distributed.
- HA clusters for PostgreSQL, MariaDB, NGINX, and Kubernetes
- Virtual IP addresses and automated failover
- Synchronization and monitoring via Prometheus Exporter
- Integration into Ansible playbooks for automated recovery

Trainings
You can find specific trainings and current topics in the Comelio GmbH training catalog.
Available in-house at your company, as a webinar, or as an open training—designed to meet different requirements.
Frequently asked questions about Availability
In this FAQ, you will find the topics that come up most frequently in consultations and training sessions. Each answer is kept brief and refers to further content where necessary. Can’t find your question? Feel free to contact me.

Is high availability alone sufficient for critical systems?
No. High availability reduces downtime for individual components, but does not protect against data corruption, misconfigurations, or site failures. For critical systems, a combination of HA, backups, and clear disaster recovery strategies is essential.
Active-active or active-passive—which makes more sense?
Active-active is suitable for stateless services or load balancing, but requires clean synchronization. For stateful applications, active-passive with controlled failover is often more stable and predictable. The decision depends on consistency requirements and the operating model.
Is monitoring alone sufficient to prevent failures?
No. Monitoring identifies problems, but does not prevent them. Only in combination with redundancy, automation, and clearly defined response processes can true operational stability be achieved.
