Architecture Deep Dive: How We Built a Modern Compute Platform with KVM and API-First

The cloud hosting market is full of buzzwords. But in practice, three things matter most to developers, DevOps teams, and businesses in 2024: predictable performance, seamless automation, and transparent costs.
Foto aus dem Rechenzentrum mit den Icons einer Cloud und Servern.

This is precisely why we fundamentally rethought our previous "Cloud Server" platform from the ground up and rebuilt it into an open, API-first "Compute" platform. It was a shift away from rigid monthly costs toward granular, tangible resources: vCPU, RAM, storage, and network.

In this technical deep dive, we'll explain why this transformation was necessary, what the new architecture looks like, and what benefits you as a user can expect.

What do users expect from a cloud today?

The catalyst for our reboot was clear customer feedback that aligned with our own technical goals. The requirements for modern infrastructure are:

  • Predictable Performance: Users want to rely on the fact that the booked performance (e.g., 4 vCPUs) is constantly available, without "noisy neighbors" or unclear fair-use policies.

  • Simple Automation: Infrastructure must be treatable as code (Infrastructure as Code). A good web panel is essential, but no longer sufficient on its own; the platform must be fully controllable via API (e.g., via Terraform or Ansible).

  • Transparency & Flexibility: Instead of opaque packages, users want clear resources (vCPU, RAM, storage) at clear prices.

Our old software base had become too complex for these requirements. Rather than adding more layers, we streamlined and placed the platform on a new foundation.

A technical deep dive: How is the new platform built?

To achieve these goals, we rely on open standards and a robust, decoupled architecture.

The Foundation: Why we use KVM

In the compute layer (i.e., the computing level), we consistently rely on the industry standard KVM (Kernel-based Virtual Machine). We deliberately avoid proprietary experiments and use the native Linux kernel technology that also forms the backbone of the major hyperscalers. The decision for KVM was made for pragmatic reasons:

  • Performance: The overhead from virtualization is minimal.

  • Stability: KVM is technologically mature, proven millions of times over, and extremely robust.

  • Compatibility: As an open standard, KVM prevents vendor lock-in.

The entire platform is orchestrated by Apache CloudStack, which we've adapted at critical points for our requirements. Our goal is to provide robust building blocks (vCPU, RAM, network) on which our customers can independently operate their own platforms – up to and including Kubernetes clusters.

Network & Availability: What happens during host maintenance?

A common problem with traditional servers is downtime during maintenance windows. Our new platform solves this through live migration: Virtual machines (your instances) can be moved from one physical host system to another during operation without noticeable interruption.

The network behind this is massively designed: Each host node is connected with 50 Gbit/s and secured by default with powerful DDoS protection at the network level. You can operate your instances in completely private networks (ideal for backend services) or equip them with public IPv4 and IPv6 addresses.

Storage: When do I need HA storage and when High-IOPS?

No application is like another. A web app needs fail-safe storage, while a database needs maximum I/O performance. We solve this with a clear two-tier model:

  1. HA Storage (High Availability):

  • Technology: Based on Ceph, a distributed storage system.

  • Function: Data is replicated over the network across multiple physical systems. If a storage node fails, your instance continues to run.

  • Use: In practice, we achieve around 2,000 IOPS – ideal for web servers, standard applications, and systems where availability is more important than raw I/O performance.

  1. HI Storage (High-IOPS):

  • Technology: Locally attached NVMe SSDs.

  • Function: Storage is located directly in the instance's host system, which drastically reduces latency.

  • Use: Here we achieve typical values of up to 8,000 IOPS. Perfect for compute-intensive databases, caching servers, or big data workloads.

Additionally, you can attach flexible block storage (up to 10 TB per volume) to your instances to expand storage space as needed.

The most important question: What does flexibility cost?

A central aspect of the new platform is billing. Instead of fixed monthly prices for rigid packages, we bill by the hour. This switch makes completely new scenarios economically viable:

  • Seasonal Peaks: Operate additional web servers only during the Christmas shopping season.

  • CI/CD Pipelines: Start runners only for the duration of a release window.

  • Test & Staging: Clone your production environment for a two-hour test and delete the instance again.

The crucial advantage: Shut-down instances generate no compute costs (only minimal storage costs for the occupied storage space). You pay exclusively for actual usage.

How do I automate my infrastructure?

Our platform follows the API-first approach. This means technically: The convenient web panel you use in your browser is itself just a client of our API. This gives you the guarantee that every function available by click can also be controlled by code. This enables true "Infrastructure as Code" (IaC) workflows:

  • Terraform: Describe your entire infrastructure (servers, networks, firewalls) declaratively in code.

  • Ansible: Configure your systems automatically and repeatably.

For monitoring the core platform, we use Zabbix (local) and iLert (alerting) to ensure GDPR-compliant operations within the EU. We monitor traffic, CPU utilization, disk states, and temperatures so you can focus on your applications.

Transparency: What were the biggest technical hurdles?

A rebuild of this magnitude never goes smoothly. Our biggest challenge was complex VXLAN effects in combination with white-box switches running SONiC (Software for Open Networking in the Cloud).

Faulty flags in the network stack initially led to sporadic packet loss in the overlay network. Until a manufacturer patch was available, our engineers had to work deep in the Linux network stack and implement targeted filters in the kernel. That worked – but at the cost of a delayed go-live. Edge-case bugs in CloudStack as well as driver dependencies also challenged us. The positive effect for you: Our test and staging pipelines are significantly more mature today, which noticeably increases platform stability.

Your advantage: What does the new Compute platform bring you?

For your daily operations, this means:

  • Predictable performance through clear resource allocation and dedicated resources depending on "flavor".

  • No vendor lock-in thanks to open standards (KVM, CloudStack, Ceph).

  • True automation (IaC) via a well-thought-out API, Terraform, and Ansible.

  • GDPR compliance through operation and monitoring in Germany/EU.

  • Less maintenance effort for your teams thanks to live migration and robust base components.

Typical use cases of our customers:

  • E-Commerce: Short-term handling of load spikes (e.g., Black Friday).

  • DevOps: Rapid setup of test environments in CI/CD pipelines.

  • Analytics: Temporary booking of high-IOPS instances for ETL jobs.

  • SaaS: Strict tenant separation via private networks.

  • Kubernetes: Building clusters on a standardized compute base without "special features".

Conclusion

The step from a traditional "Cloud Server" structure to an open "Compute" platform was not merely rebranding for us, but a technical restart based on proven standards. Today, an open, powerful, and scalable platform on our own hardware is ready – with transparent billing, a resilient architecture, and a clear roadmap for future requirements.

If you're planning a migration, starting a new project, or want to become more independent from the hyperscalers, contact us: Our tech team will be happy to think through the solution with you.

FAQ

Are my data GDPR-compliant protected?

Yes. As a German provider, we operate our own hardware exclusively in certified data centers in Germany (or the EU). Monitoring and alerting are also handled via EU services, so data sovereignty is maintained at all times.

How does hourly billing work?

You only pay for the resources you actually use. When you stop an instance (shut it down), there are no costs for CPU and RAM – you only pay a minimal fee for the occupied storage space. This makes short-term test environments or scaling during load spikes very cost-efficient.

When should I choose HA storage and when High-IOPS (NVMe)?

Use HA storage (Ceph) for web servers and critical services that require high availability, as data is replicated multiple times here. Choose HI storage (local NVMe) for databases or caching services that require maximum write/read speed (up to 8,000 IOPS) and lowest latencies.

Can I control the cloud infrastructure with Terraform or Ansible?

Yes. Since our platform follows the API-first approach, all functions can be automated. We support common "Infrastructure as Code" tools like Terraform and Ansible, so you can make deployments repeatable and scalable.

Go up