Application backups are not as simple as in the world of database lectures. Have you ever heard about the ACID properties with “A” representing “atomicity” and “D” durability? After a database commit, everything is on disk. Nothing can get lost. Plus, all commands within the transaction are on disk – or everything is undone. When a database crashes and restarts, its data reflects precisely the effects of all committed transactions. It works as designed if an application relies on exactly one database. Sadly, applications are more complex, as Figure 1 illustrates. They tend to access not only one database but several in parallel, plus write data to files and file shares – and disks attached to VMs.
Understanding Application-Consistent Backups and their Benefits
While private users can save their data by manually copying all their files to a different hard disk or the cloud, this approach is too simplistic for larger applications, even if we focus only on VM disks. If it is a large disk, files with names starting with “A” might get copied at 5:00 am and those starting with “Z” at 5:05 am. Thus, the “Z-files” could have been changed between 5:00 am and 5:05 am, making the A-files and the Z-files inconsistent. Furthermore, copying open files causes issues, changes might be only in the memory and not written to disk, or there could be pending I/O transactions. Thus, a clean-up of files and their data might be necessary when restarting the application using such file copies. The clean-up can be automated or a manual task for engineers. In any case, it prolongs application outages.
Application consistency overcomes this challenge. The idea is to perform backups such that applications run after restarting a VM without any clean-up actions. Thus, business-critical applications benefit from lower downtimes, i.e., they can provide better Recovery Time Objectives. Plus, the organization benefits in crisis events, aka business continuity management situations. When a company has to evacuate all workloads to a different cloud data center, the engineers can rely on the VMs to restart and applications to come up without manual intervention. The engineers can focus on fixing more complex issues, e.g., related to integrations with other components, rather than the complete IT department being stuck with cleaning up file systems.
The most prominent solution for application-consistent backups is Microsoft’s Windows Volume Shadow Copy Service (VSS). Microsoft products come with it, and your organization’s applications (and your third-party software provider) can also implement it for Windows workloads. The exact details of VSS are, however, not so relevant from a cloud security architecture perspective. What matters is the available features for application-consistent backups in AWS, GCP, and Azure.
Application-Consistent Backups in the Cloud
Azure comes with a solution for application-consistent backups for Linux VMs, at least for those deployed with the Azure Resource Manager and not the Service Manager. It is a framework enabling application developers or operations specialists to integrate pre- and post-scripts into Azure’s backup process. Pre-scripts can invoke, for example, APIs of the application to tell the application to finish off all “write” activities. Then, Azure performs the backup copy. Afterward, Azure invokes the post-script, and normal operations continue. For this purpose, a configuration file (Figure 2) must be on all relevant VMs.
The pre- and post-scripts and this configuration file are critical from a security perspective. They run with root privileges. Thus, they must be secured to prevent attackers having gained access to the VM to change these settings and execute malicious code as root.
The situation for Windows VMs on Azure is much easier compared to the Linux world. By default, all VM backups use the Microsoft VSS service. So, if (and only if!) the applications on the VM implement VSS, all backups are application-consistent without the need for extra VM configurations. If not, the disk backup is not application- but only file-consistent.
Finally, a quick remark on the AWS and the Google Cloud Platform (GCP) features. Both follow the same approach as Azure: pre- and post-scripts for Linux VMs, and Microsoft’s VSS for Windows VMs.
Back to the Big Picture
To conclude: Application-consistent backups reduce downtimes of applications by reducing the work for engineers after crashes or VM evacuations. However, the term application consistency can be misleading. When looking again at Figure 1, it is clear that the consistency between the VM disks and the database backups is not guaranteed. Applications have to cover the challenge that the VM disk backup is from 4:07 am, one database backup is from 4:05, and the second from 4:17. So, even with application-consistent backups, there are still exciting tasks and challenges for engineers in the area of backup and recovery!
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are essential concepts for business-critical applications. The RTO defines the maximum time an application might be unavailable after a crash – or after a data center burns down (Figure 1). That is the time engineers (or automatic processes) have to start the application on other servers – potentially at a different location – and restore all necessary data from backups.
How much data might get lost max in such an event, this defines the RPO, the Recovery Point Objective. It depends on the backup frequency. If a company makes a backup each night at 4 am, the RPO is 24 hours. Then, if the data center burns down at 4:15, only the data from the last 15 minutes is lost. If it burns down at 3 am, the data from 23 hours is lost.
One factor influencing the RTO is the time needed to restore the data from a backup. Network bandwidth and storage are significant factors, but so is the backup type. Full, incremental, and differential backups take different long.
The starting point is always a full backup. The system writes all data to a secure backup storage (Figure 2). So, why would an architect not always go for a full backup? It is simple: Suppose an application is highly critical. Thus, the RPO is 5 min. Let us further assume that it is a massive database, but not much data changes within five minutes. Then, a full backup every five minutes seems too much. It might even be technically impossible, depending on network bandwidth and the amount of data. Incremental backups can be a better solution for such a scenario.
An incremental backup copies only the most recent changes away. If the last full backup was at 4 am, an incremental backup at 4:05 am copies the changes from the last five minutes. The next backup at 4:10 am copies, as well, only the changes of the last five minutes, i.e., the changes between 4:05 am and 4:10 am. Incremental backups have on disadvantage compared to full backups: the restoration time might be longer. To restore the situation of 4:10 am, one has to fully restore the data from 4 am, then apply what changed till 4.05 am, and then the incremental changes till 4:10 am.
A middle way between incremental and full backups is the differential backup. A differential backup copies away all changes from the last full backup. At 4:05 am, a differential backup writes all changes of the last five minutes; at 4:10 am, all changes between 4:00 and 4:10, and at 4:15, all changes between 4:00 and 4:15. Thus, a restore of an incremental backup means: restore the full backup and (only) apply the changes of the last differential backup. However, these concepts get transparent with the cloud: the cloud providers implement standard configurations.
The Azure Backup Center, for example, enables customers to configure several backups per day for VMs based on the Service Recovery Vaults, e.g., every 4 hours. The first one is a full backup; the subsequent backups are incremental (Figure 3). Microsoft has decided. However, these concepts are still crucial for architects and engineers if they implement their own solutions or use 3rd party software with sophisticated configuration features. Then, they have to optimize and experiment themselves to identify an appropriate balance between …
backup frequency and full versus increment backups, and
the needed time for restoring a backup impacting the RTO
Ever wondered how large corporations design their networks in the cloud? The hub-and-spoke pattern is probably the most important to understand their on-prem and cloud network designs, no matter whether an IT department runs on GCP, AWS, Azure, or any other cloud provider.
Mesh is the “free love” vision transferred to enterprise network designs, whereas the hub-and-spoke pattern implements more of a harem concept. But to start with the big picture: The focus is not on individual applications. Network design looks at how to organize connectivity and isolation for networks with hundreds or even thousands of applications. These applications serve business needs, such as SAP or self-developed insurance solutions. The design must consider as well technical or security-related applications such as Web Application Firewalls, IAM services, Messaging Middleware, or Data Loss Prevention solutions.
Hub-and-Spoke vs Mesh Network
The first aspect is zoning. IT department group their resources. VMs belonging to an HR application might be separated from air traffic control systems. They are in two separate zones. Network design patterns define how such zones interact, i.e., which zones have connectivity and can interact directly with which other. When looking at concepts, two patterns are particularly important: the mesh network pattern and the hub-and-spoke pattern (Figure 1).
A hub-and-spoke network is hierarchical. One zone is the hub. It connects with every other zone – the spokes. It is a classic 1-to-n relationship known from harems. Resources in two separate spokes zones always interact via the hub. A mesh network builds on the free-love-idea: every zone can interact with every other zone (and zones might forward traffic to other zones in the absence of direct connectivity).
Why Hub-and-Spoke Patterns are Popular
Mesh networks (and free love) cause a mess if relationships and connectivity are not tightly managed. In contrast, the hierarchical model of hub-and-spoke networks allows for centralized governance and operations of components and eases controlling the network traffic flow.
Routing all traffic to and from the internet through one hub zone eases control and security. In this zone, DDoS protection, the (network-) data-loss-prevention solution, and other critical internet-connectivity-related applications must run. Then, perimeter security is in place for the complete data center.
One dedicated zone with internet connectivity also eases setting up web application firewalls (WAFs), which usually require integration with an IAM solution. Having both in dedicated central zones is much easier than having WAFs in ten zones and IAM solutions in seven.
Centralizing components in a hub is useful beyond internet connectivity: (Azure or on-prem) Active Directories, messaging middleware, application performance monitoring, and many more benefit from centralization. This insight does not mean building one monolithic hub with all centralized applications. Enterprise-scale networks can have several hubs, e.g., one serving as a management zone with monitoring and another for internet connectivity (Figure 2).
Grouping Criteria for Zoning
Zones are groups of (cloud) resources, e.g., VMs, isolated from other resources and VMs in other zones. The grouping criteria differ between organizations, companies, and departments. Typical grouping criteria are (Figure 2):
Applications or groups of applications forming one solution
Stages such as Production, Preproduction, Integration, Test, and Development.
Teams or departments managing the resources and potentially following different change management processes
Sensitivity of the data of applications
Cloud Features for Implementing Zones
Cloud providers do not offer a “zone” or a “hub-and-spoke network” feature. They provide sophisticated building blocks for structuring workloads and networks. Tenants, Virtual Private Clouds (VPCs), and Subnets are features for organizing networks and resources hierarchically. AWS, Azure, GCP, and all the others provide routing tables for – well – routing. Access Control Lists and Network Security Groups (the exact names differ between the clouds) enable engineers to implement firewalls for blocking or allowing specific network traffic. The clouds come with Internet Gateways, NATs, etc. Everything is ready for you to implement effective and secure network designs. However, whether you do that and how many free-love-like connections you allow is up to you and all other network architects of the world.
The concept of the Amazon Virtual Privat Cloud (VPC) and Subnets are the AWS terms for structuring the network. A VPC consists of a continuous IP address range represented by a CIDR block, e.g., 10.0.16.0/24. A network design with overlapping IP ranges for different VPCs is technically possible though it can result in issues when connecting, aka peering, such VPCs later.
The network design divides the IP range of a VPC further into Subnets. Subnets are the canvases into which engineers deploy VMs – EC2 instances in AWS speak, short for Elastic Compute Cloud (Figure 1). Subnets belonging to the same VPC must not have overlapping IP ranges, but they do not have to consume the complete range. Non-used IP ranges ease responding to changing business needs and IT landscapes. So, everything is similar to what we know from GCP and Azure. The way subnets are protected, however, differs slightly.
Figure 1: Network diagram with VPCs and Subnets and corresponding screens in the AWS GUI
Controlling Network Traffic with Network ACLs
One AWS concept for securing the subnet perimeter is the Network Access Control Lists (Network ACLs) feature. Network ACLs allow or deny ingress and egress traffic. They act as a stateless firewall that checks traffic against a simple ruleset. Inbound rules (Figure 2, A) allow or deny traffic from specified IP addresses outside the subnet. Outbound rules (Figure 2, B) inspect traffic leaving the subnet. In addition to the source or target IP, rules can consider the port number (e.g., 23, 8080) and the protocol type. If there are conflicting rules, the priority of the rules is the deciding factor.
Figure 2: Sample Network ACL
When designing a concrete network, architects should be aware of default behavior and setups – and how Network ACLs, VPCs, and subnets relate. First, each VPC has a default Network ACL. It allows all ingress and egress traffic to and from the subnets within the VPC. However, the VPC (by default) has no connectivity with the internet, external networks, or other AWS VPC, be it in the same or another AWS account (we discuss this in the following section). Engineers can either change the default Network ACL or create a new one and associate it with one or more subnets of the same VPC.
A simplistic example is a VPC hosting several web applications. Each application might have a frontend and a backend subnet. Frontend subnets have open HTTP and HTTPS ports plus an open JDBC port to connect to the database backend subnets, which allow only JDBC traffic. Figure 3 illustrates such a scenario. VPC A has a Network ACL “Frontends” associated with Subnets SN1 and SN2. There is a differenet second Network ACL “DB Backend” associated with Subnet SN3. VPC A has a default Network ACL, but it is not in use. In contrast, the default Network ACL for VPC B is associated with the only subnet there, SN7.
Figure 3: Understanding Network Access Control Lists NACLs) for AWS IaaS Workloads
In addition to Network ACLs, AWS provides a more sophisticated second feature for controlling network traffic: Security Groups. While Network ACLs impact the subnet perimeter, Security Groups focus on the traffic going through the network interfaces of individual VMs (or EC2 instances), for which AWS uses the term Elastic Network Interface (ENI). Applicable Security Groups can be configured per ENI. Each ENI must have at least one but can also have multiple Security Groups associated.
The rules for Security Groups specify the protocol as well as the source, respectively target IP addresses on the target with the following main differences:
Security Groups control traffic going through the ENI. Thus, they can also restrict traffic within a subnet, and different VMs within the same network can have different applicable Security Groups.
Security Groups act as stateful firewalls. Only the initiating in- or outbound traffic has to be allowed. Reply traffic is allowed automatically.
There are only “allow” rules, no “deny” rules. Adding an additional Security Group to a VM might result in more ports, protocols, and target or source IPs (or Security Groups) being allowed. Adding rules to Security Groups never restricts or forbids traffic.
By default, all outgoing traffic is allowed, and all incoming traffic is denied.
An important aspect from the security perspective is that EC2 instances can have secondary ENIs with a second IPv4 address which puts EC2 instances logically into two subnets. There are even more options in a world with IPv6.
Figure 4 illustrates a network design with two subnets in one VPC with one default Security Group and a newly created second security group named SG-1. Both apply to the ENI2 network interface, whereas all other VMs only have one security group. The example also illustrates two network interfaces being attached to the same EC2 instance, which, as a result, becomes part of subnet SN1 and SN2.
Figure 4: Security Groups as Means of Protecting IaaS Workloads in AWS
Connecting VPCs with the Outside
A VPC without connectivity to other VPCs or the internet is securely protected against outside attacks – and completely useless. Suppose users, admins, customers, or other programs cannot start any processing because no one can reach the EC2 instances, and they cannot access outcomes and results. In that case, there is no need to have a VPC At least some EC2 instances, respectively, subnets and VPCs they belong to must interact with external servers and services. It creates an attack surface, but that is unavoidable and taggling the risks is part of the network security design.
VPC Peering is the essential concept enabling communication between VPCs belonging to the same AWS account or other AWS accounts within the same AWS organization. Components and VMs in peered VPCs interact as if they belong to the same VPC. All traffic stays internal. No traffic goes via the internet, keeping the attack surface low regarding outside attacks.
Shared VPCs are a similar concept. AWS accounts can share a VPC with other AWS accounts within the same AWS organization. Then, the other AWS accounts can create and manage components within this shared VPC. However, while everyone uses the same Shared VPC, resources of the different AWS accounts are managed and seen only by the AWS accounts to which they belong. AWS markets Shared VPCs as a feature to reduce the number of VPCs while still ensuring high interconnectivity, separate billing, and isolation or access control.
VPC Peering and Shared VPCs are powerful concepts but are limited to the IT landscape of one single company. However, Internet Gateways and NAT Gateways are key AWS features when companies interact with others. In this context, Architects have to understand: Do they need only traffic from within a subnet or VPC to the internet, or should EC2 instances be directly reachable from the internet? The latter, obviously, poses a higher attack surface and security risk since attackers can directly reach out to VMs.
Exposure to the internet requires EC2 instances to have a Public IP and the subnet/VPC to have an Internet Gateway component with a suitable routing table. Such a design allows ingress and egress traffic from and to the internet.
If EC2 instances only need to initiate egress traffic, a NAT Gateway is the solution. It takes invocations from EC2 instances within a subnet, routes them to external IPs, and sends replies back to the initiating EC2 instance. On the way, the NAT replaces the internal caller IP with its external IP and vice versa for replies.
AWS comes with various additional and less-frequently used concepts. Three features or services center around interactions between an on-premise and cloud infrastructure: AWS Virtual Private Network, AWS Direct Connect (dedicated, private connections), and AWS Transit Gateways (managing hybrid networks).
Finally, there is a connectivity or network feature named AWS Private Link. It is less relevant for the IaaS world of EC2 but crucial for application landscapes incorporating AWS PaaS and DBaaS services. I will cover the concept in one of my next posts.
PaaS services promise software engineers nothing less than effortless eternal happiness: a few clicks and you have ready-to-use, secure, and unlimited scalable cloud service, be it an AI, a database, or a super-exotic service. But do cloud providers deliver as promised? Is, for example, redundancy a child’s play? Redundancy is much trickier than backups. Backups are about having a copy of an application’s data for emergencies. In contrast, redundancy guarantees that the application runs 7×24. Even major disasters must not impact a solution’s stability.
In the following, we look closely at one crucial Azure service, Azure Storage Accounts. It is Microsoft’s web service for storing data such as objects respectively, blobs, or files. We elaborate on which kind of disasters or failures the web service copes with – and how transparent it is from an application or operations perspective. Azure provides different options for configuring the recovery behavior of Storage Accounts (Figure 1), which the following paragraphs explain in more detail. Companies need specific guidelines on when to use which one – and writing such a guideline is a crucial cloud (security) architecture task.
Cloud architects have to make (up to) three redundancy-related configuration decisions:
Should the data be replicated three times within one data center (local redundancy), or should there be a copy in three separate data centers (zone redundancy)?
Should there be copies in a data center further away in a secondary region (geo-redundancy)? For regulatory and compliance reasons, geo-redundant copies are usually in the same jurisdiction as the original data. Azure stores, e.g., geo-redundant copies for data in Switzerland-North (greater Zurich area) in the Azure region Switzerland-West around Geneva.
Should applications be allowed to read the geo-redundant copies if the primary region fails (“read access”)?
Figure 2 and Figure 3 illustrate the configuration options for a better understanding. The architecture in Figure 2 visualizes local redundancy. An application reads and writes data stored in one data center (A). The different copies are synchronously replicated and always in sync. If – and only if – the engineers choose the geo-redundant add-on, Azure replicates the data asynchronously to one data center in a secondary region (B). In such a case, the Storage Account can be configured to allow applications to read from the storage account in the secondary region if the first region is unavailable (C).
Figure 3 looks very similar. It illustrates zone-redundancy. In such a set-up, the three copies are in different data centers in the primary region (A). If geo-redundancy is active, all asynchronous copies in the secondary are stored in the same single data center (B)as in the previous case in Figure 2. Applications can get read access in case of unavailability of the primary region – as with local redundancy (C).
In the next step, we look closer at three failure scenarios: a failure of a single device, a data center, and an entire region. Which configuration option allows the application to continue to run? How transparent are such incidents for applications? What does the cloud operations team or the applications team have to do?
First, a failure of a single device in a data center has no impact on any application or operational staff. Azure always maintains three up-to-date copies of any data, be it in one or be it in three data centers. These copies are always in sync. Thus, if a device fails, another device in the same or another data center takes over and has precisely the same data. There is neither data loss nor interruption of application availability, nor any necessity for manual interventions.
A failure of a complete data center is transparent for zone-redundant storage accounts. Two other data centers have in-sync copies of the data if one data center burns down or is temporarily unavailable, e.g., due to power issues. Applications can continue to read or write to the Storage Account. No engineer has to intervene. In contrast, a data center failure can have severe consequences if a Storage Account is configured with local redundancy. The data becomes unavailable for read and write operations. If the data center does not come up again, all data is gone.
If an entire region fails due to a large-scale disaster, all Storage Accounts without a geo-redundancy option are temporarily or forever unavailable. Storage Accounts with geo-redundancy have a copy in a different region. In contrast to “normal” local or zone redundancy, geo-redundancy is not transparent. It might require application code adaption. Plus, the cloud staff has to perform failover and cleanup tasks.
Failover to the secondary region is an explicit decision of the customer. Its cloud staff must initiate the failover. Then, the secondary site becomes the new primary one though two aspects require attention. First, loss of data might occur. Microsoft has no formal SLA but mentions that the recovery point objective (RPO) is 15 minutes. In other words: data written in the last 15 minutes before the failure is lost. The application (and the business) must be able to cope with that. Second, Storage Accounts store their data only in one data center after the failover (local redundancy). Engineers have to reconfigure the Storage Accounts to reassure zone- or geo-redundancy. Eventually, they might want to move the data back to the original primary region.
When an entire region fails, this impacts applications even with the geo-redundancy option and even if the application code continues to run. Applications cannot read or write any data to or from their Storage Account in the primary region. Either the application code handles such circumstances (e.g., by showing users a “temporarily unavailable” page), or the application behaves unpredictably or crashes. Applications can reduce the impact on users with more sophisticated routines. The option to allow applications to read from secondary regions in case of a failure of the primary one enables applications to show users at least the currently available data. Writing changes is not possible – or only to another storage account not impacted; quite some work, but potentially worth the effort for highly business-critical applications.
To conclude: Azure offers a variety of configuration options for redundancy for Azure Storage Accounts, one of their most used services. Choosing an adequate option is an essential task for cloud architects. They have to balance the costs of the more sophisticated options with the actual availability needs based on the criticality of their applications.
With all the efficient and innovative Platform-as-a-Service (PaaS) services in the cloud world, a catastrophe is only one click away. An engineer wrongly configures a database or object storage service, and cybercriminals have access to all data from anywhere on the internet. Booz Allen Hamilton, WWE, Verizon Wireless, Accenture, the Pentagon: these are just some prominent companies and organizations that misconfigured their AWS S3 object storage and lost millions of data records. So, how can cloud security architects avert such catastrophes?
The article exemplifies typical but risky configuration options in Microsoft’s Azure and the Google Cloud Platform (GCP). Then, its focus shifts to Azure’s concepts of Private Endpoints and Private Links and to GPC Service Controls. Looking at the different features of the two sample public cloud providers allows for a better understanding of the various security risks and mitigation approaches when protecting PaaS services.
Basic PaaS Configurations
When creating PaaS service instances via the GUI, the cloud vendors usually suggest (also) security configuration options with a relaxed security level. Their motto: simple and easy to set up, every engineer trying out the cloud and a specific service should succeed. But such configurations come with risks that this section elaborates on in more detail.
In classic data centers, a solution can invoke internal web or microservices only if meeting two conditions:
Authentication (and authorization), i.e., only specific applications, solutions, and users can invoke the service after verifying their identity and when they have the needed access rights and roles.
Reachability of the service in the network: Networks are typically divided into different zones, separated by firewalls. Invoking services in different zones requires opening up connections.
PaaS service default configurations usually require authentication but come with limited network-level protection. By default, PaaS services belong to one zone: the worldwide internet, where everybody can reach everything.
Figure 1 illustrates the network security configurations when creating instances for two typical GCP PaaS services: Cloud Storage, Google’s object storage service, and managed SQL database instances. When engineers create a database service in GCP, they have to choose between a public and a private IP address (Figure 1, 1). A public IP means that everyone on the internet can connect to this service. If the authentication mechanisms are in place and configured correctly and if attackers cannot get access keys etc., this is a perfectly safe approach. However, it is a challenge to ensure that none of hundreds or thousands of engineers in a larger IT organization ever make a critical mistake. And indeed, Goole provides the option to restrict access to certain network zones to reduce the risk (Figure 1, 2).
When creating a Cloud Storage bucket, engineers must decide whether to enforce the prevention of public access to the data stored in this bucket. The challenges with any object storage are – be it in GCP, Azure, AWS, or any other cloud – contradicting network security needs for the two primary use cases.
The first use case for object storage is storing classic (internal) application data. An archiving system might store Pdf files, security solutions potentially video recordings that document who performed which command on a particular system. Such data must not be made available to the internet. The perfect security setting preventing for such an object storage service instance is “forbid all internet access.”
The second use case for object storage is storing and delivering websites, e.g., a web page with an interview text plus incorporated pictures. Everyone on the internet should be able to access the web pages and read the interview. There is no need for network firewalls or authentication mechanisms.
So, there are two highly relevant use cases, and one technology solves both needs and challenges. That is, usually, perfect, just not in this case. In this peculiar case, it is a security risk. When you configure the object storage for a bank’s know-your-customer documents by mistake as publicly accessible website storage, you might only notice when your documents appear for sale in the darknet.
Public IPs and the risk of misconfiguration is not a GCP specialty. Figure 2 presents a GUI mask for creating a Cosmos DB database service via the Azure portal. Two settings create public endpoints: “all networks” and “public endpoint (selected networks).” Especially in the case of “all networks,” anyone on the internet can reach the Cosmos service instance. In other words, access control misconfigurations can have catastrophic results.
To point out the situation clearly – and this is the same for GCP Cosmos DB or Azure SQL databases and many more database services in the cloud – there is no sensible reason why databases should have a public IP. Databases are backend systems. There is no point reaching them from outside the organization.
The screenshot in Figure 2 has a third option for Cosmos DB: Private Endpoints. The following section looks at the details.
Azure Private Endpoint and Link
Some time ago, Azure introduced Private Endpoints and Private Links. They provide an extra layer of security. Thanks to them, engineers can incorporate PaaS service instances (e.g., Cosmos databases) into a VNet and access PaaS services without going via public IPs. When creating a Cosmos Database Service instance with this option, engineers create a Private Endpoint in their VNet with a VNet-internal, private, non-public IP. The Private Endpoint points to the actual resource, e.g., a Cosmos Database, via a Private Link.
“Private Endpoint” is the third connectivity option in the example of creating a Cosmos service instance (Figure 3). When chosen, engineers can create and add a private endpoint. The Private Endpoint has a name – and the engineer specifies into which VNet Azure shall deploy the Private Endpoint.
Figure 4 illustrates what happens in the background in the case of a private endpoint / private link connectivity (B) versus the traditional approach (A).
In addition to configuring Azure Private Endpoints and Private Links, engineers must be aware of and mitigate two more risks. First, Private Links and Private Endpoints are for themselves save, but the cloud architecture must ensure no firewalls are open on the resource itself (risk C). The benefit is that there is no need to open a firewall to make solutions work and components interact. Thus, forbidding opening ports gets feasible. Second, criminal employees might try to exfiltrate data. They could send company data to a personal Cosmos Database service instance controlled and owned by themselves as private persons (D). Cloud security architects can demand that all traffic to storage or database services goes via Azure Private Endpoints thanks to Private Endpoints and Private Links. Auditing and analyzing networks for exfiltration paths gets much more straightforward.
Before discussing Google’s real cool feature for PaaS perimeter protection, a final remark about Azure Private Links. Azure Private Link Service enables engineers to make their own services and resources available via Azure Private Endpoints. Thereby, they benefit from the same security features for their code that Microsoft relies on for protecting its Azure PaaS services.
GCP Perimeter Protection with VPC Service Controls
Google’s VPC Service Controls concept enables cloud security architects to build a perimeter-protected network zone composed of IaaS and (!) PaaS resources. A VPC Service Control definition comprises five main elements: projects, services in scope, accessible services, access level, and traffic exceptions.
The GCP projects of a VPC Service Control define the trust perimeter. These projects trust each other and can freely interact as defined by the network configurations. In contrast, applications in other VPC Service Control perimeters must not access the resources if not permitted explicitly.
The GCPServices part of a VPC Service Control definition specifies for which services the perimeter protection of the VPC Service Control applies. That is a crucial setting. If, for example, Cloud Storage is not part of the perimeter definition, engineers can easily transfer data in and out of the VPC Service Control perimeter. A note: Google adds the feature to more and more services, but it might not yet be available for all needed by a company. Thus, disabling unprotectable GCP services can e a necessity.
GCP Accessible Services add another nuance to perimeter definitions. The feature enables engineers to limit the GCP services, which resources can invoke within the Service Control perimeter.
As a result of a VPC Service Control definition, services respectively their callability fall into three categories:
Service instances of GCP services under VPC Service Control perimeter protection. They are only accessible from within the perimeter. Plus, they cannot invoke external service instances.
Service instances of GCP services without restriction. VMs within a VPC Service Control can connect to VPC-internal and external service instances. Plus, VPC external resources can (try to) connect to VPC-internal instances of this type.
Services blocked for use within a VPC Service Control (“GCP Accessible Services”) setting.
The GCP Access Level is an option for context-aware authentication and authorization. It is a new and upcoming approach requiring a detailed analysis worth a dedicated article.
A last core element is an option for defining exceptions, i.e., adding ingress and egress trafficrules allowing otherwise not allowed traffic into and out of the VPC Service Control. Thus, VMs from within one service parameter might get read access to Cloud Storage in a different service perimeter.
VPC Service Controls in the GCP world and the Azure Private Link/Private Endpoint help cloud security architects shield PaaS (and IaaS) service instances from the internet or other internal network zones. Their absoluteness – especially the exceptionally rigid VPC Service Controls – is what makes them so valuable. No one has to define, analyze, and manage thousands of rules and exceptions to understand and ensure the overall security posture of a network consisting of IaaS and PaaS resources. Instead, PaaS security becomes (nearly) a child’s play.
Commonalities and differences between securing VM-based Workloads on Microsoft Azure and the Google Cloud Platform (GCP)
When moving from an on-premises data center to the cloud, architecting and engineering a secure network is the first challenge. The days of buying hardware and wiring devices are over. Instead, cloud network engineers configure the network with templates and web GUIs. A secure network setup reflects three needs:
Workload structuring and separation, i.e., the high-level network design
Fine-grain connectivity and security controls
Enabling intra- and inter-company interactions between systems, applications, and components
For these three areas, we answer the following questions. First, which network architectural features do cloud vendors offer to reach these aims? What are the commonalities and differences between cloud providers such as Microsoft Azure or Google’s GCP? The latter helps architects to understand underlying principles rather than just product features.
The High-Level Network Design for Clouds
Four layers structure the cloud network security (Figure 1), with the second-highest being the most intuitive to understand. This second-highest layer corresponds to classic, simple on-prem networks. An IP range constitutes a network or network zone and gets a name and identifier. It is a moment of creation: let there be a network with the IP range 10.0.0.0 to 10.15.255.255 and call it “production environment headquarters.” In general, network architects should choose addresses from the private IP range, i.e., from 10.0.0.0 to 10.255.255.255, 172.16.0.0 to 172.31.255.255, or 192.168.0.0 to 192.168.255.255.
Zoning concepts typically reflect one or more of the following dimensions to structure the network landscape and workloads:
Geographical locations: Germany, Switzerland, Singapore, and China – or New York, Boston, San Francisco
Stages, for example, production, preproduction, test, and development
Business lines, e.g., investment banking, retail banking, corporate banking, private banking, or wealth management
Data sensitivity classes such as secret, confidential, internal, or public
IT departments subdivide network zones into subnets, each taking over a subset of the zone’s IP range. A zone with IP range 10.0.0.0 to 10.15.255.255, for example, cloud have two subnets: 10.0.0.0 to 10.0.255.255 and 10.4.0.0 to 10.4.63.255. Subnets allow for finer granular structuring.
Zoning is a complex task – and that is why cloud security architects have to support the design process. Should all VMs of an application reside in one subnet – or all VMs belonging to the same application or solution cluster? Should frontend and backend components reside in one subnet, or should they be in entirely different zones? Security architecture is one (important) stakeholder. Network architecture and business and enterprise architecture have a say as well.
Implementing High-Level Zoning Concepts in Azure
In Azure, VNets and Subnets are the features for implementing zones and subnets. A VNet’s IP range is coherent. IP addresses within a VNet are unique, though different VNets can have overlapping IP addresses. Within VNet A with all its subnets, there can be only one VM with IP 10.10.10.10. However, if an IT department has two VNets Z1 and Z2, both can contain one VM with IP 10.10.10.10. Furthermore, all resources of a VNet, such as VMs, reside in one region.
A fundamental security aspect is the connectivity between VNets and Subnets. Which VMs can talk with each other by default, which ones not, and which kind of shielding can engineers implement? Azure isolates VNets with their subnets and VMs from each other. Without explicit configuration, a VM in one VNet cannot connect to a VM in a different VNet. In contrast, VMs within a VNet can contact each other by default, even if residing in different Subnets. So, placing VMs in two subnets of the same VNet does not provide any extra level of security without additional (fine-granular level) measures we discuss later.
Implementing High-Level Zoning Concepts in GCP
VPC Networks and Subnets are GCP’s corresponding concepts for structuring a network and its IP range. VMs placed in a VPC Network, whether in the same or different GCP Subnets, can reach every other VM within the VPC. GCP allows VPCs to have overlapping IP ranges, as Figure 2 illustrates. In the example, there are VPC Networks with the IP address ranges 10.0.0.0/16 and one with 10.0.0.0/20, which obviously overlap.
A GCP Subnet is a feature preventing chaos in large VPC Networks. A Subnet takes over parts of the IP range of the VPC Network to which it belongs. All IPs are unique within a Subnet, implying that Subnet IP ranges do not overlap. In GCP, Subnets also have a geographical aspect. Resources within a GCP Subnet reside in one geographic GCP region.
The Management Layers
The need for a sophisticated management layer for organizing hundreds or more VNets or VPC Networks is evident when looking at the companies and customers the cloud vendors target. They do not want (only) small companies or startups paying for 5 or 10 VMs. They want the big fishes as well: corporations listed in the Swiss Market Index, the Eurostoxx 50, etc. These companies need thousands or ten-thousands of VMs. Two layers – zones such as VNets or VPCs plus Subnets – are insufficient for large clouds. Thus, the cloud providers introduced additional management layers on top.
When trying out the clouds the first time, many get the impression that accounts are the highest ordering principle, typically represented by an email address and an additional name or identifier. But cloud providers offer features for combining accounts to GPC Organizations in the Google world and Azure Management Groups for Microsoft. They help for governance and billing purposes or identities. These features help companies for which the move to the cloud was a decentral (chaotic) grass-root movement – or when company networks and cloud infrastructures evolve over the years due to mergers and acquisitions.
Below an Azure Subscription, aka an account, Microsoft provides the concept of Resource Groups, into which all VMs, VNets, and all other components are placed. Google has the concept of GCP Projects, which corresponds to the Azure Resource Groups. But there is one big difference: Google’s idea of GCP Folders, a tree-like flexible structure with multiple layers and nesting. GCP folders help in modeling and structuring workloads based on various dimensions such as business units, geographic regions, or stages with ease (Figure 3). In contrast, Azure has a rigid 3-levels model: organization, subscription, resource group.
The management layers typically provide options to enforce and monitor configurations (“policies”). However, here we focus on directly relevant network security configuration options. In this context, GCP has one related feature: Hierarchical Firewall Policies. GCP allows associating organizations and folders with such policies, which enforce specific firewall rules for all folders and VMs hierarchically lower (or to delegate decisions to lower levels) or restricting the applicability of rules to specific VMs by specifying target networks or service accounts.
Fine-Granular Access Control for Azure
Microsoft and Google follow different philosophies regarding fine-granular access control (and firewalls) within a VNet or VPC Network. Azure provides more – and more rigid – concepts making it essential for engineers to understand the interplay. GCP has fewer, though partially more general or flexible features.
The Azure World
Azure enables architects to secure networks with three concepts: network security groups, application security groups, and firewalls.
A network security group (NSG) is a firewall attached to Azure Subnets or individual VMs (specifically: a VM’s network interface), enabling engineers to control connectivity even within Azure Subnets. An NSG defines which IP addresses outside the Azure Subnet can reach which inside IP address – and vice versa. The rules can apply to individual IPs, groups of IPs, or the complete Subnet. If VMs serve different purposes – frontend, backend, or database servers – NSGs allow configuring their exposure to other subnets or the global internet differently, based on exact necessities and the risk appetite.
Application Security Group (ASG) are a refinement of NSGs and overcome their biggest downside: IP-based rules. IP addresses might change and/or if the number of frontend servers handling customer requests increases, all relevant rules in the NSG rule sets require a modification.
An ASG is a set of VMs with a dedicated tag. They are not (!) Identified based on an IP. Instead of opening the HTTPS port to 10.5.10.26 in a pure NSG setup, ASGs allow opening the HTPPS port for all VMs of the subnet belonging to the ASG “asg-frontend-vms”. The ASG might contain a VM with the IP 10.5.10.26, but any other VM within the subnet would be governed by the same rules as soon, and as long the VM is part of the ASG.
Figure 4 provides an example of an ASG – “asg-frontend-api-servers” – on the left. The sample VM on the right belongs to precisely one ASG: “asg-frontend-api-servers”.
An ASG impacts the VMs associated with it once the ASG becomes part of an NSG. Suppose an NSG allows HTTPS traffic to a VM handling incoming API requests. Then, rather than explicitly listing IPs in the NSG (and updating them when starting or stopping VMs), the engineers create an ASG and add the relevant VMs to this ASG.
In addition to VNets, Subnets, Network Security Groups, and Application Security Groups, Azure also allows configuring a dedicated security service, Azure Firewall. Such a firewall helps block unwanted traffic. It is a component typically placed at the perimeter between company VNets and the global internet. In contrast, VNets, Subnets, NSGs, and ASGs help control the company-internal network traffic.
The various components and features might be overwhelming when looked at the first time – and seem to be overlapping. That’s one way to see it. The other way to look at all these features is that managers and architects can achieve the same goal with overlapping features – and thus share responsibilities between teams more efficiently. A high level of network security when going live is necessary, but it is not enough. IT departments have to manage and maintain all network components to stay on the same security level in the years to come. More sophisticated and partially overlapping features make the work distribution more manageable.
Fine-Granular Access Control for GCP
NSGs on the Subnets and the VMs level together with ASGs – Azure provides various options for fine-granular connectivity configuration. In contrast, GCP provides firewalls (only) on the VPC level. Rules applying only to one VM are part of the VPC firewall rulesets. This minor technical finesse impacts network or firewall processes in IT organizations. Rules on the VPC level favor central firewall management, making it more challenging to delegate (some) firewall rule decisions to individual application teams.
Figure 6 illustrates options for defining VPC Firewall Rules. A rule can apply to all VMs of a VPC Network. It can apply to specifically tagged VMs (like the ASG concept in Azure) or VM with specific service accounts. The latter is a concept not found in Azure. When defining firewall rules, architects can configure the applicability further related to where the network traffic comes from based on IP ranges, source tags, or a service account within the same VPC.
To sum things up: firewall rules and connectivity in GCP and Azure provide different concepts. The result: different setups and organizational approaches. Just one final warning for GCP: The GCP portal presents default rules for opening SSH, ICMP, or even all ingress traffic. These rules simplify the first steps in GCP for newbies – and might result in risky configurations.
By default, VMs in different VNets and VPC Networks cannot communicate. However, there are solutions for inter-zone communication – and they do not require traffic via the internet: Azure VNet Peering and Google Cloud VPC Network Peering.
Suppose one zone has the IP range 10.0.0.0 to 10.0.255.255 and a second one 10.4.0.0 to 10.4.0.255, VMs, VMs can interact easily. No colliding IP addresses cause issues when determining an actual target VM for traffic. However, why should cloud architects peer zones, thereby allowing communications between VMs placed in two zones to isolate the workloads?
Peering is not beneficial for two-zone networks but for networks with many. GCP, for example, allows peering VPC Networks of the same or different GCP Projects. The VPC Networks can even belong to different GCP Organizations. VMs in peered VPC Networks can communicate with each other if not prevented by firewall rules. The twist is the non-transitive nature of peering. Suppose three VMs – VM1, VM2, and VM3 – are in three different VPC Networks: Z1, Z2, and Z3. Peering Z1 and Z2 plus Z2 and Z3 enables the communication between VM1 and VM2 and between VM2 and VM3, but not between VM1 and VM3. Non-transitive peering is vital when setting up typical hub-and-spoke network designs (Figure 7).
VNet Peering in Azure comes with different features and two main configuration options:
Enabling inbound and/or outbound traffic separately – not just allowing or disabling bidirectional traffic as in GCP
Restricting or approving traffic not originating in the peered VNet, also a feature not available in GCP
Peering is a simple yet powerful concept. It helps enable and control company-internal network communications, especially for setting up zones hosting communication middleware or management components.
Connecting VMs and the Internet
Public IPs for VMs are evil; this is the first message a security architect should hammer into all engineers’ minds. Vulnerabilities on the VM or application layers or simple misconfigurations are potential entry doors for attackers. However, locking up all VMs is also not possible. Admins cannot plugin keyboards to VMs to administer them. They come via the network as well.
The solution? Intermediate components or services that forward only parts of the network traffic (e.g., only TCP while blocking UDP traffic) and potentially even inspect the traffic. In particular, the following components and services are relevant:
Network Address Translation (NAT) Services such as the GCP Cloud NAT or Azure Virtual Network NAT. They allow VMs without public IPs to invoke services on the internet without exposing the VMs to external attackers.
Load Balancers such as the Microsoft Azure Load Balancer or Google’s Cloud Load Balancer. Their primary purpose is to distribute high loads to multiple backend VMs, hiding this fact for the invoking applications. At the same time, a load balancer reduces the attack surface, first by preventing direct access to VMs and, second, by limiting the traffic that gets through, e.g., to the HTTP(S) protocol.
Web Application Firewalls restrict even the content of http(s) requests, preventing, e.g., attacks with malformed invocations.
Azure Bastion allows admins to connect to and manage VMs without publicly exposing a VM with an IP. It works as follows: administrators connect first to the Azure portal. From there, the bastion host provides access to the VMs. A bastion host is a hardened server, making it less likely that (unneeded) components potentially have security issues.
These services are worth more profound analysis. Plus, for some cases, 3rd party solutions are an option, not only the ones from the cloud vendors. But all this is another long story, as is protecting VMs and their workloads against malware and other attacks.
To conclude, Microsoft Azure and Google’s GCP provide similar yet different features for creating and securing networks. Engineers aiming at getting applications to run on VMs benefit from their similarity. In contrast, security architects have a different perspective. They do not want to open up network connections. They must close the obvious and hidden wholes in the network design. For them, it is essential to understand all these little features and details where GCP and Azure – or any other public cloud – differ. It is less searching for easter eggs on a sunny day in spring. It is more isolating and avoiding mines in challenging terrain or stormy sea.
“Check whether all needed services are active!” It was a ubiquitous warning at the beginning of all hands-on labs when I took my first Google Cloud Platform (GCP) tutorials. I could not understand Google’s philosophy. Why would you disable a service? Why does Google not activate them by default and keep them on all the time?
Today, about two years later, I am a big fan of this great feature. It eases my work as a cloud security architect. IT departments benefit from lesser costs and less friction between engineering and security. Plus, the feature helps to protect Platform-as-a-Service (PaaS)-heavy workloads.
Firewalls keep many attackers away in the old world of servers and virtual machines. In a PaaS world, services are directly exposed internally or to the internet, making inadequate authentication and authorization mechanisms and incomplete configurations a considerable risk. Thus, hardening PaaS services is a necessity. If left to application engineers, some might forget about the hardening or projects under pressure delay the hardening to a later point in the future – and it remains a future task to eternity and beyond. Here, the option to disable PaaS services makes a difference. Figure 1 compares a hide-and-seek and a governance-based working model.
The upper process illustrates what is a matter of time in a larger organization. Application teams use a new, innovative PaaS service without proper hardening. The CISO organization finds out and is furious. Next, they analyze together with the cloud platform management team which configurations are appropriate and necessary for hardening the service. Typically, the hardening bases upon a CIS benchmark or cloud-vendor best practices. Next, the platform management organizes the hardening – and the engineering team (hopefully) does not have to make too many modifications to make their code work with the hardened service.
Figure 1: Hardening approaches with and without security governance
The corporate culture and employee motivation do not benefit from CISOs-turned-cops working style always looking for guilty engineers. Reworking code to work with hardened services causes extra costs, especially when more extensive architectural adjustments are necessary. Plus, there is an increased risk for security incidents till the completion of the hardening. Thus, a more bureaucratic process (Figure 1, lower process) that seems to slow innovation and agility is often the better choice.
When an engineer wants to use a new service, he informs the cloud platform management. The latter elaborates the hardening requirements with the CISO organization and ensures their implementation. Afterward, the cloud platform management makes the newly hardened PaaS service available for the engineers, who integrate it into their solutions.
To enable a new PaaS service in the Google Cloud Platform, the cloud platform management changes to the “all products” overview in the GCP console, selects the service it wants to activate, and presses “Enable” (Figure 2). In Google’s GCP, the activation is just a single click on a button.
Over the holiday, I took my time to read this book about state-sponsored attacks. Quite interesting if you want to model potential attacks on your company and organization. My summary in three parts distilled my learnings from reading the book.
Comparing TLS and OAuth 2.0 sounds like comparing apples and pears for seasoned security experts. Still, it is a great way to illustrate the various facets of authentication in complex, open, and interconnected IT landscapes.