Category Archives: Azure

Azure Backup Functionality for IaaS Workloads

VMs and Fileshares are fundamental building blocks for IaaS workloads in on-prem or cloud environments such as Azure. So, which features does Azure provide for companies to back up their IaaS-related data and components? And how does Azure help prevent undesired manipulations or deletions of such backups? To answer these questions, this article elaborates on the interplay between, first, Azure’s VM and file share services and their configurations with, second, Azure Recovery Services vaults for providing, managing, and maintaining VM and file share backups.

An initial remark before going into details. Geo- and zone-redundancy are related but different concepts. They reduce the risk of data loss due to the unavailability or destruction of hardware components and data center incidents. They do not allow restoring earlier versions of the data and backups in case of data mismanipulation. “Going back in time” is the unique selling proposition of backups. Though, keeping backups in various geographic locations helps, e.g., if a larger electricity blackout brings down data centers for days in a larger geographic area.

Backup Features for VMs in Azure

Azure’s solution for backups of IaaS cloud workloads is the Recovery Services vault (RSV). The configuration for VM backups in the portal is straightforward. With the “enhanced backup” feature, Azure can backup VMs up to every 4 hours (Figure 1, 1). There is an option to keep backups close to allow for a quick restore (2). Most important, and not available for all types of data and services in the cloud, Azure provides a long-time storage option to keep one backup per week, month, or year for months or even years (3). The final configuration option in the Azure portal is the list of VMs in scope for the backup – then, the configuration is complete.

Figure 1: Configuring VM Backups in the Azure Portal

Protecting and Securing VM Backups in Azure

Azure has various built-in features for preventing accidental or intentional deletion of backups. The most radical is the “immutable” option (Figure 2, 1). If switched on (and locked in), backups cannot be deleted before the retention period expired. This immutability feature is a lifesaver if ransomware attackers delete or encrypt critical data. A second feature, multi-user authentication (2), enables IT organizations to demand that a second person approves critical activities such as vault deletion operations or modifications of backup policies. It benefits organizations by preventing severe misconfigurations resulting in the loss or unavailability of current or future backups, whether by mistake or on purpose. To formulate it differently: Immutable backups help rebuild your data center after really severe incidents. Multi-user authentication helps prevent such a mess from happening in the first place and ensures that your backups exist.

Finally, the soft delete setting allows enabling the feature to roll back deletion operations (Figure 2, 3 and 4). In the context of VM backups in RSVs, the feature is especially beneficial to restore the status quo ante after smaller application management or engineering mistakes. If application managers notice something was deleted by mistake some time ago, they can easily restore it. Helpful for operational mistakes but only of limited value for ransomware attacks. Engineers can circumvent the soft delete feature – and even Microsoft documents how to delete all data forever, even if soft delete is active. A closing remark for VM backups: These configuration options apply to VMs and their backups though configurations take place via Recovery Services vaults.

Azure Fileshare Backups

File shares are not a CISO’s darling, but the technology exists for decades and probably continues to live for some more years. File shares enable uncomplicated interactions between users themselves, applications themselves, and between users and applications. It might be an often-redundant technology, but the ability to use file shares is a must in any context of legacy applications.

Fileshare Backups in Azure

While (in the portal) the relevant configuration options for Azure VMs backups are on the Recovery Service Vault, the situation is different for file shares. Many backup-related configurations take place on the Azure Storage Accounts, which contain the file shares. And a warning for those who understand backing up Azure Blobs, which are also stored in Azure Storage Accounts: backups for blobs (see “Blob and PostgreSQL Backups in Azure”) differ from the ones for file shares.

Azure supports ad-hoc backups (just click “add snapshot” in the portal’s snapshots mask) and periodic backups (Figure 3). Configuration options for periodic backups are the frequency (every four hours or less often) and how long backups are kept. Azure allows configuring retention periods of several years.

Protecting and Securing Fileshare Backups in Azure

Protecting file share backups is delicate because it is not just about protecting the backups. It is also about protecting the Storage Accounts. They contain the actual backup data. If deleted, all associated backups are gone. Thus, cloud security architects should be aware of the various configuration options for Storage Accounts and Recovery Services vaults to prevent the deletion or discontinuation of critical backups (Figure 4).

Figure 4: Protecting Azure File Share Backups

On the top are configuration options on the Storage Account level and for the Recovery Services vault. Most important is the immutability feature. When active and locked in, it guarantees that nobody – and really nobody – can delete these backups. It is a brand-new feature in public preview. Second, there is an option to forbid deleting the Storage Account utilizing a delete-lock. It makes deleting Storage Accounts hard to impossible. The different purposes are crucial: immutability is about having a backup when someone tries or succeeds in deleting all or critical data. The lock helps more to prevent the Storage Account deletion, which would bring down applications (even if the data can be restored with an immutable backup). Thus, its purpose is to improve reliability and application uptime.

On the Azure File Share level, Azure provides the soft-delete feature (Figure 5). It has some peculiarities, especially since Azure supports soft-delete for SMB file shares and not for NFS ones. The SMB variant is particularly strong in the Windows respectively Microsoft world. The scope for soft-delete is not a single file but complete file shares. If switched on, engineers can restore a file share after its deletion. However, it does not bring back individual files if they are deleted. For this, the backup functionality with periodic backups or ad-hoc snapshots has to be used. So, to conclude: Azure provides various backup-related features, but understanding them in detail is critical to prevent issues if organizations really need them in critical situations.

Figure 5: Azure Storage Account Configurations for File Shares

Jump Hosts, Admin Workstations, and the Azure Bastion Service

The term bastion host is a reminiscence of medieval fortifications and everyday IT slang since long before clouds became relevant. A bastion host is a server with high exposure to external attacks and, thus, specifically secured and protected. The term covers servers with – due to their functionality – a high risk for external attacks. DNS servers are a typical example.

The IT community also uses the term bastion hosts for jump hosts. The latter have a very focused purpose. Jump servers provide access to VMs (or good-old-on-prem servers) in a secure and otherwise inaccessible environment, e.g., from employee laptops or the internet.

Jump hosts are critical in the public cloud. In the cloud, admins accessing a VM always come from cloud-external networks – but backend VMs or middleware servers should not be accessible from the internet. Jump hosts are the solution. Admins connect from their laptops to the jump host; only from there can they reach backend VMs, e.g., SSH or RDP.

Still, jump hosts are not helpful for all admin tasks in the cloud. They are necessary for IaaS workloads, i.e., when admins connect to VMs on the operating system level. That is not necessary for PaaS services such as Cosmos DB, for which customers do not have OS-level access.

Understanding Jump Hosts

Jump hosts promise the impossible: secure the admin access to VMs in the public cloud, which, due to the nature of the cloud, always comes from the internet or a less secure network. They achieve their goal by combining three measures:

Reducing the attack surface of the host. Jump hosts have only one purpose – they are a host to “jump” to the real interesting ones. Thus, many components running on a server can be removed during the hardening, e.g., many unneeded drivers. Fewer components mean fewer components that might have vulnerabilities that attackers can exploit. Furthermore, servers support many protocols that a jump host does not need. Restricting protocols and ports is another measure to reduce the attack surface. A jump host has to access the backend VMs with RDP for Windows and SSH for Linux. All other ports should be blocked. Finally, IT organizations can ensure that admins install urgent patches immediately and first on the bastion hosts before patching any other system.
Restrict the reachability or connectivity of the Jump Hosts. If the admin laptops are all in a workplace zone, only ingress traffic from such zones should be allowed for the bastion hosts, thereby locking out any attacks originating, e.g., from servers in high-risk countries.
Defined point of control. Security operating centers have to prioritize the events they can investigate. When the operating system signals a “low” risk event, the SOC can investigate them for bastion hosts. However, they might not have enough security analysts to look for such events on ordinary VMs. Thus, bastion hosts imply a network topology with clear entry points, which eases prioritizing and monitoring security events.

In the end, this unique combination of measures and configurations makes the difference between an ordinary VM and a hardened jump host (hopefully) withstanding the most sophisticated attacks.

Beyond Networking: Admin Workstations and IAM in the Context of Jump Hosts

Jump hosts themselves must be set up adequately, but they are also part of an ecosystem. Especially the integration in the enterprise identity and access management is essential. If not integrated well, the integration becomes a potential breaking point that attackers could exploit. The leaver process for removing system access from employees leaving voluntarily – or involuntarily – the organization must work as well for jump hosts, and multi-factor authentication for any login to the jump host is essential.

Admin workstations or privileged users’ workstations are an additional concept that improves security with workplace-related measures. It contrasts today’s work-from-anywhere zeitgeist and has the potential to annoy admins. IT organizations might want to restrict from where admins can log on to jump hosts. An implementation option could be the country from where remote workers have access, or it might even be just a few IP addresses of dedicated workstations in selected office buildings. It all depends on the organization’s VMs, their criticality, and the company’s risk appetite.

The Azure Bastion Service

Setting up a bastion host does not require Harry Potter-like magic. Many IT organizations have built and managed them for many, many years. Nevertheless, Azure offers jump hosts as a managed (PaaS) service. Its name: Azure Bastion. Its configuration options are limited, as Figure 1 shows. The tier level (A) is relevant if scalability is essential (i.e., many admins work in parallel), whereas other options relate to authentication implementation variants (B).

Figure 1: Setting up an Azure Bastion Host

Azure Bastion’s pricing is pure horror compared with a single VM’s costs. Such a comparison lacks, however, two aspects. First, setting up and maintaining your own bastion host requires time – and engineering hours are not free. Second, applications and VMs can share a single Azure Bastion service instance by peering the bastion’s Vnet and the application Vnets. For sure, these bastion hosts become even more critical elements in an application landscape because they become a single door to many VMs.

To conclude: jump hosts are an essential pattern for secure access to backend VMs in the cloud. You can configure your own or use a service like Azure Bastion; what matters is a comprehensive and consequent implementation.

Blob and PostgreSQL Backups in Azure: Features & Security

Backup features are core services all public clouds provide. Azure’s characteristic is a confusing backup service portfolio. Two backup technologies coexist: Azure Backup vault and Azure Site Recovery vault. None is better—the data source you want to back up defines which vault type to use – and there are services that even backups in different ways. As a rule of thumb, the Service Recovery Vault is the solution for typical IaaS workloads such as Windows and Linux VMs, Fileshapres, and specific databases running on VMs. Azure Backup Vault is strong with Azure PaaS services. Azure starts to hide the complexity by introducing the Azure Backup Center (Figure 1). It is a unified user interface for handling backups in the Azure portal. Still, when architecting the backup solution, understanding the differences is crucial. Even for one vault type, nuances depend on the data.

Figure 1: Vault Types in the Azure Backup Center (Recovery Services vault – 1 and Backup vault – 2)

A typical enterprise workload in Azure cloud might comprise of …

VMs and their disks
Object Storage, i.e., Azure Blobs in Azure Storage Accounts
File Storage in Azure Storage Accounts
DBaaS (Cosmos DB, Azure SQL Managed Instance, Azure SQL)

Then, a service recovery vault would take the VM, including customer-installed and managed databases installed on VMs, plus file systems stored in Azure’s Storage Accounts. Blobs and DBaaS data, however, go into a Backup vault, though things turn out to be different than expected if you take a closer look as we do in the following paragraphs. We discuss how to configure ad-hoc and periodic backups for blobs and DBaaS. And the interesting aspect is: In the most recent forms, the Backup Vaults are not or only slightly involved.

Operational Backups for Azure Blobs

The quickest way to back up a single Blob in Azure is to go to the Azure portal, switch to the blob, and select “make a snapshot” – and you are done (Figure 2). Super convenient and super quick, just not doable if you have hundreds of storage accounts and want to be sure that you get a backup every hour.

Figure 2: Snapshot backups for single Blobs

For backup automation, Azure’s most advanced solution for blobs in Azure Storage Accounts is the Operational Backup for Azure Blobs feature. Azure Storage Accounts can have multiple (block blob) containers (Storage Accounts can also store and manage other data: file shares, queues, or tables). The actual blobs reside inside these containers. Figure 3 illustrates these three main blob management layers and how they contribute to securing blob backups.

Figure 3: Azure backup features for Blobs, including protection options

When creating a storage account in the Azure portal, engineers make several backup-related choices, as the options in Figure 2 illustrate:

Point-in-time recovery (PITR): The ability to roll back to anytime within the retention period. The maximum value is 360 days (Figure 2, A).
Soft-delete for blobs and containers (B and C). If these features are active, Azure keeps a copy of blobs and containers after their deletion. Engineers could restore them if the deletion was a mistake (or a malicious activity). Soft-delete for blobs allows restoring a single object, but soft-delete on the container level is necessary when someone deletes a container with the included blobs.
Versioning for blobs (D): In contrast to files, object storage does not modify a blob. It replaces a blob with a newer version. Versioning means keeping (some) older versions. Azure can restore an older blob version, e.g., if modifications were incorrect or if ransomware attackers replace objects with encrypted versions.
The version change feed (E) is Azure’s feature to ensure non-repudiation of blob changes by logging any change.
The immutability feature, if switched on, makes it impossible to delete a backup for a defined period. No administrator, hacker, or Microsoft employee can delete backups protected with this feature.

Figure 4: Data Protection options for Storage Accounts

After creating the Storage Account, engineers can lock in the immutability setting and switch on the already mentioned “operational backup” feature. It comes with many features mentioned in the last paragraph: PITR, versioning, and soft-delete. Its biggest benefit is the integration with the general Backup Center management GUIs (Figure 5).

Figure 5: Activating Operational Backups for Azure Blobs

Protecting the Backups

Network-level shielding and access control, disaster readiness, backup loss prevention – these three topics drive the protection of your backups. Network shielding reflects the protection of the backups and the backup-related services on the network level. The actual backups are never directly accessible for Azure customers and users, only via the portal. Shielding the service for a concrete Azure tenant is possible with private endpoints. They are the method of choice in the Azure (PaaS) world, complemented by role-based access control based on identities and roles managed in a central solution, e.g., the Azure Active Directory. Privileged Identity Management (PIM) is Azure’s feature for even more security for privileged roles and users, valuable not only for classic admins but also for engineers who configure backup settings.

The second big topic, disaster readiness, relates to preparations for surviving large-scale incidents. A backup does not help if located in the same burned-down data center as the VMs. Thus, geo- (or zone-) redundant backups are valuable. Finally, backup loss prevention is about preventing backups from being deleted by successful attackers – or due to operational mistakes of employees. Two features drive this aspect:

Resource Guard to enforce a four-eyes principle before critical backup-related operations
Immutability, i.e., the absolute impossibility that anyone can delete a backup.

Finally, if you encrypt the backup with a customer-managed key, restoring the backup is only possible if the specific keys exists – but encryption and key management are big topics. One article is not enough to cover all the relevant aspects.

Figure 6: Storage Account Settings: PITR, Soft Delete, Versioning

Backups for Azure Database for PostgreSQL

PostgreSQL is one of the database-as-a-service offerings from Microsoft. And when looking just at the various PostgreSQL variances, the full complexity of backups in the cloud using Azure-native tools becomes visible. The PostgreSQL variants are single server, flexible server, and Azure Arc. The latter addresses mixed on-prem and cloud workloads, so I focus on the first two. A first glance at the backup features tells you that backup vaults are where the backups are stored – and that immutable backup is an available feature (at least in the preview). But this holds only for the PostgreSQL single-server variant. The flexible server variant is the one you probably choose for newer software architectures. What Azure-native backups offer here differs, as I lay out in the following paragraphs.

Initially, a short remark on how PostgreSQL works internally for storing and managing the data: An Azure PostgreSQL flexible server instance is what one calls a database server in the on-prem world. It is an environment that can host many databases with the actual data. So, we have a three-layered model – flexible server, database – data. Regarding backups, there are many similarities with blob backups but some distinct differences. Most importantly, point-in-time recovery is available with a maximum retention period of 35 days. Azure allows configuring these backups as geo-redundant, in which context Microsoft mentions a recovery-point objective of around 1 hour, though this is not a guaranteed service level agreement. However, there is no built-in solution for long-time backups. Customers have to implement their backup solution with the help of the PostgreSQL command pg_dump. They must implement processes performing daily exports and cleaning up older backup files.

Figure 7: Configuration options for built-in automatic backups for Azure Databases

Securing PostgreSQL Flexible Server Backups

The security features for PostgreSQL flexible server backups are much more limited than for blobs, as Figure 3 illustrates. Identity and access management restrict access for users, at least if attackers cannot hack a company’s Active Directory. The continuous backup for point-in-time recovery provides the option to roll back the database server for a defined number of days. The only built-in security feature, however, is a delete lock preventing the deletion of the flexible server. The deletion of the server automatically deletes all backups. Adding and removing a lock on the server is an action users with a specific role can add and remove. If an internal user with such a role (or an attacker who could overtake a user with such a role) deletes the lock, he can delete the complete server with all its backups afterward. There is no built-in immutable backup option, though writing backup routines to dump exports to immutable (blob) storage is an option. Otherwise, IT departments really have to trust their Active Directory security measures and their admin and backup employees if they rely on such backups.

Figure 8: Azure backup features for PostgreSQL Flexible Server backups, including protection options

So, what is the conclusion for cloud security architects from what we can learn from Azure cloud-native features for backups for blobs and PostgreSQL flexible servers? First, many advanced cloud-native backup features help companies to improve their backup implementation. Second, the availability of features depends on the exact PaaS service and service variant. In other words: it looks chaotic and random, which backup features an Azure (database or storage) service has. Thus, the third and most important learning is: ensuring a defined backup service level for a complete application landscape in Azure is a big challenge for architects. It requires governance, which cloud services applications can use, and clear, enforceable guidelines on performing and securing these backups.

Backups for Selected Database-as-a-Service Services in Azure and AWS

Database-as-a-Service (DBaaS) is one step for IT departments to focus on software development, configuring and integrating 3^rd party solutions, and maintaining and optimizing application operations rather than spending time with infrastructure and middleware topics. But even DBaaS does not make backups obsolete. Certainly, engineers should not dump production databases, but what if they do it by mistake? Also, DBaaS does not come with a 100% availability guarantee, even though you can minizine risks with redundancy – but not every database might be worth such ongoing costs. So, what are the options?

In the following, we look at backup features for selected DBaaS offerings:

Azure Cosmos DB
Azure SQL Server
Amazon DynamoDB

DBaaS implies that a web service provides database functionality via its API. Engineers and architects can only design backup strategies based on this API. So, what are the concrete features and options for the mentioned services?

Backups for Azure Cosmos DB

The backup options for most Cosmos DB variants – for NoSQL, MongoDB, Apache Casandra, Table, and Apache Gremlin, excluding PostgreSQL – are the same. For them, the first configuration decision is between continuous and periodic backup. A periodic backup means that at regular points in time, Azure performs a backup (Figure 1, A). Customers can set the periodicity to any value between 1 and 24 hours (Figure 1, B). The retention period can be up to 30 days (Figure 1, C). Azure keeps backups for so long; then Azure it them automatically. Alternatively, engineers can configure continuous backups (Figure 1, D). Then, they can roll back to any point in time within the last 7 or 30 days.

Figure 1: Backup Configurations for Azure Cosmos DB

To prepare for more extensive outages, natural disasters, larger electricity blackouts, etc., Azure copies every backup by default to a paired region, such as from Switzerland (North) to Switzerland (West) or from France (Central) to France (South). If configured via CLI, options are geo-, zone-, and locally redundant storage. Self-service restores from the second region are not self-service but via service ticket and Microsoft support. Thus, larger events might result in a high volume of service requests and heavy delays for the restoration. Geo-redundancy for the service itself (not just for backups) can be necessary for business-critical applications.

Azure DB provides a sophisticated, easy-to-use solution for device failures (even the redundancy alone covers most cases) and operational mistakes, i.e., somebody realizes she deleted some data yesterday she did not intend to. There are shortcomings in other use cases:

No support for ad-hoc backups. They are helpful, e.g., before deploying complicated patches that change the data model and data. Continuous backups cover them, but there is no option with periodic backups.
The absolute limit of 30 days retention time can be too short, depending on the risk appetite and the setup. Cyberattackers might be in the system for a while unnoticed. Data manipulations not causing immediate issues might not be covered. Also, operational mistakes go unnoticed if data is only used monthly, quarterly, or yearly for, e.g., dedicated end-of-month, end-of-quarter, or end-of-year processing.

The statement is not that there are no solutions. Exporting data by script or restoring them into a different database instance might work. However, it is not an out-of-the-box feature. It requires customer engineering – or even the need for a 3^rd party backup solution.

Backups for Azure SQL Server

Azure SQL server is a managed service – or a database-as-a-service – offered by Microsoft. While customers must make storage and server choices (unlike Cosmos, which scales up and down as needed), Microsofts performs (and automates) the database administration, including patching.

Azure SQL Server has point-in-time recovery, i.e., continuous backups enabling the restoration of the data of a specific point in time within the last 1 to 35 days. Internally, Azure performs full backups weekly and differential backups, depending on the configuration, every 12 or 24 hours (Figure 2, A). In addition, Azure backups transactional logs every 10 minutes. If the database crashes, a maximum of 10 minutes of processed transactions are lost. In other words, the Recovery Point Objective (RPO) is 10 minutes.

Figure 2: Backup Configurations for Azure SQL

If engineers do not change the backup policy, backups are geo-redundant (Figure 3), i.e., copied to a paired zone, as discussed for the Cosmos DB. If customers have to rely on the geo-redundant copy, the RPO is 1 hour instead of 10 minutes.

Figure 3: Backup Redundancy Options for Azure SQL

A significant difference to Cosmos DB is the longer chance to put backup aside. A dedicated configuration option exists for how long to keep a weekly, monthly, and yearly backup – and the setting allows one to configure multiple years, not just some weeks.

The Azure documentation states that the restore time is usually “less than 12 hours”, though it might take considerably longer. Business-critical applications cannot rely on having a backup on the other side of the globe. They might have to configure the service itself as geo-redundant.

Backups for AWS Dynamo DB

Dynamo DB is a NoSQL database-as-a-service offering from AWS. Amazon fully integrated Dynamo DB with AWS Backup, AWS’s service for managing backups on an enterprise level. While there might also be legacy options, I focus on this backup functionality. After discussing two Azure services, the idea is to see how AWS approaches backups. And one difference was already mentioned: the full integration into the standard AWS backup solution.

For AWS Dynamo DB backups, the following paragraphs cover three aspects:

Continuous backups for Point-in-Time Recovery (PITR)
Ad-hoc Backups
Periodic Backups

If explicitly enabled, point-in-time recovery based on continuous backups (Figure 4, A) is available in case of an issue. The retention time is 35 days, i.e., a restore to any given point within the last 35 days is possible.

Figure 4: Setting for Point-in-Time Recovery (PITR)

For ad-hoc backups, the first statement is: AWS provides the feature for Dynamo DB (Figure 5); there is no need for a workaround like in Azure. Engineers can configure some details – when it should start, how long it can take, in which vault to store it, and when to move it to cold storage – the most important being the retention time. When should AWS delete the backup? Here, the service allows choosing several years.

Figure 5: Configuring On-Demand Backups for Dynamo DB in AWS

Finally, there is the option to schedule periodic backups. The configuration options are nearly the same as in Figure 5, though engineers must define the periodicity and the backup window for periodic backups. But why periodic backups when there is point-in-time recovery?

The main reason is that retention time point-in-time recovery of 35 days, whereas periodic backups can be kept for years. Other reasons – exotic ones – are: maybe cost savings, or maybe it is a legacy architectural design from before point-in-time recovery was available.

To conclude: The discussed database-as-a-service offerings from Azure and AWS all come with backup functionality, but they differ for all of them. Cloud and security architects do good when checking the company-internal backup requirements and policies and validating whether and how they can implement them. This validation is necessary per cloud and service.

Application-Consistent Backups for VM Workloads in the Cloud

Application backups are not as simple as in the world of database lectures. Have you ever heard about the ACID properties with “A” representing “atomicity” and “D” durability? After a database commit, everything is on disk. Nothing can get lost. Plus, all commands within the transaction are on disk – or everything is undone. When a database crashes and restarts, its data reflects precisely the effects of all committed transactions. It works as designed if an application relies on exactly one database. Sadly, applications are more complex, as Figure 1 illustrates. They tend to access not only one database but several in parallel, plus write data to files and file shares – and disks attached to VMs.

*Figure 1: Real-live Backup Scenarios for Applications and their VMs*

Understanding Application-Consistent Backups and their Benefits

While private users can save their data by manually copying all their files to a different hard disk or the cloud, this approach is too simplistic for larger applications, even if we focus only on VM disks. If it is a large disk, files with names starting with “A” might get copied at 5:00 am and those starting with “Z” at 5:05 am. Thus, the “Z-files” could have been changed between 5:00 am and 5:05 am, making the A-files and the Z-files inconsistent. Furthermore, copying open files causes issues, changes might be only in the memory and not written to disk, or there could be pending I/O transactions. Thus, a clean-up of files and their data might be necessary when restarting the application using such file copies. The clean-up can be automated or a manual task for engineers. In any case, it prolongs application outages.

Application consistency overcomes this challenge. The idea is to perform backups such that applications run after restarting a VM without any clean-up actions. Thus, business-critical applications benefit from lower downtimes, i.e., they can provide better Recovery Time Objectives. Plus, the organization benefits in crisis events, aka business continuity management situations. When a company has to evacuate all workloads to a different cloud data center, the engineers can rely on the VMs to restart and applications to come up without manual intervention. The engineers can focus on fixing more complex issues, e.g., related to integrations with other components, rather than the complete IT department being stuck with cleaning up file systems.

The most prominent solution for application-consistent backups is Microsoft’s Windows Volume Shadow Copy Service (VSS). Microsoft products come with it, and your organization’s applications (and your third-party software provider) can also implement it for Windows workloads. The exact details of VSS are, however, not so relevant from a cloud security architecture perspective. What matters is the available features for application-consistent backups in AWS, GCP, and Azure.

Application-Consistent Backups in the Cloud

Azure comes with a solution for application-consistent backups for Linux VMs, at least for those deployed with the Azure Resource Manager and not the Service Manager. It is a framework enabling application developers or operations specialists to integrate pre- and post-scripts into Azure’s backup process. Pre-scripts can invoke, for example, APIs of the application to tell the application to finish off all “write” activities. Then, Azure performs the backup copy. Afterward, Azure invokes the post-script, and normal operations continue. For this purpose, a configuration file (Figure 2) must be on all relevant VMs.

Figure 2: Configuration File for Application Consistent Backups of Linux VMs in Azure. Highlighted are the configuration of pre- and post-backup scripts. Other settings are for defining parameters and handling exceptions and failures.

The pre- and post-scripts and this configuration file are critical from a security perspective. They run with root privileges. Thus, they must be secured to prevent attackers having gained access to the VM to change these settings and execute malicious code as root.

The situation for Windows VMs on Azure is much easier compared to the Linux world. By default, all VM backups use the Microsoft VSS service. So, if (and only if!) the applications on the VM implement VSS, all backups are application-consistent without the need for extra VM configurations. If not, the disk backup is not application- but only file-consistent.

Finally, a quick remark on the AWS and the Google Cloud Platform (GCP) features. Both follow the same approach as Azure: pre- and post-scripts for Linux VMs, and Microsoft’s VSS for Windows VMs.

Back to the Big Picture

To conclude: Application-consistent backups reduce downtimes of applications by reducing the work for engineers after crashes or VM evacuations. However, the term application consistency can be misleading. When looking again at Figure 1, it is clear that the consistency between the VM disks and the database backups is not guaranteed. Applications have to cover the challenge that the VM disk backup is from 4:07 am, one database backup is from 4:05, and the second from 4:17. So, even with application-consistent backups, there are still exciting tasks and challenges for engineers in the area of backup and recovery!

Backup Basics: On Backup Types, RTO & RPO

Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are essential concepts for business-critical applications. The RTO defines the maximum time an application might be unavailable after a crash – or after a data center burns down (Figure 1). That is the time engineers (or automatic processes) have to start the application on other servers – potentially at a different location – and restore all necessary data from backups.

Figure 1: Understanding RPO, RTO, and Backup Frequencies

How much data might get lost max in such an event, this defines the RPO, the Recovery Point Objective. It depends on the backup frequency. If a company makes a backup each night at 4 am, the RPO is 24 hours. Then, if the data center burns down at 4:15, only the data from the last 15 minutes is lost. If it burns down at 3 am, the data from 23 hours is lost.

One factor influencing the RTO is the time needed to restore the data from a backup. Network bandwidth and storage are significant factors, but so is the backup type. Full, incremental, and differential backups take different long.

The starting point is always a full backup. The system writes all data to a secure backup storage (Figure 2). So, why would an architect not always go for a full backup? It is simple: Suppose an application is highly critical. Thus, the RPO is 5 min. Let us further assume that it is a massive database, but not much data changes within five minutes. Then, a full backup every five minutes seems too much. It might even be technically impossible, depending on network bandwidth and the amount of data. Incremental backups can be a better solution for such a scenario.

Figure 2: Understanding Full, Incremental, and Differential Backups

An incremental backup copies only the most recent changes away. If the last full backup was at 4 am, an incremental backup at 4:05 am copies the changes from the last five minutes. The next backup at 4:10 am copies, as well, only the changes of the last five minutes, i.e., the changes between 4:05 am and 4:10 am. Incremental backups have on disadvantage compared to full backups: the restoration time might be longer. To restore the situation of 4:10 am, one has to fully restore the data from 4 am, then apply what changed till 4.05 am, and then the incremental changes till 4:10 am.

A middle way between incremental and full backups is the differential backup. A differential backup copies away all changes from the last full backup. At 4:05 am, a differential backup writes all changes of the last five minutes; at 4:10 am, all changes between 4:00 and 4:10, and at 4:15, all changes between 4:00 and 4:15. Thus, a restore of an incremental backup means: restore the full backup and (only) apply the changes of the last differential backup. However, these concepts get transparent with the cloud: the cloud providers implement standard configurations.

The Azure Backup Center, for example, enables customers to configure several backups per day for VMs based on the Service Recovery Vaults, e.g., every 4 hours. The first one is a full backup; the subsequent backups are incremental (Figure 3). Microsoft has decided. However, these concepts are still crucial for architects and engineers if they implement their own solutions or use 3^rd party software with sophisticated configuration features. Then, they have to optimize and experiment themselves to identify an appropriate balance between …

backup frequency and full versus increment backups, and
the needed time for restoring a backup impacting the RTO

Figure 3: Predefined options for backup policies regarding incremental and full backup periodicity in Azure

The Hub and Spoke Network Pattern

Ever wondered how large corporations design their networks in the cloud? The hub-and-spoke pattern is probably the most important to understand their on-prem and cloud network designs, no matter whether an IT department runs on GCP, AWS, Azure, or any other cloud provider.

Mesh is the “free love” vision transferred to enterprise network designs, whereas the hub-and-spoke pattern implements more of a harem concept. But to start with the big picture: The focus is not on individual applications. Network design looks at how to organize connectivity and isolation for networks with hundreds or even thousands of applications. These applications serve business needs, such as SAP or self-developed insurance solutions. The design must consider as well technical or security-related applications such as Web Application Firewalls, IAM services, Messaging Middleware, or Data Loss Prevention solutions.

Hub-and-Spoke vs Mesh Network

The first aspect is zoning. IT department group their resources. VMs belonging to an HR application might be separated from air traffic control systems. They are in two separate zones. Network design patterns define how such zones interact, i.e., which zones have connectivity and can interact directly with which other. When looking at concepts, two patterns are particularly important: the mesh network pattern and the hub-and-spoke pattern (Figure 1).

A hub-and-spoke network is hierarchical. One zone is the hub. It connects with every other zone – the spokes. It is a classic 1-to-n relationship known from harems. Resources in two separate spokes zones always interact via the hub. A mesh network builds on the free-love-idea: every zone can interact with every other zone (and zones might forward traffic to other zones in the absence of direct connectivity).

Why Hub-and-Spoke Patterns are Popular

Mesh networks (and free love) cause a mess if relationships and connectivity are not tightly managed. In contrast, the hierarchical model of hub-and-spoke networks allows for centralized governance and operations of components and eases controlling the network traffic flow.

Routing all traffic to and from the internet through one hub zone eases control and security. In this zone, DDoS protection, the (network-) data-loss-prevention solution, and other critical internet-connectivity-related applications must run. Then, perimeter security is in place for the complete data center.

One dedicated zone with internet connectivity also eases setting up web application firewalls (WAFs), which usually require integration with an IAM solution. Having both in dedicated central zones is much easier than having WAFs in ten zones and IAM solutions in seven.

Centralizing components in a hub is useful beyond internet connectivity: (Azure or on-prem) Active Directories, messaging middleware, application performance monitoring, and many more benefit from centralization. This insight does not mean building one monolithic hub with all centralized applications. Enterprise-scale networks can have several hubs, e.g., one serving as a management zone with monitoring and another for internet connectivity (Figure 2).

Figure 2: Overlapping Hub Networks Example with an Internet-facing External Zone and an internal Management Zone e.g., for Monitoring

Grouping Criteria for Zoning

Zones are groups of (cloud) resources, e.g., VMs, isolated from other resources and VMs in other zones. The grouping criteria differ between organizations, companies, and departments. Typical grouping criteria are (Figure 2):

Applications or groups of applications forming one solution
Stages such as Production, Preproduction, Integration, Test, and Development.
Teams or departments managing the resources and potentially following different change management processes
Sensitivity of the data of applications

Cloud Features for Implementing Zones

Cloud providers do not offer a “zone” or a “hub-and-spoke network” feature. They provide sophisticated building blocks for structuring workloads and networks. Tenants, Virtual Private Clouds (VPCs), and Subnets are features for organizing networks and resources hierarchically. AWS, Azure, GCP, and all the others provide routing tables for – well – routing. Access Control Lists and Network Security Groups (the exact names differ between the clouds) enable engineers to implement firewalls for blocking or allowing specific network traffic. The clouds come with Internet Gateways, NATs, etc. Everything is ready for you to implement effective and secure network designs. However, whether you do that and how many free-love-like connections you allow is up to you and all other network architects of the world.

Understanding Redundancy for Azure Storage Accounts

PaaS services promise software engineers nothing less than effortless eternal happiness: a few clicks and you have ready-to-use, secure, and unlimited scalable cloud service, be it an AI, a database, or a super-exotic service. But do cloud providers deliver as promised? Is, for example, redundancy a child’s play? Redundancy is much trickier than backups. Backups are about having a copy of an application’s data for emergencies. In contrast, redundancy guarantees that the application runs 7×24. Even major disasters must not impact a solution’s stability.

In the following, we look closely at one crucial Azure service, Azure Storage Accounts. It is Microsoft’s web service for storing data such as objects respectively, blobs, or files. We elaborate on which kind of disasters or failures the web service copes with – and how transparent it is from an application or operations perspective. Azure provides different options for configuring the recovery behavior of Storage Accounts (Figure 1), which the following paragraphs explain in more detail. Companies need specific guidelines on when to use which one – and writing such a guideline is a crucial cloud (security) architecture task.

Figure 1: Redundancy and Backup Options for Azure Storage Accounts – actual replication mechanisms (1) and read-feature for regional failure (2)

Cloud architects have to make (up to) three redundancy-related configuration decisions:

Should the data be replicated three times within one data center (local redundancy), or should there be a copy in three separate data centers (zone redundancy)?
Should there be copies in a data center further away in a secondary region (geo-redundancy)? For regulatory and compliance reasons, geo-redundant copies are usually in the same jurisdiction as the original data. Azure stores, e.g., geo-redundant copies for data in Switzerland-North (greater Zurich area) in the Azure region Switzerland-West around Geneva.
Should applications be allowed to read the geo-redundant copies if the primary region fails (“read access”)?

Figure 2 and Figure 3 illustrate the configuration options for a better understanding. The architecture in Figure 2 visualizes local redundancy. An application reads and writes data stored in one data center (A). The different copies are synchronously replicated and always in sync. If – and only if – the engineers choose the geo-redundant add-on, Azure replicates the data asynchronously to one data center in a secondary region (B). In such a case, the Storage Account can be configured to allow applications to read from the storage account in the secondary region if the first region is unavailable (C).

Figure 2: Azure Storage Account with Local-Redundancy Option

Figure 3 looks very similar. It illustrates zone-redundancy. In such a set-up, the three copies are in different data centers in the primary region (A). If geo-redundancy is active, all asynchronous copies in the secondary are stored in the same single data center (B)as in the previous case in Figure 2. Applications can get read access in case of unavailability of the primary region – as with local redundancy (C).

Figure 3: Azure Storage Account with Zone-Redundancy Option

In the next step, we look closer at three failure scenarios: a failure of a single device, a data center, and an entire region. Which configuration option allows the application to continue to run? How transparent are such incidents for applications? What does the cloud operations team or the applications team have to do?

First, a failure of a single device in a data center has no impact on any application or operational staff. Azure always maintains three up-to-date copies of any data, be it in one or be it in three data centers. These copies are always in sync. Thus, if a device fails, another device in the same or another data center takes over and has precisely the same data. There is neither data loss nor interruption of application availability, nor any necessity for manual interventions.

A failure of a complete data center is transparent for zone-redundant storage accounts. Two other data centers have in-sync copies of the data if one data center burns down or is temporarily unavailable, e.g., due to power issues. Applications can continue to read or write to the Storage Account. No engineer has to intervene. In contrast, a data center failure can have severe consequences if a Storage Account is configured with local redundancy. The data becomes unavailable for read and write operations. If the data center does not come up again, all data is gone.

If an entire region fails due to a large-scale disaster, all Storage Accounts without a geo-redundancy option are temporarily or forever unavailable. Storage Accounts with geo-redundancy have a copy in a different region. In contrast to “normal” local or zone redundancy, geo-redundancy is not transparent. It might require application code adaption. Plus, the cloud staff has to perform failover and cleanup tasks.

Failover to the secondary region is an explicit decision of the customer. Its cloud staff must initiate the failover. Then, the secondary site becomes the new primary one though two aspects require attention. First, loss of data might occur. Microsoft has no formal SLA but mentions that the recovery point objective (RPO) is 15 minutes. In other words: data written in the last 15 minutes before the failure is lost. The application (and the business) must be able to cope with that. Second, Storage Accounts store their data only in one data center after the failover (local redundancy). Engineers have to reconfigure the Storage Accounts to reassure zone- or geo-redundancy. Eventually, they might want to move the data back to the original primary region.

When an entire region fails, this impacts applications even with the geo-redundancy option and even if the application code continues to run. Applications cannot read or write any data to or from their Storage Account in the primary region. Either the application code handles such circumstances (e.g., by showing users a “temporarily unavailable” page), or the application behaves unpredictably or crashes. Applications can reduce the impact on users with more sophisticated routines. The option to allow applications to read from secondary regions in case of a failure of the primary one enables applications to show users at least the currently available data. Writing changes is not possible – or only to another storage account not impacted; quite some work, but potentially worth the effort for highly business-critical applications.

To conclude: Azure offers a variety of configuration options for redundancy for Azure Storage Accounts, one of their most used services. Choosing an adequate option is an essential task for cloud architects. They have to balance the costs of the more sophisticated options with the actual availability needs based on the criticality of their applications.

Tutorial: Azure Function, Storage Accounts, and Managed-Identity-based Authentication

Serverless application logic, storage accounts, platform-as-a-service identity and access management, this small tutorial combines state-of-the-art Azure technologies in a four-step demo or tutorial. The result: a serverless Azure Function invoked via a web browser that returns a welcome text, which the Azure Function reads from an Azure Storage account using Azure Identity for authentication and authorization.

Prerequisites for performing this tutorial are:

An Azure cloud account (we use only free services)
A local Visual Studio installation (free version sufficient)

Setting up the Azure Storage Account backend

Create a resource group named “rg_Demo26a” (Figure 1, A). The resource group collects and stores all the resources we create in the following. Having all resources in a dedicated resource group eases cleaning them up after completing this demo.

Next, create an Azure Storage account “sta26a” (Figure 1, B) in the resource group “rg_Demo26a”. On the “advanced” tab, disable public access and storage account key access and enable default Azure Active Directory authorization (C). Then, advance to the “networking” tab. Here, make sure to have “enable public access” activated (D). This setting means that our storage is directly connected to the Internet and only protected by access management methods. It is a dangerous setting, though it saves us time in the demo. Never do this in any Azure subscription used for anything but for playing around.

Figure 1: Creating a Resource Group and an Azure Storage Account

Creating a Storage account does not allow you to upload or read files. Thus, before actually uploading any file to the Storage Account, you need to grant yourself “Contributor” rights for the Storage Account. Therefore, go to your Storage Account “sta26a”, select “Access Control (IAM)” in the menu to the left (Figure 2, A), and push the “Add role assignment” button (B).

Then, search for roles containing “blob” (Figure 3, A), select the role “Storage Blob Data Contributor” (B), and click “next” (C) to proceed to the next submask. Here, choose “user, grout” for the “Assign access to“ setting (D) and click on “sel3ect members” (E). A new submask appears to the right. It presents the users of this Azure tenant. Select the user you are currently logged in (F) and have a coffee afterward. It takes a few minutes until the new rights are active. If you proceed too early, the next step fails.

Figure 3: The actual role assignment to an AAD registered user

Now you are ready to upload a file. Create a text file “output.txt” on your laptop with the content “Hello world, hello Klaus!”. Upload this file into our Azure Storage Account.

Switch to the newly created Storage Account and select “upload” (Figure 4, A). Select “Create new” and type in “cont26a” as container name (B). Open the file explorer by clicking on the symbol to the right. Select the file we just created (C). Then, press “Upload” to load the file “output.txt” to the cloud (D). If the upload button is not blue, something went wrong in any of the previous steps.

Figure 4: Creating a container and uploading a file to an Azure Storage Account.

As a last step in the Azure Portal, for now, open the Storage Account sta26 and select the storage browser (A). Click on the blob containers (B), then on “sta26a” (C). You see the list of files stored there. Click on the three points at the right of the row “output.txt” (D) and click “Copy URL” (E). Paste the URL in some document or file. You need the string in half an hour. Then, for the next step, switch to Visual Studio.

Figure 5: Retrieving the URL of a file stored in an Azure Storage Account

Creating an Azure Function

We create, first, a simple predefined Azure Function and deploy it to the cloud. It does not connect to our Storage Account. The step is more for verifying that the overall setup is correct. In VisualStudio, select “File > New > Project” (Figure 5, A). VisualStudio will propose various templates. Type in “Functions” into the search field (B), select “Azure Functions,” and continue (C). If you do not find this template, you can install missing ones after clicking on “Not finding what you are looking for? Install more tools and features” and search for “Azure development”.

Figure 6: Starting to create an Azure Function project

The next mask asks you for a function name (use “FunctionAlpha”) and a location where to save your files (Figure 6, A). Then comes the “additional information” tab. Select “.NET 6” and “http trigger” as the function you want to get created. Make sure that the other settings are as in figure (B): select “Use Azure,” deselect “Docker,” and use the function scope as authorization level. As a result, VisualStudio creates a simple function that returns some output when invoked.

Figure 7: Configuring and creating the http listener example

We execute the code now locally on your laptop. Run the function by clicking on “FunctionAlpha” next to a green triangular on the menu to the top (Figure 6, A). A new window opens up, in which the listener process runs. This process waits for incoming http requests. To send a request, copy the URL provided in the window (B) and paste it into a web browser window. You should get the output as shown in the figure (C).

Figure 8: Running the http listener example

We now know that the function works locally on the laptop. Next, we deploy this function to an Azure runtime environment so that the function runs – serverless – in the cloud.

Deploying and Invoking an Azure Function

Azure comes with a feature to structure your functions – you do not create an Azure Function directly in a resource group but collect them in one or more Azure Function Apps. Create one in the Azure portal (Figure 8, A). Use resource group “rg_26a” and type in “FunctionApp26a” as the name for the Function App. Select “Code” as the way of publishing, “.NET” as runtime stack, “6” as the version, and “Windows” as the operating system. On the next mask (B), choose to create a new storage account and type in “stafa26a” as its name. Select “review and create” and start the creation process.

Wait until you see that the resource has been created (C).

Figure 9: Creating an Azure Function App.

To deploy the function to the Azure cloud, select “Build > Publish Selection” (Figure 8, B), then “Azure” as target (B), and “Azure Function App (Windows)” (C). Finally, choose your subscription, then select the resource group “rg_Demo26a” and the recently created Azure Function App “FunctionApp26a” (D). Continue and wait for the (quick) configurations to take place.

You might think you are done. I thought precisely the same several times. However, you have to go back to “Build > Publish Selection” and click the publish button (E).

Figure 10: Deploying an Azure Function prepared in Visual Studio to the Azure Cloud.

Now, switch back to the Azure Portal. In the Azure Function App screen, we choose “Functions” in the menu bar to the left (A). We now see our function with the name “Function 1” (B). Click on the name to open the details page, where you can retrieve the web function’s URL (C). Click on “Get Function URL,” copy the link, and paste it into a web browser (D).

Figure 11: Invoking the Azure Function deployed in the Azure Cloud

Connecting the Worlds …

We have created a Storage Account with one text file that our function should display when invoked via an internet browser. Therefore, we modify the function and read the welcome text from the Storage Account.

Open Visual Studio again and replace the Azure Function code with the following (check and update the blob URL if necessary):

using system;
using system.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Logging;
using Azure.Storage.Blobs;
using Azure.Storage.Blobs.Models;
using Azure.Identity;

namespace FunctionAlpha
{
    public static class Function1
    {
        [FunctionName("Function1")]
        
        public static async Task<IActionResult> Run(
            [HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequest req,
             ILogger log)
        {
            
            log.LogInformation("C# HTTP trigger function processed a request.");

            ManagedIdentityCredential myCredentials = new ManagedIdentityCredential();

            log.LogInformation("**1**");

            var myBlobUrl = "https://sta26a.blob.core.windows.net/sta26a/output.txt";
            BlobClient bc = new BlobClient(new Uri(myBlobUrl), myCredentials);

            log.LogInformation("**2**");

            BlobDownloadResult downloadResult = await bc.DownloadContentAsync();
            string downloadedData = downloadResult.Content.ToString();
            
            log.LogInformation("**3**");

            return new OkObjectResult("Output text from file in the cloud: "+ downloadedData);
        }
        
    }
}

Save the file, run it locally, and invoke it as before in Figure 8. Now you should get an error message in the browser; the listener process should display an error message (see Figure 12).

Figure 12: Invoking the Azure Function accessing a blob in an Azure Storage account from a local laptop – without proper access rights

Next, deploy the updated Azure Function to the Azure Cloud, following the steps already shown in Figure 10. Then, run the function as illustrated in Figure 11. The result is another error message (HTTP Error 500), as Figure 13 illustrates. The exact text depends on your browser and the browser (language) settings.

Figure 13: Failing invocation of the updated Azure Function, resulting in an HTTP ERROR 500

The reason is simple: the Azure Function still has no read access to the file in the Azure Storage Account. To reach this aim, we have to switch on the system-assigned managed identity for our Azure Function App “FunctionApp26a”. Go to this Function App and select “Identity” in the menu bar to the left (Figure 14, A), change the status to “on” (B), and save the change (C).

Figure 14: Activating system-assigned managed identity

Now, we do one final reconfiguration add grant the function’s managed identity read access to the storage account. Therefore, switch to the storage account “sta26a” (Figure 15, A), open “Access Control (IAM)” (B), and click on “add role assignment” (C).

On the next mask, search for the “Storage Blob Data Reader” role and click on “next” (Figure 16, A). On the next mask (B), select “Managed Identity,” click on “+ Select members,” and select “FunctionApp26a”. Then, click on “Select” and “Review + assign.”

Figure 16: Assigning a role to an Azure Managed Identity

Now copy the Azure Function’s URL again into a web browser. Be sure to retrieve the full URL, not just the beginning. If your URL looks like https://functionapp26a.azurewebsites.net/api/Function1, it will fail. It should look like https://functionapp26a.azurewebsites.net/api/Function1?code=**SOME-STRAGE-CHARACTERS**==. The result should look like Figure 17. If so, you made it. Time to celebrate!

Figure 17: Finally, what you should see at the end of the tutorial. The Azure Function retrieves text from the Azure Storage Account and returns it

PaaS Perimeter Security in Azure and GCP

With all the efficient and innovative Platform-as-a-Service (PaaS) services in the cloud world, a catastrophe is only one click away. An engineer wrongly configures a database or object storage service, and cybercriminals have access to all data from anywhere on the internet. Booz Allen Hamilton, WWE, Verizon Wireless, Accenture, the Pentagon: these are just some prominent companies and organizations that misconfigured their AWS S3 object storage and lost millions of data records. So, how can cloud security architects avert such catastrophes?

The article exemplifies typical but risky configuration options in Microsoft’s Azure and the Google Cloud Platform (GCP). Then, its focus shifts to Azure’s concepts of Private Endpoints and Private Links and to GPC Service Controls. Looking at the different features of the two sample public cloud providers allows for a better understanding of the various security risks and mitigation approaches when protecting PaaS services.

Basic PaaS Configurations

When creating PaaS service instances via the GUI, the cloud vendors usually suggest (also) security configuration options with a relaxed security level. Their motto: simple and easy to set up, every engineer trying out the cloud and a specific service should succeed. But such configurations come with risks that this section elaborates on in more detail.

In classic data centers, a solution can invoke internal web or microservices only if meeting two conditions:

Authentication (and authorization), i.e., only specific applications, solutions, and users can invoke the service after verifying their identity and when they have the needed access rights and roles.
Reachability of the service in the network: Networks are typically divided into different zones, separated by firewalls. Invoking services in different zones requires opening up connections.

PaaS service default configurations usually require authentication but come with limited network-level protection. By default, PaaS services belong to one zone: the worldwide internet, where everybody can reach everything.

Figure 1 illustrates the network security configurations when creating instances for two typical GCP PaaS services: Cloud Storage, Google’s object storage service, and managed SQL database instances. When engineers create a database service in GCP, they have to choose between a public and a private IP address (Figure 1, 1). A public IP means that everyone on the internet can connect to this service. If the authentication mechanisms are in place and configured correctly and if attackers cannot get access keys etc., this is a perfectly safe approach. However, it is a challenge to ensure that none of hundreds or thousands of engineers in a larger IT organization ever make a critical mistake. And indeed, Goole provides the option to restrict access to certain network zones to reduce the risk (Figure 1, 2).

Figure 1: GCP PaaS Services for storing data. Network security settings when creating SQL Server (left) and Cloud Storage instances (right)

When creating a Cloud Storage bucket, engineers must decide whether to enforce the prevention of public access to the data stored in this bucket. The challenges with any object storage are – be it in GCP, Azure, AWS, or any other cloud – contradicting network security needs for the two primary use cases.

The first use case for object storage is storing classic (internal) application data. An archiving system might store Pdf files, security solutions potentially video recordings that document who performed which command on a particular system. Such data must not be made available to the internet. The perfect security setting preventing for such an object storage service instance is “forbid all internet access.”

The second use case for object storage is storing and delivering websites, e.g., a web page with an interview text plus incorporated pictures. Everyone on the internet should be able to access the web pages and read the interview. There is no need for network firewalls or authentication mechanisms.

So, there are two highly relevant use cases, and one technology solves both needs and challenges. That is, usually, perfect, just not in this case. In this peculiar case, it is a security risk. When you configure the object storage for a bank’s know-your-customer documents by mistake as publicly accessible website storage, you might only notice when your documents appear for sale in the darknet.

Public IPs and the risk of misconfiguration is not a GCP specialty. Figure 2 presents a GUI mask for creating a Cosmos DB database service via the Azure portal. Two settings create public endpoints: “all networks” and “public endpoint (selected networks).” Especially in the case of “all networks,” anyone on the internet can reach the Cosmos service instance. In other words, access control misconfigurations can have catastrophic results.

To point out the situation clearly – and this is the same for GCP Cosmos DB or Azure SQL databases and many more database services in the cloud – there is no sensible reason why databases should have a public IP. Databases are backend systems. There is no point reaching them from outside the organization.

The screenshot in Figure 2 has a third option for Cosmos DB: Private Endpoints. The following section looks at the details.

Figure 2: Network security options when creating a Cosmos DB

Azure Private Endpoint and Link

Some time ago, Azure introduced Private Endpoints and Private Links. They provide an extra layer of security. Thanks to them, engineers can incorporate PaaS service instances (e.g., Cosmos databases) into a VNet and access PaaS services without going via public IPs. When creating a Cosmos Database Service instance with this option, engineers create a Private Endpoint in their VNet with a VNet-internal, private, non-public IP. The Private Endpoint points to the actual resource, e.g., a Cosmos Database, via a Private Link.

“Private Endpoint” is the third connectivity option in the example of creating a Cosmos service instance (Figure 3). When chosen, engineers can create and add a private endpoint. The Private Endpoint has a name – and the engineer specifies into which VNet Azure shall deploy the Private Endpoint.

Figure 3: Private Endpoint configuration option for Azure Cosmos DB

Figure 4 illustrates what happens in the background in the case of a private endpoint / private link connectivity (B) versus the traditional approach (A).

In addition to configuring Azure Private Endpoints and Private Links, engineers must be aware of and mitigate two more risks. First, Private Links and Private Endpoints are for themselves save, but the cloud architecture must ensure no firewalls are open on the resource itself (risk C). The benefit is that there is no need to open a firewall to make solutions work and components interact. Thus, forbidding opening ports gets feasible. Second, criminal employees might try to exfiltrate data. They could send company data to a personal Cosmos Database service instance controlled and owned by themselves as private persons (D). Cloud security architects can demand that all traffic to storage or database services goes via Azure Private Endpoints thanks to Private Endpoints and Private Links. Auditing and analyzing networks for exfiltration paths gets much more straightforward.

Figure 4: Classic PaaS Connectivity (A) versus Private Endpoint / Private Link (B). C and D illustrate remaining network security risks for Private Endpoint and Private LInks.

Before discussing Google’s real cool feature for PaaS perimeter protection, a final remark about Azure Private Links. Azure Private Link Service enables engineers to make their own services and resources available via Azure Private Endpoints. Thereby, they benefit from the same security features for their code that Microsoft relies on for protecting its Azure PaaS services.

GCP Perimeter Protection with VPC Service Controls

Google’s VPC Service Controls concept enables cloud security architects to build a perimeter-protected network zone composed of IaaS and (!) PaaS resources. A VPC Service Control definition comprises five main elements: projects, services in scope, accessible services, access level, and traffic exceptions.

The GCP projects of a VPC Service Control define the trust perimeter. These projects trust each other and can freely interact as defined by the network configurations. In contrast, applications in other VPC Service Control perimeters must not access the resources if not permitted explicitly.

The GCP Services part of a VPC Service Control definition specifies for which services the perimeter protection of the VPC Service Control applies. That is a crucial setting. If, for example, Cloud Storage is not part of the perimeter definition, engineers can easily transfer data in and out of the VPC Service Control perimeter. A note: Google adds the feature to more and more services, but it might not yet be available for all needed by a company. Thus, disabling unprotectable GCP services can e a necessity.

GCP Accessible Services add another nuance to perimeter definitions. The feature enables engineers to limit the GCP services, which resources can invoke within the Service Control perimeter.

As a result of a VPC Service Control definition, services respectively their callability fall into three categories:

Service instances of GCP services under VPC Service Control perimeter protection. They are only accessible from within the perimeter. Plus, they cannot invoke external service instances.
Service instances of GCP services without restriction. VMs within a VPC Service Control can connect to VPC-internal and external service instances. Plus, VPC external resources can (try to) connect to VPC-internal instances of this type.
Services blocked for use within a VPC Service Control (“GCP Accessible Services”) setting.

The GCP Access Level is an option for context-aware authentication and authorization. It is a new and upcoming approach requiring a detailed analysis worth a dedicated article.

A last core element is an option for defining exceptions, i.e., adding ingress and egress traffic rules allowing otherwise not allowed traffic into and out of the VPC Service Control. Thus, VMs from within one service parameter might get read access to Cloud Storage in a different service perimeter.

VPC Service Controls in the GCP world and the Azure Private Link/Private Endpoint help cloud security architects shield PaaS (and IaaS) service instances from the internet or other internal network zones. Their absoluteness – especially the exceptionally rigid VPC Service Controls – is what makes them so valuable. No one has to define, analyze, and manage thousands of rules and exceptions to understand and ensure the overall security posture of a network consisting of IaaS and PaaS resources. Instead, PaaS security becomes (nearly) a child’s play.

Cloud Security Architecture – AI and MLOps – Testing

Klaus Haller's Homepage and Publications