Application backups are not as simple as in the world of database lectures. Have you ever heard about the ACID properties with “A” representing “atomicity” and “D” durability? After a database commit, everything is on disk. Nothing can get lost. Plus, all commands within the transaction are on disk – or everything is undone. When a database crashes and restarts, its data reflects precisely the effects of all committed transactions. It works as designed if an application relies on exactly one database. Sadly, applications are more complex, as Figure 1 illustrates. They tend to access not only one database but several in parallel, plus write data to files and file shares – and disks attached to VMs.
Understanding Application-Consistent Backups and their Benefits
While private users can save their data by manually copying all their files to a different hard disk or the cloud, this approach is too simplistic for larger applications, even if we focus only on VM disks. If it is a large disk, files with names starting with “A” might get copied at 5:00 am and those starting with “Z” at 5:05 am. Thus, the “Z-files” could have been changed between 5:00 am and 5:05 am, making the A-files and the Z-files inconsistent. Furthermore, copying open files causes issues, changes might be only in the memory and not written to disk, or there could be pending I/O transactions. Thus, a clean-up of files and their data might be necessary when restarting the application using such file copies. The clean-up can be automated or a manual task for engineers. In any case, it prolongs application outages.
Application consistency overcomes this challenge. The idea is to perform backups such that applications run after restarting a VM without any clean-up actions. Thus, business-critical applications benefit from lower downtimes, i.e., they can provide better Recovery Time Objectives. Plus, the organization benefits in crisis events, aka business continuity management situations. When a company has to evacuate all workloads to a different cloud data center, the engineers can rely on the VMs to restart and applications to come up without manual intervention. The engineers can focus on fixing more complex issues, e.g., related to integrations with other components, rather than the complete IT department being stuck with cleaning up file systems.
The most prominent solution for application-consistent backups is Microsoft’s Windows Volume Shadow Copy Service (VSS). Microsoft products come with it, and your organization’s applications (and your third-party software provider) can also implement it for Windows workloads. The exact details of VSS are, however, not so relevant from a cloud security architecture perspective. What matters is the available features for application-consistent backups in AWS, GCP, and Azure.
Application-Consistent Backups in the Cloud
Azure comes with a solution for application-consistent backups for Linux VMs, at least for those deployed with the Azure Resource Manager and not the Service Manager. It is a framework enabling application developers or operations specialists to integrate pre- and post-scripts into Azure’s backup process. Pre-scripts can invoke, for example, APIs of the application to tell the application to finish off all “write” activities. Then, Azure performs the backup copy. Afterward, Azure invokes the post-script, and normal operations continue. For this purpose, a configuration file (Figure 2) must be on all relevant VMs.
The pre- and post-scripts and this configuration file are critical from a security perspective. They run with root privileges. Thus, they must be secured to prevent attackers having gained access to the VM to change these settings and execute malicious code as root.
The situation for Windows VMs on Azure is much easier compared to the Linux world. By default, all VM backups use the Microsoft VSS service. So, if (and only if!) the applications on the VM implement VSS, all backups are application-consistent without the need for extra VM configurations. If not, the disk backup is not application- but only file-consistent.
Finally, a quick remark on the AWS and the Google Cloud Platform (GCP) features. Both follow the same approach as Azure: pre- and post-scripts for Linux VMs, and Microsoft’s VSS for Windows VMs.
Back to the Big Picture
To conclude: Application-consistent backups reduce downtimes of applications by reducing the work for engineers after crashes or VM evacuations. However, the term application consistency can be misleading. When looking again at Figure 1, it is clear that the consistency between the VM disks and the database backups is not guaranteed. Applications have to cover the challenge that the VM disk backup is from 4:07 am, one database backup is from 4:05, and the second from 4:17. So, even with application-consistent backups, there are still exciting tasks and challenges for engineers in the area of backup and recovery!
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are essential concepts for business-critical applications. The RTO defines the maximum time an application might be unavailable after a crash – or after a data center burns down (Figure 1). That is the time engineers (or automatic processes) have to start the application on other servers – potentially at a different location – and restore all necessary data from backups.
How much data might get lost max in such an event, this defines the RPO, the Recovery Point Objective. It depends on the backup frequency. If a company makes a backup each night at 4 am, the RPO is 24 hours. Then, if the data center burns down at 4:15, only the data from the last 15 minutes is lost. If it burns down at 3 am, the data from 23 hours is lost.
One factor influencing the RTO is the time needed to restore the data from a backup. Network bandwidth and storage are significant factors, but so is the backup type. Full, incremental, and differential backups take different long.
The starting point is always a full backup. The system writes all data to a secure backup storage (Figure 2). So, why would an architect not always go for a full backup? It is simple: Suppose an application is highly critical. Thus, the RPO is 5 min. Let us further assume that it is a massive database, but not much data changes within five minutes. Then, a full backup every five minutes seems too much. It might even be technically impossible, depending on network bandwidth and the amount of data. Incremental backups can be a better solution for such a scenario.
An incremental backup copies only the most recent changes away. If the last full backup was at 4 am, an incremental backup at 4:05 am copies the changes from the last five minutes. The next backup at 4:10 am copies, as well, only the changes of the last five minutes, i.e., the changes between 4:05 am and 4:10 am. Incremental backups have on disadvantage compared to full backups: the restoration time might be longer. To restore the situation of 4:10 am, one has to fully restore the data from 4 am, then apply what changed till 4.05 am, and then the incremental changes till 4:10 am.
A middle way between incremental and full backups is the differential backup. A differential backup copies away all changes from the last full backup. At 4:05 am, a differential backup writes all changes of the last five minutes; at 4:10 am, all changes between 4:00 and 4:10, and at 4:15, all changes between 4:00 and 4:15. Thus, a restore of an incremental backup means: restore the full backup and (only) apply the changes of the last differential backup. However, these concepts get transparent with the cloud: the cloud providers implement standard configurations.
The Azure Backup Center, for example, enables customers to configure several backups per day for VMs based on the Service Recovery Vaults, e.g., every 4 hours. The first one is a full backup; the subsequent backups are incremental (Figure 3). Microsoft has decided. However, these concepts are still crucial for architects and engineers if they implement their own solutions or use 3rd party software with sophisticated configuration features. Then, they have to optimize and experiment themselves to identify an appropriate balance between …
backup frequency and full versus increment backups, and
the needed time for restoring a backup impacting the RTO
Ever wondered how large corporations design their networks in the cloud? The hub-and-spoke pattern is probably the most important to understand their on-prem and cloud network designs, no matter whether an IT department runs on GCP, AWS, Azure, or any other cloud provider.
Mesh is the “free love” vision transferred to enterprise network designs, whereas the hub-and-spoke pattern implements more of a harem concept. But to start with the big picture: The focus is not on individual applications. Network design looks at how to organize connectivity and isolation for networks with hundreds or even thousands of applications. These applications serve business needs, such as SAP or self-developed insurance solutions. The design must consider as well technical or security-related applications such as Web Application Firewalls, IAM services, Messaging Middleware, or Data Loss Prevention solutions.
Hub-and-Spoke vs Mesh Network
The first aspect is zoning. IT department group their resources. VMs belonging to an HR application might be separated from air traffic control systems. They are in two separate zones. Network design patterns define how such zones interact, i.e., which zones have connectivity and can interact directly with which other. When looking at concepts, two patterns are particularly important: the mesh network pattern and the hub-and-spoke pattern (Figure 1).
A hub-and-spoke network is hierarchical. One zone is the hub. It connects with every other zone – the spokes. It is a classic 1-to-n relationship known from harems. Resources in two separate spokes zones always interact via the hub. A mesh network builds on the free-love-idea: every zone can interact with every other zone (and zones might forward traffic to other zones in the absence of direct connectivity).
Why Hub-and-Spoke Patterns are Popular
Mesh networks (and free love) cause a mess if relationships and connectivity are not tightly managed. In contrast, the hierarchical model of hub-and-spoke networks allows for centralized governance and operations of components and eases controlling the network traffic flow.
Routing all traffic to and from the internet through one hub zone eases control and security. In this zone, DDoS protection, the (network-) data-loss-prevention solution, and other critical internet-connectivity-related applications must run. Then, perimeter security is in place for the complete data center.
One dedicated zone with internet connectivity also eases setting up web application firewalls (WAFs), which usually require integration with an IAM solution. Having both in dedicated central zones is much easier than having WAFs in ten zones and IAM solutions in seven.
Centralizing components in a hub is useful beyond internet connectivity: (Azure or on-prem) Active Directories, messaging middleware, application performance monitoring, and many more benefit from centralization. This insight does not mean building one monolithic hub with all centralized applications. Enterprise-scale networks can have several hubs, e.g., one serving as a management zone with monitoring and another for internet connectivity (Figure 2).
Grouping Criteria for Zoning
Zones are groups of (cloud) resources, e.g., VMs, isolated from other resources and VMs in other zones. The grouping criteria differ between organizations, companies, and departments. Typical grouping criteria are (Figure 2):
Applications or groups of applications forming one solution
Stages such as Production, Preproduction, Integration, Test, and Development.
Teams or departments managing the resources and potentially following different change management processes
Sensitivity of the data of applications
Cloud Features for Implementing Zones
Cloud providers do not offer a “zone” or a “hub-and-spoke network” feature. They provide sophisticated building blocks for structuring workloads and networks. Tenants, Virtual Private Clouds (VPCs), and Subnets are features for organizing networks and resources hierarchically. AWS, Azure, GCP, and all the others provide routing tables for – well – routing. Access Control Lists and Network Security Groups (the exact names differ between the clouds) enable engineers to implement firewalls for blocking or allowing specific network traffic. The clouds come with Internet Gateways, NATs, etc. Everything is ready for you to implement effective and secure network designs. However, whether you do that and how many free-love-like connections you allow is up to you and all other network architects of the world.
PaaS services promise software engineers nothing less than effortless eternal happiness: a few clicks and you have ready-to-use, secure, and unlimited scalable cloud service, be it an AI, a database, or a super-exotic service. But do cloud providers deliver as promised? Is, for example, redundancy a child’s play? Redundancy is much trickier than backups. Backups are about having a copy of an application’s data for emergencies. In contrast, redundancy guarantees that the application runs 7×24. Even major disasters must not impact a solution’s stability.
In the following, we look closely at one crucial Azure service, Azure Storage Accounts. It is Microsoft’s web service for storing data such as objects respectively, blobs, or files. We elaborate on which kind of disasters or failures the web service copes with – and how transparent it is from an application or operations perspective. Azure provides different options for configuring the recovery behavior of Storage Accounts (Figure 1), which the following paragraphs explain in more detail. Companies need specific guidelines on when to use which one – and writing such a guideline is a crucial cloud (security) architecture task.
Cloud architects have to make (up to) three redundancy-related configuration decisions:
Should the data be replicated three times within one data center (local redundancy), or should there be a copy in three separate data centers (zone redundancy)?
Should there be copies in a data center further away in a secondary region (geo-redundancy)? For regulatory and compliance reasons, geo-redundant copies are usually in the same jurisdiction as the original data. Azure stores, e.g., geo-redundant copies for data in Switzerland-North (greater Zurich area) in the Azure region Switzerland-West around Geneva.
Should applications be allowed to read the geo-redundant copies if the primary region fails (“read access”)?
Figure 2 and Figure 3 illustrate the configuration options for a better understanding. The architecture in Figure 2 visualizes local redundancy. An application reads and writes data stored in one data center (A). The different copies are synchronously replicated and always in sync. If – and only if – the engineers choose the geo-redundant add-on, Azure replicates the data asynchronously to one data center in a secondary region (B). In such a case, the Storage Account can be configured to allow applications to read from the storage account in the secondary region if the first region is unavailable (C).
Figure 3 looks very similar. It illustrates zone-redundancy. In such a set-up, the three copies are in different data centers in the primary region (A). If geo-redundancy is active, all asynchronous copies in the secondary are stored in the same single data center (B)as in the previous case in Figure 2. Applications can get read access in case of unavailability of the primary region – as with local redundancy (C).
In the next step, we look closer at three failure scenarios: a failure of a single device, a data center, and an entire region. Which configuration option allows the application to continue to run? How transparent are such incidents for applications? What does the cloud operations team or the applications team have to do?
First, a failure of a single device in a data center has no impact on any application or operational staff. Azure always maintains three up-to-date copies of any data, be it in one or be it in three data centers. These copies are always in sync. Thus, if a device fails, another device in the same or another data center takes over and has precisely the same data. There is neither data loss nor interruption of application availability, nor any necessity for manual interventions.
A failure of a complete data center is transparent for zone-redundant storage accounts. Two other data centers have in-sync copies of the data if one data center burns down or is temporarily unavailable, e.g., due to power issues. Applications can continue to read or write to the Storage Account. No engineer has to intervene. In contrast, a data center failure can have severe consequences if a Storage Account is configured with local redundancy. The data becomes unavailable for read and write operations. If the data center does not come up again, all data is gone.
If an entire region fails due to a large-scale disaster, all Storage Accounts without a geo-redundancy option are temporarily or forever unavailable. Storage Accounts with geo-redundancy have a copy in a different region. In contrast to “normal” local or zone redundancy, geo-redundancy is not transparent. It might require application code adaption. Plus, the cloud staff has to perform failover and cleanup tasks.
Failover to the secondary region is an explicit decision of the customer. Its cloud staff must initiate the failover. Then, the secondary site becomes the new primary one though two aspects require attention. First, loss of data might occur. Microsoft has no formal SLA but mentions that the recovery point objective (RPO) is 15 minutes. In other words: data written in the last 15 minutes before the failure is lost. The application (and the business) must be able to cope with that. Second, Storage Accounts store their data only in one data center after the failover (local redundancy). Engineers have to reconfigure the Storage Accounts to reassure zone- or geo-redundancy. Eventually, they might want to move the data back to the original primary region.
When an entire region fails, this impacts applications even with the geo-redundancy option and even if the application code continues to run. Applications cannot read or write any data to or from their Storage Account in the primary region. Either the application code handles such circumstances (e.g., by showing users a “temporarily unavailable” page), or the application behaves unpredictably or crashes. Applications can reduce the impact on users with more sophisticated routines. The option to allow applications to read from secondary regions in case of a failure of the primary one enables applications to show users at least the currently available data. Writing changes is not possible – or only to another storage account not impacted; quite some work, but potentially worth the effort for highly business-critical applications.
To conclude: Azure offers a variety of configuration options for redundancy for Azure Storage Accounts, one of their most used services. Choosing an adequate option is an essential task for cloud architects. They have to balance the costs of the more sophisticated options with the actual availability needs based on the criticality of their applications.
Serverless application logic, storage accounts, platform-as-a-service identity and access management, this small tutorial combines state-of-the-art Azure technologies in a four-step demo or tutorial. The result: a serverless Azure Function invoked via a web browser that returns a welcome text, which the Azure Function reads from an Azure Storage account using Azure Identity for authentication and authorization.
Prerequisites for performing this tutorial are:
An Azure cloud account (we use only free services)
A local Visual Studio installation (free version sufficient)
Setting up the Azure Storage Account backend
Create a resource group named “rg_Demo26a” (Figure 1, A). The resource group collects and stores all the resources we create in the following. Having all resources in a dedicated resource group eases cleaning them up after completing this demo.
Next, create an Azure Storage account “sta26a” (Figure 1, B) in the resource group “rg_Demo26a”. On the “advanced” tab, disable public access and storage account key access and enable default Azure Active Directory authorization (C). Then, advance to the “networking” tab. Here, make sure to have “enable public access” activated (D). This setting means that our storage is directly connected to the Internet and only protected by access management methods. It is a dangerous setting, though it saves us time in the demo. Never do this in any Azure subscription used for anything but for playing around.
Creating a Storage account does not allow you to upload or read files. Thus, before actually uploading any file to the Storage Account, you need to grant yourself “Contributor” rights for the Storage Account. Therefore, go to your Storage Account “sta26a”, select “Access Control (IAM)” in the menu to the left (Figure 2, A), and push the “Add role assignment” button (B).
Then, search for roles containing “blob” (Figure 3, A), select the role “Storage Blob Data Contributor” (B), and click “next” (C) to proceed to the next submask. Here, choose “user, grout” for the “Assign access to“ setting (D) and click on “sel3ect members” (E). A new submask appears to the right. It presents the users of this Azure tenant. Select the user you are currently logged in (F) and have a coffee afterward. It takes a few minutes until the new rights are active. If you proceed too early, the next step fails.
Now you are ready to upload a file. Create a text file “output.txt” on your laptop with the content “Hello world, hello Klaus!”. Upload this file into our Azure Storage Account.
Switch to the newly created Storage Account and select “upload” (Figure 4, A). Select “Create new” and type in “cont26a” as container name (B). Open the file explorer by clicking on the symbol to the right. Select the file we just created (C). Then, press “Upload” to load the file “output.txt” to the cloud (D). If the upload button is not blue, something went wrong in any of the previous steps.
As a last step in the Azure Portal, for now, open the Storage Account sta26 and select the storage browser (A). Click on the blob containers (B), then on “sta26a” (C). You see the list of files stored there. Click on the three points at the right of the row “output.txt” (D) and click “Copy URL” (E). Paste the URL in some document or file. You need the string in half an hour. Then, for the next step, switch to Visual Studio.
Creating an Azure Function
We create, first, a simple predefined Azure Function and deploy it to the cloud. It does not connect to our Storage Account. The step is more for verifying that the overall setup is correct. In VisualStudio, select “File > New > Project” (Figure 5, A). VisualStudio will propose various templates. Type in “Functions” into the search field (B), select “Azure Functions,” and continue (C). If you do not find this template, you can install missing ones after clicking on “Not finding what you are looking for? Install more tools and features” and search for “Azure development”.
The next mask asks you for a function name (use “FunctionAlpha”) and a location where to save your files (Figure 6, A). Then comes the “additional information” tab. Select “.NET 6” and “http trigger” as the function you want to get created. Make sure that the other settings are as in figure (B): select “Use Azure,” deselect “Docker,” and use the function scope as authorization level. As a result, VisualStudio creates a simple function that returns some output when invoked.
We execute the code now locally on your laptop. Run the function by clicking on “FunctionAlpha” next to a green triangular on the menu to the top (Figure 6, A). A new window opens up, in which the listener process runs. This process waits for incoming http requests. To send a request, copy the URL provided in the window (B) and paste it into a web browser window. You should get the output as shown in the figure (C).
We now know that the function works locally on the laptop. Next, we deploy this function to an Azure runtime environment so that the function runs – serverless – in the cloud.
Deploying and Invoking an Azure Function
Azure comes with a feature to structure your functions – you do not create an Azure Function directly in a resource group but collect them in one or more Azure Function Apps. Create one in the Azure portal (Figure 8, A). Use resource group “rg_26a” and type in “FunctionApp26a” as the name for the Function App. Select “Code” as the way of publishing, “.NET” as runtime stack, “6” as the version, and “Windows” as the operating system. On the next mask (B), choose to create a new storage account and type in “stafa26a” as its name. Select “review and create” and start the creation process.
Wait until you see that the resource has been created (C).
To deploy the function to the Azure cloud, select “Build > Publish Selection” (Figure 8, B), then “Azure” as target (B), and “Azure Function App (Windows)” (C). Finally, choose your subscription, then select the resource group “rg_Demo26a” and the recently created Azure Function App “FunctionApp26a” (D). Continue and wait for the (quick) configurations to take place.
You might think you are done. I thought precisely the same several times. However, you have to go back to “Build > Publish Selection” and click the publish button (E).
Now, switch back to the Azure Portal. In the Azure Function App screen, we choose “Functions” in the menu bar to the left (A). We now see our function with the name “Function 1” (B). Click on the name to open the details page, where you can retrieve the web function’s URL (C). Click on “Get Function URL,” copy the link, and paste it into a web browser (D).
Connecting the Worlds …
We have created a Storage Account with one text file that our function should display when invoked via an internet browser. Therefore, we modify the function and read the welcome text from the Storage Account.
Open Visual Studio again and replace the Azure Function code with the following (check and update the blob URL if necessary):
public static class Function1
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequest req,
log.LogInformation("C# HTTP trigger function processed a request.");
ManagedIdentityCredential myCredentials = new ManagedIdentityCredential();
var myBlobUrl = "https://sta26a.blob.core.windows.net/sta26a/output.txt";
BlobClient bc = new BlobClient(new Uri(myBlobUrl), myCredentials);
BlobDownloadResult downloadResult = await bc.DownloadContentAsync();
string downloadedData = downloadResult.Content.ToString();
return new OkObjectResult("Output text from file in the cloud: "+ downloadedData);
Save the file, run it locally, and invoke it as before in Figure 8. Now you should get an error message in the browser; the listener process should display an error message (see Figure 12).
Next, deploy the updated Azure Function to the Azure Cloud, following the steps already shown in Figure 10. Then, run the function as illustrated in Figure 11. The result is another error message (HTTP Error 500), as Figure 13 illustrates. The exact text depends on your browser and the browser (language) settings.
The reason is simple: the Azure Function still has no read access to the file in the Azure Storage Account. To reach this aim, we have to switch on the system-assigned managed identity for our Azure Function App “FunctionApp26a”. Go to this Function App and select “Identity” in the menu bar to the left (Figure 14, A), change the status to “on” (B), and save the change (C).
Now, we do one final reconfiguration add grant the function’s managed identity read access to the storage account. Therefore, switch to the storage account “sta26a” (Figure 15, A), open “Access Control (IAM)” (B), and click on “add role assignment” (C).
On the next mask, search for the “Storage Blob Data Reader” role and click on “next” (Figure 16, A). On the next mask (B), select “Managed Identity,” click on “+ Select members,” and select “FunctionApp26a”. Then, click on “Select” and “Review + assign.”
Now copy the Azure Function’s URL again into a web browser. Be sure to retrieve the full URL, not just the beginning. If your URL looks like https://functionapp26a.azurewebsites.net/api/Function1, it will fail. It should look like https://functionapp26a.azurewebsites.net/api/Function1?code=**SOME-STRAGE-CHARACTERS**==. The result should look like Figure 17. If so, you made it. Time to celebrate!
With all the efficient and innovative Platform-as-a-Service (PaaS) services in the cloud world, a catastrophe is only one click away. An engineer wrongly configures a database or object storage service, and cybercriminals have access to all data from anywhere on the internet. Booz Allen Hamilton, WWE, Verizon Wireless, Accenture, the Pentagon: these are just some prominent companies and organizations that misconfigured their AWS S3 object storage and lost millions of data records. So, how can cloud security architects avert such catastrophes?
The article exemplifies typical but risky configuration options in Microsoft’s Azure and the Google Cloud Platform (GCP). Then, its focus shifts to Azure’s concepts of Private Endpoints and Private Links and to GPC Service Controls. Looking at the different features of the two sample public cloud providers allows for a better understanding of the various security risks and mitigation approaches when protecting PaaS services.
Basic PaaS Configurations
When creating PaaS service instances via the GUI, the cloud vendors usually suggest (also) security configuration options with a relaxed security level. Their motto: simple and easy to set up, every engineer trying out the cloud and a specific service should succeed. But such configurations come with risks that this section elaborates on in more detail.
In classic data centers, a solution can invoke internal web or microservices only if meeting two conditions:
Authentication (and authorization), i.e., only specific applications, solutions, and users can invoke the service after verifying their identity and when they have the needed access rights and roles.
Reachability of the service in the network: Networks are typically divided into different zones, separated by firewalls. Invoking services in different zones requires opening up connections.
PaaS service default configurations usually require authentication but come with limited network-level protection. By default, PaaS services belong to one zone: the worldwide internet, where everybody can reach everything.
Figure 1 illustrates the network security configurations when creating instances for two typical GCP PaaS services: Cloud Storage, Google’s object storage service, and managed SQL database instances. When engineers create a database service in GCP, they have to choose between a public and a private IP address (Figure 1, 1). A public IP means that everyone on the internet can connect to this service. If the authentication mechanisms are in place and configured correctly and if attackers cannot get access keys etc., this is a perfectly safe approach. However, it is a challenge to ensure that none of hundreds or thousands of engineers in a larger IT organization ever make a critical mistake. And indeed, Goole provides the option to restrict access to certain network zones to reduce the risk (Figure 1, 2).
When creating a Cloud Storage bucket, engineers must decide whether to enforce the prevention of public access to the data stored in this bucket. The challenges with any object storage are – be it in GCP, Azure, AWS, or any other cloud – contradicting network security needs for the two primary use cases.
The first use case for object storage is storing classic (internal) application data. An archiving system might store Pdf files, security solutions potentially video recordings that document who performed which command on a particular system. Such data must not be made available to the internet. The perfect security setting preventing for such an object storage service instance is “forbid all internet access.”
The second use case for object storage is storing and delivering websites, e.g., a web page with an interview text plus incorporated pictures. Everyone on the internet should be able to access the web pages and read the interview. There is no need for network firewalls or authentication mechanisms.
So, there are two highly relevant use cases, and one technology solves both needs and challenges. That is, usually, perfect, just not in this case. In this peculiar case, it is a security risk. When you configure the object storage for a bank’s know-your-customer documents by mistake as publicly accessible website storage, you might only notice when your documents appear for sale in the darknet.
Public IPs and the risk of misconfiguration is not a GCP specialty. Figure 2 presents a GUI mask for creating a Cosmos DB database service via the Azure portal. Two settings create public endpoints: “all networks” and “public endpoint (selected networks).” Especially in the case of “all networks,” anyone on the internet can reach the Cosmos service instance. In other words, access control misconfigurations can have catastrophic results.
To point out the situation clearly – and this is the same for GCP Cosmos DB or Azure SQL databases and many more database services in the cloud – there is no sensible reason why databases should have a public IP. Databases are backend systems. There is no point reaching them from outside the organization.
The screenshot in Figure 2 has a third option for Cosmos DB: Private Endpoints. The following section looks at the details.
Azure Private Endpoint and Link
Some time ago, Azure introduced Private Endpoints and Private Links. They provide an extra layer of security. Thanks to them, engineers can incorporate PaaS service instances (e.g., Cosmos databases) into a VNet and access PaaS services without going via public IPs. When creating a Cosmos Database Service instance with this option, engineers create a Private Endpoint in their VNet with a VNet-internal, private, non-public IP. The Private Endpoint points to the actual resource, e.g., a Cosmos Database, via a Private Link.
“Private Endpoint” is the third connectivity option in the example of creating a Cosmos service instance (Figure 3). When chosen, engineers can create and add a private endpoint. The Private Endpoint has a name – and the engineer specifies into which VNet Azure shall deploy the Private Endpoint.
Figure 4 illustrates what happens in the background in the case of a private endpoint / private link connectivity (B) versus the traditional approach (A).
In addition to configuring Azure Private Endpoints and Private Links, engineers must be aware of and mitigate two more risks. First, Private Links and Private Endpoints are for themselves save, but the cloud architecture must ensure no firewalls are open on the resource itself (risk C). The benefit is that there is no need to open a firewall to make solutions work and components interact. Thus, forbidding opening ports gets feasible. Second, criminal employees might try to exfiltrate data. They could send company data to a personal Cosmos Database service instance controlled and owned by themselves as private persons (D). Cloud security architects can demand that all traffic to storage or database services goes via Azure Private Endpoints thanks to Private Endpoints and Private Links. Auditing and analyzing networks for exfiltration paths gets much more straightforward.
Before discussing Google’s real cool feature for PaaS perimeter protection, a final remark about Azure Private Links. Azure Private Link Service enables engineers to make their own services and resources available via Azure Private Endpoints. Thereby, they benefit from the same security features for their code that Microsoft relies on for protecting its Azure PaaS services.
GCP Perimeter Protection with VPC Service Controls
Google’s VPC Service Controls concept enables cloud security architects to build a perimeter-protected network zone composed of IaaS and (!) PaaS resources. A VPC Service Control definition comprises five main elements: projects, services in scope, accessible services, access level, and traffic exceptions.
The GCP projects of a VPC Service Control define the trust perimeter. These projects trust each other and can freely interact as defined by the network configurations. In contrast, applications in other VPC Service Control perimeters must not access the resources if not permitted explicitly.
The GCPServices part of a VPC Service Control definition specifies for which services the perimeter protection of the VPC Service Control applies. That is a crucial setting. If, for example, Cloud Storage is not part of the perimeter definition, engineers can easily transfer data in and out of the VPC Service Control perimeter. A note: Google adds the feature to more and more services, but it might not yet be available for all needed by a company. Thus, disabling unprotectable GCP services can e a necessity.
GCP Accessible Services add another nuance to perimeter definitions. The feature enables engineers to limit the GCP services, which resources can invoke within the Service Control perimeter.
As a result of a VPC Service Control definition, services respectively their callability fall into three categories:
Service instances of GCP services under VPC Service Control perimeter protection. They are only accessible from within the perimeter. Plus, they cannot invoke external service instances.
Service instances of GCP services without restriction. VMs within a VPC Service Control can connect to VPC-internal and external service instances. Plus, VPC external resources can (try to) connect to VPC-internal instances of this type.
Services blocked for use within a VPC Service Control (“GCP Accessible Services”) setting.
The GCP Access Level is an option for context-aware authentication and authorization. It is a new and upcoming approach requiring a detailed analysis worth a dedicated article.
A last core element is an option for defining exceptions, i.e., adding ingress and egress trafficrules allowing otherwise not allowed traffic into and out of the VPC Service Control. Thus, VMs from within one service parameter might get read access to Cloud Storage in a different service perimeter.
VPC Service Controls in the GCP world and the Azure Private Link/Private Endpoint help cloud security architects shield PaaS (and IaaS) service instances from the internet or other internal network zones. Their absoluteness – especially the exceptionally rigid VPC Service Controls – is what makes them so valuable. No one has to define, analyze, and manage thousands of rules and exceptions to understand and ensure the overall security posture of a network consisting of IaaS and PaaS resources. Instead, PaaS security becomes (nearly) a child’s play.
Commonalities and differences between securing VM-based Workloads on Microsoft Azure and the Google Cloud Platform (GCP)
When moving from an on-premises data center to the cloud, architecting and engineering a secure network is the first challenge. The days of buying hardware and wiring devices are over. Instead, cloud network engineers configure the network with templates and web GUIs. A secure network setup reflects three needs:
Workload structuring and separation, i.e., the high-level network design
Fine-grain connectivity and security controls
Enabling intra- and inter-company interactions between systems, applications, and components
For these three areas, we answer the following questions. First, which network architectural features do cloud vendors offer to reach these aims? What are the commonalities and differences between cloud providers such as Microsoft Azure or Google’s GCP? The latter helps architects to understand underlying principles rather than just product features.
The High-Level Network Design for Clouds
Four layers structure the cloud network security (Figure 1), with the second-highest being the most intuitive to understand. This second-highest layer corresponds to classic, simple on-prem networks. An IP range constitutes a network or network zone and gets a name and identifier. It is a moment of creation: let there be a network with the IP range 10.0.0.0 to 10.15.255.255 and call it “production environment headquarters.” In general, network architects should choose addresses from the private IP range, i.e., from 10.0.0.0 to 10.255.255.255, 172.16.0.0 to 172.31.255.255, or 192.168.0.0 to 192.168.255.255.
Zoning concepts typically reflect one or more of the following dimensions to structure the network landscape and workloads:
Geographical locations: Germany, Switzerland, Singapore, and China – or New York, Boston, San Francisco
Stages, for example, production, preproduction, test, and development
Business lines, e.g., investment banking, retail banking, corporate banking, private banking, or wealth management
Data sensitivity classes such as secret, confidential, internal, or public
IT departments subdivide network zones into subnets, each taking over a subset of the zone’s IP range. A zone with IP range 10.0.0.0 to 10.15.255.255, for example, cloud have two subnets: 10.0.0.0 to 10.0.255.255 and 10.4.0.0 to 10.4.63.255. Subnets allow for finer granular structuring.
Zoning is a complex task – and that is why cloud security architects have to support the design process. Should all VMs of an application reside in one subnet – or all VMs belonging to the same application or solution cluster? Should frontend and backend components reside in one subnet, or should they be in entirely different zones? Security architecture is one (important) stakeholder. Network architecture and business and enterprise architecture have a say as well.
Implementing High-Level Zoning Concepts in Azure
In Azure, VNets and Subnets are the features for implementing zones and subnets. A VNet’s IP range is coherent. IP addresses within a VNet are unique, though different VNets can have overlapping IP addresses. Within VNet A with all its subnets, there can be only one VM with IP 10.10.10.10. However, if an IT department has two VNets Z1 and Z2, both can contain one VM with IP 10.10.10.10. Furthermore, all resources of a VNet, such as VMs, reside in one region.
A fundamental security aspect is the connectivity between VNets and Subnets. Which VMs can talk with each other by default, which ones not, and which kind of shielding can engineers implement? Azure isolates VNets with their subnets and VMs from each other. Without explicit configuration, a VM in one VNet cannot connect to a VM in a different VNet. In contrast, VMs within a VNet can contact each other by default, even if residing in different Subnets. So, placing VMs in two subnets of the same VNet does not provide any extra level of security without additional (fine-granular level) measures we discuss later.
Implementing High-Level Zoning Concepts in GCP
VPC Networks and Subnets are GCP’s corresponding concepts for structuring a network and its IP range. VMs placed in a VPC Network, whether in the same or different GCP Subnets, can reach every other VM within the VPC. GCP allows VPCs to have overlapping IP ranges, as Figure 2 illustrates. In the example, there are VPC Networks with the IP address ranges 10.0.0.0/16 and one with 10.0.0.0/20, which obviously overlap.
A GCP Subnet is a feature preventing chaos in large VPC Networks. A Subnet takes over parts of the IP range of the VPC Network to which it belongs. All IPs are unique within a Subnet, implying that Subnet IP ranges do not overlap. In GCP, Subnets also have a geographical aspect. Resources within a GCP Subnet reside in one geographic GCP region.
The Management Layers
The need for a sophisticated management layer for organizing hundreds or more VNets or VPC Networks is evident when looking at the companies and customers the cloud vendors target. They do not want (only) small companies or startups paying for 5 or 10 VMs. They want the big fishes as well: corporations listed in the Swiss Market Index, the Eurostoxx 50, etc. These companies need thousands or ten-thousands of VMs. Two layers – zones such as VNets or VPCs plus Subnets – are insufficient for large clouds. Thus, the cloud providers introduced additional management layers on top.
When trying out the clouds the first time, many get the impression that accounts are the highest ordering principle, typically represented by an email address and an additional name or identifier. But cloud providers offer features for combining accounts to GPC Organizations in the Google world and Azure Management Groups for Microsoft. They help for governance and billing purposes or identities. These features help companies for which the move to the cloud was a decentral (chaotic) grass-root movement – or when company networks and cloud infrastructures evolve over the years due to mergers and acquisitions.
Below an Azure Subscription, aka an account, Microsoft provides the concept of Resource Groups, into which all VMs, VNets, and all other components are placed. Google has the concept of GCP Projects, which corresponds to the Azure Resource Groups. But there is one big difference: Google’s idea of GCP Folders, a tree-like flexible structure with multiple layers and nesting. GCP folders help in modeling and structuring workloads based on various dimensions such as business units, geographic regions, or stages with ease (Figure 3). In contrast, Azure has a rigid 3-levels model: organization, subscription, resource group.
The management layers typically provide options to enforce and monitor configurations (“policies”). However, here we focus on directly relevant network security configuration options. In this context, GCP has one related feature: Hierarchical Firewall Policies. GCP allows associating organizations and folders with such policies, which enforce specific firewall rules for all folders and VMs hierarchically lower (or to delegate decisions to lower levels) or restricting the applicability of rules to specific VMs by specifying target networks or service accounts.
Fine-Granular Access Control for Azure
Microsoft and Google follow different philosophies regarding fine-granular access control (and firewalls) within a VNet or VPC Network. Azure provides more – and more rigid – concepts making it essential for engineers to understand the interplay. GCP has fewer, though partially more general or flexible features.
The Azure World
Azure enables architects to secure networks with three concepts: network security groups, application security groups, and firewalls.
A network security group (NSG) is a firewall attached to Azure Subnets or individual VMs (specifically: a VM’s network interface), enabling engineers to control connectivity even within Azure Subnets. An NSG defines which IP addresses outside the Azure Subnet can reach which inside IP address – and vice versa. The rules can apply to individual IPs, groups of IPs, or the complete Subnet. If VMs serve different purposes – frontend, backend, or database servers – NSGs allow configuring their exposure to other subnets or the global internet differently, based on exact necessities and the risk appetite.
Application Security Group (ASG) are a refinement of NSGs and overcome their biggest downside: IP-based rules. IP addresses might change and/or if the number of frontend servers handling customer requests increases, all relevant rules in the NSG rule sets require a modification.
An ASG is a set of VMs with a dedicated tag. They are not (!) Identified based on an IP. Instead of opening the HTTPS port to 10.5.10.26 in a pure NSG setup, ASGs allow opening the HTPPS port for all VMs of the subnet belonging to the ASG “asg-frontend-vms”. The ASG might contain a VM with the IP 10.5.10.26, but any other VM within the subnet would be governed by the same rules as soon, and as long the VM is part of the ASG.
Figure 4 provides an example of an ASG – “asg-frontend-api-servers” – on the left. The sample VM on the right belongs to precisely one ASG: “asg-frontend-api-servers”.
An ASG impacts the VMs associated with it once the ASG becomes part of an NSG. Suppose an NSG allows HTTPS traffic to a VM handling incoming API requests. Then, rather than explicitly listing IPs in the NSG (and updating them when starting or stopping VMs), the engineers create an ASG and add the relevant VMs to this ASG.
In addition to VNets, Subnets, Network Security Groups, and Application Security Groups, Azure also allows configuring a dedicated security service, Azure Firewall. Such a firewall helps block unwanted traffic. It is a component typically placed at the perimeter between company VNets and the global internet. In contrast, VNets, Subnets, NSGs, and ASGs help control the company-internal network traffic.
The various components and features might be overwhelming when looked at the first time – and seem to be overlapping. That’s one way to see it. The other way to look at all these features is that managers and architects can achieve the same goal with overlapping features – and thus share responsibilities between teams more efficiently. A high level of network security when going live is necessary, but it is not enough. IT departments have to manage and maintain all network components to stay on the same security level in the years to come. More sophisticated and partially overlapping features make the work distribution more manageable.
Fine-Granular Access Control for GCP
NSGs on the Subnets and the VMs level together with ASGs – Azure provides various options for fine-granular connectivity configuration. In contrast, GCP provides firewalls (only) on the VPC level. Rules applying only to one VM are part of the VPC firewall rulesets. This minor technical finesse impacts network or firewall processes in IT organizations. Rules on the VPC level favor central firewall management, making it more challenging to delegate (some) firewall rule decisions to individual application teams.
Figure 6 illustrates options for defining VPC Firewall Rules. A rule can apply to all VMs of a VPC Network. It can apply to specifically tagged VMs (like the ASG concept in Azure) or VM with specific service accounts. The latter is a concept not found in Azure. When defining firewall rules, architects can configure the applicability further related to where the network traffic comes from based on IP ranges, source tags, or a service account within the same VPC.
To sum things up: firewall rules and connectivity in GCP and Azure provide different concepts. The result: different setups and organizational approaches. Just one final warning for GCP: The GCP portal presents default rules for opening SSH, ICMP, or even all ingress traffic. These rules simplify the first steps in GCP for newbies – and might result in risky configurations.
By default, VMs in different VNets and VPC Networks cannot communicate. However, there are solutions for inter-zone communication – and they do not require traffic via the internet: Azure VNet Peering and Google Cloud VPC Network Peering.
Suppose one zone has the IP range 10.0.0.0 to 10.0.255.255 and a second one 10.4.0.0 to 10.4.0.255, VMs, VMs can interact easily. No colliding IP addresses cause issues when determining an actual target VM for traffic. However, why should cloud architects peer zones, thereby allowing communications between VMs placed in two zones to isolate the workloads?
Peering is not beneficial for two-zone networks but for networks with many. GCP, for example, allows peering VPC Networks of the same or different GCP Projects. The VPC Networks can even belong to different GCP Organizations. VMs in peered VPC Networks can communicate with each other if not prevented by firewall rules. The twist is the non-transitive nature of peering. Suppose three VMs – VM1, VM2, and VM3 – are in three different VPC Networks: Z1, Z2, and Z3. Peering Z1 and Z2 plus Z2 and Z3 enables the communication between VM1 and VM2 and between VM2 and VM3, but not between VM1 and VM3. Non-transitive peering is vital when setting up typical hub-and-spoke network designs (Figure 7).
VNet Peering in Azure comes with different features and two main configuration options:
Enabling inbound and/or outbound traffic separately – not just allowing or disabling bidirectional traffic as in GCP
Restricting or approving traffic not originating in the peered VNet, also a feature not available in GCP
Peering is a simple yet powerful concept. It helps enable and control company-internal network communications, especially for setting up zones hosting communication middleware or management components.
Connecting VMs and the Internet
Public IPs for VMs are evil; this is the first message a security architect should hammer into all engineers’ minds. Vulnerabilities on the VM or application layers or simple misconfigurations are potential entry doors for attackers. However, locking up all VMs is also not possible. Admins cannot plugin keyboards to VMs to administer them. They come via the network as well.
The solution? Intermediate components or services that forward only parts of the network traffic (e.g., only TCP while blocking UDP traffic) and potentially even inspect the traffic. In particular, the following components and services are relevant:
Network Address Translation (NAT) Services such as the GCP Cloud NAT or Azure Virtual Network NAT. They allow VMs without public IPs to invoke services on the internet without exposing the VMs to external attackers.
Load Balancers such as the Microsoft Azure Load Balancer or Google’s Cloud Load Balancer. Their primary purpose is to distribute high loads to multiple backend VMs, hiding this fact for the invoking applications. At the same time, a load balancer reduces the attack surface, first by preventing direct access to VMs and, second, by limiting the traffic that gets through, e.g., to the HTTP(S) protocol.
Web Application Firewalls restrict even the content of http(s) requests, preventing, e.g., attacks with malformed invocations.
Azure Bastion allows admins to connect to and manage VMs without publicly exposing a VM with an IP. It works as follows: administrators connect first to the Azure portal. From there, the bastion host provides access to the VMs. A bastion host is a hardened server, making it less likely that (unneeded) components potentially have security issues.
These services are worth more profound analysis. Plus, for some cases, 3rd party solutions are an option, not only the ones from the cloud vendors. But all this is another long story, as is protecting VMs and their workloads against malware and other attacks.
To conclude, Microsoft Azure and Google’s GCP provide similar yet different features for creating and securing networks. Engineers aiming at getting applications to run on VMs benefit from their similarity. In contrast, security architects have a different perspective. They do not want to open up network connections. They must close the obvious and hidden wholes in the network design. For them, it is essential to understand all these little features and details where GCP and Azure – or any other public cloud – differ. It is less searching for easter eggs on a sunny day in spring. It is more isolating and avoiding mines in challenging terrain or stormy sea.