Tag Archives: Backups

Azure Backup Functionality for IaaS Workloads

VMs and Fileshares are fundamental building blocks for IaaS workloads in on-prem or cloud environments such as Azure. So, which features does Azure provide for companies to back up their IaaS-related data and components? And how does Azure help prevent undesired manipulations or deletions of such backups? To answer these questions, this article elaborates on the interplay between, first, Azure’s VM and file share services and their configurations with, second, Azure Recovery Services vaults for providing, managing, and maintaining VM and file share backups.

An initial remark before going into details. Geo- and zone-redundancy are related but different concepts. They reduce the risk of data loss due to the unavailability or destruction of hardware components and data center incidents. They do not allow restoring earlier versions of the data and backups in case of data mismanipulation. “Going back in time” is the unique selling proposition of backups. Though, keeping backups in various geographic locations helps, e.g., if a larger electricity blackout brings down data centers for days in a larger geographic area.

Backup Features for VMs in Azure

Azure’s solution for backups of IaaS cloud workloads is the Recovery Services vault (RSV). The configuration for VM backups in the portal is straightforward. With the “enhanced backup” feature, Azure can backup VMs up to every 4 hours (Figure 1, 1). There is an option to keep backups close to allow for a quick restore (2). Most important, and not available for all types of data and services in the cloud, Azure provides a long-time storage option to keep one backup per week, month, or year for months or even years (3). The final configuration option in the Azure portal is the list of VMs in scope for the backup – then, the configuration is complete.

Figure 1: Configuring VM Backups in the Azure Portal

Protecting and Securing VM Backups in Azure

Azure has various built-in features for preventing accidental or intentional deletion of backups. The most radical is the “immutable” option (Figure 2, 1). If switched on (and locked in), backups cannot be deleted before the retention period expired. This immutability feature is a lifesaver if ransomware attackers delete or encrypt critical data. A second feature, multi-user authentication (2), enables IT organizations to demand that a second person approves critical activities such as vault deletion operations or modifications of backup policies. It benefits organizations by preventing severe misconfigurations resulting in the loss or unavailability of current or future backups, whether by mistake or on purpose. To formulate it differently: Immutable backups help rebuild your data center after really severe incidents. Multi-user authentication helps prevent such a mess from happening in the first place and ensures that your backups exist.

Finally, the soft delete setting allows enabling the feature to roll back deletion operations (Figure 2, 3 and 4). In the context of VM backups in RSVs, the feature is especially beneficial to restore the status quo ante after smaller application management or engineering mistakes. If application managers notice something was deleted by mistake some time ago, they can easily restore it. Helpful for operational mistakes but only of limited value for ransomware attacks. Engineers can circumvent the soft delete feature – and even Microsoft documents how to delete all data forever, even if soft delete is active. A closing remark for VM backups: These configuration options apply to VMs and their backups though configurations take place via Recovery Services vaults.

Azure Fileshare Backups

File shares are not a CISO’s darling, but the technology exists for decades and probably continues to live for some more years. File shares enable uncomplicated interactions between users themselves, applications themselves, and between users and applications. It might be an often-redundant technology, but the ability to use file shares is a must in any context of legacy applications.

Fileshare Backups in Azure

While (in the portal) the relevant configuration options for Azure VMs backups are on the Recovery Service Vault, the situation is different for file shares. Many backup-related configurations take place on the Azure Storage Accounts, which contain the file shares. And a warning for those who understand backing up Azure Blobs, which are also stored in Azure Storage Accounts: backups for blobs (see “Blob and PostgreSQL Backups in Azure”) differ from the ones for file shares.

Azure supports ad-hoc backups (just click “add snapshot” in the portal’s snapshots mask) and periodic backups (Figure 3). Configuration options for periodic backups are the frequency (every four hours or less often) and how long backups are kept. Azure allows configuring retention periods of several years.

Protecting and Securing Fileshare Backups in Azure

Protecting file share backups is delicate because it is not just about protecting the backups. It is also about protecting the Storage Accounts. They contain the actual backup data. If deleted, all associated backups are gone. Thus, cloud security architects should be aware of the various configuration options for Storage Accounts and Recovery Services vaults to prevent the deletion or discontinuation of critical backups (Figure 4).

Figure 4: Protecting Azure File Share Backups

On the top are configuration options on the Storage Account level and for the Recovery Services vault. Most important is the immutability feature. When active and locked in, it guarantees that nobody – and really nobody – can delete these backups. It is a brand-new feature in public preview. Second, there is an option to forbid deleting the Storage Account utilizing a delete-lock. It makes deleting Storage Accounts hard to impossible. The different purposes are crucial: immutability is about having a backup when someone tries or succeeds in deleting all or critical data. The lock helps more to prevent the Storage Account deletion, which would bring down applications (even if the data can be restored with an immutable backup). Thus, its purpose is to improve reliability and application uptime.

On the Azure File Share level, Azure provides the soft-delete feature (Figure 5). It has some peculiarities, especially since Azure supports soft-delete for SMB file shares and not for NFS ones. The SMB variant is particularly strong in the Windows respectively Microsoft world. The scope for soft-delete is not a single file but complete file shares. If switched on, engineers can restore a file share after its deletion. However, it does not bring back individual files if they are deleted. For this, the backup functionality with periodic backups or ad-hoc snapshots has to be used. So, to conclude: Azure provides various backup-related features, but understanding them in detail is critical to prevent issues if organizations really need them in critical situations.

Figure 5: Azure Storage Account Configurations for File Shares

Blob and PostgreSQL Backups in Azure: Features & Security

Backup features are core services all public clouds provide. Azure’s characteristic is a confusing backup service portfolio. Two backup technologies coexist: Azure Backup vault and Azure Site Recovery vault. None is better—the data source you want to back up defines which vault type to use – and there are services that even backups in different ways. As a rule of thumb, the Service Recovery Vault is the solution for typical IaaS workloads such as Windows and Linux VMs, Fileshapres, and specific databases running on VMs. Azure Backup Vault is strong with Azure PaaS services. Azure starts to hide the complexity by introducing the Azure Backup Center (Figure 1). It is a unified user interface for handling backups in the Azure portal. Still, when architecting the backup solution, understanding the differences is crucial. Even for one vault type, nuances depend on the data.

Figure 1: Vault Types in the Azure Backup Center (Recovery Services vault – 1 and Backup vault – 2)

A typical enterprise workload in Azure cloud might comprise of …

VMs and their disks
Object Storage, i.e., Azure Blobs in Azure Storage Accounts
File Storage in Azure Storage Accounts
DBaaS (Cosmos DB, Azure SQL Managed Instance, Azure SQL)

Then, a service recovery vault would take the VM, including customer-installed and managed databases installed on VMs, plus file systems stored in Azure’s Storage Accounts. Blobs and DBaaS data, however, go into a Backup vault, though things turn out to be different than expected if you take a closer look as we do in the following paragraphs. We discuss how to configure ad-hoc and periodic backups for blobs and DBaaS. And the interesting aspect is: In the most recent forms, the Backup Vaults are not or only slightly involved.

Operational Backups for Azure Blobs

The quickest way to back up a single Blob in Azure is to go to the Azure portal, switch to the blob, and select “make a snapshot” – and you are done (Figure 2). Super convenient and super quick, just not doable if you have hundreds of storage accounts and want to be sure that you get a backup every hour.

Figure 2: Snapshot backups for single Blobs

For backup automation, Azure’s most advanced solution for blobs in Azure Storage Accounts is the Operational Backup for Azure Blobs feature. Azure Storage Accounts can have multiple (block blob) containers (Storage Accounts can also store and manage other data: file shares, queues, or tables). The actual blobs reside inside these containers. Figure 3 illustrates these three main blob management layers and how they contribute to securing blob backups.

Figure 3: Azure backup features for Blobs, including protection options

When creating a storage account in the Azure portal, engineers make several backup-related choices, as the options in Figure 2 illustrate:

Point-in-time recovery (PITR): The ability to roll back to anytime within the retention period. The maximum value is 360 days (Figure 2, A).
Soft-delete for blobs and containers (B and C). If these features are active, Azure keeps a copy of blobs and containers after their deletion. Engineers could restore them if the deletion was a mistake (or a malicious activity). Soft-delete for blobs allows restoring a single object, but soft-delete on the container level is necessary when someone deletes a container with the included blobs.
Versioning for blobs (D): In contrast to files, object storage does not modify a blob. It replaces a blob with a newer version. Versioning means keeping (some) older versions. Azure can restore an older blob version, e.g., if modifications were incorrect or if ransomware attackers replace objects with encrypted versions.
The version change feed (E) is Azure’s feature to ensure non-repudiation of blob changes by logging any change.
The immutability feature, if switched on, makes it impossible to delete a backup for a defined period. No administrator, hacker, or Microsoft employee can delete backups protected with this feature.

Figure 4: Data Protection options for Storage Accounts

After creating the Storage Account, engineers can lock in the immutability setting and switch on the already mentioned “operational backup” feature. It comes with many features mentioned in the last paragraph: PITR, versioning, and soft-delete. Its biggest benefit is the integration with the general Backup Center management GUIs (Figure 5).

Figure 5: Activating Operational Backups for Azure Blobs

Protecting the Backups

Network-level shielding and access control, disaster readiness, backup loss prevention – these three topics drive the protection of your backups. Network shielding reflects the protection of the backups and the backup-related services on the network level. The actual backups are never directly accessible for Azure customers and users, only via the portal. Shielding the service for a concrete Azure tenant is possible with private endpoints. They are the method of choice in the Azure (PaaS) world, complemented by role-based access control based on identities and roles managed in a central solution, e.g., the Azure Active Directory. Privileged Identity Management (PIM) is Azure’s feature for even more security for privileged roles and users, valuable not only for classic admins but also for engineers who configure backup settings.

The second big topic, disaster readiness, relates to preparations for surviving large-scale incidents. A backup does not help if located in the same burned-down data center as the VMs. Thus, geo- (or zone-) redundant backups are valuable. Finally, backup loss prevention is about preventing backups from being deleted by successful attackers – or due to operational mistakes of employees. Two features drive this aspect:

Resource Guard to enforce a four-eyes principle before critical backup-related operations
Immutability, i.e., the absolute impossibility that anyone can delete a backup.

Finally, if you encrypt the backup with a customer-managed key, restoring the backup is only possible if the specific keys exists – but encryption and key management are big topics. One article is not enough to cover all the relevant aspects.

Figure 6: Storage Account Settings: PITR, Soft Delete, Versioning

Backups for Azure Database for PostgreSQL

PostgreSQL is one of the database-as-a-service offerings from Microsoft. And when looking just at the various PostgreSQL variances, the full complexity of backups in the cloud using Azure-native tools becomes visible. The PostgreSQL variants are single server, flexible server, and Azure Arc. The latter addresses mixed on-prem and cloud workloads, so I focus on the first two. A first glance at the backup features tells you that backup vaults are where the backups are stored – and that immutable backup is an available feature (at least in the preview). But this holds only for the PostgreSQL single-server variant. The flexible server variant is the one you probably choose for newer software architectures. What Azure-native backups offer here differs, as I lay out in the following paragraphs.

Initially, a short remark on how PostgreSQL works internally for storing and managing the data: An Azure PostgreSQL flexible server instance is what one calls a database server in the on-prem world. It is an environment that can host many databases with the actual data. So, we have a three-layered model – flexible server, database – data. Regarding backups, there are many similarities with blob backups but some distinct differences. Most importantly, point-in-time recovery is available with a maximum retention period of 35 days. Azure allows configuring these backups as geo-redundant, in which context Microsoft mentions a recovery-point objective of around 1 hour, though this is not a guaranteed service level agreement. However, there is no built-in solution for long-time backups. Customers have to implement their backup solution with the help of the PostgreSQL command pg_dump. They must implement processes performing daily exports and cleaning up older backup files.

Figure 7: Configuration options for built-in automatic backups for Azure Databases

Securing PostgreSQL Flexible Server Backups

The security features for PostgreSQL flexible server backups are much more limited than for blobs, as Figure 3 illustrates. Identity and access management restrict access for users, at least if attackers cannot hack a company’s Active Directory. The continuous backup for point-in-time recovery provides the option to roll back the database server for a defined number of days. The only built-in security feature, however, is a delete lock preventing the deletion of the flexible server. The deletion of the server automatically deletes all backups. Adding and removing a lock on the server is an action users with a specific role can add and remove. If an internal user with such a role (or an attacker who could overtake a user with such a role) deletes the lock, he can delete the complete server with all its backups afterward. There is no built-in immutable backup option, though writing backup routines to dump exports to immutable (blob) storage is an option. Otherwise, IT departments really have to trust their Active Directory security measures and their admin and backup employees if they rely on such backups.

Figure 8: Azure backup features for PostgreSQL Flexible Server backups, including protection options

So, what is the conclusion for cloud security architects from what we can learn from Azure cloud-native features for backups for blobs and PostgreSQL flexible servers? First, many advanced cloud-native backup features help companies to improve their backup implementation. Second, the availability of features depends on the exact PaaS service and service variant. In other words: it looks chaotic and random, which backup features an Azure (database or storage) service has. Thus, the third and most important learning is: ensuring a defined backup service level for a complete application landscape in Azure is a big challenge for architects. It requires governance, which cloud services applications can use, and clear, enforceable guidelines on performing and securing these backups.

Backup Strategies and Concepts in the Public Clouds

When moving their workload to the cloud, companies must rethink their backup concepts and strategies. Reuse existing on-premise backup solutions, switch to cloud-native backup features, rely on a new cloud-ready 3^rd party software, or even contract a backup service provider? The world is full of opportunities. The key is understanding what remains the same in the cloud – and what changes.

Formulating the Requirements

The cloud does not change any backup requirements; cloud providers only complicate or simplify backup implementations. The three well-known dimensions of functional requirements for backups stay the same:

Recovery Time Objective (RTO), i.e., the duration needed to restore the data from a backup
Recovery Point Objective (RPO), i.e., how much data is maximum lost in case a backup has to be restored
Backup location, i.e., the physical location where the backup resides.

The location aspect covers two dimensions. The first relates to data residency: In which jurisdictions are you allowed to store your backups, respectively, where do you have to keep a copy? In highly regulated sectors such as health or financial industries, regulators push companies and organizations to store the data in the regulator’s sphere of influence, aka the local country. When looking at data protection laws, they can restrict data transfers – including backup data – to other countries.

The second dimension of location reflects how far away geographically one keeps backups. A backup two kilometers away from the primary data center survives if the primary data center burns down. It does not help if a devastating flood destroys all buildings in a valley. So, clarifying which disasters your backup should survive is essential, too.

Finally, there is one aspect every backup product owner might want to stay away from, and that is archiving. Companies tend to move old data out of their operational systems and into an archive. Archiving is about understanding which data should be kept for how long when data can be deleted, and ensuring that data stored in old file formats remain accessible for ten years. Archiving is essential but orthogonal to ensuring that the organization has data backup to restore business operations after a blackout or earthquake.

Manageability of Backup Requirements

When hundreds or thousands of applications try to optimize the backup parameters to suit their needs optimally, just the discussions cost a fortune, plus there is the risk that nobody notices that some critical applications make wrong choices. Thus, larger IT organizations group applications with similar needs and relevance. “Top-10” applications or “Gold” versus “Silver” versus “Bronze” are sample names for service levels – for backups and beyond. A few simple categories ease understanding whether an application has the correct service level. One should never forget the cost impacts backups and service levels generally have. Maybe it is worth investing in having an online shop back operational within minutes. However, it is a waste of money to d the same for a Christmas card application allowing sales staff to order cards every year in the second week of November.

The new Challenge in the Cloud: Coverage

When companies only have IaaS workloads in the cloud, nothing changes. Backing up Linux or Windows VMs and DB2 or SQL servers does not change in the cloud. The challenge comes from platform-as-a-service (PaaS) services: Object storage is the new standard storage. Cloud providers provide file shares, serverless functions, and database-as-a-service services. The technical variety of sources to be backed up is higher than ever. And to add additional complexity: most companies rely not on just one cloud provider but on two and more, which all have slightly varying backup functionalities.

Finally, there are software-as-a-service solutions such as Google Workspace, O365, Salesforce, and many less-known ones. Companies must validate for each whether a service has a feature to export and externally backup the data and which kind of backups these vendors provide. A challenging task – and more than one company has to decide whether to accept insufficient backups or migrate to a different SaaS provider.

*Figure 1: Backups in the on-premise world versus backup scope in the public cloud*

Innovative Cloud-native Backup Features

In the on-premise world, the 3-2-1 pattern was the golden backup rule: three backups, two media types, and one copy kept outside your organization. However, the cloud providers make two recent innovations a commodity in their cloud-native backup features: continuous backup and geo-redundancy. Cloud backup features are a big step forward for many cloud customers compared to their on-premise backups. First, storing backups on the other side of the globe requires, nowadays, just a click on the cloud portal. So, even the smallest company can afford it in the cloud. Second, continuous backups are available for many cloud services, allowing for point-in-time recovery. Do you want to restore the state yesterday at 21:23? Click here, and you are done – no need for the 3-2-1 pattern anymore.

In this new world, also simple backup tests are obsolete. Performing a backup of a VM and trying to see whether the restore works? Not needed anymore thanks to the superior integration of cloud-native backups in the cloud eco-system. Thus, companies can focus on testing the real challenges: restoring complete solutions consisting of a mix of database types and IaaS and PaaS services. Just say “hello” to the future of backups in the clouds!

AWS Backup for IaaS Workloads: The Basics

Use Cases & Technologies

Cloud providers market their platform-as-a-service features, e.g., for AI or IoT. In contrast, many companies currently focus on moving their existing Windows and Linux applications into the cloud. “Lift-and-shift” is the motto. Rearchitecting is a task for later. As you might remember, neither the rise of C nor Java caused the immediate death of COBOL applications. Thus, in the cloud, the ability to backup IaaS workload is one of the most crucial topics when moving to the cloud – and AWS Backup is Amazon’s cloud-native solution for backups. How does it help?

What does the workload look like, and which use cases should be supported? With the focus on IaaS, the VMs with their disks, file storage, and object storage are in place. The latter is a web service, but object storage is the new main storage variant in the cloud for new components configured or developed in a public cloud environment.

AWS Backup supports a long list of AWS services, including the following IaaS-related ones:

EC2, AWS’s term for VMs
Amazon Elastic Block Store (Amazon EBS), aka block storage for EC2 instances
Amazon Elastic File System (Amazon EFS), a file system solution working with EC2 with Linux operating system, a technology typically needed by many traditional applications
Amazon FSx supporting various file systems, including Windows File Server or NetApp ONTAP
Amazon Simple Storage Service (Amazon S3) as the object storage service

So, this blog post covers how to organize backups for IaaS workload running on Windows and Linux EC2 instances with attached EBS volumes on which applications run relying on EFS and FX file systems for classic file storage, plus incorporating S3 buckets as the new dominant storage type in clouds. Figure 1 visualizes this.

Typical technical use cases for backup solutions are:

Ad-hoc Backups, i.e., engineers make a manual copy before complex, risky changes
Periodic Backups to prepare if something goes wrong and one has to restore a recent or long-ago state
Continuous backups for Point-in-Time Recovery (PITR) as a solution covering both use cases above

AWS links to the solutions for 1 and 2 directly on the dashboard page for AWS in the AWS portal (Figure 2), whereas number 3 is part of the configuration options for 2.

A fourth aspect is geo-redundancy, important for disaster recovery events, such as the loss of complete data centers or region-wide blackouts (4). Finally, we look at Frameworks, a governance and compliance tool (5).

AWS Backup Plan

Setting up an AWS Backup Plan in AWS Backup means configuring one or more backup rules (Figure 3, 1) and one or more resource assignments (Figure 3, 2).

There are many ways to configure the various settings for backup rules. Thus, we focus on the settings for two scenarios. The first one is for periodic backups that have to be stored for years and which should be safe even in case of larger catastrophes. Thus, the configuration shown in Figure 4 sets the interval to four hours, whereas the retention time is 10 years, and a copy is stored in a different region than the original backup.

Figure 4: AWS Backup Plan for periodic backups with long retention and copy to another region

The second scenario is about backups needed for misconfigurations in operations. A wrong patch is deployed, an admin – by mistake – drops production data, etc. For this scenario, a continuous backup is perfect, allowing for point-in-time recovery (Figure 5). Just a warning: the fine print states that the continuous backup is available only for three AWS services: RDS, S3, and SAP Hana on EC2. For other services, the “hourly” periodicity counts. So, when looking at the IaaS services in AWS mentioned in the beginning, PITR is not available for EBS Volumes and EFS File Systems which are assigned to EC2 instances.

Figure 5: AWS Backup Rule with continuous backup for Point-in-Time Recovery (PITR)

The resource assignment defines what is in scope for the backup. It allows filtering based on the resource type (e.g., EBS, S3) and tags assigned to resources (Figure 6). Once resource assignments and backup rules are defined, everything is ready, and Azure performs the (next) backup as specified in the rule.

Figure 6: Resource assignments in AWS Backup

Frameworks for AWS Backup

Backups help bring up an application or a complete data center again when something goes severely wrong. However, if backups are missing in such an emergency, this is not just bad luck but a result of insufficient organization, governance, and oversight. AWS helps to prevent such situations by providing Frameworks for AWS Backups. They define the expected state (or configurations) for backups helping to identify a gap between how the world should be and how it is. It is a pretty exhaustive list, as Figure 3 illustrates.

Figure 7: Configuration options for AWS Backup Frameworks in the AWS Portal

On-Demand Backups

On-demand backups are easy to configure and start in AWS: Choose the resource type and a concrete instance, make sure that the retention period is set correctly (remember: too many unnecessary backups require a lot of storage), and maybe define the vault to be used for storing the backup, and you are done (Figure 8).

The most notable aspect – besides that there is a single GUI for creating on-demand backups for all relevant services – is that one can back up just one instance of one resource. Sufficient for small applications, but for anything slightly complex, the tag-based definition of what is part of a backup (as known from the periodic backups) is much more efficient.

Figure 8: On-demand Backups in AWS Backup

Additional Backup Services in AWS

AWS unites backup features for all resource types in the single AWS Backup GUI. Still, there are other options to make backups. Amazon Data Lifecycle Manager can back up EBS Volumes. The EFS-to-EFS backup solution is an option for file systems. And, S3 has a versioning feature allowing to restore previous versions of the object; it even has to be activated for backups. However, from a governance perspective, having all backups in one place – AWS Backup or a 3^rd party solution – eases the governance. And do not forget: AWS Backup comes with this clever and easy-to-use feature for governance, AWS Backup Frameworks. Use it!

Backups for Selected Database-as-a-Service Services in Azure and AWS

Database-as-a-Service (DBaaS) is one step for IT departments to focus on software development, configuring and integrating 3^rd party solutions, and maintaining and optimizing application operations rather than spending time with infrastructure and middleware topics. But even DBaaS does not make backups obsolete. Certainly, engineers should not dump production databases, but what if they do it by mistake? Also, DBaaS does not come with a 100% availability guarantee, even though you can minizine risks with redundancy – but not every database might be worth such ongoing costs. So, what are the options?

In the following, we look at backup features for selected DBaaS offerings:

Azure Cosmos DB
Azure SQL Server
Amazon DynamoDB

DBaaS implies that a web service provides database functionality via its API. Engineers and architects can only design backup strategies based on this API. So, what are the concrete features and options for the mentioned services?

Backups for Azure Cosmos DB

The backup options for most Cosmos DB variants – for NoSQL, MongoDB, Apache Casandra, Table, and Apache Gremlin, excluding PostgreSQL – are the same. For them, the first configuration decision is between continuous and periodic backup. A periodic backup means that at regular points in time, Azure performs a backup (Figure 1, A). Customers can set the periodicity to any value between 1 and 24 hours (Figure 1, B). The retention period can be up to 30 days (Figure 1, C). Azure keeps backups for so long; then Azure it them automatically. Alternatively, engineers can configure continuous backups (Figure 1, D). Then, they can roll back to any point in time within the last 7 or 30 days.

Figure 1: Backup Configurations for Azure Cosmos DB

To prepare for more extensive outages, natural disasters, larger electricity blackouts, etc., Azure copies every backup by default to a paired region, such as from Switzerland (North) to Switzerland (West) or from France (Central) to France (South). If configured via CLI, options are geo-, zone-, and locally redundant storage. Self-service restores from the second region are not self-service but via service ticket and Microsoft support. Thus, larger events might result in a high volume of service requests and heavy delays for the restoration. Geo-redundancy for the service itself (not just for backups) can be necessary for business-critical applications.

Azure DB provides a sophisticated, easy-to-use solution for device failures (even the redundancy alone covers most cases) and operational mistakes, i.e., somebody realizes she deleted some data yesterday she did not intend to. There are shortcomings in other use cases:

No support for ad-hoc backups. They are helpful, e.g., before deploying complicated patches that change the data model and data. Continuous backups cover them, but there is no option with periodic backups.
The absolute limit of 30 days retention time can be too short, depending on the risk appetite and the setup. Cyberattackers might be in the system for a while unnoticed. Data manipulations not causing immediate issues might not be covered. Also, operational mistakes go unnoticed if data is only used monthly, quarterly, or yearly for, e.g., dedicated end-of-month, end-of-quarter, or end-of-year processing.

The statement is not that there are no solutions. Exporting data by script or restoring them into a different database instance might work. However, it is not an out-of-the-box feature. It requires customer engineering – or even the need for a 3^rd party backup solution.

Backups for Azure SQL Server

Azure SQL server is a managed service – or a database-as-a-service – offered by Microsoft. While customers must make storage and server choices (unlike Cosmos, which scales up and down as needed), Microsofts performs (and automates) the database administration, including patching.

Azure SQL Server has point-in-time recovery, i.e., continuous backups enabling the restoration of the data of a specific point in time within the last 1 to 35 days. Internally, Azure performs full backups weekly and differential backups, depending on the configuration, every 12 or 24 hours (Figure 2, A). In addition, Azure backups transactional logs every 10 minutes. If the database crashes, a maximum of 10 minutes of processed transactions are lost. In other words, the Recovery Point Objective (RPO) is 10 minutes.

Figure 2: Backup Configurations for Azure SQL

If engineers do not change the backup policy, backups are geo-redundant (Figure 3), i.e., copied to a paired zone, as discussed for the Cosmos DB. If customers have to rely on the geo-redundant copy, the RPO is 1 hour instead of 10 minutes.

Figure 3: Backup Redundancy Options for Azure SQL

A significant difference to Cosmos DB is the longer chance to put backup aside. A dedicated configuration option exists for how long to keep a weekly, monthly, and yearly backup – and the setting allows one to configure multiple years, not just some weeks.

The Azure documentation states that the restore time is usually “less than 12 hours”, though it might take considerably longer. Business-critical applications cannot rely on having a backup on the other side of the globe. They might have to configure the service itself as geo-redundant.

Backups for AWS Dynamo DB

Dynamo DB is a NoSQL database-as-a-service offering from AWS. Amazon fully integrated Dynamo DB with AWS Backup, AWS’s service for managing backups on an enterprise level. While there might also be legacy options, I focus on this backup functionality. After discussing two Azure services, the idea is to see how AWS approaches backups. And one difference was already mentioned: the full integration into the standard AWS backup solution.

For AWS Dynamo DB backups, the following paragraphs cover three aspects:

Continuous backups for Point-in-Time Recovery (PITR)
Ad-hoc Backups
Periodic Backups

If explicitly enabled, point-in-time recovery based on continuous backups (Figure 4, A) is available in case of an issue. The retention time is 35 days, i.e., a restore to any given point within the last 35 days is possible.

Figure 4: Setting for Point-in-Time Recovery (PITR)

For ad-hoc backups, the first statement is: AWS provides the feature for Dynamo DB (Figure 5); there is no need for a workaround like in Azure. Engineers can configure some details – when it should start, how long it can take, in which vault to store it, and when to move it to cold storage – the most important being the retention time. When should AWS delete the backup? Here, the service allows choosing several years.

Figure 5: Configuring On-Demand Backups for Dynamo DB in AWS

Finally, there is the option to schedule periodic backups. The configuration options are nearly the same as in Figure 5, though engineers must define the periodicity and the backup window for periodic backups. But why periodic backups when there is point-in-time recovery?

The main reason is that retention time point-in-time recovery of 35 days, whereas periodic backups can be kept for years. Other reasons – exotic ones – are: maybe cost savings, or maybe it is a legacy architectural design from before point-in-time recovery was available.

To conclude: The discussed database-as-a-service offerings from Azure and AWS all come with backup functionality, but they differ for all of them. Cloud and security architects do good when checking the company-internal backup requirements and policies and validating whether and how they can implement them. This validation is necessary per cloud and service.

Application-Consistent Backups for VM Workloads in the Cloud

Application backups are not as simple as in the world of database lectures. Have you ever heard about the ACID properties with “A” representing “atomicity” and “D” durability? After a database commit, everything is on disk. Nothing can get lost. Plus, all commands within the transaction are on disk – or everything is undone. When a database crashes and restarts, its data reflects precisely the effects of all committed transactions. It works as designed if an application relies on exactly one database. Sadly, applications are more complex, as Figure 1 illustrates. They tend to access not only one database but several in parallel, plus write data to files and file shares – and disks attached to VMs.

*Figure 1: Real-live Backup Scenarios for Applications and their VMs*

Understanding Application-Consistent Backups and their Benefits

While private users can save their data by manually copying all their files to a different hard disk or the cloud, this approach is too simplistic for larger applications, even if we focus only on VM disks. If it is a large disk, files with names starting with “A” might get copied at 5:00 am and those starting with “Z” at 5:05 am. Thus, the “Z-files” could have been changed between 5:00 am and 5:05 am, making the A-files and the Z-files inconsistent. Furthermore, copying open files causes issues, changes might be only in the memory and not written to disk, or there could be pending I/O transactions. Thus, a clean-up of files and their data might be necessary when restarting the application using such file copies. The clean-up can be automated or a manual task for engineers. In any case, it prolongs application outages.

Application consistency overcomes this challenge. The idea is to perform backups such that applications run after restarting a VM without any clean-up actions. Thus, business-critical applications benefit from lower downtimes, i.e., they can provide better Recovery Time Objectives. Plus, the organization benefits in crisis events, aka business continuity management situations. When a company has to evacuate all workloads to a different cloud data center, the engineers can rely on the VMs to restart and applications to come up without manual intervention. The engineers can focus on fixing more complex issues, e.g., related to integrations with other components, rather than the complete IT department being stuck with cleaning up file systems.

The most prominent solution for application-consistent backups is Microsoft’s Windows Volume Shadow Copy Service (VSS). Microsoft products come with it, and your organization’s applications (and your third-party software provider) can also implement it for Windows workloads. The exact details of VSS are, however, not so relevant from a cloud security architecture perspective. What matters is the available features for application-consistent backups in AWS, GCP, and Azure.

Application-Consistent Backups in the Cloud

Azure comes with a solution for application-consistent backups for Linux VMs, at least for those deployed with the Azure Resource Manager and not the Service Manager. It is a framework enabling application developers or operations specialists to integrate pre- and post-scripts into Azure’s backup process. Pre-scripts can invoke, for example, APIs of the application to tell the application to finish off all “write” activities. Then, Azure performs the backup copy. Afterward, Azure invokes the post-script, and normal operations continue. For this purpose, a configuration file (Figure 2) must be on all relevant VMs.

Figure 2: Configuration File for Application Consistent Backups of Linux VMs in Azure. Highlighted are the configuration of pre- and post-backup scripts. Other settings are for defining parameters and handling exceptions and failures.

The pre- and post-scripts and this configuration file are critical from a security perspective. They run with root privileges. Thus, they must be secured to prevent attackers having gained access to the VM to change these settings and execute malicious code as root.

The situation for Windows VMs on Azure is much easier compared to the Linux world. By default, all VM backups use the Microsoft VSS service. So, if (and only if!) the applications on the VM implement VSS, all backups are application-consistent without the need for extra VM configurations. If not, the disk backup is not application- but only file-consistent.

Finally, a quick remark on the AWS and the Google Cloud Platform (GCP) features. Both follow the same approach as Azure: pre- and post-scripts for Linux VMs, and Microsoft’s VSS for Windows VMs.

Back to the Big Picture

To conclude: Application-consistent backups reduce downtimes of applications by reducing the work for engineers after crashes or VM evacuations. However, the term application consistency can be misleading. When looking again at Figure 1, it is clear that the consistency between the VM disks and the database backups is not guaranteed. Applications have to cover the challenge that the VM disk backup is from 4:07 am, one database backup is from 4:05, and the second from 4:17. So, even with application-consistent backups, there are still exciting tasks and challenges for engineers in the area of backup and recovery!

Backup Basics: On Backup Types, RTO & RPO

Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are essential concepts for business-critical applications. The RTO defines the maximum time an application might be unavailable after a crash – or after a data center burns down (Figure 1). That is the time engineers (or automatic processes) have to start the application on other servers – potentially at a different location – and restore all necessary data from backups.

Figure 1: Understanding RPO, RTO, and Backup Frequencies

How much data might get lost max in such an event, this defines the RPO, the Recovery Point Objective. It depends on the backup frequency. If a company makes a backup each night at 4 am, the RPO is 24 hours. Then, if the data center burns down at 4:15, only the data from the last 15 minutes is lost. If it burns down at 3 am, the data from 23 hours is lost.

One factor influencing the RTO is the time needed to restore the data from a backup. Network bandwidth and storage are significant factors, but so is the backup type. Full, incremental, and differential backups take different long.

The starting point is always a full backup. The system writes all data to a secure backup storage (Figure 2). So, why would an architect not always go for a full backup? It is simple: Suppose an application is highly critical. Thus, the RPO is 5 min. Let us further assume that it is a massive database, but not much data changes within five minutes. Then, a full backup every five minutes seems too much. It might even be technically impossible, depending on network bandwidth and the amount of data. Incremental backups can be a better solution for such a scenario.

Figure 2: Understanding Full, Incremental, and Differential Backups

An incremental backup copies only the most recent changes away. If the last full backup was at 4 am, an incremental backup at 4:05 am copies the changes from the last five minutes. The next backup at 4:10 am copies, as well, only the changes of the last five minutes, i.e., the changes between 4:05 am and 4:10 am. Incremental backups have on disadvantage compared to full backups: the restoration time might be longer. To restore the situation of 4:10 am, one has to fully restore the data from 4 am, then apply what changed till 4.05 am, and then the incremental changes till 4:10 am.

A middle way between incremental and full backups is the differential backup. A differential backup copies away all changes from the last full backup. At 4:05 am, a differential backup writes all changes of the last five minutes; at 4:10 am, all changes between 4:00 and 4:10, and at 4:15, all changes between 4:00 and 4:15. Thus, a restore of an incremental backup means: restore the full backup and (only) apply the changes of the last differential backup. However, these concepts get transparent with the cloud: the cloud providers implement standard configurations.

The Azure Backup Center, for example, enables customers to configure several backups per day for VMs based on the Service Recovery Vaults, e.g., every 4 hours. The first one is a full backup; the subsequent backups are incremental (Figure 3). Microsoft has decided. However, these concepts are still crucial for architects and engineers if they implement their own solutions or use 3^rd party software with sophisticated configuration features. Then, they have to optimize and experiment themselves to identify an appropriate balance between …

backup frequency and full versus increment backups, and
the needed time for restoring a backup impacting the RTO

Cloud Security Architecture – AI and MLOps – Testing

Klaus Haller's Homepage and Publications