cosmos.jpg

The shift toward online digital conferences has prompted Microsoft to reconvene Ignite about six months early this year. Scanning the data and analytics announcements, the overriding theme is of extending the reach of the portfolio of Azure data platforms.

For data and analytics, the headlines on this go-round include a new Azure Managed Instance for Apache Cassandra; support for a MongoDB 4.0 API in Azure Cosmos DB; the general availability of Azure Synapse Link for Cosmos DB; and some enhancements to Azure Cache for Redis offering. And Microsoft is introducing new tools for data warehouse users to automate their migration to Azure Synapse Analytics. On the hybrid cloud front, there are several announcements for the software-defined hybrid platform Azure Arc, including support of Kubernetes (K8s) and the addition of Azure Machine Learning to the small, but growing stable of Azure services available on Arc.

We’re splitting data and analytics coverage into two parts. We’ll focus on the data platforms and the K8s support on Arc, while Big on Data bro Andrew Brust will turn the spotlight on Power BI, Azure Purview, and Azure Machine Learning.

Now let’s get down to business.

ADDING ANOTHER CASSANDRA CLOUD PATH

Microsoft is announcing the preview of a new lift-and-shift option for Cassandra customers: Azure Managed Instance for Apache Cassandra. It mimics a similar offering for SQL Server customers, with Azure SQL Managed Instance in that it is designed to replicate the customer’s environment with a single tenant implementation but with a partially managed cloud service where Azure picks up the server provisioning, software maintenance, and automatic backups.

Managed Instance joins Azure Cosmos DB in presenting a second path for Cassandra users. The two are very different services, although Microsoft is also providing a migration path that could allow managed instance to act as a steppingstone to Cosmos DB if the customer wants it. There is a baseline similarity, however as both support multi-AZ and multi-region deployment.

The differences start at the storage engine: Managed Instance is a pure implementation of Apache Cassandra, whereas Cosmos DB has its own canonical storage engine that supports a compatible implementation via API, in a manner akin to how AWS delivers Amazon Keyspaces. In fact, Cassandra is one of many data models available through Cosmos DB, where Microsoft offers a selection of APIs. ACID consistency is another differentiator: in Managed Instance, customers set consistency the way they would with the Cassandra tooling that they already use, whereas in Cosmos DB, there are five preset consistency options. And of course, there is a deployment environment: designed to replicate the customer’s on-premises environment. Managed Instance is a single-tenant (or bare metal) implementation, whereas Cosmos DB is multi-tenant. There are other subtle differences as well.

As noted, Managed Instance is designed for customers who either want to take advantage of the cloud to help simplify the running of Cassandra, or provide a waystation for moving their implementation to Cosmos DB. For the latter scenario, they can use a managed replication connector to populate Cassandra data into Cosmos DB.

The addition of Managed Instance is the latest example of the growing richness of Cassandra cloud services, which is a very recent phenomenon. Despite the fact that Apache Cassandra has been one of the most popular databases as ranked by db-Engines, until the past year, it had not gotten much love in the cloud. Apache Cassandra was known as a highly robust, scalable, write-centric operational database suited for global deployments. But Cassandra was not the easiest of databases to set up and operate because building and running at scale is not easy. There’s a good reason why the distributed databases have either been the domains of tech giants with their in-depth IT resources or delivered as managed services by cloud providers. But when it came to DBaaS, in the early days of cloud distributed databases, DynamoDB was the primary game in town.

The first shots came with niche players like Aiven and Instaclustr, which were first to deliver managed cloud services for Apache Cassandra. They were followed by Microsoft and AWS, which provided Cassandra support in managed DBaaS services via API with Azure Cosmos DB and Amazon Keyspaces, respectively. After all that, DataStax unleashed Astra, which includes Apache Cassandra as part of an implementation of DataStax Enterprise, and which just became available as serverless. Microsoft’s addition of Managed Instance reinforces the diversity of Cassandra choices, between single and multi-tenant, server- and serverless deployment, and native vs. API-compatible engine.

MONGODB 4.0 API AND ANALYTICS SUPPORT FOR COSMOS DB

Speaking of Azure Cosmos DB, Microsoft is announcing a string of updates for its multimodel database focusing largely on cross-platform connectivity and security. The highlight is the release of a new MongoDB 4.0-based API. It was cleanroom-engineered by Microsoft and does not include any MongoDB software.

With the 4.0 API, Azure Cosmos DB narrows the features gap with MongoDB Atlas, which until now has been the only cloud DBaaS with this support. That translates to features such as multi-document transaction support, retryable writes, and new aggregation operators. The notion is making Cosmos DB a more viable alternative to MongoDB Atlas for customers looking to move their current MongoDB deployments to a managed DBaaS cloud service.

Normally, the announcement of a new API version would not stir notice, but this one is a bit exceptional. Microsoft’s announcement marks the first time that we have seen a 4.x generation MongoDB API available from a third party. Until now, the hurdle has been licensing. Starting with 4.0, MongoDB curtailed the use of the API with the Server Side Public License (SSPL), which prohibited access to other cloud providers unless they bought a commercial license. The SSPL ushered in a wave of reckoning among open source database providers concerned about cloud providers profiting off their IP, which we have covered in these pages ad infinitum.

The practical result of these more restrictive licenses has been a forking on open source database implementations. The creator carries the latest features, with third parties taking one of several paths: do without the latest capabilities; kick off their own competing open-source projects; or conduct cleanroom engineering. The latter is the path that Microsoft has taken with Cosmos DB, and as of now, they are the first third party to support the MongoDB 4.0 API.

Other Azure Cosmos DB announcements include the general availability of Azure Synapse Link for Cosmos DB, which we covered when the preview was unveiled. It uses a change-data-capture (CDC)-like mechanism to intercept and replicate Cosmos DB updates to Synapse, where the data can be used for analytics; it is done without impacting Cosmos DB performance.

This is a logical next step among cloud providers to build more synergies across their broadening database portfolios, as cloud providers are supposed to help enterprises break down silos to fulfill their promises of delivering operational simplification. For instance, late last year, we spotlighted Amazon Glue Elastic Views, which also takes a CDC-like approach to stream updates from source to target, including a mix of relational and nonrelational sources (e.g., DynamoDB) and targets (e.g., Redshift).

Incremental enhancements include previews of several features to harden and secure Cosmos DB to levels well-established in the transaction database world. They include continuous database backup and point-in-time restores, and role-based access.

DOUBLING DOWN ON REDIS ENTERPRISE

Like AWS and Google, Microsoft has long offered a basic implementation of the open-source Redis in-memory data store as a cache. But, like Google, Microsoft has also partnered with Redis Labs to offer a jointly branded and supported service for Redis Enterprise, a fuller-featured offering that can be used as a multi-model database or message broker, in addition to its core caching function. The Enterprise and Enterprise Flash tiers of Azure Cache for Redis that are jointly offered with Redis Labs are now generally available and come with enhancements such as real-time search, time-series data management, and support for cache sizes up to 10x larger than the standard Azure Cache service. Additionally, active geo-replication has been announced for preview.

A NEW ONRAMP FOR AZURE AYNAPSE ANALYTICS

Cloud providers want to re-platform your database, and so it’s not surprising that each offer tools for key tasks such as schema conversion. Microsoft already offers Azure Database Migration Service, which handles that task. Some are more than happy to perform emulations so you can keep your SQL, but run it on their platform. For instance, late last year, AWS unveiled a preview of Babelfish for Aurora PostgreSQL that, for SQL Server customers, will let you continue using T-SQL, but it will run through translation on Amazon Aurora PostgreSQL.

Now Microsoft is taking the opposite tack, providing a tool for automating the conversion of different flavors of SQL and schema to help you migrate from rival data warehouses to Azure Synapse Analytics. Now in preview, it will help you take the SQL code from Netezza, SQL Server, and Snowflake and soon, Teradata, BigQuery, Redshift, and others, and automatically convert the code to T-SQL. Microsoft claims Azure Synapse Pathway could translate over 100,000 lines of code in minutes. It has not made claims (yet) about degree of SQL coverage, so we don’t know how much functional coverage it will have, but we assume that coming out of the box, it should automate the translation of the most common SQL calls.

KUBERNETES FRENZY ON AZURE ARC

This week, Microsoft has tied together another loose end with its Azure Arc, software-defined hybrid cloud platform, announcing the GA release of Azure Arc-enabled Kubernetes. K8s has been a supported feature of Arc, but unlike Red Hat OpenShift or Google Cloud Anthos, it has not been solely defined by it, or any specific K8s implementation for that matter. Azure Arc supports K8s, but also supports more traditional virtualization regimes such as physical and virtual servers both Linux and Windows running on Hyper-V or VMware.

The significance of K8s support is simplifying the autoscaling and security in any cloud environment. It comes through standard approaches to orchestrating all the tasks necessary operating a cloud-native environment that runs on containers, with data and applications functions refactored as microservices. All Azure cloud services that currently run on Arc, including Azure Data Services, and as Andrew has covered, Azure Machine Learning, require a K8s environment.

This week’s GA announcement for K8s on Arc clears the way for Azure services to in turn become production-ready on the hybrid platform. That’s an obvious first step, but for now, organizations seeking to take advantage of Azure-enabled Kubernetes must have internal knowledge on how to create and onboard K8s clusters. We’re looking forward to the day when Microsoft buries this in a black box, clearing the way for IT organizations that may not have K8s experts to run these services within their own data centers, or conceivably, some other public cloud.

Disclosure: AWS, DataStax, and Microsoft are dbInsight clients.