Updated: Jul 5
Data & AI summit is an Annual Summit held by Databricks where hundreds of sessions are hosted. Here are the top seven announcements that stirred the audience. A quick recap of what's new in Databricks.
If you are a fan of opensource Apache Spark & Delta Lake but don't want the hurdle of managing the platform, then, Databricks fills that gap as a managed service. While Databricks has many personas, Data Processing and Machine Learning are the heart of the platform. So, it does make sense for them to host the Data + AI summit.
This year's summit was hybrid in-person and virtual. Interestingly all the sessions were available the following day to watch on-demand. While I didn't get a chance to see all the 100+ sessions, I did focus on what matters. So here are some of the announcements that caught my interest.
If you are using Spark Streaming already, get ready for an upgrade. Karthik - Director of Streaming at Databricks illustrated how they intend to optimize spark streaming even faster. They aim to double the speed of streaming, enhancements such as arbitrary stateful processing and more. I'm a little disappointed that they didn't include Azure connectors, but we can make it work with Azure functions or logic apps I suppose.
Reynold Xin (@rxin) / Twitter, the cofounder of Databricks announced Spark Connect. It's a major upgrade to Apache SPARK, making all its features available via REST Apis. It's a thin API Client that delivers the power of Apache spark. Think of it this way, you don't have to install Spark on your local machine, instead just install the thin client. A game changer especially where enterprises are building apps on edge devices which are limited by storage and compute. I am curious to see how orgs will leverage this feature. , the cofounder of Databricks announced Spark Connect. It's a major upgrade to Apache SPARK, making all its features available via REST Apis.
Spark Connect is a thin API Client that delivers the power of Apache spark. Think of it this way, you don't have to install Spark on your local machine, instead just install the client and perform all Spark functions. A game changer, especially where enterprises are building apps on edge devices, which are limited by storage and compute. I am curious to see how organizations will leverage this feature.
Delta Lake 2.0
Simply put, Delta Lake manages versions of data snapshots. Here is an example: let's say a nightly job takes a full snapshot of TABLE A which is 1 TB and drops it into a Data Lake. In a year, it's 365 TB and a ton of compute to compare "day 1" file with "day 100" file to identify inserts, deletes and changes. To make it even complex, there could be 100s of such tables.
Delta Lake simplifies all this. Behind the scenes, it maintains versions. The snapshot changes drastically, reducing the storage and gives you the ability to query any version like this.
spark.sql("SELECT * FROM default.table_a TIMESTAMP AS OF '2022-06-29 00:37:58'")
But many of these features were proprietary to Databricks. That has changed with the introduction of 2.0. Now Delta Lake and the entire project is now open source. Here is a list of all features/enhancements available in Delta 2.0
Unity Catalog is Generally Available
Unity Catalog is a centralized governance/data lineage tool offered by Databricks for Spark jobs. I have heard a lot about it from Solution Architects but seeing it in action intrigues me even more. I have my doubts though:
Governance is an enterprise responsibility, I don't believe this is solving an enterprise governance problem.
Why haven't they used an open-source product like Apache Atlas to build it?
If you are aware of data platforms like data.world, Databricks marketplace is similar. Think of a situation where you need data that is not available in your organization. Wouldn't it be easy if you could just pull it out of thin air? That is exactly what Databricks Marketplace offers. I need to explore more about this marketplace, how it's curated and how to monetize :)
Marketplaces run into two key issues as they grow:
How to trust a product (aka dataset)?
How to pick the right product from 100's of similar data products, especially when it matures with competing vendors.
Time will tell how this will be solved.
Upgrade to Delta Live Tables
Delta Live Tables or DLT is data & data Infrastructure as code. You manage data with code. Here is an example to show what I mean:
Here in this example: data scientists write SQL, and everything else around the code is taken care of by the platform. They don't have to build the infrastructure or know spark.
DLT is getting several enhancements, including new performance optimization, Enhanced Autoscaling and change data capture (CDC), to make the platform compatible with slowly changing dimensions, and allowing them to be updated incrementally, rather than from scratch, when dimensional hierarchies change.
Upgrades to MLFlow
Open-source project MLflow forms the backbone of Databricks’ MLOps capabilities. Although proprietary components, including the Databricks feature store, exist, the MLflow-based functionality includes machine learning experiments execution and management, as well as a model repository with versioning. Today, Databricks is announcing MLflow 2.0, which will add a major new feature, called Pipelines. Pipelines are templates for setting up ML applications, so everything’s ready for productionalization, monitoring, testing and deployment.
The templates — based on code files and Git-based version control — are customizable and allow monitoring hooks to be inserted. Although based on source code files, developers can interact with Pipelines from notebooks, providing a good deal of flexibility. Adding Pipelines should be a boon to the industry, as numerous companies, including all three major cloud providers, have either adopted MLflow as a standard or documented how to use it with their platforms.
I believe that Databricks is a great platform and they have made reasonable updates to the platform given its growth and funding. Whether you use Databricks or Apache Spark or Delta Lake the new upgrades are worth leveraging or at least exploring.
I am passionate about data and have been working with and exploring tools for over 16 years now. Data technologies amazes me, and I continue learning every day. You may or may not use Databricks, but I am certain that you are using Data in one way or the other. These techs might come in handy to you some day. I hope you learned something from this post.
If you are curious to learn more about Databricks and how your team/organization will benefit from it, feel free to reach out. If you are new to Data Engineering here is an article on How to become a Data Engineer. Happy to help!
Link to the on-demand conference: