In recent years, Databricks has stirred up a lot of buzz in the technology space. If you have asked yourself the question, “Why would I use Databricks?”, many organizations have been directed to leverage the toolset as the new “Modern Data Warehousing” solution in the Azure arsenal. Based on several of our clients choosing to get on board with Databricks as a faster, easier solution to traditional data warehousing, we wanted to take a look at the key use cases for Databricks, as recommended by the product company, as well as what we are seeing in real-world scenarios. If you are new to Databricks, this article is targeted at folks with a technology background in data that are just getting familiar with the product and the different ways it can be leveraged as part of your modern data estate.
Please note that our organization (Bridgeworks Consulting Group) does offer implementation services leveraging Databricks and other Azure tools. Although we are looking at the toolset in comparison to other Azure products, this should not be considered an expansive tools analysis of “like” competitor tools on the market or non-Microsoft products. This article speaks only to where Databricks could fit into your existing Microsoft/Azure-based suite of tools related to a Modern Data Platform.
Now that we have dispensed with what our goals are and how Bridgeworks fits into this ecosystem, let us dive into Databricks and the advertised use cases. If you browse to Databricks’ home page and then to solutions by use case, you will find the main use cases Databricks identifies as how to best leverage their product offering:
- Just In time Data Warehousing
- Machine Learning and Graph Processing
- Cybersecurity Analytics
- Deep Learning
- IoT Analytics
- GDPR Data Subject Requests
At a headline level when you read through these use cases on the web site, you may derive a few key things.
- First, that Databricks is targeted at those companies that are beginning to acquire larger amounts of data than they may have had in the past. Also, that these organizations are starting to experience challenges with processing these large amounts of data using traditional data techniques.
- Second, these use cases, do not tightly align with common/traditional data warehousing terms such as Extract, Transform and Load (ETL), Data Marts, Operational Reporting, etc.
- Third, this technology is coming with capabilities that are considered “required” in today’s world, such as GDPR (General Data Protection Regulation) compliance. This is now expected with “newer” technologies and becoming a huge focus of data solutions.
If you are data-savvy, I am sure you can make some other inferences with this list, but for the purposes of this article, we will focus in on these areas of Databricks and how it is different from your current or past data shop.
Now let’s take a deeper dive in the areas of Machine Learning and Graph Processing, Cybersecurity Analytics, Deep Learning and IoT Analytics with Databricks. You may think that looking at each of these areas in more detail would require another 50-100 pages in this article and you would be right. However, if we focus on what all these things have in common, you will notice something very important that you need to pay attention to with Databricks. All of these capabilities are enabled or require large amounts of data (e.g. Big Data). As such when we look at these capabilities our Databricks use cases begin to take better shape:
- Machine Learning and Graph Processing takes advantage of the underlying Spark technologies to build, tune and deploy solutions that are typically leveraging large amounts of information both in terms of growth and historical data to create advanced models which identify trends and provide advanced analytics and visualizations.
- High Volume/Data Size Scaling: By leveraging the capabilities built into Azure to dynamically scale cluster, storage and compute, Databricks can be easily extensible without the heavy lifting you might be accustomed to in an Infrastructure as a Service (IaaS) offering. Most of the scaling is configurable and automated within the Databricks Azure clusters and works seamlessly to allow you uninterrupted service during peak times.
- Implementation: Developers can also extend Machine Learning capabilities using a host of languages including Python, R, Scala or SQL, which makes adoption of the solutions within Databricks (notebooks) easier for a host of developer groups.
- The Cybersecurity Analytics offering provides a platform that can process petabytes of real-time streaming and historic thread data to provide threat analysis and review. Leveraging the machine learning tools mentioned above, security teams and developers can create models that help identify threats and adapt to new situations regularly.
- Deep Learning focuses on leveraging your libraries of images, text documents, voice recordings to provide a host of solution possibilities both for your reporting and analytical needs as well as potential transactional system applications. Databricks brings technologies into a single platform to provide image classification, object detection (e.g. facial recognition), and natural language processing. This then allows a developer the ability to implement valuable solutions that gain business insights from what is typically the most “unused” set of information within an organization.
- IoT Analytics has become a hot topic in the past few years. Being able to connect to various devices such as drones, trucking fleets, building sensors, video feeds and more have become a norm we live with in today’s world. For organizations that have been able to tap into and consolidate this information, a wealth of new capabilities have been unlocked. Databricks is able to plug into real-time streaming data from IoT devices and provide a scalable solution with actionable insights for business needs.
- GDPR Data Subject Requests, at a headline level, allow people in foreign countries (in Europe) to request what personal data has been collected and how that data is being used. They can also request to have that data changed, restricted or even erased. Databricks provides a tool (e.g. Databricks Delta) to sift through huge amounts of data to quickly find related information and then do transactional operations on the search results, such as Delete, Update or Merge. Note: This does require that you ingest the data you are looking to search into what are called Delta Tables before you can leverage this capability, which may still require additional downstream operations to the data source after the fact.
If you have been following so far, I want to point out some key terms from above to remember:
- Large Amounts of Information
- Real-Time Streaming
- Historical Data (Petabytes)
- Unstructured Data (e.g. text files, voice recordings, images, etc.).
If you are a traditional data warehousing specialist, you might be saying, “Well, what happened to the ‘Just In-Time Data Warehousing’ you mentioned above? Surely that has to do with my current environment architecture and structured (relational database) data, right? Upon taking a look at the Just In-Time Data Warehousing offering, what you will see is a restatement of a lot of the tools mentioned to handle large-scale data (e.g., distributed cluster scaling on-demand, scaling of compute and storage separately, support for multiple developer languages like R, Scala, Python and SQL, machine learning, graph processing, etc.). These capabilities and technologies are typically associated with a big data environment.
So, what gives? Does this mean that the traditional relational data warehouse is no longer viable and that we all need to move to a Delta Lake or Lakehouse architecture? Some schools of thought are answering that question with a resounding “Yes”. However, when you read the documentation and investigate further, the software companies are pretty clear about when to use these types of “new technologies” (see key terms above). Then why are we seeing a new host of clients that are confused and trying to leverage Databricks to answer all their questions even if they are only analyzing thousands of records from a relational database? What we are seeing with our clients is that they have been led to believe that the traditional data warehouse is being phased out and that leveraging new technologies like Databricks is the “new” and “more efficient” way to build a modern data platform.
As with any “newer” toolset, there is always confusion, but what is happening with Databricks appears to go beyond the new tool or new technology terms learning curve. From our experience, a number of organizations are being sold on the Lakehouse or Delta Lake architectures, without much of a focus on whether they need it or if it will solve their problems. In some cases, when incorrectly applied, we have seen these architectures actually have a net result that is worse than existing solutions.
In the next article in this series, we will explore where Databricks fits into an organization that might not have petabytes of data, machine learning, streaming data or IoT use cases. We will also explore what types of developer training, retooling and adoption timelines to expect if you plan on picking up a toolset like Databricks in a traditional relational database shop. Finally, we will do some compare and contrast with Microsoft’s Synapse release to determine when using those capabilities may also be a viable alternative to Databricks.