In the first part of this blog series, I discussed the traditional data warehouse and some of the issues that can occur in the period following the initial project.
I concluded by outlining the 5 common challenges that I hear from customers.
This blog will focus on the fourth issue: OLD TECHNOLOGY
In one of the previous post, we discussed the topic of complex architecture for traditional data warehouse.
For this post, the focus will be on technology components that come together to support these architectures and also explore some of the difficulties experienced over time.
Let's refer back to the conceptual architecture for the traditional warehouse presented in the previous post.
Hardware - the platform customers use are typically wintel-based servers and hosted on-premise. These hardware are generally commodity items that are a combination of processor(s), memory, storage and networking equipment.
Software - this article will focus on software used to manage the ETL process and the subsequent storage and optimisation of the data.
This allows us to look at the architecture above in a slightly different way. Please also note that the blog series is discussing about the traditional data warehouse and hence the Front End Analytics layer will not be deliberated at this point.
The diagram above illustrates a typical ETL process. For customers that are using the custom RDBMS SQL code or ETL tools bundled with the RDBMS, the ETL process and data will reside together in the Store and Optimise layer.
So where are the problems?
This combination of hardware and software may have worked well when the original data warehouses were being built and deployed. However due to the increase of data volume and change of requirements over the years, cracks have begun to show.
Typically when discussing about hardware, the focus will be on processors, memory, storage and networking. Each of these items has seen varying degrees of innovation that have resulted in more powerful computing platforms being made available for increasingly lower costs.
Processors - Based on Moore's law, the number of transistors in a dense integrated circuit doubles approximately every two years. It should not be a surprise that the amount of computer power available has increased exponentially in recent times.
By Wgsimon - Own work, CC-BY-SA
Much of the more recent hardware performance gains are realised through multi-core and parallel multi-processor systems. The challenge here is that a processor is a single component of an integrated server so it is unlikely to be changed without upgrading the entire platform.
Memory - as is the case with the processor, the speed of the server's memory and in particular the amount in the system often constrains the effectiveness of the data warehouse platform. Modern server platforms include up to data processors that are able to address larger volumes of memory at a much faster rate. This can be highly beneficial during ETL and query cycles.
Storage - if on board memory is crucial in ensuring that the processor can access the data it requires as quickly as possible, then the underlying disk based storage tier cannot be ignored. In fact, many single server data warehouse implementations use basic HDD arrays that struggle to keep up with the demands placed on them by increasing data volumes and complexity of user queries.
Networking - the basic I/O specifications of many iteration 1 data warehouse implementation means that their storage and networking layers have long since passed their optimal state. Networking performance is critical at 2 points within the data warehouse cycle.
- Moving data from source systems into the data warehouse
- Moving results sets from the data warehouse to the BI tool
It seems obvious, yet we still see architectures where either older or slower networking standards are being used. Or even worse, the network routing between the data warehouse and the source systems and BI systems are sub-optimal. As a result introduce unnecessary bottlenecks and delays.
The reality is that you won't be able to apply these advantages to your current data warehouse unless it is migrated onto a completely new set of hardware.
"Unfortunately, I regularly speak to customers who are struggling with their data warehouse performance as a result of still using the same server they implemented 5 years ago."
However, hardware alone is not the answer, although it may well be the problem.
New hardware on its own will most certainly deliver an increase in performance but it's unlikely to address the wider concerns covered in the rest of this blog series. Concerns relating, for instance, to the need to ingest and prepare data of differing variety, volume and velocity.
In other words, you also have to ensure that your software is optimised to address the new requirements but also take advantage of modern hardware platforms.
Extract Transform & Load (ETL)
- If you use an ETL Tool or Engine, is it supported on the latest OS and hardware platforms?
- Is your ETL process multi-threaded and able to scale up and out in order to take advantage of hardware innovations?
- Are you able to run multiple parallel queries against your source systems in order to obtain the basic data you require, or are you limited to sequential queries?
- Does your ETL method/ tool include native change data capture?
- Is your ETL solution able to retrieve data from both structured and unstructured sources?
- Can you consume web services and OData as easily as flat files and relational databases?
- Can your ETL process seamlessly work with both streaming and batch data
Store and Optimise
- Are you using a traditional RDBMS platform or one that is optimised for analytic workloads?
- Have you had to create multiple aggregate tables that pre-calculate answers to deliver the performance you require?
- Are you constantly having to tune and update indexes in order to match the performance demands of your users?
- Do you find that your data warehouse regularly swaps volumes of data to disk in order to resolve user queries?
- Are you forced to store multiple copies of your data in order to improve performance, for example OLAP cubes?
- Can your data warehouse platform scale out across commodity hardware to improve performance?
- Can you store both structured and unstructured data within the same repository or seamlessly co-exist and leverage other data stores such as Hadoop?
- Have you had to archive data from your data warehouse in order to maintain performance levels or to remain within hardware limitations?
- Are you able to compress the data you hold, so you can hold more, without impacting on performance?
- Does your data warehouse platform include native analytics libraries for delivering advanced capabilities such as; geo-coding and geo-spatial processing, predictive algorithms (R for example) and embedded data quality tools?
- Is your data warehouse platform able to support open standards for querying such as ODBC, JDBC, MDX, OData, etc?
To be clear, many of the requirements that we discussed, particularly in the blog on COMPLEX ARCHITECTURE will never be effectively addressed in a cost effective, manageable fashion without significant changes to legacy hardware and software technologies.
So what's the answer?
Is it possible to deliver a data warehouse solution that is designed from the ground up to run optimally on modern hardware platforms? A solution that takes advantage of and transform Intels chip development efforts in high performance computing?
Is it possible to take the core of your traditional data warehouse solution and re-model it in a cost effective way to work within a modern platform based on a tight synergy of hardware and software?
Can you minimise the challenges of data explosion and duplication within your data warehouse without compromising performance?
The answer: Yes (Of course…).
The Modern Data Platform
The Modern Data Platform delivers on all the requirements for a next generation data warehouse. Enabling organisations to radically simplify their existing legacy or overly complex solutions in order to lower running costs, improve agility and gain breakthrough performance that delivers real business value.
Remember to follow the rest of the posts in this blog series where we will explore in detail the common challenges of traditional data warehousing.