The Cloud, Data Warehousing and Snowflake, all my old ideas shot to ribbons!
For those of us brought up on traditional data warehousing the promise of the Cloud is staggering, it offers speed, scale and cost saving. It’s hard to think what is not to like. But then you dig a bit deeper, and find that just because things are in the Cloud they may be cheaper, faster and scale better than what you are used to, but they are still failing to really get the most out of the Cloud. What is more we are not just talking marginal gains we are talking really significant gains!
When people look at a data warehouse they are looking at a number of critical characteristics, notably scaling. The growth of data is staggering and even if it is comparatively cheap want to minimise storage costs and pay for what you use and only when you use it. You want to match your resources very closely to your usage, and have the assurance that you can scale on demand when you need it. Secondly you are looking for performance. If you are trying to develop a data culture, where people will make decisions based on facts, and not intuition, you need to be able to address their questions in real time, and not interrupt their train of thought whilst you wait for an answer. Equally we now demand that when we look at data it is not a snapshot of the past but reflects here and now. Thirdly, we are no longer only interested in structured data, we want to be able to handle structured and semi structured data, so that all data can be handled and ingested without massive engineering overheads. That means bringing in data in XML and JSON files to accommodate things like website data, social media data, documents and so forth. Fourthly you want the data to be accessible to a wide range of tools to enable things like Self Service BI, advanced analytics via Spark, and so on.
Personally I have been talking about these things as a goal for a long time now, and have usually been forced to compromise when looking at what is available. But last week I attended a Looker event in London and came across a host of products that have been on my radar but which I had yet to look at in detail. One of those is Snowflake. Snowflake is one of the first data warehouse platforms that I have seen that has been designed bottom up for the Cloud; it’s not a ported and compromised solution.
To put Snowflake into context, engineers out of Oracle founded them in 2012. In 2014 they had their first paying customers, and the product was on general availability in 2016. So it’s new, but don’t think that means they are still just new kids on the block trying to grab attention. As of January 2018 they have 1,200 customers, and have $473 million in funding. What that has achieved are customers like Capital One, Sony, Adobe and Experian.
So, looking at Snowflake for the first time it has a number of features that are very different. Its not a shared disc, nor a shared nothing architecture, it’s a multi-cluster, shared data model with centralised scale out storage, with multiple independent compute clusters. It’s a data-warehouse as a service so there are no servers to manage, no software to install, no indexes to maintain and optimise. So that entire overhead is taken care of!
Secondly the next thing to get an understanding of is their virtual warehouse concept. The data is distributed across a proprietary S3 cluster (its built on AWS at present, and will be on Azure later this year). The customer pays for the underlying storage in S3, and the compute capacity of the virtual warehouse. Warehouses can be scaled for the work required, and can be paused at any time for cost efficiency, so you can switch things off overnight with no impact to the underlying data. Changes to the configuration can be undertaken at any time and applied in seconds. Workloads are separated and accurately scaled, all very controllable and fine scaled.
The third concept is how the data is shared. We are used to separating development, staging and production environments, we are accustomed to then moving and copying data between separate environments. Snowflake allows users to make “Zero Copy” clones instantly. This is a significant potential cost saving, and also is fast as there are no writes involved. The architecture provides built in data resilience and recovery, “Time Travel” enables access back through 90 days via a simply query, with roll back and recovery. So out of the box it’s secure, governed and resilient, it is both PCI and HIPAA (standard for ensuring Health care data is protected) certified.
Fourthly, with Snowflake they ingest data in its native format, and impose a schema on read. So the semi structured data is queried using SQL. All of the data is in the one database.
With all of this you are only paying for what you use. So we have a fairly ideal set of attributes that really do offer something that is faster, faster to load data, and faster to analyse, it’s notably cheaper and simpler, the saving in the management and tuning must be significant, all achieved without compromise, which is really impressive.