Opinionated musings on data lakes, lake houses, warehouses, SQL Engines
A SQL engine is a bit of software that cranks through data to answer a question formulated in, not surprisingly, SQL. In general terms, the computer science needed to execute SQL isn't that complicated: You need to be able to sort stuff, you need to be able to hash stuff and write loops to iterate over data, apply predicates and performing simple arithmetic calculations. Simple aggregations, joins and window functions are all built on top of such basic primitives. Basic SQL is well defined and pretty easy to parse. Generally, most kids with a good computer science degree can write a SQL engine that works in the happiest of times; just look at the source code for PostgreSQL's executor for example, so simple yet so capable. There's a reason why in-memory engines have been so popular: Because they are so incredibly easy to write, but it turns out they are also inappropriate for analytic workloads because it's unaffordable to have sufficient memory to hold all the data and intermediate results needed to run huge numbers of concurrent queries (more detail here will be provided in a subsequent article).
SQL tech is a fashion: Trends come and go. Storage engines, too. We've been bombarded with a never-ending onslaught of them with our data lakes, and "lake houses." In this time, we've had Hive, Hive+Tez, Hive+Spark, Presto, Drill, HAWQ, Shark, Impala and Spark SQL. We've got Clickhouse out of Russia and Apache Doris out of China (they truly couldn't come up with a better name, might as well have called it Ethel, Arthur or Gertrude). We've had the Kudu storage engine, The Delta Lake and the Delta Engine (I guess alpha through gamma didn't work). 28 table engines in Clickhouse and a list of Hadoop file formats like ORC, Parquet, Avro, SequenceFile, or RCFile. Thank heaven at least Parquet has become pretty standard as the new data interchange format and we can largely forget about the rest of them.
A standalone SQL engine is simple. If the SQL engine had a job, it would be a customer service agent answering the phone, one call at a time, without a HOLD button, and forwarding to the requested extension accordingly. The data warehouse, however, is all about concurrent magic. The agent goes home, cooks a sausage dinner, feeds the dog, washes the clothes and handles phone calls concurrently with dozens of people on hold... without one dropped caller, without inadvertently losing the phone in the washer, rocking the sausages to sleep or baking the dog. The data warehouse juggles, is 100% dependable, and always gets you through just in time regardless of whatever else is going on.
You can't build a data warehouse out of completely independent, discrete parts and have it just work. If you tried to build an electric car out of laptop batteries, the engine from a bus, seats from the cinema and electrical components from Home Depot or B&Q, even if you made it work you'd have an unreliable piece of junk the minute you tried to actually drive. Vehicles need to be purposely designed as a whole, as do mobile phones, laptops, skyscrapers, bridges, lunar landers or any other complex pieces of engineering. The parts must fit together seamlessly, and — in the case of databases — it normally takes the focus of an entire company to do so.
Inexperienced people who want to build a fast SQL engine for the first time start by reading academic papers about new ways to optimise things (normally referencing a couple of TPC-H queries). They take these couple of TPC-H queries, run them, make them go fast and declare victory. Anonymous benchmarks go up on web sites, the hype starts, and a new SQL engine is born. I know this is what happens because during my time as an inexperienced database developer I did this too and did due diligence on plenty of other startups doing it. Some of them even write the stuff in Java.
Now Databricks has been trying with their "Delta Engine." Impressive videos of computer scientists talking about amazing things filled with all the buzzwords, trying to make an enterprise grade data warehouse for the first time. Databrick's tech is absolutely brilliant for programmers and data scientists, but unsuitable for underlying a data warehouse. It's the same thing that happened with Impala and Kudu all the others. It's why Snowflake storms ahead and grows as a business, and why so much old technology is still entrenched in this market. Snowflake works. It's taken Google over a decade to get BigQuery to a point where it runs well enough for a lot of stuff, it's why Amazon had to licence technology to build Redshift, and Microsoft still struggle to build a good parallel analytic database. However Snowflake is trying to move into Databricks' world, and Databricks is trying to move into Snowflake's world.
We can see it's hard from the recent benchmark wars between Snowflake and Databricks. They run benchmarks on bloody enormous 256 server node clusters that burn money faster than a Hummer climbing the French Alps, and probably pollute the environment nearly as much as well. With a concurrency level of 4. That's not a typo, 4 (four). That's 256 servers, serving 4 people. 64 servers each. Does this make a credible data warehouse? Nope.
Let's talk about some of the stuff that happens all the time in the real world, and why this is hard:
- Hundreds or thousands of users run concurrent queries. Databricks wedges for hours and finally recovers. Snowflake backs up but, providing you have configured it to automatically spend more money, spins up more virtual stuff and gets the job done. Yellowbrick just runs them all.
- At the same time, some users submit badly written queries that run forever. Snowflake charges even more money to get the job done. Databricks wedges, again. Yellowbrick separates out the bad queries, penalty boxes them and makes sure they don't get in the way of the good stuff.
- At the same time, some SQL consisting of 90-something joins across wide tables that's 10 screen-fulls long gets submitted. Snowflake spends 30 minutes compiling. Databricks errors out. Yellowbrick runs it.
- At the same time, some users submit queries that blow up data sets and spill like crazy. Databricks has random heap memory issues. Snowflake (this time, without charging more money) and Yellowbrick just take their time and run the query.
- At the same time, ad-hoc queries are submitted that miss data in the cache. Snowflake goes slow, Databricks goes even slower, Yellowbrick runs almost as fast.
- At the same time, some other workload changes and churns 30% of the data per day. Databricks grinds to a halt, the others run just fine.
- While that's happening, everything is highly available and all your data is being asynchronously replicated to another cloud or site to make sure business continues in the event of a disaster.
Combine this with the need to run complex ELT (data transformation in place) with correct transaction semantics, run legacy stored procedures, deal with flaky hardware stuttering, deal with network errors and failure, deal with software crashes, deal with data centre outages and internet outages. Optimise and tune this for different cloud instances and platforms. Combine this with a gigantic ecosystem that has to be supported, from BusinessObjects to Tableau, Informatica to Kafka, Golden Gate to Qlik and all the rest. Then make it run fast, from buggy clients with crap network connections where people randomly cancel stuff... and you'll start to get a feel for what's hard about making this stuff work in the real world. Running fast with the buzzwords is the easy bit.
From a distance it's in fact clear to see what's wrong with architectures like SQL on Hadoop, lake houses and Delta Engine/Lake. This whole dream about querying data in random formats stored in different places and doing so efficiently are two competing goals that can never be reconciled. Snowflake and Yellowbrick persist data in the same format it's queried in: Optimally arranged for fast finding of data and vectorised compute, good for writing and updating and incrementally pageable into the cache. The Lake Houses store data in one format but caches a different and more optimal format — an admission that Parquet is crap for caching, finding and computing on data fast — yet they still won't give up on this 'data lake' vision. Which means for it to work well, your cache has to be big enough to hold all the data you actually want to query. Presto does well querying data in different formats in different places, but it does so slowly and doesn't pretend to be an interactive, ad-hoc data warehouse that can handle all of the difficult stuff above.
A SQL engine, a data lake or a lake house will never be a true Data Warehouse. There are of course plenty of benefits in the other direction as well—the level of flexibility offered by a general compute platform like Databricks is unmatched by any database. However my personal opinion is that these are two different worlds and as Databricks "moves left" and Snowflake "moves right" and try to overlap and compete more, neither will be particularly successful in occupying too much of the others' turf.
Databricks will continue to do incredibly well in the data science driven world and making life much better for the people burned with Hadoop. Snowflake will continue to grow in data warehousing which is the bigger and more valuable market. Of course, our dear Yellowbrick Data will continue to grow incredibly well also, keep taking business from IBM, Oracle, Snowflake, Microsoft, Teradata and all the other players who want something more modern.