A data warehouse for data apps

We've seen more and more customers building data apps: Applications that put real-time analytics in the hands of consumers and business customers at scale by presenting insights from very large data sets (typically tens to hundreds of terabytes) through easy-to-use web and mobile apps.

The most obvious example of a data app may be something like Google Analytics and Search Console for analysing your web site's performance. In the consumer world, Intuit Mint, Credit Karma or financial data apps like the American Express apps let us search, analyse and drill down into spending habits. In the B2B domain, many apps fit the mould, like those from our customer Symphony Retail AI to analyse customer and supplier data of package goods, LexisNexis to protect most internet purchases against fraudulent identities, TEOCO to help telecom carriers with revenue assurance, or Nielsen and NCSolutions for understanding consumer behaviour. We've also got healthcare apps for recommending provider assignments for patients, telemetry analysis apps for various types of devices, and even government apps that help find criminals.

Characteristics of data apps

Data applications' needs are different from traditional enterprise data warehouses which tend to be financially focused platforms where purchasing criteria include compatibility with ETL and BI tools, batch reporting, that sort of thing. Here are some traits of data apps:

  • The apps are built by software developers (rather than IT/BI staff) and normally born in the cloud.
  • The apps offer a mobile or web user interface, along with an API equivalent, that runs SQL queries against a back-end database to deliver insights with customisable, explorable criteria, frequently graphically and most often augmented with machine learned predictive models.
  • They typically run on Kubernetes with Cloud Native architecture.
  • The data volumes are large; in our customer base, from single-digit terabytes through to several petabytes, certainly more than traditional scale-up databases can handle.
  • The data velocity tends to be fairly high with a need to have the most recent data available quickly. For example, if I'm analysing payment fraud, I need the data from the most recent seconds, not last week.
  • The applications require interactive responsiveness for common queries – in a couple of seconds or less; it's not OK to ask for an insight then wait 20 minutes for it to show up.
  • The workloads tend to have an ad-hoc component to let users slice and dice the data different ways, from running giant aggregates to browsing records or searching for needles in haystacks.

What data apps need from their database

Here's some things to think about when deciding on the best database to back your application:

Economies of scale: As you become more successful and add users, queries, services and concurrency, and data volumes grow, your database expenditure must increase more slowly.

Run alongside, and scale with, your app: This means being based on Kubernetes to leverage the strengths of Cloud Native architecture for controlling scale, cost and consumption, and not having to pay egress charges to get data in or results out.

Run next to your data: Can the database run and answer queries next to the data at rest? At the edge? In your choice of cloud provider, in your own cloud account? Or in a co-lo at scale? If you have to ship all the data off to a centralised service provider, they will mark up all the supporting infrastructure and sell it back to you, pocketing your revenue and eating away at your margin.

Data velocity: Can the database ingest and analyse real-time data without the app having to worry about micro-batching, unifying queries across OLTP or OLAP databases, or other unnatural acts?

Interoperability and avoiding lock-in: Is the database compatible with other open source databases, so that if one product doesn't meet the app's needs it can be migrated to a different one? This applies to everything from SQL dialect to connections to authentication to data ingest from tools like Kafka or Pulsar.

Resource isolation: Users need fast answers that meet SLAs no matter how much ETL or data wrangling is going on. Large, important consumers of data should be allowed isolated resources. This means support for multiple compute clusters aka virtual warehouses with separate storage and compute.

Workload management: Computationally expensive (or badly written!) queries must not hog compute resources and get in the way of interactive response times.

Elasticity: Compute capacity must be scalable to meet demand without incurring downtime.

How the Yellowbrick Data Warehouse meets these needs

At Yellowbrick we've created a Kubernetes-based data warehouse that runs inside your own cloud VPC. It supports separate storage and compute with multiple clusters, all managed through SQL. The data warehouse runs in your own cloud VPC so you pay your own object storage and compute instance costs; we only bill for compute vCPUs used by our software, the unit cost of which decreases with scale and is predictable with subscriptions.

Since Yellowbrick is designed for data apps, it's PostgreSQL compatible and comes with open source connectors for Kafka and other real-time streaming data ingest and CDC tools. Thanks to our heritage in high-end enterprise data warehousing we have workload management capabilities that allow truly ad-hoc environments that are still fair to interactive users along with features like asynchronous replication for disaster recovery.

Oh, and we run in co-lo and on-premises as well as the public cloud if you need.

How we're different

Category Yellowbrick   BigQuery   MS Synapse   Redshift   Snowflake
Managed with K8S
Your VPC next to your data
Pay your own cloud infrastructure
Economies of scale
Full SQL-driven elasticity
Supports real-time ingest
PostgreSQL compatibility Partial
On-premises/colo support

A great reference architecture

A reference architecture for data apps

This is how our customers run data applications. A Yellowbrick Data Warehouse instance is provisioned in a VPC in their cloud account. They pay all their infrastructure costs for running it; Yellowbrick stores data on cloud object storage and has a choice of compute instances available for the "clusters" that run queries.

At least one load cluster is present for dealing with data ingest, and sometimes a second cluster for integrating ETL from traditional databases or performing ELT ("data wrangling."). The load cluster – and ETL cluster if present – continually insert/delete/merge/CTAS and transform data as needed which is then persisted in the object storage.

Multiple additional query clusters are provisioned for running app queries against the data. Each compute cluster caches data locally for performance and sees the latest available written data consistently (ACID) and in real-time. The compute clusters are typically small, of the order of a handful of nodes, but each one forms the unit of scale. To cope with increasing demand, more computer clusters are spun up, and when demand wanes, they can be spun down to save money, with no user impact. A cluster load balancer rotates query load between the compute clusters, automatically choosing the least busy one. Within the clusters, workload management makes sure long-running queries don't block short, interactive ones.

Some customers have dedicated compute clusters for batch and critical path processing in addition to scalable ad-hoc query clusters. Yellowbrick supports thousands of compute nodes and clusters.

Conclusion

Data applications need a data warehouse that runs alongside the application, scales predictably, and offers economies of scale to reward a growing application business with higher profit margins. Such a modern data warehouse must be elastic, with separate storage and compute, be capable of handling real-time data, and based on Kubernetes Cloud Native architecture that can run anywhere.

Only the Yellowbrick Data Warehouse fits the bill, and augments it with enterprise-class features like asynchronous replication for disaster recovery, SQL stored procedures, and a huge ecosystem of certified, interoperable tools.