GCP Ingestion Architecture

This document specifies the architecture for GCP Ingestion as a whole.

Architecture Diagram

diagram.mmd

Architecture Components

Ingestion Edge

Landfill Sink

Decoder

Republisher

BigQuery Sink

Dataset Sink

Notes

PubSub stores unacknowledged messages for 7 days. Any PubSub subscription more than 7 days behind requires a backfill.

Dataflow will extend ack deadlines indefinitely when consuming messages, and will not ack messages until they are processed by an output or GroupByKey transform.

Dataflow jobs achieve at least once delivery by not using GroupByKey transforms and not falling more than 7 days behind in processing.

Design Decisions

Kubernetes Engine and PubSub

Kubernetes Engine is a scalable, managed service based on an industry standard. PubSub is a simple, scalable, managed service. By comparison a compute instance group instead of Kubernetes Engine and Kafka instead of PubSub would require more operational overhead and engineering effort for maintenance.

Different topics for "raw" and "validated" data

We don't want to have to repeat the validation logic in the case where we have multiple consumers of the data. Raw data can be sent to a single topic to simplify the edge service and then validated data can be sent to topics split by docType and other attributes, in order to allow consumers for specific sets of data.

BigQuery

BigQuery provides a simple, scalable, managed service for executing SQL queries over arbitrarily large or small amounts of data, with built-in schema validation, hyperloglog functions, UDF support, and destination tables (sometimes called materialized views) for minimizing cost and latency of derived tables. Alternatives (such as Presto) would have more operational overhead and engineering effort for maintenance, while generally being less featureful.

Save messages as newline delimited JSON

One of the primary challenges of building a real-world data pipeline is anticipating and adapting to changes in the schemas of messages flowing through the system. Strong schemas and structured data give us many usability and performance benefits, but changes to the schema at one point in the pipeline can lead to processing errors or dropped data further down the pipeline.

Saving messages as newline delimited JSON allows us to gracefully handle new fields added upstream without needing to specify those fields completely before they are stored. New columns can be added to a table's schema and then restored via a BigQuery load operation.

Use destination tables

For complex queries that are calculated over time-based windows of data, using destination tables allows us to save time and cost by only querying each new window of data once.

Use views for user-facing data

Views we create in BigQuery can be a stable interface for users while we potentially change versions or implementations of a pipeline behind the scenes. If we wanted to rewrite a materialized view, for example, we might run the new and old definitions in parallel, writing to separate tables; when we’re comfortable that the new implementation is stable, we could cut users over to the new implementation by simply changing the definition of the user-facing view.

Known Issues

Further Reading

Differences from AWS