BigQuery Sink Service Specification

This document specifies the behavior of the service that delivers decoded messages into BigQuery.

Data Flow

Consume messages from a PubSub topic or Cloud Storage location and insert them into BigQuery. Send errors to another PubSub topic or Cloud Storage location.

Implementation

Execute this as an Apache Beam job.

Configuration

Require configuration for:

Accept optional configuration for:

Coerce Types

Reprocess the JSON payload in each message to match the schema of the destination table found in BigQuery as codified by the jsonschema-transpiler.

Support the following logical transformations:

Accumulate Unknown Values As additional_properties

Accumulate values that are not present in the destination BigQuery table schema and inject as a JSON string into the payload as additional_properties. This should make it possible to backfill a new column by using JSON operators in the case that a new field was added to a ping in the client before being added to the relevant JSON schema.

Unexpected fields should never cause the message to fail insertion.

Errors

Send all messages that trigger an error described below to the error output.

Handle any exceptions when routing and decoding messages by returning them in a separate PCollection. We detect messages that are too large to send to BigQuery and route them to error output by raising a PayloadTooLarge exception.

Errors when writing to BigQuery via streaming inserts are returned as a PCollection via the getFailedInserts method. Use InsertRetryPolicy.retryTransientErrors when writing to BigQuery so that retries are handled automatically and all errors returned are non-transient.

Error Message Schema

Always include the error attributes specified in the Decoded Error Message Schema.

Encode errors received as type TableRow as JSON in the payload of a PubsubMessage, and add error attributes.

Do not modify errors received as type PubsubMessage except to add error attributes.

Other Considerations

Message Acks

Acknowledge messages in the PubSub topic subscription only after successful delivery to an output. Only deliver messages to a single output.