Landfill Service Specification
This document specifies the behavior of the service that batches raw messages into long term storage.
Data Flow
Consume messages from a Google Cloud PubSub topic and write in batches to Google Cloud Storage. Split batches based by time windows based on when they were retrieved from PubSub. Additionally, split batches when they reach a certain size, if possible.
Implementation
Execute this as an Apache Beam job.
Latency
Accept a configuration for batch window size. Deliver batches to Cloud Storage within 5 minutes of the batch window closing.
Other Considerations
Message Acks
Only acknowledge messages in the PubSub topic subscription after delivery to Cloud Storage. This is the default behavior for Beam as long as no shuffle operations are performed. For very long windows Beam should automatically extend the ack deadline of undelivered messages.