Ingestion Testing Workflow

The ingestion-beam handles data flow of documents from the edge into various sinks. You may be interested in standing up a small testing instance to validate the integration of the various components.

diagrams/workflow.mmd Figure: An overview of the various components necessary to query BigQuery against data from a PubSub subscription.

Setting up the GCS project

Read through whd/gcp-quickstart for details about the sandbox environment that is provided by data operations.

Bootstrapping schemas from mozilla-pipeline-schemas

Building the project

Follow the instructions of the project readme. Here is a quick-reference for a running a job from a set of files in GCS.

export GOOGLE_APPLICATION_CREDENTIALS=keys.json
PROJECT=$(gcloud config get-value project)
BUCKET="gs://$PROJECT"

path="$BUCKET/data/*.ndjson"
mvn compile exec:java -Dexec.args="\
    --runner=Dataflow \
    --project=$PROJECT \
    --autoscalingAlgorithm=NONE \
    --workerMachineType=n1-standard-1 \
    --numWorkers=1 \
    --gcpTempLocation=$BUCKET/tmp \
    --inputFileFormat=json \
    --inputType=file \
    --input=$path\
    --outputType=bigquery \
    --output=$PROJECT:\${document_namespace}.\${document_type}_v\${document_version} \
    --bqWriteMethod=file_loads \
    --tempLocation=$BUCKET/temp/bq-loads \
    --errorOutputType=file \
    --errorOutput=$BUCKET/error/ \
"