<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Saoussen’s Substack]]></title><description><![CDATA[My personal Substack]]></description><link>https://saoussenchaabnia.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png</url><title>Saoussen’s Substack</title><link>https://saoussenchaabnia.substack.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 09 Jun 2026 01:13:23 GMT</lastBuildDate><atom:link href="https://saoussenchaabnia.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Saoussen CHAABNIA]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[saoussenchaabnia@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[saoussenchaabnia@substack.com]]></itunes:email><itunes:name><![CDATA[Saoussen CHAABNIA]]></itunes:name></itunes:owner><itunes:author><![CDATA[Saoussen CHAABNIA]]></itunes:author><googleplay:owner><![CDATA[saoussenchaabnia@substack.com]]></googleplay:owner><googleplay:email><![CDATA[saoussenchaabnia@substack.com]]></googleplay:email><googleplay:author><![CDATA[Saoussen CHAABNIA]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Production-Ready MLOps on GCP Part 8: Model Monitoring & Continuous Training]]></title><description><![CDATA[Part 8 of an 8-part series on building enterprise-grade MLOps systems]]></description><link>https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-e8f</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-e8f</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 17 Feb 2026 15:16:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Complete Series</strong>:</p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 1: Architecture Overview</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-5f1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 2: Tools &amp; Workflows for ML Teams</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-06c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 3: Infrastructure as Code</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-8ac?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 4: Reusable KFP Components</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-022?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 5: Production Training Pipeline</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-a6c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 6: Production Prediction Pipeline </a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-9c6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 7: CI/CD for ML</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-e8f?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 8: Model Monitoring &amp; Continuous Training</a> (You are here)</p></li></ul><h2><strong>Introduction</strong></h2><p>In the previous article, we automated the entire development workflow with CI/CD. But production ML has one more critical challenge: <strong>models degrade over time</strong>.</p><p>Your model was trained on January data. It&#8217;s now November. User behavior has changed. Payment methods shifted. Routes evolved. Your model&#8217;s accuracy is silently degrading.</p><p>This final article covers:</p><ul><li><p>Event-driven continuous training (automatic retraining)</p></li><li><p>Scheduled retraining patterns</p></li><li><p>Production orchestration patterns</p></li><li><p>Observability and debugging</p></li><li><p>Cost management</p></li><li><p>Responding to model degradation</p></li></ul><p>By the end, you&#8217;ll know how to keep models fresh and accurate in production.</p><h2><strong>The Model Degradation Problem</strong></h2><p><strong>Scenario</strong>: Your taxi fare prediction model</p><pre><code>January (training data):
  - Average trip: 5.2 miles
  - Payment: 60% credit, 40% cash
  - Peak hour: 8 AM
  - Model RMSE: 2.5
November (production data):
  - Average trip: 7.8 miles (+50%!)
  - Payment: 75% credit, 25% cash
  - Peak hour: 9 AM
  - Model RMSE: 3.8 (+52% worse!)</code></pre><p><strong>Without monitoring</strong>: You don&#8217;t notice until customers complain. <strong>With monitoring</strong>: Automatic alerts + retraining.</p><h2><strong>Event-Driven Continuous Training</strong></h2><p><strong>Goal</strong>: New data arrives &#8594; automatically retrain model.</p><h3><strong>Cloud Run Function Trigger</strong></h3><pre><code># Cloud Run Function (simplified)
def mlops_entrypoint(event, context):
    &#8220;&#8221;&#8220;Triggered when new data arrives in BigQuery.&#8221;&#8220;&#8221;
    # Parse event
    dataset_id = event[&#8217;protoPayload&#8217;][&#8217;resourceName&#8217;]
    # Check if significant new data
    if should_retrain(dataset_id):
        # Trigger training pipeline
        trigger_training_pipeline(
            template_path=&#8221;gs://.../taxifare-training-pipeline:latest&#8221;,
            enable_caching=False,
            use_latest_data=True
        )
    return &#8220;OK&#8221;</code></pre><h3><strong>Trigger Conditions</strong></h3><pre><code>def should_retrain(dataset_id):
    # Option 1: Time-based
    if hours_since_last_training() &gt; 24:
        return True
    # Option 2: Data volume-based
    if new_rows_since_last_training() &gt; 100000:
        return True
    # Option 3: Performance-based (requires ground truth)
    if recent_rmse() &gt; champion_rmse * 1.1:
        return True
    return False</code></pre><h3><strong>Event Flow</strong></h3><pre><code>New BigQuery Rows
       &#8595;
Pub/Sub Message
       &#8595;
Cloud Run Function
       &#8595;
(Decision: Should retrain?)
       &#8595;
Trigger Training Pipeline
       &#8595;
Train &#8594; Evaluate &#8594; Compare &#8594; Upload (if better)
       &#8595;
Pub/Sub Notification (pipeline complete)
       &#8595;
Cloud Run Function
       &#8595;
Trigger Prediction Pipeline (use new model)</code></pre><h3><strong>Configuration</strong></h3><pre><code># Terraform: Event trigger setup
resource &#8220;google_eventarc_trigger&#8221; &#8220;bigquery_insert_trigger&#8221; {
  name     = &#8220;bigquery-data-insert&#8221;
  location = var.region
  matching_criteria {
    attribute = &#8220;type&#8221;
    value     = &#8220;google.cloud.bigquery.dataset.v1.dataInserted&#8221;
  }
  destination {
    cloud_run_service {
      service = google_cloud_run_service.mlops_trigger.name
      region  = var.region
    }
  }
}</code></pre><h2><strong>Scheduled Retraining</strong></h2><p>For predictable retraining (e.g., weekly):</p><h2><strong>Vertex AI Pipeline Schedule Setup</strong></h2><pre><code># Create weekly training schedule using Vertex AI Pipeline Schedules
poetry run python -m pipelines.utils.schedule_pipeline \
  --pipeline_type=training \
  --template_path=https://us-central1-kfp.pkg.dev/my-project/mlops-pipeline-repo/taxifare-training-pipeline/latest \
  --pipeline_root=gs://my-project-pl-root \
  --display_name=prod-training-pipeline \
  --schedule_name=prod-training-schedule \
  --cron=&#8221;0 2 * * 0&#8221; \
  --enable_caching=false \
  --use_latest_data=true</code></pre><h3><strong>Common Schedules</strong></h3><pre><code># Daily at 2 AM
--schedule=&#8221;0 2 * * *&#8221;
# Weekly on Sunday at 2 AM
--schedule=&#8221;0 2 * * 0&#8221;
# Monthly on 1st at 2 AM
--schedule=&#8221;0 2 1 * *&#8221;
# Every 6 hours
--schedule=&#8221;0 */6 * * *&#8221;</code></pre><h3><strong>Scheduled Pipeline Parameters</strong></h3><pre><code># Schedule with parameters
poetry run python -m pipelines.utils.schedule_pipeline \
  --project=my-prod-project \
  --location=us-central1 \
  --pipeline_template_path=gs://.../training:latest \
  --schedule=&#8221;0 2 * * 0&#8221; \
  --parameters=&#8217;{
    &#8220;use_latest_data&#8221;: true,
    &#8220;enable_caching&#8221;: false,
    &#8220;model_name&#8221;: &#8220;taxi-traffic-model&#8221;
  }&#8217;</code></pre><h2><strong>Production Orchestration Patterns</strong></h2><h3><strong>Pattern 1: Scheduled Training &#8594; Automatic Prediction</strong></h3><pre><code>Vertex AI Pipeline Schedule (weekly)
    &#8595;
Training Pipeline
    &#8595;
(If new champion)
    &#8595;
Trigger Prediction Pipeline
    &#8595;
Generate predictions for next week</code></pre><p><strong>Use case</strong>: Weekly batch predictions for business planning</p><p><strong>Implementation</strong>: This pattern is achieved through the Cloud Run Function that listens for training pipeline completion events via Pub/Sub, then triggers the prediction pipeline if a new champion model was uploaded.</p><h3><strong>Pattern 2: New Data &#8594; Continuous Training</strong></h3><pre><code>New Data Arrives (hourly)
    &#8595;
Cloud Run Function
    &#8595;
(Check: Enough new data?)
    &#8595;
Training Pipeline
    &#8595;
(Champion/Challenger comparison)
    &#8595;
Model Registry (update champion if better)</code></pre><p><strong>Use case</strong>: Always have the freshest model</p><p><strong>Implementation</strong>: The Cloud Run Function (<code>terraform/modules/cloudrunfunction/src/main.py</code>) triggers pipelines based on Pub/Sub events:</p><pre><code>@functions_framework.cloud_event
def mlops_entrypoint(event):
    pipeline_config = os.getenv(&#8221;PIPELINE_CONFIG&#8221;)
    pipeline_config_dict = json.loads(pipeline_config)
    submit_pipeline_job(pipeline_config_dict)</code></pre><p>The function reads configuration from environment variables and submits the appropriate pipeline job (training or prediction) to Vertex AI.</p><h3><strong>Pattern 3: Event-Driven Training via Cloud Run Function</strong></h3><p>The Cloud Run Function can be triggered by various events (Pub/Sub, Cloud Storage, etc.) to automatically start training or prediction pipelines. The actual trigger mechanism is configured in Terraform (<code>terraform/modules/cloudrunfunction/</code>) and the function logic handles pipeline submission to Vertex AI.</p><h2><strong>Observability and Debugging</strong></h2><h2><strong>Key Metrics to Monitor</strong></h2><p><strong>1. Model Performance</strong>: View in Vertex AI Model Registry:</p><ul><li><p>RMSE, MAE, MAPE, MSLE metrics for each model version</p></li><li><p>Comparison between champion and challenger models</p></li><li><p>Evaluation results from test set</p></li></ul><p><strong>2. Data Skew</strong>: Monitored by the <code>model_batch_predict_op</code> component in the prediction pipeline:</p><ul><li><p>Training vs. serving feature distributions</p></li><li><p>Skew detection thresholds configured per feature</p></li><li><p>Automatic email alerts when skew exceeds threshold</p></li><li><p>Metrics logged to Cloud Logging</p></li></ul><p><strong>3. Pipeline Execution</strong>: Track in Vertex AI Pipelines console:</p><ul><li><p>Pipeline success/failure rates</p></li><li><p>Component execution times</p></li><li><p>Resource utilization</p></li><li><p>Error logs and stack traces</p></li></ul><p><strong>4. Training Frequency</strong>: Monitor via Vertex AI Pipeline Schedules:</p><ul><li><p>Scheduled run frequency (hourly, daily, weekly)</p></li><li><p>Manual vs. automatic triggers</p></li><li><p>Champion model update frequency</p></li></ul><h2><strong>Cloud Logging Queries</strong></h2><p><strong>Find training triggers</strong>:</p><pre><code>resource.type=&#8221;cloud_run_revision&#8221;
resource.labels.service_name=&#8221;mlops-trigger&#8221;
jsonPayload.message=&#8221;Triggering training pipeline&#8221;</code></pre><p><strong>Find champion promotions</strong>:</p><pre><code>resource.type=&#8221;aiplatform.googleapis.com/PipelineJob&#8221;
jsonPayload.message=~&#8221;Challenger wins&#8221;</code></pre><p><strong>Find skew detections</strong>:</p><pre><code>resource.type=&#8221;aiplatform.googleapis.com/BatchPredictionJob&#8221;
jsonPayload.skew_detected=true</code></pre><h2><strong>Dashboards</strong></h2><p>Create Cloud Monitoring dashboards to visualize:</p><ul><li><p>Vertex AI Pipeline execution success rates and durations</p></li><li><p>Model evaluation metrics from the Model Registry</p></li><li><p>Cloud Run Function invocation counts and errors</p></li><li><p>BigQuery job statistics for data processing steps</p></li><li><p>Skew detection alerts from batch prediction jobs</p></li></ul><h2><strong>Responding to Model Degradation</strong></h2><h3><strong>Alert &#8594; Investigate &#8594; Retrain Workflow</strong></h3><p><strong>1. Receive Alert</strong>: When the prediction pipeline&#8217;s <code>model_batch_predict_op</code> component detects data skew, it sends an email alert configured in the component parameters.</p><p><strong>2. Investigate</strong>:</p><ul><li><p>Check Vertex AI Pipelines console for skew detection details</p></li><li><p>Review Cloud Logging for skew metrics and feature distributions</p></li><li><p>Compare recent prediction data against training dataset in BigQuery</p></li></ul><p><strong>3. Trigger Retraining</strong>: Manually trigger the training pipeline with latest data:</p><pre><code>cd pipelines
poetry run python -m pipelines.utils.trigger_pipeline \
  --template_path=./taxifare-training-pipeline.yaml \
  --display_name=manual-retrain-pipeline \
  --enable_caching=false \
  --use_latest_data=true</code></pre><p>Or use the Makefile shortcut:</p><pre><code>make training build=false enable_caching=false use_latest_data=true</code></pre><p><strong>4. Validate Improvement</strong>:</p><ul><li><p>Check Vertex AI Model Registry for new model metrics</p></li><li><p>Compare RMSE between old champion and new model</p></li><li><p>The pipeline&#8217;s champion/challenger logic automatically promotes better models</p></li></ul><h2><strong>Optimization Strategies</strong></h2><p><strong>1. Use Pipeline Caching</strong>:</p><pre><code># Enable caching for preprocessing steps that don&#8217;t change
make training enable_caching=true</code></pre><p><strong>2. Adjust Training Schedule</strong>: Configure pipeline schedules based on data velocity:</p><ul><li><p>High-frequency data: Daily training</p></li><li><p>Stable data: Weekly training</p></li><li><p>Monitor skew alerts to determine optimal frequency</p></li></ul><p><strong>3. Right-Size Training Resources</strong>: Configure machine types in <code>get_workerpool_spec_op</code> component based on dataset size and model complexity.</p><p><strong>4. Clean Up Old Artifacts</strong>: Regularly manage artifacts in:</p><ul><li><p>Artifact Registry (old pipeline versions and Docker images)</p></li><li><p>Vertex AI Model Registry (non-champion model versions)</p></li><li><p>Cloud Storage (old pipeline artifacts and outputs)</p></li></ul><p><strong>5. Optimize BigQuery Costs</strong>: The preprocessing SQL queries (<code>ingest.sql</code>, <code>ingest_pred.sql</code>) are optimized to:</p><ul><li><p>Filter data early in the query</p></li><li><p>Use partitioning when available</p></li><li><p>Limit data scanned with timestamps</p></li></ul><h2><strong>Best Practices</strong></h2><h3><strong>1. Always Version Everything</strong></h3><p>The system automatically versions:</p><ul><li><p><strong>Models</strong>: Stored in Vertex AI Model Registry with version numbers</p></li><li><p><strong>Pipelines</strong>: Tagged in Artifact Registry (e.g., <code>v1.2.3</code>, <code>latest</code>)</p></li><li><p><strong>Docker images</strong>: Tagged in Artifact Registry matching pipeline versions</p></li><li><p><strong>Training data</strong>: Timestamped via <code>timestamp</code> parameter in pipeline runs</p></li></ul><h3><strong>2. Use Champion/Challenger Pattern</strong></h3><p>Implemented in the <code>upload_best_model_op</code> component:</p><ul><li><p>New models are only promoted if they beat the current champion</p></li><li><p>RMSE comparison happens automatically during training pipeline</p></li><li><p>All models are preserved in registry for rollback capability</p></li></ul><h3><strong>3. Monitor Before Optimizing</strong></h3><pre><code>1. Deploy with monitoring
2. Observe for 1 week
3. Identify bottlenecks
4. Optimize selectively
5. Measure improvement</code></pre><h3><strong>4. Set Up Alerts Thoughtfully</strong></h3><pre><code># Bad: Alert on every small change
if rmse &gt; baseline_rmse * 1.01:
    alert()
# Good: Alert on sustained degradation
if rolling_avg_rmse(days=7) &gt; baseline_rmse * 1.15:
    alert()</code></pre><h3><strong>5. Document Retraining Decisions</strong></h3><pre><code>## Retraining Log
### 2024-01-15
- Trigger: Scheduled weekly retrain
- Data: 2024-01-08 to 2024-01-15
- Result: New model RMSE 2.3 (vs champion 2.5) &#8594; Promoted
- Notes: Improved accuracy on credit card payments
### 2024-01-22
- Trigger: Accuracy degradation alert
- Data: 2024-01-15 to 2024-01-22
- Result: New model RMSE 2.8 (vs champion 2.3) &#8594; Not promoted
- Notes: Holiday data anomaly, monitoring</code></pre><h2><strong>Conclusion: Your Complete MLOps System</strong></h2><p>You&#8217;ve now built a <strong>complete, production-ready MLOps system</strong> across 8 articles:</p><ol><li><p><strong>Architecture</strong> &#8594; Multi-environment design</p></li><li><p><strong>Developer Experience</strong> &#8594; Productive workflows</p></li><li><p><strong>Infrastructure</strong> &#8594; Automated with Terraform</p></li><li><p><strong>Components</strong> &#8594; Modular and reusable</p></li><li><p><strong>Training</strong> &#8594; Sophisticated pipeline</p></li><li><p><strong>Prediction</strong> &#8594; Scalable inference</p></li><li><p><strong>CI/CD</strong> &#8594; Complete automation</p></li><li><p><strong>Operations</strong> &#8594; Continuous improvement</p></li></ol><p>Your system now:</p><ul><li><p>Trains models automatically</p></li><li><p>Deploys only better models</p></li><li><p>Generates predictions at scale</p></li><li><p>Monitors for degradation</p></li><li><p>Retrains when needed</p></li><li><p>Maintains itself</p></li></ul><p><strong>What&#8217;s next?</strong></p><ul><li><p>Implement it for your use case</p></li><li><p>Customize for your data</p></li><li><p>Extend with new features</p></li><li><p>Share learnings with the community</p></li></ul><p>Thank you for following this comprehensive series!</p><p>Now go build amazing, self-maintaining ML systems! &#128640;&#127881;</p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Production-Ready MLOps on GCP Part 7: CI/CD for ML]]></title><description><![CDATA[Part 7 of a 8-part series on building enterprise-grade MLOps systems]]></description><link>https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-9c6</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-9c6</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 17 Feb 2026 15:14:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!y6GF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Complete Series</strong>:</p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 1: Architecture Overview</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-5f1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 2: Tools &amp; Workflows for ML Teams</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-06c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 3: Infrastructure as Code</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-8ac?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 4: Reusable KFP Components</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-022?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 5: Production Training Pipeline</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-a6c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 6: Production Prediction Pipeline </a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-9c6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 7: CI/CD for ML</a> (You are here)</p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-e8f?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 8: Model Monitoring &amp; Continuous Training</a></p></li></ul><h2><strong>Introduction</strong></h2><p>In the previous article, we built a sophisticated training pipeline that goes from raw data to a production-ready model in Vertex AI Model Registry. But there&#8217;s a problem: if you&#8217;re manually building containers, compiling pipelines, and deploying infrastructure whenever code changes, you&#8217;ll spend more time on plumbing than on improving your models.</p><p>This is where <strong>CI/CD (Continuous Integration / Continuous Deployment)</strong> transforms MLOps from manual and error-prone to automated and reliable.</p><p>But ML CI/CD isn&#8217;t just copying traditional software CI/CD. ML systems have unique challenges:</p><ul><li><p><strong>Long-running jobs</strong>: Training can take hours (not seconds like unit tests)</p></li><li><p><strong>Non-determinism</strong>: Models trained on same data can differ slightly</p></li><li><p><strong>Multi-artifact deployments</strong>: Code + data + models + infrastructure</p></li><li><p><strong>Multiple environments</strong>: Dev, test, prod with different data</p></li><li><p><strong>Integration testing</strong>: Need actual cloud resources (expensive!)</p></li></ul><p>In this article, we&#8217;ll explore:</p><ul><li><p>Our 6 Cloud Build CI/CD pipelines</p></li><li><p>Testing strategies for ML (unit, integration, E2E)</p></li><li><p>Infrastructure automation with Terraform</p></li><li><p>Release management and versioning</p></li><li><p>Development workflow from PR to production</p></li></ul><p>By the end, you&#8217;ll understand how to build CI/CD that makes deploying ML as reliable as deploying traditional software.</p><h2><strong>CI/CD Architecture Overview</strong></h2><p>Our CI/CD runs entirely in an <strong>admin GCP project</strong> separate from dev/test/prod:</p><pre><code>GitHub Repository
       |
       | (webhook on PR/merge/tag)
       &#8595;
Admin Project - Cloud Build
       |
       &#9500;&#9472;&#9472;&gt; PR Checks (on pull request)
       &#9500;&#9472;&#9472;&gt; E2E Tests (on /gcbrun comment)
       &#9500;&#9472;&#9472;&gt; Terraform Plan (on PR affecting terraform/)
       &#9500;&#9472;&#9472;&gt; Terraform Apply (on merge to main)
       &#9500;&#9472;&#9472;&gt; Release (on git tag)
       &#9492;&#9472;&#9472;&gt; Schedule Pipelines (manual trigger)
       |
       &#8595;
Deploy to: Dev / Test / Prod Projects</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y6GF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y6GF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png 424w, https://substackcdn.com/image/fetch/$s_!y6GF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png 848w, https://substackcdn.com/image/fetch/$s_!y6GF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png 1272w, https://substackcdn.com/image/fetch/$s_!y6GF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y6GF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png" width="784" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!y6GF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png 424w, https://substackcdn.com/image/fetch/$s_!y6GF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png 848w, https://substackcdn.com/image/fetch/$s_!y6GF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png 1272w, https://substackcdn.com/image/fetch/$s_!y6GF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ffe58c1-beb0-4828-9135-72bca634cb34_784x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>omplete CI/CD workflow from code commit to production deployment</p><p><strong>Why a separate admin project?</strong></p><ul><li><p><strong>Security</strong>: Cloud Build has permissions to deploy to all environments</p></li><li><p><strong>Isolation</strong>: CI/CD failures don&#8217;t affect production workloads</p></li><li><p><strong>Auditing</strong>: All deployments tracked in one place</p></li><li><p><strong>Cost tracking</strong>: Separate billing for CI/CD</p></li></ul><h2><strong>The 6 Cloud Build Pipelines</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tnf5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tnf5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png 424w, https://substackcdn.com/image/fetch/$s_!Tnf5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png 848w, https://substackcdn.com/image/fetch/$s_!Tnf5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png 1272w, https://substackcdn.com/image/fetch/$s_!Tnf5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tnf5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png" width="788" height="366" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:366,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Tnf5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png 424w, https://substackcdn.com/image/fetch/$s_!Tnf5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png 848w, https://substackcdn.com/image/fetch/$s_!Tnf5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png 1272w, https://substackcdn.com/image/fetch/$s_!Tnf5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f16af13-010e-437d-ad59-6cafca87fd9d_788x366.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>1. PR Checks (</strong><code>pr-checks.yaml</code><strong>)</strong></h3><p><strong>Trigger</strong>: Every pull request <strong>Purpose</strong>: Fast feedback on code quality</p><pre><code>steps:
  - name: python:3.10.14
    args:
      - -c
      - |
        # Install Poetry
        curl -sSL https://install.python-poetry.org | python3 -
        export PATH=&#8221;/builder/home/.local/bin:$$PATH&#8221;
        # Install dependencies
        make install
        # Git init for pre-commit
        git init &amp;&amp; git add .
        # Compile pipelines (validates syntax)
        make compile pipeline=training
        make compile pipeline=prediction
        # Run unit tests
        make test-components
        make test-pipelines
timeout: 5400s  # 90 minutes</code></pre><p><strong>What gets checked</strong>:</p><ul><li><p>Code quality: flake8, black, ruff (via pre-commit hooks)</p></li><li><p>Pipeline syntax: Can KFP compile the pipelines?</p></li><li><p>Component logic: Do unit tests pass?</p></li><li><p>Pipeline logic: Do pipeline tests pass?</p></li></ul><p><strong>Example failure</strong>:</p><pre><code>&#10060; flake8: line too long (E501)
   File: components/src/components/upload_best_model_op.py
   Line 45: import google.cloud.aiplatform_v1 import ModelEvaluation, ModelServiceClient</code></pre><p>This catches issues <strong>before they&#8217;re merged</strong>, saving time and preventing broken main branches.</p><p><strong>Developer experience</strong>:</p><ol><li><p>Open PR</p></li><li><p>Cloud Build automatically runs checks</p></li><li><p>Results appear as GitHub check (&#9989; or &#10060;)</p></li><li><p>Fix issues if needed</p></li><li><p>Merge when all checks pass</p></li></ol><h3><strong>2. E2E Tests (</strong><code>e2e-test.yaml</code><strong>)</strong></h3><p><strong>Trigger</strong>: Comment <code>/gcbrun</code> on PR <strong>Purpose</strong>: Validate that pipelines actually run in Vertex AI</p><pre><code>steps:
  # Build training container
  - id: build-training-image
    name: gcr.io/cloud-builders/docker
    dir: model
    args: [
      &#8216;build&#8217;,
      &#8216;-t&#8217;, &#8216;${_TEST_VERTEX_LOCATION}-docker.pkg.dev/...&#8217;,
      &#8216;.&#8217;
    ]
  # Push to Artifact Registry
  - id: push-training-image
    name: gcr.io/cloud-builders/docker
    args: [&#8217;push&#8217;, &#8216;...&#8217;]
  # Run E2E tests
  - id: e2e-tests
    name: python:3.10.14
    args:
      - -c
      - |
        make install
        export TRAINING_IMAGE=...
        make e2e-tests pipeline=training
        make e2e-tests pipeline=prediction
timeout: 18000s  # 5 hours (allows pipelines to run)</code></pre><p><strong>What gets tested</strong>:</p><ol><li><p><strong>Build</strong>: Can the training container build successfully?</p></li><li><p><strong>Training pipeline</strong>: Does it run end-to-end in Vertex AI?</p></li><li><p><strong>Prediction pipeline</strong>: Does it produce predictions?</p></li><li><p><strong>Artifacts</strong>: Are models uploaded? Predictions generated?</p></li></ol><p><strong>Why manual trigger?</strong></p><ul><li><p>E2E tests are expensive (VM costs, BigQuery, etc.)</p></li><li><p>E2E tests take hours</p></li><li><p>Not every PR needs E2E testing</p></li><li><p>Developer decides when to run</p></li></ul><p><strong>When to run E2E tests</strong>:</p><ul><li><p>&#9989; Before merging major changes</p></li><li><p>&#9989; After refactoring pipeline logic</p></li><li><p>&#9989; When adding new components</p></li><li><p>&#10060; For documentation-only changes</p></li><li><p>&#10060; For minor bug fixes</p></li></ul><h3><strong>3. Terraform Plan (</strong><code>terraform-plan.yaml</code><strong>)</strong></h3><p><strong>Trigger</strong>: PR that modifies <code>terraform/</code> files <strong>Purpose</strong>: Preview infrastructure changes before applying</p><pre><code>steps:
  - name: hashicorp/terraform
    args:
      - init
      - -backend-config=bucket=${_TFSTATE_BUCKET}
  - name: hashicorp/terraform
    args:
      - plan
      - -out=tfplan
  - name: hashicorp/terraform
    args:
      - show
      - tfplan</code></pre><p><strong>Example output</strong>:</p><pre><code>Terraform will perform the following actions:
  # google_storage_bucket.new_bucket will be created
  + resource &#8220;google_storage_bucket&#8221; &#8220;new_bucket&#8221; {
      + name     = &#8220;my-project-new-bucket&#8221;
      + location = &#8220;us-central1&#8221;
    }
  # google_service_account_iam_member.new_permission will be added
  + resource &#8220;google_service_account_iam_member&#8221; &#8220;new_permission&#8221; {
      + role               = &#8220;roles/storage.objectViewer&#8221;
      + service_account_id = &#8220;projects/.../serviceAccounts/vertex-pipelines@...&#8221;
    }
Plan: 2 to add, 0 to change, 0 to destroy.</code></pre><p><strong>Why this matters</strong>:</p><ul><li><p>Prevents accidental deletions</p></li><li><p>Makes infrastructure changes visible to reviewers</p></li><li><p>Enables discussion before changes are applied</p></li><li><p>Catches Terraform syntax errors</p></li></ul><p><strong>Separate triggers for each environment</strong>:</p><ul><li><p><code>terraform-plan-dev.yaml</code></p></li><li><p><code>terraform-plan-test.yaml</code></p></li><li><p><code>terraform-plan-prod.yaml</code></p></li></ul><p>This allows environment-specific infrastructure changes.</p><h3><strong>4. Terraform Apply (</strong><code>terraform-apply.yaml</code><strong>)</strong></h3><p><strong>Trigger</strong>: Merge to <code>main</code> branch <strong>Purpose</strong>: Actually deploy infrastructure changes</p><pre><code>steps:
  - name: hashicorp/terraform
    args:
      - init
      - -backend-config=bucket=${_TFSTATE_BUCKET}
  - name: hashicorp/terraform
    args:
      - apply
      - -auto-approve</code></pre><p><strong>Deployment order</strong>:</p><ol><li><p>Dev environment (lowest risk)</p></li><li><p>Test environment (validate before prod)</p></li><li><p>Prod environment (final deployment)</p></li></ol><p><strong>Safety mechanisms</strong>:</p><ul><li><p>Terraform plan must have been reviewed in PR</p></li><li><p>State is locked in GCS (prevents concurrent applies)</p></li><li><p>Separate triggers prevent accidental prod deployment</p></li><li><p>Cloud Build logs create audit trail</p></li></ul><h3><strong>5. Release (</strong><code>release.yaml</code><strong>)</strong></h3><p><strong>Trigger</strong>: Git tag (e.g., <code>v1.2.3</code>) <strong>Purpose</strong>: Build and push versioned artifacts to all environments</p><pre><code>steps:
  # Build Docker image
  - id: build-container-images
    name: gcr.io/cloud-builders/docker
    args:
      - -c
      - |
        docker build -t ${_IMAGE_NAME}:latest .
        for proj in ${_DESTINATION_PROJECTS} ; do
          docker tag ${_IMAGE_NAME}:latest \
            .../${proj}/mlops-docker-repo/${_IMAGE_NAME}:${TAG_NAME}
          docker push \
            .../${proj}/mlops-docker-repo/${_IMAGE_NAME}:${TAG_NAME}
        done
  # Compile and upload pipelines
  - id: compile-and-publish-pipelines
    name: python:3.10.14
    args:
      - -c
      - |
        make install
        for proj in ${_DESTINATION_PROJECTS} ; do
          export TRAINING_IMAGE=.../${proj}/.../training:${TAG_NAME}
          make compile pipeline=training
          make compile pipeline=prediction
          # Upload to Artifact Registry
          poetry run python -m pipelines.utils.upload_pipeline \
            --template_path=taxifare-training-pipeline.yaml \
            --tag=latest \
            --tag=${TAG_NAME}
        done
timeout: 1800s  # 30 minutes</code></pre><p><strong>Artifacts created</strong> (for each environment):</p><ol><li><p><strong>Docker image</strong>: <code>training:v1.2.3</code></p></li><li><p><strong>Training pipeline</strong>: <code>taxifare-training-pipeline:v1.2.3</code></p></li><li><p><strong>Prediction pipeline</strong>: <code>taxifare-prediction-pipeline:v1.2.3</code></p></li></ol><p><strong>Tagging strategy</strong>:</p><ul><li><p><code>latest</code>: Always points to most recent release</p></li><li><p><code>v1.2.3</code>: Specific version for rollback</p></li></ul><p><strong>Release workflow</strong>:</p><pre><code># Create and push git tag
git tag -a v1.2.3 -m &#8220;Release 1.2.3: Improved model accuracy&#8221;
git push origin v1.2.3
# Cloud Build automatically:
# 1. Builds Docker images
# 2. Compiles pipelines
# 3. Pushes to all environments (dev/test/prod)</code></pre><h3><strong>6. Schedule Pipelines (</strong><code>schedule-pipelines.yaml</code><strong>)</strong></h3><p><strong>Trigger</strong>: Manual <strong>Purpose</strong>: Create Vertex AI Pipeline Schedules for periodic retraining</p><pre><code>steps:
  - name: python:3.10.14
    args:
      - -c
      - |
        poetry run python -m pipelines.utils.schedule_pipeline \
          --project=${_VERTEX_PROJECT_ID} \
          --location=${_VERTEX_LOCATION} \
          --pipeline_template_path=${_TRAINING_TEMPLATE_PATH} \
          --schedule=&#8221;0 2 * * 0&#8221;  # Every Sunday at 2 AM</code></pre><p><strong>Use cases</strong>:</p><ul><li><p>Weekly model retraining in production</p></li><li><p>Daily retraining in test environment</p></li><li><p>Monthly full data refresh</p></li></ul><h2><strong>Testing Strategies for ML</strong></h2><p>ML testing requires multiple levels:</p><h3><strong>Level 1: Unit Tests</strong></h3><p>Test individual component logic in isolation:</p><pre><code># tests/test_upload_best_model_op.py
def test_champion_wins(mock_model_class, tmp_path):
    &#8220;&#8221;&#8220;Test that champion is preserved when it&#8217;s better.&#8221;&#8220;&#8221;
    # Mock champion with RMSE=0.8
    mock_champion = create_mock_model(rmse=0.8)
    mock_model_class.list.return_value = [mock_champion]
    # Create challenger with worse RMSE=0.9
    challenger_metrics = {&#8221;rmse&#8221;: 0.9}
    # Call component function
    upload_model(
        model_eval_metrics=challenger_metrics,
        eval_metric=&#8221;rmse&#8221;,
        eval_lower_is_better=True,
        # ...
    )
    # Assert challenger uploaded but NOT as default
    mock_model_class.upload.assert_called_once_with(
        is_default_version=False  # Champion preserved!
    )</code></pre><p><strong>Benefits</strong>:</p><ul><li><p>Fast (milliseconds)</p></li><li><p>Free (no cloud resources)</p></li><li><p>Run on every commit</p></li></ul><h3><strong>Level 2: Pipeline Compilation Tests</strong></h3><p>Ensure pipelines can compile to YAML:</p><pre><code>def test_training_pipeline_compiles():
    &#8220;&#8221;&#8220;Validate training pipeline compiles without errors.&#8221;&#8220;&#8221;
    from kfp import compiler
    from pipelines.training import pipeline
    compiler.Compiler().compile(
        pipeline_func=pipeline,
        package_path=&#8221;test_training_pipeline.yaml&#8221;
    )
    # If this doesn&#8217;t raise, compilation succeeded</code></pre><p><strong>Catches</strong>:</p><ul><li><p>Syntax errors in pipeline definition</p></li><li><p>Missing component imports</p></li><li><p>Incorrect input/output connections</p></li></ul><h3><strong>Level 3: End-to-End Tests</strong></h3><p>Run actual pipelines in dev environment:</p><pre><code># Build training container
docker build -t training:test ./model
docker push us-central1-docker.pkg.dev/my-dev-project/training:test
# Run training pipeline E2E
poetry run python -m pipelines.utils.run_pipeline \
  --pipeline=training \
  --project=my-dev-project \
  --location=us-central1 \
  --enable_caching=false
# Verify outputs
# - Model exists in Model Registry?
# - Evaluation metrics logged?
# - Champion comparison executed?</code></pre><p><strong>What E2E tests catch</strong>:</p><ul><li><p>IAM permission issues</p></li><li><p>API enablement problems</p></li><li><p>Resource quota limits</p></li><li><p>Real data quality issues</p></li><li><p>Training convergence problems</p></li></ul><h2><strong>Pre-commit Hooks: Local Quality Gates</strong></h2><p>Before code even reaches CI, pre-commit hooks catch issues:</p><pre><code># .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files
  - repo: https://github.com/psf/black
    hooks:
      - id: black
  - repo: https://github.com/pycqa/flake8
    hooks:
      - id: flake8
        args: [--max-line-length=100]
  - repo: https://github.com/astral-sh/ruff-pre-commit
    hooks:
      - id: ruff
        args: [--fix]</code></pre><p><strong>Developer workflow</strong>:</p><pre><code># Install hooks once
cd pipelines &amp;&amp; poetry run pre-commit install
# Hooks run automatically on git commit
git add components/src/components/my_component.py
git commit -m &#8220;Add new component&#8221;
# Pre-commit runs:
### Removes trailing whitespace
### Formats code with black
### Checks code quality with flake8
### Auto-fixes issues with ruff
# Commit proceeds only if all hooks pass</code></pre><p><strong>Benefits</strong>:</p><ul><li><p>Instant feedback (don&#8217;t wait for CI)</p></li><li><p>Consistent code style across team</p></li><li><p>Catches common issues before PR</p></li></ul><h2><strong>Complete Workflow: From Code to Production</strong></h2><p>Let&#8217;s walk through a complete development cycle, from feature development to production deployment:</p><h3><strong>1. Feature Development</strong></h3><pre><code># Create feature branch
git checkout -b feature/improve-preprocessing
# Make changes
vim pipelines/src/pipelines/queries/ingest.sql
# Run tests locally
make test-pipelines
# Commit (pre-commit hooks run)
git add pipelines/
git commit -m &#8220;Improve feature engineering in preprocessing&#8221;</code></pre><h3><strong>2. Pull Request</strong></h3><pre><code># Push branch
git push origin feature/improve-preprocessing
# Open PR on GitHub
gh pr create --title &#8220;Improve preprocessing&#8221; --body &#8220;Add speed features&#8221;</code></pre><p><strong>Automatic triggers</strong>:</p><ul><li><p>PR Checks: Linting, tests, compilation</p></li><li><p>Terraform Plan (if infrastructure changed)</p></li></ul><h3><strong>3. Code Review</strong></h3><p>Reviewer sees:</p><ul><li><p>Code changes</p></li><li><p>PR check results (all passing)</p></li><li><p>Terraform plan (if applicable)</p></li></ul><p>Reviewer can request:</p><pre><code>Could you run E2E tests to validate this works end-to-end?
Comment /gcbrun to trigger</code></pre><p>Developer comments <code>/gcbrun</code> &#8594; E2E tests run</p><h3><strong>4. Merge</strong></h3><p>Once approved and checks pass:</p><pre><code>gh pr merge --squash</code></pre><p><strong>Automatic triggers</strong>:</p><ul><li><p>Terraform Apply (if infrastructure changed)</p></li><li><p>Deploy to dev environment</p></li></ul><h3><strong>5. Release</strong></h3><p>When ready for test/prod:</p><pre><code># Create release tag
git tag -a v1.3.0 -m &#8220;Release 1.3.0: Improved preprocessing&#8221;
git push origin v1.3.0</code></pre><p><strong>Automatic triggers</strong>:</p><ul><li><p>Build Docker images for all environments</p></li><li><p>Compile and upload pipelines</p></li><li><p>Tag with v1.3.0 and latest</p></li></ul><h3><strong>6. Deploy to Test Environment</strong></h3><p>After the release is complete, manually create pipeline schedules in the test then prod environments:</p><p><strong>Option A: Via Cloud Build UI</strong></p><ol><li><p>Go to Cloud Build &#8594; Triggers in admin project</p></li><li><p>Find <code>schedule-pipelines</code> trigger &#8594; Click &#8220;Run&#8221;</p></li><li><p>Provide substitutions</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nQqm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nQqm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png 424w, https://substackcdn.com/image/fetch/$s_!nQqm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png 848w, https://substackcdn.com/image/fetch/$s_!nQqm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png 1272w, https://substackcdn.com/image/fetch/$s_!nQqm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nQqm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png" width="732" height="690" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/546d7830-ce16-4712-9577-421bd7d81213_732x690.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:690,&quot;width&quot;:732,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!nQqm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png 424w, https://substackcdn.com/image/fetch/$s_!nQqm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png 848w, https://substackcdn.com/image/fetch/$s_!nQqm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png 1272w, https://substackcdn.com/image/fetch/$s_!nQqm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546d7830-ce16-4712-9577-421bd7d81213_732x690.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gy3F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gy3F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png 424w, https://substackcdn.com/image/fetch/$s_!Gy3F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png 848w, https://substackcdn.com/image/fetch/$s_!Gy3F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png 1272w, https://substackcdn.com/image/fetch/$s_!Gy3F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gy3F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png" width="698" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:698,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Gy3F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png 424w, https://substackcdn.com/image/fetch/$s_!Gy3F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png 848w, https://substackcdn.com/image/fetch/$s_!Gy3F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png 1272w, https://substackcdn.com/image/fetch/$s_!Gy3F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54455f1c-6844-4bee-9d7e-593dbda5f00e_698x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Option B: Via gcloud</strong></p><pre><code>gcloud builds submit \
  --config=cloudbuild/schedule-pipelines.yaml \
  --project=admin-project \
  --substitutions=_ENV=test,_TRAINING_TAG_NAME=v1.3.0,...</code></pre><p>This creates two Vertex AI Pipeline Schedules in the test project:</p><ul><li><p><code>test-training-schedule</code> (runs hourly)</p></li><li><p><code>test-prediction-schedule</code> (runs daily)</p></li></ul><p>Verify: Vertex AI Console &#8594; Pipelines &#8594; Schedules</p><h3><strong>7. Deploy to Production</strong></h3><p>Once validated in test, repeat for production:</p><pre><code>gcloud builds submit \
  --config=cloudbuild/schedule-pipelines.yaml \
  --project=admin-project \
  --substitutions=_ENV=prod,_TRAINING_TAG_NAME=v1.3.0,...</code></pre><p>Creates <code>prod-training-schedule</code> and <code>prod-prediction-schedule</code></p><p><strong>Important</strong>: Schedules are created via Vertex AI&#8217;s PipelineJobSchedule API (not Cloud Scheduler), executed by <code>pipelines/src/pipelines/utils/schedule_pipeline.py</code></p><h2><strong>Event-Driven Execution (Optional)</strong></h2><p>For continuous training triggered by new data, deploy the Cloud Run Function via Terraform:</p><pre><code># In terraform/environments/prod/main.tf
module &#8220;cloudrunfunction&#8221; {
  source = &#8220;../../modules/cloudrunfunction&#8221;
  pipeline_config = {
    type                     = &#8220;training&#8221;
    training_template_path   = &#8220;https://.../taxifare-training-pipeline/latest&#8221;
    prediction_template_path = &#8220;https://.../taxifare-batch-prediction-pipeline/latest&#8221;
    # ... other config
  }
  dataset_id = &#8220;chicago_taxi_trips&#8221;
  table_id   = &#8220;taxi_trips&#8221;
}</code></pre><p>The function (<code>terraform/modules/cloudrunfunction/src/main.py</code>) triggers pipelines when new data is inserted into BigQuery, providing an alternative to scheduled runs.</p><h2><strong>Artifact Management</strong></h2><h3><strong>Docker Image Versioning</strong></h3><pre><code>us-central1-docker.pkg.dev/my-project/mlops-docker-repo/training:
&#9500;&#9472;&#9472; latest           (points to v1.3.0)
&#9500;&#9472;&#9472; v1.3.0           (current release)
&#9500;&#9472;&#9472; v1.2.3           (previous release)
&#9500;&#9472;&#9472; v1.2.2           (older release)
&#9492;&#9472;&#9472; abc123f          (commit SHAs for testing)</code></pre><p><strong>Tagging strategy</strong>:</p><ul><li><p><code>latest</code>: Production deployments pull this</p></li><li><p><code>v1.2.3</code>: Specific version for reproducibility</p></li><li><p>Commit SHA: E2E testing during PR</p></li></ul><h3><strong>Pipeline Versioning</strong></h3><p>Same strategy for compiled KFP pipelines:</p><pre><code>mlops-pipeline-repo/taxifare-training-pipeline:
&#9500;&#9472;&#9472; latest
&#9500;&#9472;&#9472; v1.3.0
&#9500;&#9472;&#9472; v1.2.3</code></pre><p><strong>Rollback scenario</strong>:</p><pre><code># Something wrong with v1.3.0?
# Submit pipeline with older version
poetry run python -m pipelines.utils.run_pipeline \
  --template_path=https://.../taxifare-training-pipeline:v1.2.3</code></pre><h2><strong>Best Practices</strong></h2><h3><strong>1. Fail Fast</strong></h3><p>Order steps from fastest to slowest:</p><pre><code>steps:
  - Lint (5s)              # Fail here if code quality issues
  - Unit tests (30s)       # Fail here if logic broken
  - Compile (1min)         # Fail here if syntax errors
  - E2E tests (1hr)        # Only run if everything else passes</code></pre><h3><strong>2. Make CI/CD Logs Searchable</strong></h3><pre><code># Structured logging
logging.info(f&#8221;component=upload_model status=success model_id={model_id}&#8221;)</code></pre><p>Cloud Logging query:</p><pre><code>resource.type=&#8221;cloud_build&#8221;
jsonPayload.component=&#8221;upload_model&#8221;
jsonPayload.status=&#8221;success&#8221;</code></pre><h3><strong>3. Separate Admin Project</strong></h3><p>Never run CI/CD in production project:</p><ul><li><p>Security isolation</p></li><li><p>Failure isolation</p></li><li><p>Cost tracking</p></li></ul><h3><strong>4. Use Substitution Variables</strong></h3><pre><code>substitutions:
  _VERTEX_PROJECT_ID: my-project
  _VERTEX_LOCATION: us-central1
# Easy to update, no hardcoded values</code></pre><h3><strong>5. Test Terraform in Dev First</strong></h3><p>Sequence:</p><ol><li><p>Terraform plan/apply in dev</p></li><li><p>Validate resources created</p></li><li><p>Then apply to test</p></li><li><p>Finally apply to prod</p></li></ol><h2><strong>Conclusion</strong></h2><p>CI/CD transforms ML development from manual and error-prone to automated and reliable:</p><ul><li><p><strong>PR Checks</strong>: Catch issues before merge (90s feedback)</p></li><li><p><strong>E2E Tests</strong>: Validate end-to-end functionality (optional, expensive)</p></li><li><p><strong>Terraform Automation</strong>: Infrastructure as code with preview/apply</p></li><li><p><strong>Release Management</strong>: Versioned artifacts across all environments</p></li><li><p><strong>Schedule Pipelines</strong>: Automated retraining setup</p></li></ul><p>With this CI/CD system:</p><ul><li><p>Every commit is tested</p></li><li><p>Every infrastructure change is reviewed</p></li><li><p>Every release is versioned</p></li><li><p>Production deployments are repeatable</p></li></ul><p>In the next article, we&#8217;ll explore production deployment: running predictions, monitoring models, and handling the full production lifecycle.</p><p><strong>Key Takeaways:</strong></p><ul><li><p>Separate admin project for CI/CD security and isolation</p></li><li><p>Six pipelines cover full lifecycle: checks, tests, infrastructure, releases</p></li><li><p>Multi-level testing: unit (fast), compilation (syntax), E2E (real)</p></li><li><p>Artifact versioning enables reproducibility and rollback</p></li><li><p>Pre-commit hooks catch issues before CI</p></li><li><p>Fail fast: run cheapest validations first</p></li></ul><p><strong>Next in Series</strong>: Production ML Deployment: Batch Predictions &amp; Monitoring</p><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP">production-ready-MLOps-on-GCP</a></p><p><strong>CI/CD Code</strong>:</p><ul><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/tree/main/cloudbuild">Cloud Build configs</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/blob/main/.pre-commit-config.yaml">Pre-commit config</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/blob/main/Makefile">Makefile</a></p></li></ul><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Production-Ready MLOps on GCP Part 6: Prediction Pipeline(From Champion Model to Batch Predictions)]]></title><description><![CDATA[Part 6 of a 8-part series on building enterprise-grade MLOps systems]]></description><link>https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-a6c</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-a6c</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Sun, 08 Feb 2026 16:32:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jfXe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Complete Series</strong>:</p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 1: Architecture Overview</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-5f1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 2: Tools &amp; Workflows for ML Teams</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-06c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 3: Infrastructure as Code</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-8ac?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 4: Reusable KFP Components</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-022?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 5: Production Training Pipeline</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-a6c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 6: Production Prediction Pipeline (You are here)</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-9c6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 7: CI/CD for ML</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-e8f?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 8: Model Monitoring &amp; Continuous Training</a></p></li></ul><h2><strong>Introduction</strong></h2><p>In the <a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-022?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">previous article</a>, we built a sophisticated training pipeline that takes raw data and produces a champion model in Vertex AI Model Registry. Now it&#8217;s time to <strong>put that model to work</strong> by generating predictions at scale.</p><p>Running predictions in production requires:</p><ul><li><p><strong>Finding the right model</strong>: Always use the current champion</p></li><li><p><strong>Preprocessing consistency</strong>: Apply the same transformations as training</p></li><li><p><strong>Scalability</strong>: Handle millions of predictions efficiently</p></li><li><p><strong>Monitoring</strong>: Detect when data distributions shift</p></li><li><p><strong>Reliability</strong>: Fail fast and fail clearly</p></li></ul><p>In this article, we&#8217;ll explore:</p><ul><li><p>Prediction pipeline architecture and design</p></li><li><p>Complete code walkthrough</p></li><li><p>Batch prediction with BigQuery</p></li><li><p>Model monitoring and skew detection</p></li><li><p>Running predictions in different scenarios</p></li></ul><p>By the end, you&#8217;ll understand how to build a prediction pipeline that reliably serves your trained models.</p><h2><strong>Prediction Pipeline Architecture</strong></h2><p>Our prediction pipeline is simpler than training but equally critical:</p><pre><code>1. Lookup Champion Model (Model Registry)
         &#8595;
2. Preprocess Data (BigQuery SQL - same as training)
         &#8595;
3. Batch Prediction (BigQuery &#8594; BigQuery)
         &#8595;
4. Monitor for Skew (Training-serving skew detection)
         &#8595;
5. Alert on Issues (Email alerts if skew detected)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ILRw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ILRw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png 424w, https://substackcdn.com/image/fetch/$s_!ILRw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png 848w, https://substackcdn.com/image/fetch/$s_!ILRw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png 1272w, https://substackcdn.com/image/fetch/$s_!ILRw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ILRw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png" width="784" height="242" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a277288f-e931-41e2-9398-65755b63697c_784x242.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:242,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ILRw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png 424w, https://substackcdn.com/image/fetch/$s_!ILRw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png 848w, https://substackcdn.com/image/fetch/$s_!ILRw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png 1272w, https://substackcdn.com/image/fetch/$s_!ILRw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa277288f-e931-41e2-9398-65755b63697c_784x242.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Key design decisions</strong>:</p><ul><li><p><strong>BigQuery &#8594; BigQuery</strong>: Input and output both in BigQuery for seamless integration</p></li><li><p><strong>Same preprocessing</strong>: SQL preprocessing identical to training (consistency)</p></li><li><p><strong>Built-in monitoring</strong>: Vertex AI automatically compares to training data</p></li><li><p><strong>Scalable</strong>: Horizontal scaling with multiple replicas</p></li></ul><h3><strong>Step 1: Lookup Champion Model</strong></h3><p>The prediction pipeline starts by finding the current production model:</p><pre><code>champion_model = lookup_model_op(
    model_name=&#8221;taxi-traffic-model&#8221;,
    location=location,
    project=project,
    fail_on_model_not_found=True,  # Must exist for predictions!
).set_display_name(&#8221;Look up champion model&#8221;)</code></pre><p><strong>What happens</strong>:</p><ol><li><p>Query Vertex AI Model Registry for models with display name <code>taxi-traffic-model</code></p></li><li><p>Filter for the <strong>default version</strong> (the champion)</p></li><li><p>Extract model URI and training dataset metadata</p></li><li><p>Pass to batch prediction step</p></li></ol><p><strong>Critical</strong>: Setting <code>fail_on_model_not_found=True</code> ensures the pipeline fails fast if no model exists, preventing silent failures.</p><h3><strong>Step 2: Data Preprocessing</strong></h3><p><strong>Goal</strong>: Transform raw prediction data into features matching training format.</p><pre><code>prep_query = generate_query(
    input_file=queries_folder / &#8220;ingest_pred.sql&#8221;,
    source=bq_source_uri,
    dataset=f&#8221;{project}.{dataset}&#8221;,
    table_=&#8221;prep_prediction_table&#8221;,
    start_timestamp=timestamp,
    use_latest_data=use_latest_data,
)

prep_op = BigqueryQueryJobOp(
    project=project,
    location=&#8221;US&#8221;,
    query=prep_query,
).set_display_name(&#8221;Ingest &amp; preprocess data&#8221;)</code></pre><h3><strong>Why Same SQL as Training?</strong></h3><p><strong>Training preprocessing</strong>:</p><pre><code>SELECT
  EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS dayofweek,
  EXTRACT(HOUR FROM trip_start_timestamp) AS hourofday,
  trip_miles,
  trip_seconds,
  SAFE_DIVIDE(trip_miles, trip_seconds) * 3600 AS trip_distance,
  company,
  payment_type,
  fare AS total_fare  -- Label for training
FROM ...</code></pre><p><strong>Prediction preprocessing</strong>:</p><pre><code>SELECT
  EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS dayofweek,
  EXTRACT(HOUR FROM trip_start_timestamp) AS hourofday,
  trip_miles,
  trip_seconds,
  SAFE_DIVIDE(trip_miles, trip_seconds) * 3600 AS trip_distance,
  company,
  payment_type
  -- NO label column (we&#8217;re predicting it!)
FROM ...</code></pre><p><strong>Critical for consistency</strong>: If preprocessing differs between training and prediction, the model will fail or produce garbage predictions.</p><h3><strong>Step 3: Batch Prediction</strong></h3><p><strong>Goal</strong>: Generate predictions for thousands/millions of rows at scale.</p><pre><code>model_batch_predict_op(
    model=champion_model.outputs[&#8221;model&#8221;],
    job_display_name=&#8221;taxi-fare-predict-job&#8221;,
    location=location,
    project=project,

    # Input: BigQuery table
    source_uri=f&#8221;bq://{project}.{dataset}.prep_prediction_table&#8221;,
    source_format=&#8221;bigquery&#8221;,
    # Output: BigQuery table
    destination_uri=f&#8221;bq://{project}.{dataset}&#8221;,
    destination_format=&#8221;bigquery&#8221;,
    # Resource configuration
    machine_type=&#8221;n2-standard-4&#8221;,
    starting_replica_count=3,
    max_replica_count=10,
    # Monitoring configuration
    monitoring_training_dataset=champion_model.outputs[&#8221;training_dataset&#8221;],
    monitoring_alert_email_addresses=[&#8221;team@example.com&#8221;],
    monitoring_skew_config={&#8221;defaultSkewThreshold&#8221;: {&#8221;value&#8221;: 0.001}},
).after(prep_op).set_display_name(&#8221;Run prediction job&#8221;)</code></pre><h3><strong>Batch Prediction Workflow</strong></h3><ol><li><p><strong>Job Submission</strong>: Vertex AI creates a batch prediction job</p></li><li><p><strong>Resource Allocation</strong>: Provisions 3&#8211;10 VMs (based on data size)</p></li><li><p><strong>Model Loading</strong>: Loads SavedModel on each VM</p></li><li><p><strong>Parallel Processing</strong>: Each VM processes a partition of the data</p></li><li><p><strong>Predictions</strong>: Each row gets a prediction</p></li><li><p><strong>Output</strong>: Writes predictions to BigQuery</p></li></ol><p><strong>Output table structure</strong>:</p><pre><code>SELECT * FROM `my-project.taxi_trips_dataset.predictions_2024_01_15`
</code></pre><h3><strong>Horizontal Scaling</strong></h3><pre><code>starting_replica_count=3,   # Start with 3 VMs
max_replica_count=10,       # Scale up to 10 if needed</code></pre><p><strong>How scaling works</strong>:</p><ul><li><p>Small dataset (&lt; 10k rows): 3 VMs sufficient</p></li><li><p>Medium dataset (100k rows): Scales to ~5 VMs</p></li><li><p>Large dataset (1M+ rows): Scales to 10 VMs</p></li></ul><p><strong>Cost optimization</strong>:</p><ul><li><p>Use <code>n2-standard-2</code> for small datasets</p></li><li><p>Use <code>n2-standard-4</code> for medium datasets</p></li><li><p>Use <code>n2-standard-8</code> for large datasets</p></li></ul><h3><strong>Step 4: Model Monitoring and Skew Detection</strong></h3><p>Over time, <strong>data distributions shift</strong>:</p><p><strong>Example scenario</strong>:</p><pre><code>Training data (Jan-Mar 2024):
  - Average trip: 5.2 miles
  - Payment: 60% credit card, 40% cash
  - Peak hour: 8 AM

Production data (Nov 2024):
  - Average trip: 7.8 miles  &#8592; Shift!
  - Payment: 75% credit card, 25% cash  &#8592; Shift!
  - Peak hour: 9 AM  &#8592; Shift!</code></pre><p>When distributions shift, model accuracy degrades. <strong>Model monitoring</strong> catches this.</p><h3><strong>Training-Serving Skew Detection</strong></h3><p>Vertex AI automatically compares:</p><ul><li><p><strong>Training data distribution</strong> (saved during training)</p></li><li><p><strong>Prediction data distribution</strong> (from batch prediction)</p></li></ul><p><strong>Skew metrics</strong>:</p><pre><code>monitoring_skew_config={
    &#8220;defaultSkewThreshold&#8221;: {&#8221;value&#8221;: 0.001},
    # Or per-feature thresholds:
    # &#8220;skewThresholds&#8221;: {
    #     &#8220;payment_type&#8221;: {&#8221;value&#8221;: 0.005},
    #     &#8220;trip_distance&#8221;: {&#8221;value&#8221;: 0.01},
    # }
}</code></pre><p><strong>How skew is calculated</strong>:</p><p>For categorical features (e.g., <code>payment_type</code>):</p><pre><code>Skew = L-infinity distance between distributions

Training: {cash: 0.4, credit: 0.6}
Prediction: {cash: 0.25, credit: 0.75}

Skew = max(|0.4-0.25|, |0.6-0.75|) = max(0.15, 0.15) = 0.15</code></pre><p>If skew &gt; threshold (0.001), alert is triggered.</p><h3><strong>Alert Configuration</strong></h3><pre><code>monitoring_alert_email_addresses=[&#8221;ml-team@example.com&#8221;],
notification_channels=[
    &#8220;projects/my-project/notificationChannels/email-channel&#8221;,
    &#8220;projects/my-project/notificationChannels/slack-channel&#8221;,
]</code></pre><p><strong>Alert email example</strong>:</p><pre><code>Subject: Model Monitoring Alert - Skew Detected
Model: taxi-traffic-model (v5)
Feature: payment_type
Skew: 0.15 (threshold: 0.001)
Training distribution:
  cash: 40%
  credit: 60%
Prediction distribution:
  cash: 25%
  credit: 75%
Recommended action: Retrain model with recent data.
View details: https://console.cloud.google.com/vertex-ai/...</code></pre><h2><strong>Complete Prediction Pipeline Code</strong></h2><p>Now let&#8217;s see how it all fits together:</p><pre><code>from kfp import compiler, dsl
from components import lookup_model_op, model_batch_predict_op
from google_cloud_pipeline_components.v1.bigquery import BigqueryQueryJobOp
from pipelines.utils.query import generate_query
import pathlib

# Monitoring configuration
ALERT_EMAILS = [&#8221;ml-team@example.com&#8221;]
NOTIFICATION_CHANNELS = []
SKEW_THRESHOLDS = {&#8221;defaultSkewThreshold&#8221;: {&#8221;value&#8221;: 0.001}}
@dsl.pipeline(name=&#8221;taxifare-batch-prediction-pipeline&#8221;)
def pipeline(
    project: str,
    location: str,
    bq_location: str,
    bq_source_uri: str = &#8220;bigquery-public-data.chicago_taxi_trips.taxi_trips&#8221;,
    dataset: str = &#8220;taxi_trips_dataset&#8221;,
    timestamp: str = &#8220;2022-12-01 00:00:00&#8221;,
    use_latest_data: bool = True,
    model_name: str = &#8220;taxi-traffic-model&#8221;,
    machine_type: str = &#8220;n2-standard-4&#8221;,
    min_replicas: int = 3,
    max_replicas: int = 10,
):
    &#8220;&#8221;&#8220;
    Prediction pipeline which:
     1. Looks up the default model version (champion)
     2. Preprocesses data using BigQuery SQL
     3. Runs batch prediction job (BigQuery &#8594; BigQuery)
     4. Monitors for training-serving skew
    Args:
        project: GCP project ID
        location: Vertex AI location (e.g., us-central1)
        bq_location: BigQuery location (e.g., US)
        bq_source_uri: Source BigQuery table
        dataset: Dataset for staging tables
        timestamp: Optional fixed timestamp for predictions
        use_latest_data: Whether to use latest data (default: True)
        model_name: Model display name in registry
        machine_type: VM type for batch prediction
        min_replicas: Minimum number of prediction workers
        max_replicas: Maximum number of prediction workers
    &#8220;&#8221;&#8220;
    queries_folder = pathlib.Path(__file__).parent / &#8220;queries&#8221;
    # Step 1: Preprocess data using same SQL as training
    prep_query = generate_query(
        input_file=queries_folder / &#8220;ingest_pred.sql&#8221;,
        source=bq_source_uri,
        dataset=f&#8221;{project}.{dataset}&#8221;,
        table_=&#8221;prep_prediction_table&#8221;,
        start_timestamp=timestamp,
        use_latest_data=use_latest_data,
    )
    prep_op = BigqueryQueryJobOp(
        project=project,
        location=&#8221;US&#8221;,
        query=prep_query,
    ).set_display_name(&#8221;Ingest &amp; preprocess data&#8221;)
    # Step 2: Lookup champion model from registry
    champion_model = lookup_model_op(
        model_name=model_name,
        location=location,
        project=project,
        fail_on_model_not_found=True,  # Must exist!
    ).set_display_name(&#8221;Look up champion model&#8221;)
    # Step 3: Run batch prediction with monitoring
    model_batch_predict_op(
        model=champion_model.outputs[&#8221;model&#8221;],
        job_display_name=&#8221;taxi-fare-predict-job&#8221;,
        location=location,
        project=project,
        # Input/Output configuration (BigQuery &#8594; BigQuery)
        source_uri=f&#8221;bq://{project}.{dataset}.prep_prediction_table&#8221;,
        destination_uri=f&#8221;bq://{project}.{dataset}&#8221;,
        source_format=&#8221;bigquery&#8221;,
        destination_format=&#8221;bigquery&#8221;,
        # Instance configuration
        instance_config={&#8221;instanceType&#8221;: &#8220;object&#8221;},
        # Resource configuration (horizontal scaling)
        machine_type=machine_type,
        starting_replica_count=min_replicas,
        max_replica_count=max_replicas,
        # Monitoring configuration
        monitoring_training_dataset=champion_model.outputs[&#8221;training_dataset&#8221;],
        monitoring_alert_email_addresses=ALERT_EMAILS,
        notification_channels=NOTIFICATION_CHANNELS,
        monitoring_skew_config=SKEW_THRESHOLDS,
    ).after(prep_op).set_display_name(&#8221;Run prediction job&#8221;)

if __name__ == &#8220;__main__&#8221;:
    compiler.Compiler().compile(
        pipeline_func=pipeline,
        package_path=&#8221;taxifare-prediction-pipeline.yaml&#8221;
    )</code></pre><h2><strong>Pipeline Execution DAG on Vertex AI pipeline</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jfXe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jfXe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png 424w, https://substackcdn.com/image/fetch/$s_!jfXe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png 848w, https://substackcdn.com/image/fetch/$s_!jfXe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png 1272w, https://substackcdn.com/image/fetch/$s_!jfXe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jfXe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png" width="788" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88979c44-9858-44a2-bc24-c13241b8d382_788x345.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!jfXe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png 424w, https://substackcdn.com/image/fetch/$s_!jfXe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png 848w, https://substackcdn.com/image/fetch/$s_!jfXe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png 1272w, https://substackcdn.com/image/fetch/$s_!jfXe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88979c44-9858-44a2-bc24-c13241b8d382_788x345.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Key Design Decisions</strong></h2><p><strong>1. Simple Linear Flow</strong> Unlike the training pipeline with its complex DAG, the prediction pipeline is deliberately simple:</p><ul><li><p>No parallel branches</p></li><li><p>No conditional logic</p></li><li><p>Fail fast if any step fails</p></li></ul><p><strong>2. Preprocessing Consistency</strong></p><pre><code># Same SQL template as training!
prep_query = generate_query(
    input_file=queries_folder / &#8220;ingest_pred.sql&#8221;,
    # ...
)</code></pre><p>The <code>ingest_pred.sql</code> has identical feature engineering as <code>ingest.sql</code> (training), just without the label column.</p><p><strong>3. Dynamic Champion Lookup</strong></p><pre><code>champion_model = lookup_model_op(
    model_name=model_name,
    fail_on_model_not_found=True,
)</code></pre><p>Never hardcode model versions. Always use the current champion dynamically.</p><p><strong>4. Built-in Monitoring</strong></p><pre><code>monitoring_training_dataset=champion_model.outputs[&#8221;training_dataset&#8221;],</code></pre><p>The training dataset metadata (saved during training) is automatically used for skew detection.</p><p><strong>5. Scalability by Default</strong></p><pre><code>min_replicas=3,
max_replicas=10,</code></pre><p>Automatically scales based on data volume:</p><ul><li><p>Small dataset: Uses 3 replicas</p></li><li><p>Large dataset: Scales up to 10 replicas</p></li></ul><h2><strong>Running the Pipeline</strong></h2><h3><strong>Compile</strong></h3><pre><code>make compile pipeline=prediction</code></pre><h3><strong>Run in Different Scenarios</strong></h3><p><strong>Production run</strong> (latest data):</p><pre><code>make prediction enable_caching=false use_latest_data=true</code></pre><p>Or using the Python utility:</p><pre><code>poetry run python -m pipelines.utils.run_pipeline \
  --pipeline=prediction \
  --project=my-prod-project \
  --use_latest_data=true \
  --enable_caching=false</code></pre><p><strong>Backfill run</strong> (historical data):</p><pre><code>poetry run python -m pipelines.utils.run_pipeline \
  --pipeline=prediction \
  --project=my-prod-project \
  --timestamp=&#8221;2024-12-01 00:00:00&#8221; \
  --use_latest_data=false</code></pre><p><strong>Testing run</strong> (small dataset):</p><pre><code>poetry run python -m pipelines.utils.run_pipeline \
  --pipeline=prediction \
  --project=my-dev-project \
  --machine_type=&#8221;n2-standard-2&#8221; \
  --min_replicas=1 \
  --max_replicas=1</code></pre><h3><strong>Expected Output</strong></h3><pre><code>Pipeline submitted: projects/123/locations/us-central1/pipelineJobs/prediction-20250113-142536

View in Vertex AI:
https://console.cloud.google.com/vertex-ai/pipelines/runs/prediction-20250113-142536</code></pre><h3><strong>Prediction Output Format</strong></h3><p>The batch prediction creates a BigQuery table:</p><pre><code>-- View predictions
SELECT * FROM `my-project.taxi_trips_dataset.predictions_20250113_142536`
LIMIT 10;</code></pre><h3><strong>Using Predictions</strong></h3><p><strong>Join with actuals</strong> (for accuracy measurement):</p><pre><code>SELECT
  p.trip_id,
  p.predicted_total_fare,
  a.actual_fare,
  ABS(p.predicted_total_fare - a.actual_fare) AS error,
  ABS(p.predicted_total_fare - a.actual_fare) / a.actual_fare AS pct_error
FROM predictions_20250113_142536 p
JOIN actual_fares a ON p.trip_id = a.trip_id
WHERE a.actual_fare &gt; 0
ORDER BY pct_error DESC
LIMIT 100;</code></pre><p><strong>Export for business use</strong>:</p><pre><code>-- Export to Google Sheets or Data Studio
SELECT
  trip_id,
  predicted_total_fare,
  CASE
    WHEN predicted_total_fare &lt; 10 THEN &#8216;Low&#8217;
    WHEN predicted_total_fare &lt; 25 THEN &#8216;Medium&#8217;
    ELSE &#8216;High&#8217;
  END AS fare_category
FROM predictions_20250113_142536;</code></pre><h2><strong>Best Practices</strong></h2><h3><strong>1. Always Use Champion Model</strong></h3><pre><code># Good: Lookup champion dynamically
champion = lookup_model_op(model_name=&#8221;taxi-traffic-model&#8221;)
# Bad: Hardcode model version
model_uri = &#8220;projects/.../models/123456/versions/1&#8221;</code></pre><p>Dynamic lookup ensures you always use the latest approved model.</p><h3><strong>2. Monitor Everything</strong></h3><p>Enable monitoring on all prediction jobs:</p><pre><code>monitoring_training_dataset=champion_model.outputs[&#8221;training_dataset&#8221;],
monitoring_skew_config=SKEW_THRESHOLDS,</code></pre><h3><strong>3. Test Predictions in Dev First</strong></h3><pre><code># Test prediction pipeline in dev
make prediction enable_caching=false
# Verify predictions look reasonable
bq query --project=my-dev-project &#8220;
  SELECT prediction, trip_miles
  FROM predictions_table
  ORDER BY RAND()
  LIMIT 10
&#8220;
# Only then run in prod</code></pre><h3><strong>4. Version Prediction Outputs</strong></h3><pre><code># Include timestamp in output table
destination_uri=f&#8221;bq://{project}.{dataset}.predictions_{date}&#8221;</code></pre><p>Enables:</p><ul><li><p>A/B testing between model versions</p></li><li><p>Historical prediction analysis</p></li><li><p>Rollback if needed</p></li></ul><h3><strong>5. Ground Truth Collection</strong></h3><p>Collect actual outcomes to measure real accuracy:</p><pre><code>-- Join predictions with actual fares (collected later)
SELECT
  p.prediction,
  a.actual_fare,
  ABS(p.prediction - a.actual_fare) AS error
FROM predictions p
JOIN actual_fares a ON p.trip_id = a.trip_id</code></pre><p>Use this to:</p><ul><li><p>Track accuracy over time</p></li><li><p>Trigger retraining when accuracy drops</p></li><li><p>Validate champion/challenger comparisons</p></li></ul><h2><strong>Conclusion</strong></h2><p>Building a production prediction pipeline requires:</p><ul><li><p><strong>Champion model lookup</strong>: Always use the current best model</p></li><li><p><strong>Preprocessing consistency</strong>: Exact same transformations as training</p></li><li><p><strong>Batch predictions at scale</strong>: Horizontal scaling with BigQuery</p></li><li><p><strong>Model monitoring</strong>: Automatic skew detection</p></li><li><p><strong>Alerting</strong>: Notify when issues arise</p></li><li><p><strong>Cost optimization</strong>: Right-size resources</p></li></ul><p>With this prediction pipeline:</p><ul><li><p>Always uses the champion model</p></li><li><p>Scales horizontally for large datasets</p></li><li><p>Monitors for data drift automatically</p></li><li><p>Alerts when issues arise</p></li><li><p>Integrates seamlessly with training pipeline</p></li></ul><p>In the next article, we&#8217;ll automate everything with CI/CD and explore production operations including continuous training, scheduled retraining, and observability.</p><p><strong>Key Takeaways:</strong></p><ul><li><p>Prediction preprocessing must match training preprocessing exactly</p></li><li><p>Always lookup champion model dynamically (never hardcode versions)</p></li><li><p>Batch predictions scale horizontally for millions of rows</p></li><li><p>Model monitoring detects training-serving skew automatically</p></li><li><p>Test predictions in dev before running in production</p></li><li><p>Version prediction outputs for analysis and rollback</p></li></ul><p><strong>Next in Series</strong>: CI/CD &amp; Production Operations</p><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP">production-ready-MLOps-on-GCP</a></p><p><strong>Prediction Pipeline Code</strong>:</p><ul><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/blob/main/pipelines/src/pipelines/prediction.py">Prediction pipeline</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/blob/main/pipelines/src/pipelines/prediction.py#L14-L18">Monitoring configuration</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/tree/main/pipelines/src/pipelines/queries">SQL queries</a></p></li></ul><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Building Distributed Multi-Agent Systems with Google’s AI Stack: Part 6]]></title><description><![CDATA[Deploying to Cloud: Cloud Run and Vertex AI Agent Engine]]></description><link>https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-b08</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-b08</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Wed, 04 Feb 2026 10:10:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!b5tc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Building Production Multi-Agent Systems with Google&#8217;s AI Stack series:</strong></p><ul><li><p>Part 1: From Monolithic AI to Distributed Intelligence: Building Your First Multi-Agent System</p></li><li><p>Part 2: Making Agents Talk: Agent-to-Agent (A2A) Protocol Deep Dive</p></li><li><p>Part 3: Building the Orchestrator: Coordinating Agents with the AgentTool Pattern</p></li><li><p>Part 4: Scaling Multi-Agent Workflows: Solving the Token Limit Problem</p></li><li><p>Part 5: External Tool Integration via Model Context Protocol (MCP)</p></li><li><p><strong>Part 6: Deploying to Cloud: Cloud Run and Vertex AI Agent Engine</strong> &#8592; You are here</p></li></ul><h2><strong>Welcome Back!</strong></h2><p>In Part 5, we integrated external tools via MCP. Now we have a complete multi-agent system running locally.</p><p>It&#8217;s time to <strong>deploy to the cloud</strong>!</p><p>In this article, we&#8217;ll deploy:</p><ul><li><p>5 specialist agents &#8594; <strong>Cloud Run</strong> (containerized, auto-scaling)</p></li><li><p>Creative Director orchestrator &#8594; <strong>Vertex AI Agent Engine</strong> (managed runtime)</p></li></ul><p>We&#8217;ll also leverage:</p><ul><li><p><strong>Parallel deployment</strong> (3x faster)</p></li><li><p><strong>Two-stage A2A configuration</strong></p></li><li><p><strong>Automated URL collection</strong></p></li></ul><p>Let&#8217;s ship it!</p><h2><strong>Deployment Architecture Overview</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b5tc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b5tc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png 424w, https://substackcdn.com/image/fetch/$s_!b5tc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png 848w, https://substackcdn.com/image/fetch/$s_!b5tc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png 1272w, https://substackcdn.com/image/fetch/$s_!b5tc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b5tc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png" width="784" height="355" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:355,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!b5tc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png 424w, https://substackcdn.com/image/fetch/$s_!b5tc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png 848w, https://substackcdn.com/image/fetch/$s_!b5tc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png 1272w, https://substackcdn.com/image/fetch/$s_!b5tc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe86db7a-2abc-45c8-be5d-79cc74395573_784x355.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Why This Architecture?</strong></h3><p><strong>Specialists on Cloud Run</strong>:</p><ul><li><p>Independent scaling (scale copywriter separately)</p></li><li><p>Containerized (full control over environment)</p></li><li><p>Auto-scaling (0&#8211;100 instances)</p></li><li><p>Cost-efficient (pay only when running)</p></li></ul><p><strong>Orchestrator on Agent Engine</strong>:</p><ul><li><p>Managed runtime (no container maintenance)</p></li><li><p>Integrated with Vertex AI</p></li><li><p>Built-in monitoring</p></li></ul><h2><strong>Prerequisites</strong></h2><h3><strong>1. Google Cloud Project Setup</strong></h3><pre><code># Install gcloud CLI
# macOS:
brew install google-cloud-sdk
# Linux:
curl https://sdk.cloud.google.com | bash
# Verify
gcloud --version
# Login and set project
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
# Enable required APIs
gcloud services enable \
    run.googleapis.com \
    aiplatform.googleapis.com \
    cloudbuild.googleapis.com \
    artifactregistry.googleapis.com</code></pre><h3><strong>2. Environment Variables</strong></h3><p>Create <code>.env</code> file:</p><pre><code># Google Cloud
PROJECT_ID=your-gcp-project-id
REGION=us-central1
# Gemini API
GOOGLE_API_KEY=your-gemini-api-key
# Notion (optional)
NOTION_API_KEY=your-notion-token
NOTION_DATABASE_ID=your-projects-db-id
TASKS_DATABASE_ID=your-tasks-db-id</code></pre><h3><strong>3. Service Accounts Setup</strong></h3><p><strong>No setup needed!</strong> Cloud Run automatically uses the default Compute Engine service account with all necessary permissions.</p><p>This simplifies deployment, no need to create custom service accounts.</p><h2><strong>Creating Dockerfiles for Specialist Agents</strong></h2><h3><strong>Standard Agent Dockerfile</strong></h3><pre><code># agents/brand_strategist/Dockerfile
FROM python:3.12-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update &amp;&amp; apt-get install -y \
    gcc \
    curl \
    &amp;&amp; rm -rf /var/lib/apt/lists/*
# Install uv for faster dependency installation
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
# Copy requirements and install
COPY requirements.txt .
RUN uv pip install --system --no-cache -r requirements.txt
# Copy agent code
COPY agent.py .
# Create non-root user for security
RUN useradd -m -u 1000 appuser &amp;&amp; chown -R appuser:appuser /app
USER appuser
# Environment
ENV PYTHONUNBUFFERED=1
ENV PORT=8080
ENV HOST=0.0.0.0
EXPOSE 8080
# Run A2A server
CMD [&#8221;python&#8221;, &#8220;agent.py&#8221;]</code></pre><h3><strong>Project Manager Dockerfile (with Node.js for MCP)</strong></h3><pre><code># agents/project_manager/Dockerfile*
FROM python:3.12-slim
WORKDIR /app
# Install Node.js for Notion MCP server
RUN apt-get update &amp;&amp; apt-get install -y \
    nodejs \
    npm \
    gcc \
    curl \
    &amp;&amp; rm -rf /var/lib/apt/lists/*
# Verify Node.js
RUN node --version &amp;&amp; npm --version
# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
# Install Python dependencies
COPY requirements.txt .
RUN uv pip install --system --no-cache -r requirements.txt
# Copy agent code
COPY agent.py .
# Create non-root user
RUN useradd -m -u 1000 appuser &amp;&amp; chown -R appuser:appuser /app
USER appuser
# Environment
ENV PYTHONUNBUFFERED=1
ENV PORT=8080
ENV HOST=0.0.0.0
EXPOSE 8080
CMD [&#8221;python&#8221;, &#8220;agent.py&#8221;]</code></pre><h2><strong>Parallel Deployment (3x Faster!)</strong></h2><h3><strong>The Problem: Sequential Deployment</strong></h3><pre><code># Old approach (SLOW - sequential)
# Deploy each agent one by one
# Total: 15 minutes! &#10060;</code></pre><h3><strong>The Solution: Async Parallel Deployment</strong></h3><pre><code># deploy/deploy_all_specialists.py
import asyncio
import subprocess
from typing import Dict, List
AGENTS = [
    {&#8221;name&#8221;: &#8220;brand-strategist&#8221;, &#8220;dir&#8221;: &#8220;brand_strategist&#8221;},
    {&#8221;name&#8221;: &#8220;copywriter&#8221;, &#8220;dir&#8221;: &#8220;copywriter&#8221;},
    {&#8221;name&#8221;: &#8220;designer&#8221;, &#8220;dir&#8221;: &#8220;designer&#8221;},
    {&#8221;name&#8221;: &#8220;critic&#8221;, &#8220;dir&#8221;: &#8220;critic&#8221;},
    {&#8221;name&#8221;: &#8220;project-manager&#8221;, &#8220;dir&#8221;: &#8220;project_manager&#8221;},
]

async def deploy_single_agent(
    agent_config: Dict,
    project_id: str,
    region: str
) -&gt; str:
    &#8220;&#8221;&#8220;Deploy a single agent to Cloud Run&#8221;&#8220;&#8221;
    name = agent_config[&#8221;name&#8221;]
    agent_dir = agent_config[&#8221;dir&#8221;]
    service_account = f&#8221;{name}-sa&#8221;
    print(f&#8221;&#128640; Deploying {name}...&#8221;)
    agent_path = Path(__file__).parent.parent / agent_dir
    sa_email = f&#8221;{service_account}@{project_id}.iam.gserviceaccount.com&#8221;
    # Build environment variables
    env_vars = (
        f&#8221;GOOGLE_GENAI_USE_VERTEXAI=true,&#8221;
        f&#8221;GOOGLE_CLOUD_PROJECT={project_id},&#8221;
        f&#8221;GOOGLE_CLOUD_LOCATION={region}&#8221;
    )    # Add Notion credentials for project-manager
    if name == &#8220;project-manager&#8221;:
        notion_api_key = os.getenv(&#8221;NOTION_API_KEY&#8221;)
        notion_db_id = os.getenv(&#8221;NOTION_DATABASE_ID&#8221;)
        if notion_api_key and notion_db_id:
            env_vars += f&#8221;,NOTION_API_KEY={notion_api_key},NOTION_DATABASE_ID={notion_db_id}&#8221;
    # Deploy command
    cmd = [
        &#8220;gcloud&#8221;, &#8220;run&#8221;, &#8220;deploy&#8221;, name,
        &#8220;--source=.&#8221;,
        &#8220;--port=8080&#8221;,
        &#8220;--platform=managed&#8221;,
        f&#8221;--region={region}&#8221;,
        f&#8221;--project={project_id}&#8221;,
        f&#8221;--service-account={sa_email}&#8221;,
        &#8220;--no-allow-unauthenticated&#8221;,
        f&#8221;--set-env-vars={env_vars}&#8221;,
        &#8220;--memory=1Gi&#8221;,
        &#8220;--cpu=1&#8221;,
        &#8220;--timeout=300&#8221;,
        &#8220;--max-instances=10&#8221;,
        &#8220;--min-instances=0&#8221;,
        &#8220;--quiet&#8221;
    ]
    # Run deployment asynchronously
    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE,
        cwd=agent_path
    )
    stdout, stderr = await process.communicate()
    if process.returncode != 0:
        print(f&#8221;&#10060; Failed to deploy {name}: {stderr.decode()}&#8221;)
        return None
    print(f&#8221;&#10003; {name} deployed successfully&#8221;)
    # Get service URL
    url = await get_service_url(name, project_id, region)
    return url

async def deploy_all_agents(project_id: str, region: str) -&gt; Dict[str, str]:
    &#8220;&#8221;&#8220;Deploy all agents in parallel and collect URLs&#8221;&#8220;&#8221;
    print(&#8221;\n&#8221; + &#8220;=&#8221;*70)
    print(&#8221;Deploying all specialist agents to Cloud Run (in parallel)&#8221;)
    print(&#8221;=&#8221;*70 + &#8220;\n&#8221;    # Deploy all agents in parallel using asyncio.gather
    tasks = [
        deploy_single_agent(agent, project_id, region)
        for agent in AGENTS
    ]
    results = await asyncio.gather(*tasks)
    # Build URL mapping
    agent_urls = {}
    for agent, url in zip(AGENTS, results):
        if url:
            agent_urls[agent[&#8221;name&#8221;]] = url
    print(&#8221;\n&#8221; + &#8220;=&#8221;*70)
    print(f&#8221;&#10003; Deployment complete! {len(agent_urls)}/{len(AGENTS)} agents deployed&#8221;)
    print(&#8221;=&#8221;*70)    return agent_urls</code></pre><p><strong>Speed comparison</strong>:</p><ul><li><p>Sequential: 5 agents &#215; 3 min = <strong>15 minutes</strong></p></li><li><p>Parallel: ~<strong>5 minutes</strong></p></li><li><p><strong>3x faster!</strong></p></li></ul><h2><strong>Two-Stage A2A Configuration</strong></h2><p>Remember our dual configuration from Part 3? Here&#8217;s how it works in deployment:</p><h3><strong>Stage 1: Initial Deployment</strong></h3><pre><code># Deploy with basic environment variables
gcloud run deploy brand-strategist \
    --source=. \
    --set-env-vars=GOOGLE_CLOUD_PROJECT=...,... \
    --region=us-central1
# Service is deployed!
# But agent card still shows placeholder URL</code></pre><h3><strong>Stage 2: Update A2A Configuration</strong></h3><pre><code>async def update_agent_a2a_config(
    service_name: str,
    url: str,
    project_id: str,
    region: str
) -&gt; None:
    &#8220;&#8221;&#8220;Update deployed agent with PUBLIC_HOST, PUBLIC_PORT, PROTOCOL&#8221;&#8220;&#8221;
    # Extract PUBLIC_HOST from URL
    # URL: https://brand-strategist-xxx.us-central1.run.app
    public_host = url.replace(&#8221;https://&#8221;, &#8220;&#8221;).replace(&#8221;http://&#8221;, &#8220;&#8221;).split(&#8221;/&#8221;)[0]
    print(f&#8221;   Updating A2A config for {service_name}...&#8221;)
    # Build environment variables update
    env_vars_update = f&#8221;PUBLIC_HOST={public_host},PUBLIC_PORT=443,PROTOCOL=https&#8221;
    # Add Notion credentials for project-manager
    if service_name == &#8220;project-manager&#8221;:
        notion_api_key = os.getenv(&#8221;NOTION_API_KEY&#8221;)
        if notion_api_key:
            env_vars_update += f&#8221;,NOTION_API_KEY={notion_api_key}&#8221;
    cmd = [
        &#8220;gcloud&#8221;, &#8220;run&#8221;, &#8220;services&#8221;, &#8220;update&#8221;, service_name,
        &#8220;--platform=managed&#8221;,
        f&#8221;--region={region}&#8221;,
        f&#8221;--project={project_id}&#8221;,
        f&#8221;--update-env-vars={env_vars_update}&#8221;,
        &#8220;--quiet&#8221;
    ]
    process = await asyncio.create_subprocess_exec(*cmd)
    await process.wait()
    if process.returncode == 0:
        print(f&#8221;   &#10003; A2A config updated for {service_name}&#8221;)
    else:
        print(f&#8221;   Warning: Could not update A2A config for {service_name}&#8221;)</code></pre><p><strong>Now the agent card shows the correct URL</strong>:</p><pre><code>{
  &#8220;name&#8221;: &#8220;brand_strategist&#8221;,
  &#8220;rpc_url&#8221;: &#8220;https://brand-strategist-xxx.us-central1.run.app:443&#8221;
}</code></pre><p>Perfect for the orchestrator to discover!</p><h2><strong>Deploying the Orchestrator to Agent Engine</strong></h2><h3><strong>Step 1: Prepare Agent Code</strong></h3><pre><code># agents/creative_director/agent.py
# Agent creation code from Part 4
# Returns App (with context compaction)
root_agent = create_creative_director()
# That&#8217;s it! Agent Engine handles the rest</code></pre><h3><strong>Step 2: Deploy to Agent Engine</strong></h3><pre><code># deploy/deploy_orchestrator.py
from google.cloud import aiplatform
from pathlib import Path
def deploy_orchestrator(agent_urls: Dict[str, str], project_id: str, region: str):
    &#8220;&#8221;&#8220;Deploy Creative Director to Vertex AI Agent Engine&#8221;&#8220;&#8221;
    print(&#8221;\n&#8221; + &#8220;=&#8221;*70)
    print(&#8221;Deploying Creative Director to Vertex AI Agent Engine&#8221;)
    print(&#8221;=&#8221;*70)
    # Initialize Vertex AI
    aiplatform.init(project=project_id, location=region)
    # Prepare environment variables with agent URLs
    env_vars = {
        &#8220;GOOGLE_API_KEY&#8221;: os.getenv(&#8221;GOOGLE_API_KEY&#8221;),
        &#8220;STRATEGIST_AGENT_URL&#8221;: agent_urls.get(&#8221;brand-strategist&#8221;),
        &#8220;COPYWRITER_AGENT_URL&#8221;: agent_urls.get(&#8221;copywriter&#8221;),
        &#8220;DESIGNER_AGENT_URL&#8221;: agent_urls.get(&#8221;designer&#8221;),
        &#8220;CRITIC_AGENT_URL&#8221;: agent_urls.get(&#8221;critic&#8221;),
        &#8220;PM_AGENT_URL&#8221;: agent_urls.get(&#8221;project-manager&#8221;),
    }
    print(&#8221;\n&#128203; Environment variables:&#8221;)
    for key, value in env_vars.items():
        if &#8220;API_KEY&#8221; not in key:
            print(f&#8221;   {key}={value}&#8221;)
    # Read requirements
    requirements = [&#8221;google-adk&#8221;, &#8220;google-genai&#8221;, &#8220;python-dotenv&#8221;]
    # Deploy to Agent Engine
    print(&#8221;\n&#128640; Deploying to Agent Engine...&#8221;)
    reasoning_engine = aiplatform.ReasoningEngine.create(
        reasoning_engine={
            &#8220;agent_file&#8221;: &#8220;agent.py&#8221;,
            &#8220;agent_name&#8221;: &#8220;root_agent&#8221;,  # Name of variable in agent.py
            &#8220;requirements&#8221;: requirements
        },
        display_name=&#8221;creative-director-orchestrator&#8221;,
        description=&#8221;Creative Director orchestrator for AI Creative Studio&#8221;,
        requirements=requirements,
        extra_packages=[Path(&#8221;agents/creative_director&#8221;)],
        env_vars=env_vars
    )
    resource_name = reasoning_engine.resource_name
    print(f&#8221;\n&#9989; Orchestrator deployed!&#8221;)
    print(f&#8221;   Resource name: {resource_name}&#8221;)
    print(f&#8221;\n&#128161; Save this to .env:&#8221;)
    print(f&#8221;   AGENT_ENGINE_RESOURCE_NAME={resource_name}&#8221;)
    return resource_name</code></pre><p><strong>Key points</strong>:</p><ul><li><p>Deploys <code>agent.py</code> with <code>root_agent</code> variable</p></li><li><p>Sets all agent URLs in environment variables</p></li><li><p>Orchestrator discovers agents at runtime!</p></li></ul><h2><strong>One-Command Deployment</strong></h2><h3><strong>The Complete Deployment Script</strong></h3><pre><code>#!/bin/bash
# deploy/deploy_complete_system.sh
set -e
echo &#8220;======================================================================&#8221;
echo &#8220;   AI Creative Studio - Complete System Deployment&#8221;
echo &#8220;======================================================================&#8221;
# Load environment
if [ ! -f .env ]; then
    echo &#8220;&#10060; Error: .env file not found&#8221;
    exit 1
fi
source .env
echo &#8220;&#8221;
echo &#8220;&#128203; Configuration:&#8221;
echo &#8220;   Project: $PROJECT_ID&#8221;
echo &#8220;   Region: $REGION&#8221;
echo &#8220;&#8221;
# Step 1: Deploy all specialist agents in parallel
echo &#8220;Step 1/2: Deploying specialist agents to Cloud Run (parallel)...&#8221;
python3 deploy_all_specialists.py
if [ $? -ne 0 ]; then
    echo &#8220;&#10060; Specialist deployment failed&#8221;
    exit 1
fi
# Step 2: Deploy orchestrator
echo &#8220;&#8221;
echo &#8220;Step 2/2: Deploying orchestrator to Vertex AI Agent Engine...&#8221;
python3 deploy_orchestrator.py --action deploy
if [ $? -ne 0 ]; then
    echo &#8220;&#10060; Orchestrator deployment failed&#8221;
    exit 1
fi
echo &#8220;&#8221;
echo &#8220;======================================================================&#8221;
echo &#8220;   &#9989; Complete System Deployed Successfully!&#8221;
echo &#8220;======================================================================&#8221;
echo &#8220;&#8221;
echo &#8220;&#129514; Test your system:&#8221;
echo &#8220;   python3 test_orchestrator.py&#8221;
echo &#8220;&#8221;</code></pre><p><strong>Run It!</strong></p><pre><code>cd deploy
chmod +x deploy_complete_system.sh
./deploy_complete_system.sh</code></pre><p><strong>Output</strong></p><pre><code>======================================================================
   AI Creative Studio - Complete System Deployment
======================================================================
&#128203; Configuration:
   Project: my-project-123
   Region: us-central1
Step 1/2: Deploying specialist agents to Cloud Run (parallel)...
======================================================================
Deploying all specialist agents to Cloud Run (in parallel)
======================================================================
&#128640; Deploying brand-strategist...
&#128640; Deploying copywriter...
&#128640; Deploying designer...
&#128640; Deploying critic...
&#128640; Deploying project-manager...
&#10003; brand-strategist deployed successfully
   URL: https://brand-strategist-xxx.us-central1.run.app
   Updating A2A config for brand-strategist...
   &#10003; A2A config updated
&#10003; copywriter deployed successfully
   URL: https://copywriter-xxx.us-central1.run.app
   Updating A2A config for copywriter...
   &#10003; A2A config updated
... (rest of agents)
======================================================================
&#10003; Deployment complete! 5/5 agents deployed
======================================================================
Step 2/2: Deploying orchestrator to Vertex AI Agent Engine...
======================================================================
Deploying Creative Director to Vertex AI Agent Engine
======================================================================
&#128203; Environment variables:
   STRATEGIST<em>_AGENT_</em>URL=https://brand-strategist-xxx.us-central1.run.app
   COPYWRITER<em>_AGENT_</em>URL=https://copywriter-xxx.us-central1.run.app
   DESIGNER<em>_AGENT_</em>URL=https://designer-xxx.us-central1.run.app
   CRITIC<em>_AGENT_</em>URL=https://critic-xxx.us-central1.run.app
   PM<em>_AGENT_</em>URL=https://project-manager-xxx.us-central1.run.app
&#128640; Deploying to Agent Engine...
&#9989; Orchestrator deployed!
   Resource name: projects/123/locations/us-central1/reasoningEngines/456
&#128161; Save this to .env:
   AGENT<em>_ENGINE_</em>RESOURCE<em>_NAME=projects/123/locations/us-central1/reasoningEngines/456
======================================================================
   &#9989; Complete System Deployed Successfully!
======================================================================
&#129514; Test your system:
   python3 test_</em>orchestrator.py
Total deployment time: ~7 minutes</code></pre><h2><strong>Testing the Deployed System</strong></h2><h3><strong>Test Script</strong></h3><pre><code># test_orchestrator.py
from google.cloud import aiplatform
import os
from dotenv import load_dotenv
load_dotenv()
# Initialize
project_id = os.getenv(&#8221;PROJECT_ID&#8221;)
region = os.getenv(&#8221;REGION&#8221;)
resource_name = os.getenv(&#8221;AGENT_ENGINE_RESOURCE_NAME&#8221;)
aiplatform.init(project=project_id, location=region)
# Load the deployed orchestrator
reasoning_engine = aiplatform.ReasoningEngine(resource_name)
# Test with a simple request
brief = &#8220;Research the market for eco-friendly smart water bottles&#8221;
print(f&#8221;&#128203; Testing deployed orchestrator\n&#8221;)
print(f&#8221;Brief: {brief}\n&#8221;)
print(&#8221;Response:&#8221;)
response = reasoning_engine.query(input=brief)
print(response[&#8221;output&#8221;])
print(&#8221;\n&#9989; Deployed system is working!&#8221;)</code></pre><h3><strong>Run Test</strong></h3><pre><code>python test_orchestrator.py</code></pre><h2><strong>Monitoring and Logs</strong></h2><h3><strong>View Orchestrator Logs</strong></h3><pre><code># Fetch logs from Agent Engine
gcloud logging read \
    &#8216;resource.type=&#8221;aiplatform.googleapis.com/ReasoningEngine&#8221;&#8217; \
    --limit=50 \
    --format=json</code></pre><h3><strong>View Agent Logs</strong></h3><pre><code># Brand Strategist logs
gcloud run services logs read brand-strategist \
    --region=us-central1 \
    --limit=50</code></pre><h3><strong>Cloud Run Dashboard</strong></h3><pre><code># Open Cloud Run console
gcloud console cloud-run</code></pre><p>View:</p><ul><li><p>Request counts</p></li><li><p>Response times</p></li><li><p>Error rates</p></li><li><p>Instance scaling</p></li></ul><h2><strong>Monitoring and Debugging Your Deployed System</strong></h2><p>Now that your system is deployed, here are quick tips for observability:</p><h3><strong>Built-in Observability</strong></h3><p><strong>ADK Logging Plugin</strong> (already enabled in code):</p><ul><li><p>Automatically logs all LLM calls, tool executions, and token usage</p></li><li><p>No custom configuration needed</p></li></ul><p><strong>Cloud Logging</strong> (automatic):</p><pre><code># View orchestrator logs
gcloud logging read \
  &#8216;resource.type=&#8221;aiplatform.googleapis.com/ReasoningEngine&#8221;&#8217; \
  --limit=100 --project=YOUR_PROJECT_ID</code></pre><pre><code># View specialist agent logs
gcloud logging read \
  &#8216;resource.type=&#8221;cloud_run_revision&#8221; AND
   resource.labels.service_name=&#8221;brand-strategist&#8221;&#8217; \
  --limit=100 --project=YOUR_PROJECT_ID</code></pre><p><strong>A2A Inspector</strong> (for testing agents):</p><ul><li><p>Install: <a href="https://github.com/a2aproject/a2a-inspector">https://github.com/a2aproject/a2a-inspector</a></p></li><li><p>Connect to your Cloud Run agent URLs</p></li><li><p>Test queries and view JSONRPC messages</p></li></ul><h3><strong>Quick Debugging Commands</strong></h3><pre><code># Tail orchestrator logs in real-time
gcloud logging tail \
  &#8216;resource.type=&#8221;aiplatform.googleapis.com/ReasoningEngine&#8221;&#8217; \
  --project=YOUR_PROJECT_ID
# Check for errors in specialist agents
gcloud logging read \
  &#8216;resource.type=&#8221;cloud_run_revision&#8221; AND severity&gt;=ERROR&#8217; \
  --limit=50 --project=YOUR_PROJECT_ID
# View Cloud Run metrics
gcloud run services describe brand-strategist \
  --platform managed --region us-central1</code></pre><p>For comprehensive monitoring, set up Cloud Monitoring dashboards and log-based alerts through the Google Cloud Console.</p><h2><strong>Visual Tour: Your Deployed System in Action </strong></h2><h3><strong>Specialists Deployed to Cloud Run</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O7Av!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O7Av!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg 424w, https://substackcdn.com/image/fetch/$s_!O7Av!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg 848w, https://substackcdn.com/image/fetch/$s_!O7Av!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!O7Av!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O7Av!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg" width="788" height="250" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:250,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!O7Av!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg 424w, https://substackcdn.com/image/fetch/$s_!O7Av!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg 848w, https://substackcdn.com/image/fetch/$s_!O7Av!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!O7Av!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fbfe60e-9860-49e0-9aef-cb39c8cf33fe_788x250.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Navigate to Cloud Run in Google Cloud Console. You should see all 5 specialist agents deployed as independent services:</p><p>&#9989; brand-strategist &#8212; Ready to research markets<br>&#9989; copywriter &#8212; Ready to write compelling copy<br>&#9989; designer &#8212; Ready to create visual concepts<br>&#9989; critic &#8212; Ready to review and provide feedback<br>&#9989; project-manager &#8212; Ready to organize tasks in Notion</p><p>Key indicators:<br>&#8212; Green checkmarks = healthy and running<br>&#8212; Each service has its own URL (the A2A endpoint)<br>&#8212; Auto-scaling configured (0 to 10 instances)<br>&#8212; Currently scaled to zero (no idle costs!)</p><h3><strong>Orchestrator Deployed to Agent Engine</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TeSE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TeSE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TeSE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TeSE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TeSE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TeSE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg" width="788" height="347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:347,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!TeSE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TeSE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TeSE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TeSE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30aec2bc-d547-4435-a8ca-8be67b69b605_788x347.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Navigate to <strong>Vertex AI</strong> &gt; <strong>Agent Engine</strong> in Google Cloud Console. You should see:</p><ul><li><p>&#128203; Display name: Creative Director</p></li></ul><h3><strong>Live Execution in Agent Engine Playground</strong></h3><p>Click on the Creative Director then go into the &#8220;Playground&#8221; Tab. A session will be created for you. Enter a prompt !</p><p><em><strong>The execution flow visible in the playground as in demo:</strong></em></p><p>Thank you for following this series!</p><p>If you built something with these patterns, I&#8217;d love to hear about it. Share your projects, questions, and improvements.</p><p>Happy building! &#128640;</p><p><strong>Code Repository</strong>: <a href="https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun">https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun</a></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Production-Ready MLOps on GCP Part 5: Training Pipeline Deep Dive]]></title><description><![CDATA[Part 5 of a 8-part series on building enterprise-grade MLOps systems]]></description><link>https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-022</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-022</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 10:12:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gRl5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Complete Series</strong>:</p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 1: Architecture Overview</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-5f1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 2: Tools &amp; Workflows for ML Teams</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-06c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 3: Infrastructure as Code</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-8ac?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 4: Reusable KFP Components</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-022?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 5: Production Training Pipeline</a> (You are here)</p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-a6c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 6: Production Prediction Pipeline </a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-9c6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 7: CI/CD for ML</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-e8f?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 8: Model Monitoring &amp; Continuous Training</a></p></li></ul><h2><strong>Introduction</strong></h2><p>In the <a href="https://medium.com/google-cloud/production-ready-mlops-on-gcp-part-5-training-pipeline-deep-dive-9850323a824d#">previous article</a>, we built a library of reusable Kubeflow Pipeline components &#8212; modular building blocks like <code>extract_table_to_gcs_op</code> and <code>upload_best_model_op</code>. Now comes the payoff: assembling these components into a complete, production-ready training pipeline.</p><p>But here&#8217;s what makes this challenging: a production training pipeline isn&#8217;t just &#8220;train a model and save it.&#8221; It needs to:</p><ul><li><p><strong>Preprocess data at scale</strong> using BigQuery</p></li><li><p><strong>Split data reproducibly</strong> so experiments are comparable</p></li><li><p><strong>Tune hyperparameters</strong> automatically to find the best configuration</p></li><li><p><strong>Train models</strong> in custom containers with full control</p></li><li><p><strong>Evaluate rigorously</strong> on held-out test data</p></li><li><p><strong>Compare with the champion</strong> to prevent degraded models from deploying</p></li><li><p><strong>Version and register</strong> models with complete lineage</p></li></ul><p>All while being:</p><ul><li><p><strong>Automated</strong>: No manual steps</p></li><li><p><strong>Reproducible</strong>: Same inputs &#8594; same outputs</p></li><li><p><strong>Observable</strong>: Full logging and monitoring</p></li><li><p><strong>Testable</strong>: Validated before production</p></li></ul><p>In this article, we&#8217;ll dissect our production training pipeline from end to end, exploring:</p><ul><li><p>Data preprocessing with BigQuery SQL</p></li><li><p>Repeatable data splitting strategies</p></li><li><p>Hyperparameter tuning with Vertex AI</p></li><li><p>Custom TensorFlow training containers</p></li><li><p>Champion/Challenger model comparison</p></li><li><p>Complete pipeline orchestration</p></li></ul><p>By the end, you&#8217;ll understand how all the pieces fit together to create a pipeline that reliably produces production-quality models.</p><h2><strong>Training Pipeline Architecture</strong></h2><p>Our training pipeline executes 8 major steps:</p><pre><code>1. Data Preprocessing (BigQuery SQL)
         &#8595;
2. Data Splitting (80/10/10 train/val/test)
         &#8595;
3. Data Extraction (BigQuery &#8594; GCS CSV)
         &#8595;
4. Hyperparameter Tuning (6 trials, 2 parallel)
         &#8595;
5. Model Training (Custom TensorFlow container)
         &#8595;
6. Model Evaluation (Test set metrics)
         &#8595;
7. Champion/Challenger Comparison (RMSE-based)
         &#8595;
8. Model Upload to Registry (if better than champion)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P3oG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P3oG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png 424w, https://substackcdn.com/image/fetch/$s_!P3oG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png 848w, https://substackcdn.com/image/fetch/$s_!P3oG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png 1272w, https://substackcdn.com/image/fetch/$s_!P3oG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P3oG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png" width="784" height="121" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:121,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!P3oG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png 424w, https://substackcdn.com/image/fetch/$s_!P3oG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png 848w, https://substackcdn.com/image/fetch/$s_!P3oG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png 1272w, https://substackcdn.com/image/fetch/$s_!P3oG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdabcbda8-3d07-40e3-8daf-797c24a9bc05_784x121.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Each step is a component (or set of components) that we explored in Article 3. The magic is in how they&#8217;re orchestrated.</p><h2><strong>Step 1: Data Preprocessing with BigQuery</strong></h2><p><strong>Goal</strong>: Transform raw Chicago taxi trip data into features ready for model training.</p><h3><strong>Why BigQuery for Preprocessing?</strong></h3><p>You might wonder: why not preprocess in Python (pandas/PySpark)? Several reasons:</p><ol><li><p><strong>Scale</strong>: BigQuery processes terabytes effortlessly; pandas doesn&#8217;t</p></li><li><p><strong>Speed</strong>: SQL on BigQuery is faster than Python for aggregations</p></li><li><p><strong>Cost</strong>: Process-compute separation &#8212; you don&#8217;t pay for idle infrastructure</p></li><li><p><strong>Simplicity</strong>: SQL is declarative and familiar to data teams</p></li><li><p><strong>Versioning</strong>: SQL queries in Git are easier to review than Spark DAGs</p></li></ol><h3><strong>Preprocessing Query (Simplified)</strong></h3><pre><code>-- ingest.sql
CREATE OR REPLACE TABLE `{dataset}.{table_}` AS (
  SELECT
    -- Temporal features
    EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS dayofweek,
    EXTRACT(HOUR FROM trip_start_timestamp) AS hourofday,
    -- Trip characteristics
    trip_miles,
    trip_seconds,
    SAFE_DIVIDE(trip_miles, trip_seconds) * 3600 AS trip_distance,
    -- Categorical features
    company,
    payment_type,
    -- Label (target)
    fare AS total_fare
  FROM `{source}`
  WHERE
    -- Data quality filters
    trip_start_timestamp IS NOT NULL
    AND trip_miles &gt; 0
    AND trip_seconds &gt; 0
    AND fare &gt; 0
    -- Timeframe filter
    {timestamp_filter}
)</code></pre><h3><strong>Key Preprocessing Decisions</strong></h3><p><strong>1. Feature Engineering in SQL</strong></p><pre><code>SAFE_DIVIDE(trip_miles, trip_seconds) * 3600 AS trip_distance</code></pre><p>We create derived features (like speed in miles/hour) directly in SQL rather than in training code. This ensures:</p><ul><li><p><strong>Training-serving consistency</strong>: Same SQL runs for training and prediction</p></li><li><p><strong>Clarity</strong>: Feature logic is explicit and reviewable</p></li><li><p><strong>Performance</strong>: BigQuery optimizes SQL execution</p></li></ul><p><strong>2. Data Quality Filters</strong></p><pre><code>WHERE trip_miles &gt; 0 AND trip_seconds &gt; 0 AND fare &gt; 0</code></pre><p>Filtering bad data at the source prevents:</p><ul><li><p>NaN/Inf values that crash training</p></li><li><p>Outliers that distort model learning</p></li><li><p>Invalid records that waste compute</p></li></ul><p><strong>3. Temporal Consistency</strong></p><pre><code>{timestamp_filter}</code></pre><p>The pipeline supports two modes:</p><ul><li><p><strong>Latest data</strong>: Dynamically selects most recent 2&#8211;3 months</p></li><li><p><strong>Fixed timestamp</strong>: Uses data from a specific time period</p></li></ul><p>This enables:</p><ul><li><p><strong>Production</strong>: Always train on fresh data</p></li><li><p><strong>Development</strong>: Reproducible experiments with fixed data</p></li></ul><h3><strong>Pipeline Code: Preprocessing Step</strong></h3><pre><code>from google_cloud_pipeline_components.v1.bigquery import BigqueryQueryJobOp
from pipelines.utils.query import generate_query

# Generate preprocessing SQL with template substitution
prep_query = generate_query(
    input_file=queries_folder / &#8220;ingest.sql&#8221;,
    source=bq_source_uri,
    location=bq_location,
    dataset=f&#8221;{project}.{dataset}&#8221;,
    table_=preprocessed_table,
    label=label,
    start_timestamp=timestamp,
    use_latest_data=use_latest_data,
)
# Execute preprocessing as a pipeline step
prep_op = BigqueryQueryJobOp(
    project=project,
    location=&#8221;US&#8221;,
    query=prep_query,
).set_display_name(&#8221;Ingest &amp; preprocess data&#8221;)</code></pre><p><strong>What happens</strong>:</p><ol><li><p><code>generate_query()</code> loads SQL template and substitutes parameters</p></li><li><p><code>BigqueryQueryJobOp</code> executes the query in BigQuery</p></li><li><p>Results are written to <code>{project}.{dataset}.preprocessed_data</code></p></li><li><p>Subsequent steps read from this table</p></li></ol><h2><strong>Step 2: Repeatable Data Splitting</strong></h2><p><strong>Goal</strong>: Split data into train (80%), validation (10%), and test (10%) sets in a deterministic, reproducible way.</p><h3><strong>The Challenge of Reproducibility</strong></h3><p>Random splits are problematic:</p><pre><code># BAD: Different split every run
train, test = random_split(data, [0.8, 0.2])</code></pre><p><strong>Problems</strong>:</p><ul><li><p>Can&#8217;t reproduce experiments</p></li><li><p>Hyperparameter tuning results aren&#8217;t comparable</p></li><li><p>Can&#8217;t debug models trained weeks ago</p></li></ul><h3><strong>Our Solution: Hash-Based Deterministic Splitting</strong></h3><pre><code>-- repeatable_splitting.sql
CREATE OR REPLACE TABLE `{destination_table}` AS (
  SELECT * FROM `{source_dataset}.{source_table}`
  WHERE MOD(ABS(FARM_FINGERPRINT(CAST(unique_key AS STRING))), {num_lots}) IN {lots}
)</code></pre><p><strong>How it works</strong>:</p><ol><li><p><strong>Hash the unique key</strong>: <code>FARM_FINGERPRINT(unique_key)</code> produces a consistent hash</p></li><li><p><strong>Modulo operation</strong>: <code>MOD(..., 10)</code> assigns each row to a bucket (0-9)</p></li><li><p><strong>Select buckets</strong>: Buckets 0&#8211;7 = train, 8 = validation, 9 = test</p></li></ol><p><strong>Benefits</strong>:</p><ul><li><p><strong>Deterministic</strong>: Same row always goes to same split</p></li><li><p><strong>Balanced</strong>: Hash distributes rows uniformly</p></li><li><p><strong>Reproducible</strong>: Re-running uses identical splits</p></li><li><p><strong>Efficient</strong>: Computed in BigQuery, not in application code</p></li></ul><h3><strong>Pipeline Code: Data Splitting</strong></h3><pre><code># Train split (buckets 0-7, 80% of data)
split_train_query = generate_query(
    input_file=queries_folder / &#8220;repeatable_splitting.sql&#8221;,
    source_dataset=f&#8221;{project}.{dataset}&#8221;,
    source_table=preprocessed_table,
    num_lots=10,
    lots=tuple(range(8)),  # (0, 1, 2, 3, 4, 5, 6, 7)
)

split_train_data = BigqueryQueryJobOp(
    project=project,
    location=bq_location,
    query=split_train_query,
).after(prep_op).set_display_name(&#8221;Split train data&#8221;)
# Validation split (bucket 8, 10% of data)
split_valid_query = generate_query(
    input_file=queries_folder / &#8220;repeatable_splitting.sql&#8221;,
    source_dataset=f&#8221;{project}.{dataset}&#8221;,
    source_table=preprocessed_table,
    num_lots=10,
    lots=&#8221;(8)&#8221;,
)
split_valid_data = BigqueryQueryJobOp(
    project=project,
    location=bq_location,
    query=split_valid_query,
).after(prep_op).set_display_name(&#8221;Split valid data&#8221;)
# Test split (bucket 9, 10% of data)
split_test_query = generate_query(
    input_file=queries_folder / &#8220;repeatable_splitting.sql&#8221;,
    source_dataset=f&#8221;{project}.{dataset}&#8221;,
    source_table=preprocessed_table,
    num_lots=10,
    lots=&#8221;(9)&#8221;,
)
split_test_data = BigqueryQueryJobOp(
    project=project,
    location=bq_location,
    query=split_test_query,
).after(prep_op).set_display_name(&#8221;Split test data&#8221;)</code></pre><p><strong>Dependency management</strong>: All three splits depend on <code>prep_op</code> via <code>.after(prep_op)</code>, ensuring preprocessing completes first. But they run <strong>in parallel</strong> with each other since they&#8217;re independent.</p><h2><strong>Step 3: Data Extraction to Cloud Storage</strong></h2><p><strong>Goal</strong>: Export BigQuery tables to GCS as CSV files that TensorFlow can read.</p><h3><strong>Why Export to GCS?</strong></h3><p>TensorFlow&#8217;s <code>tf.data.experimental.make_csv_dataset()</code> reads from files, not BigQuery directly. We need to bridge this gap.</p><pre><code># Extract training data
train_dataset = (
    extract_table_to_gcs_op(
        bq_table=split_train_data.outputs[&#8221;destination_table&#8221;]
    )
    .after(split_train_data)
    .set_display_name(&#8221;Extract training data from BigQuery to GCS&#8221;)
)

# Extract validation data
valid_dataset = (
    extract_table_to_gcs_op(
        bq_table=split_valid_data.outputs[&#8221;destination_table&#8221;]
    )
    .after(split_valid_data)
    .set_display_name(&#8221;Extract validation data from BigQuery to GCS&#8221;)
)
# Extract test data
test_dataset = (
    extract_table_to_gcs_op(
        bq_table=split_test_data.outputs[&#8221;destination_table&#8221;]
    )
    .after(split_test_data)
    .set_display_name(&#8221;Extract test data from BigQuery to GCS&#8221;)
)</code></pre><p><strong>What happens</strong>:</p><ol><li><p>Each BigQuery table is exported to a GCS URI (e.g., <code>gs://bucket/train/*.csv</code>)</p></li><li><p>The <code>extract_table_to_gcs_op</code> component handles the export job</p></li><li><p>Output artifacts (<code>train_dataset</code>, <code>valid_dataset</code>, <code>test_dataset</code>) are passed to training</p></li></ol><p><strong>Pro tip</strong>: GCS paths are automatically generated by KFP based on the pipeline run ID, ensuring each run has isolated data.</p><h2><strong>Step 4: Hyperparameter Tuning with Vertex AI</strong></h2><p><strong>Goal</strong>: Automatically find the best learning rate and batch size for our model.</p><h3><strong>Hyperparameter Search Space</strong></h3><p>We define which hyperparameters to tune and their ranges:</p><pre><code>from google.cloud.aiplatform import hyperparameter_tuning as hpt

PARAMETER_SPEC = {
    &#8220;learning-rate&#8221;: hpt.DoubleParameterSpec(
        min=0.0001,
        max=1,
        scale=&#8221;log&#8221;  # Search logarithmically
    ),
    &#8220;batch-size&#8221;: hpt.DiscreteParameterSpec(
        values=[128, 256, 512],
        scale=&#8221;linear&#8221;
    ),
}
METRIC_SPEC = {
    &#8220;val_root_mean_squared_error&#8221;: &#8220;minimize&#8221;
}</code></pre><p><strong>Design choices</strong>:</p><ul><li><p><strong>Log scale for learning rate</strong>: Search exponentially (0.0001, 0.001, 0.01, 0.1, 1)</p></li><li><p><strong>Discrete batch sizes</strong>: Only try powers of 2 for memory efficiency</p></li><li><p><strong>Validation RMSE</strong>: Optimize for generalization, not training loss</p></li></ul><h3><strong>Hyperparameter Tuning Workflow</strong></h3><pre><code># 1. Prepare args for hyperparameter tuning
args = dict(
    train_data=train_dataset.outputs[&#8221;dataset&#8221;],
    valid_data=valid_dataset.outputs[&#8221;dataset&#8221;],
    test_data=test_dataset.outputs[&#8221;dataset&#8221;],
    hypertune=True,  # Enable hyperparameter tuning mode
)

hypertune_args_step = get_training_args_dict_op(**args).set_display_name(
    &#8220;Get-Hypertune-Args&#8221;
)
# 2. Configure worker pool for tuning trials
hypertune_worker_pool_specs_step = get_workerpool_spec_op(
    worker_pool_specs=WORKER_POOL_SPECS,
    args=hypertune_args_step.output,
).set_display_name(&#8221;Get-Hypertune-Worker-Pool-Spec&#8221;)
# 3. Run hyperparameter tuning job
hypertune_step = HyperparameterTuningJobRunOp(
    display_name=&#8221;hypertune-job&#8221;,
    project=project,
    location=location,
    worker_pool_specs=hypertune_worker_pool_specs_step.output,
    study_spec_metrics=serialize_metrics(METRIC_SPEC),
    study_spec_parameters=serialize_parameters(PARAMETER_SPEC),
    max_trial_count=6,           # Try 6 different combinations
    parallel_trial_count=2,      # Run 2 trials simultaneously
    base_output_directory=f&#8221;{base_output_dir}/hypertune-job&#8221;,
).set_display_name(&#8221;Hypertune-Job&#8221;)
# 4. Extract best hyperparameters
hypertune_results_step = get_hyperparameter_tuning_results_op(
    project=project,
    location=location,
    job_resource=hypertune_step.output,
    study_spec_metrics=serialize_metrics(METRIC_SPEC),
).set_display_name(&#8221;Get-Hypertune-Results&#8221;)</code></pre><h3><strong>What Happens During Hyperparameter Tuning?</strong></h3><ol><li><p><strong>Trial Spawning</strong>: Vertex AI launches 2 parallel training jobs with different hyperparameters</p></li><li><p><strong>Training</strong>: Each trial trains the model on the training set, validates on validation set</p></li><li><p><strong>Metric Reporting</strong>: Each trial reports <code>val_root_mean_squared_error</code> to Vertex AI</p></li><li><p><strong>Algorithm</strong>: Vertex AI uses Bayesian optimization to choose next trials intelligently</p></li><li><p><strong>Best Selection</strong>: After 6 trials, the best hyperparameters are identified</p></li></ol><p><strong>Example Trial Results</strong>:</p><pre><code>Trial 1: learning_rate=0.001, batch_size=128 &#8594; val_RMSE=3.2
Trial 2: learning_rate=0.01,  batch_size=256 &#8594; val_RMSE=2.9  &#8592; Best so far
Trial 3: learning_rate=0.1,   batch_size=512 &#8594; val_RMSE=4.5
Trial 4: learning_rate=0.005, batch_size=256 &#8594; val_RMSE=2.7  &#8592; New best!
Trial 5: learning_rate=0.003, batch_size=256 &#8594; val_RMSE=2.8
Trial 6: learning_rate=0.007, batch_size=256 &#8594; val_RMSE=2.75

Best: learning_rate=0.005, batch_size=256, val_RMSE=2.7</code></pre><p>The <code>hypertune_results_step</code> extracts these best hyperparameters for final training.</p><h2><strong>Step 5: Custom TensorFlow Training Container</strong></h2><p><strong>Goal</strong>: Train a TensorFlow DNN model with the best hyperparameters using a custom container.</p><h3><strong>Why Custom Containers?</strong></h3><p>Vertex AI provides pre-built training containers, but we use a custom one because:</p><ul><li><p><strong>Full control</strong>: Install exact dependencies we need</p></li><li><p><strong>Custom preprocessing</strong>: TensorFlow layers for feature encoding</p></li><li><p><strong>Hyperparameter integration</strong>: Pass tuned hyperparameters to training script</p></li><li><p><strong>Model architecture</strong>: Implement custom DNN structure</p></li><li><p><strong>Artifact management</strong>: Save model, metrics, and metadata exactly how we want</p></li></ul><h3><strong>Training Container Structure</strong></h3><pre><code>model/
&#9500;&#9472;&#9472; Dockerfile                 # Container definition
&#9500;&#9472;&#9472; requirements.txt           # Python dependencies
&#9492;&#9472;&#9472; trainer/
    &#9500;&#9472;&#9472; __init__.py
    &#9500;&#9472;&#9472; task.py               # Entry point (argument parsing)
    &#9492;&#9472;&#9472; model.py              # Model definition and training logic</code></pre><h3><strong>Model Architecture (model.py)</strong></h3><p>Our model is a <strong>DNN with preprocessing layers</strong> built into the graph:</p><pre><code>def build_and_compile_model(dataset, model_params):
    # Numeric features (normalize)
    NUM_COLS = [&#8221;dayofweek&#8221;, &#8220;hourofday&#8221;, &#8220;trip_distance&#8221;, &#8220;trip_miles&#8221;, &#8220;trip_seconds&#8221;]

    # Ordinal categorical (integer encoding)
    ORD_COLS = [&#8221;company&#8221;]
    # One-hot categorical (one-hot encoding)
    OHE_COLS = [&#8221;payment_type&#8221;]
    # Create input layers
    num_ins = {name: Input(shape=(), name=name, dtype=tf.float32) for name in NUM_COLS}
    ord_ins = {name: Input(shape=(), name=name, dtype=tf.string) for name in ORD_COLS}
    cat_ins = {name: Input(shape=(), name=name, dtype=tf.string) for name in OHE_COLS}
    all_ins = {**num_ins, **ord_ins, **cat_ins}
    # Preprocessing layers (learned from training data)
    num_encoded = [normalization(name, dataset)(num_ins[name]) for name in NUM_COLS]
    ord_encoded = [str_lookup(name, dataset, &#8220;int&#8221;)(ord_ins[name]) for name in ORD_COLS]
    ohe_encoded = [str_lookup(name, dataset, &#8220;one_hot&#8221;)(cat_ins[name]) for name in OHE_COLS]
    # Concatenate all features
    x = Concatenate()(num_encoded + ord_encoded + ohe_encoded)
    # Hidden layers
    for units, activation in model_params[&#8221;hidden_units&#8221;]:
        x = Dense(units, activation=activation)(x)
    # Output layer (regression)
    output = Dense(1, name=&#8221;output&#8221;, activation=&#8221;linear&#8221;)(x)
    # Build model
    model = Model(inputs=all_ins, outputs=output, name=&#8221;nn_model&#8221;)
    # Compile with optimizer and metrics
    optimizer = optimizers.get(model_params[&#8221;optimizer&#8221;])
    optimizer.learning_rate = model_params[&#8221;learning_rate&#8221;]
    model.compile(
        loss=model_params[&#8221;loss_fn&#8221;],
        optimizer=optimizer,
        metrics=[
            tf.keras.metrics.RootMeanSquaredError(name=&#8221;root_mean_squared_error&#8221;),
            tf.keras.metrics.MeanAbsoluteError(name=&#8221;mean_absolute_error&#8221;),
            tf.keras.metrics.MeanAbsolutePercentageError(name=&#8221;mean_absolute_percentage_error&#8221;),
            tf.keras.metrics.MeanSquaredLogarithmicError(name=&#8221;mean_squared_logarithmic_error&#8221;),
        ],
    )
    return model</code></pre><p><strong>Key features</strong>:</p><ol><li><p><strong>Preprocessing in the Model</strong>:</p></li></ol><ul><li><p><code>Normalization</code> layer: learns mean/std from training data</p></li><li><p><code>StringLookup</code> layers: learn vocabularies for categorical features</p></li><li><p>These layers are <strong>saved with the model</strong> &#8594; no separate preprocessing needed at inference</p></li></ul><p><strong>2. Multiple Metrics</strong>:</p><ul><li><p>RMSE (primary metric for champion/challenger comparison)</p></li><li><p>MAE, MAPE, MSLE (additional evaluation metrics)</p></li></ul><p><strong>3. Configurable Architecture</strong>:</p><ul><li><p>Hidden units, optimizer, learning rate all passed as parameters</p></li><li><p>Easy to experiment without changing code</p></li></ul><h3><strong>Training Execution</strong></h3><pre><code># Prepare args for final training (not hypertuning)
args.update(dict(hypertune=False))

training_args_step = get_training_args_dict_op(**args).set_display_name(
    &#8220;Get-Training-Args&#8221;
)
# Configure worker pool with best hyperparameters
training_worker_pool_specs_step = get_workerpool_spec_op(
    worker_pool_specs=WORKER_POOL_SPECS,
    hyperparams=hypertune_results_step.output,  # Use best hyperparameters!
    args=training_args_step.output,
).set_display_name(&#8221;Get-Training-Worker-Pool-Spec&#8221;)
# Launch custom training job
custom_job_task = CustomTrainingJobOp(
    project=project,
    display_name=training_job_display_name,
    worker_pool_specs=training_worker_pool_specs_step.output,
    base_output_directory=f&#8221;{base_output_dir}/training-job&#8221;,
    location=location,
)</code></pre><p><strong>What happens</strong>:</p><ol><li><p>Vertex AI provisions a VM with the specified machine type (<code>n1-standard-4</code>)</p></li><li><p>Pulls the custom training container from Artifact Registry</p></li><li><p>Runs the training script with hyperparameters from tuning step</p></li><li><p>Model trains on training data, validates on validation data</p></li><li><p>Saves the trained model to GCS as a TensorFlow SavedModel</p></li></ol><h3><strong>Training Script Arguments (task.py)</strong></h3><pre><code>parser.add_argument(&#8221;--train-data&#8221;, required=True, help=&#8221;Path to training CSV&#8221;)
parser.add_argument(&#8221;--valid-data&#8221;, required=True, help=&#8221;Path to validation CSV&#8221;)
parser.add_argument(&#8221;--test-data&#8221;, required=True, help=&#8221;Path to test CSV&#8221;)
parser.add_argument(&#8221;--model-dir&#8221;, default=os.getenv(&#8221;AIP_MODEL_DIR&#8221;), help=&#8221;Model output directory&#8221;)
parser.add_argument(&#8221;--learning-rate&#8221;, type=float, default=0.001)
parser.add_argument(&#8221;--batch-size&#8221;, type=int, default=100)
parser.add_argument(&#8221;--epochs&#8221;, type=int, default=10)</code></pre><p>These arguments are populated by the worker pool spec, which includes the best hyperparameters.</p><h2><strong>Step 6: Model Evaluation on Test Set</strong></h2><p><strong>Goal</strong>: Evaluate the trained model on held-out test data to get unbiased performance metrics.</p><pre><code># Extract training results (model + metrics)
training_results_step = get_custom_job_results_op(
    project=project,
    location=location,
    job_resource=custom_job_task.output
).set_display_name(&#8221;Get-Training-Results&#8221;)</code></pre><p>The <code>get_custom_job_results_op</code> component:</p><ol><li><p>Reads the SavedModel from GCS</p></li><li><p>Loads the test dataset</p></li><li><p>Evaluates the model: <code>model.evaluate(test_data)</code></p></li><li><p>Extracts metrics: RMSE, MAE, MAPE, MSLE</p></li><li><p>Writes metrics to a JSON file artifact</p></li></ol><p><strong>Example metrics.json</strong>:</p><pre><code>{
  &#8220;problemType&#8221;: &#8220;regression&#8221;,
  &#8220;rootMeanSquaredError&#8221;: 2.7,
  &#8220;meanAbsoluteError&#8221;: 1.9,
  &#8220;meanAbsolutePercentageError&#8221;: 12.5,
  &#8220;meanSquaredLogarithmicError&#8221;: 0.08
}</code></pre><p>These metrics are passed to the champion/challenger comparison step.</p><h2><strong>Step 7: Champion/Challenger Comparison</strong></h2><p><strong>Goal</strong>: Only promote the new model to production if it&#8217;s better than the current champion.</p><p>This is implemented by the <code>upload_best_model_op</code> component (see Article 3 for details):</p><pre><code>upload_best_model_op(
    project=project,
    location=location,
    model=training_results_step.outputs[&#8221;model&#8221;],
    model_eval_metrics=training_results_step.outputs[&#8221;metrics&#8221;],
    test_data=test_dataset.outputs[&#8221;dataset&#8221;],
    eval_metric=&#8221;rootMeanSquaredError&#8221;,
    eval_lower_is_better=True,
    serving_container_image=PREDICTION_IMAGE,
    model_name=model_name,
    model_description=&#8221;Predict price of a taxi trip.&#8221;,
    pipeline_job_id=&#8221;{{$.pipeline_job_name}}&#8221;,
).set_display_name(&#8221;Upload model&#8221;)</code></pre><p><strong>Comparison logic</strong>:</p><ol><li><p><strong>Lookup champion</strong>: Query Vertex AI Model Registry for the current default model</p></li><li><p><strong>Get champion metrics</strong>: Read evaluation metrics from champion model</p></li><li><p><strong>Compare</strong>:</p></li></ol><pre><code>challenger_wins = ( challenger_rmse &lt; champion_rmse )Upload:</code></pre><p>4. <strong>Upload:</strong></p><ul><li><p>If challenger wins: Upload as <code>is_default_version=True</code> (becomes new champion)</p></li><li><p>If champion wins: Upload as <code>is_default_version=False</code> (versioned but not default)</p></li></ul><p><strong>Example scenario</strong>:</p><pre><code>Current Champion: RMSE = 3.1
New Model: RMSE = 2.7
&#8594; Challenger wins! (2.7 &lt; 3.1)
&#8594; Upload as default version
&#8594; New model becomes champion</code></pre><h2><strong>Step 8: Model Upload and Registry Management</strong></h2><p>The model is uploaded to Vertex AI Model Registry with:</p><ul><li><p><strong>Display name</strong>: <code>taxi-traffic-model</code></p></li><li><p><strong>Version</strong>: Auto-incremented (v1, v2, v3, &#8230;)</p></li><li><p><strong>Default flag</strong>: Set based on champion/challenger comparison</p></li><li><p><strong>Evaluation metrics</strong>: Imported and visible in Vertex AI UI</p></li><li><p><strong>Lineage</strong>: Linked to training pipeline run, datasets used</p></li><li><p><strong>Serving container</strong>: Specified for deployment</p></li></ul><p><strong>Registry view after upload</strong>:</p><pre><code>taxi-traffic-model
&#9500;&#9472;&#9472; v1 (RMSE: 3.5) - created 2 weeks ago
&#9500;&#9472;&#9472; v2 (RMSE: 3.1) - created 1 week ago [CHAMPION]
&#9492;&#9472;&#9472; v3 (RMSE: 2.7) - created today [NEW CHAMPION]</code></pre><h2><strong>Complete Pipeline Code Walkthrough</strong></h2><p>Here&#8217;s the full pipeline definition (simplified for clarity):</p><pre><code>from kfp import compiler, dsl

@dsl.pipeline(name=&#8221;taxifare-training-pipeline&#8221;)
def pipeline(
    project: str,
    location: str,
    model_name: str = &#8220;taxi-traffic-model&#8221;,
):
    # Step 1: Preprocess data
    prep_op = BigqueryQueryJobOp(
        project=project,
        location=&#8221;US&#8221;,
        query=prep_query,
    ).set_display_name(&#8221;Ingest &amp; preprocess data&#8221;)
    # Step 2: Split data (80/10/10)
    split_train = BigqueryQueryJobOp(...).after(prep_op)
    split_valid = BigqueryQueryJobOp(...).after(prep_op)
    split_test = BigqueryQueryJobOp(...).after(prep_op)
    # Step 3: Extract to GCS
    train_dataset = extract_table_to_gcs_op(...).after(split_train)
    valid_dataset = extract_table_to_gcs_op(...).after(split_valid)
    test_dataset = extract_table_to_gcs_op(...).after(split_test)
    # Step 4: Hyperparameter tuning
    hypertune_step = HyperparameterTuningJobRunOp(...)
    hypertune_results = get_hyperparameter_tuning_results_op(...)
    # Step 5: Train with best hyperparameters
    custom_job = CustomTrainingJobOp(
        worker_pool_specs=training_worker_pool_specs_step.output
    )
    # Step 6: Evaluate
    training_results = get_custom_job_results_op(...)
    # Step 7 &amp; 8: Champion/Challenger comparison and upload
    upload_best_model_op(
        model=training_results.outputs[&#8221;model&#8221;],
        model_eval_metrics=training_results.outputs[&#8221;metrics&#8221;],
        eval_metric=&#8221;rootMeanSquaredError&#8221;,
        eval_lower_is_better=True,
        model_name=model_name,
    )</code></pre><p><strong>DAG visualization </strong>on Vertex AI pipeline:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gRl5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gRl5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png 424w, https://substackcdn.com/image/fetch/$s_!gRl5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png 848w, https://substackcdn.com/image/fetch/$s_!gRl5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png 1272w, https://substackcdn.com/image/fetch/$s_!gRl5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gRl5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png" width="788" height="373" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92d8939a-2154-4436-b077-c4019d63a949_788x373.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:373,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gRl5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png 424w, https://substackcdn.com/image/fetch/$s_!gRl5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png 848w, https://substackcdn.com/image/fetch/$s_!gRl5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png 1272w, https://substackcdn.com/image/fetch/$s_!gRl5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8939a-2154-4436-b077-c4019d63a949_788x373.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ljCP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ljCP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png 424w, https://substackcdn.com/image/fetch/$s_!ljCP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png 848w, https://substackcdn.com/image/fetch/$s_!ljCP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png 1272w, https://substackcdn.com/image/fetch/$s_!ljCP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ljCP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png" width="788" height="321" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:321,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ljCP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png 424w, https://substackcdn.com/image/fetch/$s_!ljCP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png 848w, https://substackcdn.com/image/fetch/$s_!ljCP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png 1272w, https://substackcdn.com/image/fetch/$s_!ljCP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50d1ea5f-313e-4ba6-906f-7ff2e047809d_788x321.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Pipeline Observability and Debugging</strong></h2><h3><strong>Vertex AI Pipeline UI</strong></h3><p>When you run the pipeline, Vertex AI provides a rich UI:</p><ol><li><p><strong>DAG visualization</strong>: See all steps and their dependencies</p></li><li><p><strong>Step-by-step logs</strong>: Click any component to view logs</p></li><li><p><strong>Artifact tracking</strong>: See inputs/outputs of each step</p></li><li><p><strong>Lineage graph</strong>: Trace data from source to model</p></li><li><p><strong>Execution timeline</strong>: Identify bottlenecks</p></li></ol><h3><strong>Key Metrics to Monitor</strong></h3><p><strong>During hyperparameter tuning</strong>:</p><ul><li><p>Trials completed vs total</p></li><li><p>Best validation RMSE so far</p></li><li><p>Trial execution time</p></li></ul><p><strong>During training</strong>:</p><ul><li><p>Training loss curve</p></li><li><p>Validation RMSE per epoch</p></li><li><p>Training duration</p></li></ul><p><strong>After upload</strong>:</p><ul><li><p>Champion vs challenger RMSE</p></li><li><p>Model version number</p></li><li><p>Upload success/failure</p></li></ul><h2><strong>Production Considerations</strong></h2><h3><strong>Caching for Faster Iterations</strong></h3><p>KFP supports caching &#8212; if inputs haven&#8217;t changed, skip execution and reuse previous outputs:</p><pre><code># Enable caching for expensive operations
prep_op.set_caching_options(True)
split_train.set_caching_options(True)</code></pre><p><strong>When to cache</strong>:</p><ul><li><p>Data preprocessing (if source data hasn&#8217;t changed)</p></li><li><p>Data splitting (deterministic, always same result)</p></li></ul><p><strong>When NOT to cache</strong>:</p><ul><li><p>Hyperparameter tuning (want fresh trials)</p></li><li><p>Training (want latest model)</p></li></ul><h2><strong>Conclusion</strong></h2><p>We&#8217;ve dissected a production training pipeline from raw data to deployed model. The key elements that make it production-ready:</p><ul><li><p><strong>BigQuery preprocessing</strong>: Scalable SQL-based feature engineering</p></li><li><p><strong>Repeatable splitting</strong>: Hash-based deterministic train/val/test splits</p></li><li><p><strong>Hyperparameter tuning</strong>: Automatic optimization with Vertex AI</p></li><li><p><strong>Custom containers</strong>: Full control over training environment</p></li><li><p><strong>Rigorous evaluation</strong>: Test set metrics for unbiased assessment</p></li><li><p><strong>Champion/Challenger</strong>: Quality gate preventing degraded models</p></li><li><p><strong>Model registry</strong>: Versioning, lineage, and governance</p></li></ul><p>Each component is modular, testable, and reusable. The DAG clearly shows dependencies. Observability is built-in at every step.</p><p>In the next article, we&#8217;ll explore how CI/CD automates this entire workflow &#8212; from code commit to production deployment &#8212; ensuring every change is tested, validated, and deployed safely.</p><p><strong>Key Takeaways:</strong></p><ul><li><p>Preprocess at scale with BigQuery SQL for speed and simplicity</p></li><li><p>Use hash-based splitting for deterministic, reproducible experiments</p></li><li><p>Automate hyperparameter tuning to find optimal configurations</p></li><li><p>Build preprocessing into TensorFlow models for training-serving consistency</p></li><li><p>Implement Champion/Challenger pattern to protect production quality</p></li><li><p>Track lineage from raw data through trained models in Vertex AI</p></li></ul><p><strong>Next in Series</strong>: CI/CD for ML: Automating from Code to Production</p><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP">production-ready-MLOps-on-GCP</a></p><p><strong>Pipeline Code</strong>:</p><ul><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/blob/main/pipelines/src/pipelines/training.py">Training pipeline</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/tree/main/pipelines/src/pipelines/queries">SQL queries</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/tree/main/model">Model code</a></p></li></ul><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Production-Ready MLOps on GCP Part 4: Building Reusable Kubeflow Pipeline Components]]></title><description><![CDATA[Part 4 of a 8-part series on building enterprise-grade MLOps systems]]></description><link>https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-8ac</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-8ac</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 10:12:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Complete Series</strong>:</p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 1: Architecture Overview</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-5f1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 2: Tools &amp; Workflows for ML Teams</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-06c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 3: Infrastructure as Code</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-8ac?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 4: Reusable KFP Components</a> (You are here)</p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-022?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 5: Production Training Pipeline</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-a6c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 6: Production Prediction Pipeline </a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-9c6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 7: CI/CD for ML</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-e8f?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 8: Model Monitoring &amp; Continuous Training</a></p></li></ul><h2><strong>Introduction</strong></h2><p>In the <a href="https://medium.com/google-cloud/production-ready-mlops-on-gcp-part-4-building-reusable-kubeflow-pipeline-components-40cccfecd16d#">previous article</a>, we built the infrastructure foundation for our MLOps system using Terraform. Our Vertex AI environment is provisioned, our service accounts have the right permissions, and our artifact registries are ready. Now comes the exciting part: building the ML workflows themselves.</p><p>But here&#8217;s the thing: if you approach ML pipelines the way most teams do &#8212; writing monolithic scripts that do everything &#8212; you&#8217;ll end up with code that&#8217;s hard to test, impossible to reuse, and a nightmare to debug. Every pipeline becomes a snowflake, and maintaining them becomes a full-time job.</p><p>The solution? <strong>Reusable Kubeflow Pipeline (KFP) components; </strong>modular, composable building blocks that can be mixed and matched to create any ML workflow you need.</p><p>In this article, we&#8217;ll explore:</p><ul><li><p>What makes a good pipeline component</p></li><li><p>How to design components following the single responsibility principle</p></li><li><p>Deep dives into 4 critical components from our system</p></li><li><p>Testing strategies for ML components</p></li><li><p>Best practices for component development</p></li></ul><p>By the end, you&#8217;ll understand how to build a component library that makes creating production ML pipelines as simple as connecting LEGO blocks.</p><h2><strong>What Are Kubeflow Pipeline Components?</strong></h2><p>Think of a KFP component as a <strong>function with superpowers</strong>. It&#8217;s a self-contained piece of code that:</p><ul><li><p>Performs one specific task (e.g., &#8220;export BigQuery table to GCS&#8221;)</p></li><li><p>Declares its inputs and outputs explicitly</p></li><li><p>Runs in its own containerized environment</p></li><li><p>Can be reused across multiple pipelines</p></li><li><p>Is independently testable</p></li></ul><h3><strong>A Simple Example</strong></h3><p>Here&#8217;s what a basic KFP component looks like:</p><pre><code>from kfp.dsl import component, Output, Dataset

@component(
    base_image=&#8221;python:3.10&#8221;,
    packages_to_install=[&#8221;pandas==2.0.0&#8221;]
)
def process_data_op(
    input_path: str,
    output_dataset: Output[Dataset],
    filter_threshold: float = 0.5
) -&gt; None:
    &#8220;&#8221;&#8220;Process data and save to output.&#8221;&#8220;&#8221;
    import pandas as pd
    # Load data
    df = pd.read_csv(input_path)
    # Apply transformation
    df_filtered = df[df[&#8217;score&#8217;] &gt; filter_threshold]
    # Save result
    df_filtered.to_csv(output_dataset.path, index=False)</code></pre><p><strong>What makes this a component?</strong></p><ol><li><p><code>@component</code><strong> decorator</strong>: Tells KFP this is a reusable component</p></li><li><p><code>base_image</code>: Specifies the Docker image to run in</p></li><li><p><code>packages_to_install</code>: Auto-installs dependencies</p></li><li><p><strong>Type annotations</strong>: <code>Output[Dataset]</code> tells KFP this produces a dataset artifact</p></li><li><p><strong>Self-contained logic</strong>: Everything needed to run is inside the function</p></li></ol><h3><strong>Python Function-Based vs Containerized Components</strong></h3><p>KFP supports two component types:</p><p><strong>Python function-based components</strong> (what we use):</p><ul><li><p>Define components as Python functions</p></li><li><p>KFP automatically containerizes them</p></li><li><p>Easy to write and test</p></li><li><p>Perfect for most use cases</p></li></ul><p><strong>Containerized components</strong>:</p><ul><li><p>You build the Docker image yourself</p></li><li><p>Maximum control over the environment</p></li><li><p>Necessary for complex dependencies or non-Python code</p></li></ul><p>We use <strong>function-based components</strong> because they offer the best balance of simplicity and power.</p><h3><strong>Our Component Library: The Building Blocks</strong></h3><p>Our reference implementation includes 8 reusable components:</p><ol><li><p><code>extract_table_to_gcs_op</code> - Export BigQuery tables to Cloud Storage</p></li><li><p><code>get_training_args_dict_op</code> - Build training configuration dictionaries</p></li><li><p><code>get_workerpool_spec_op</code> - Configure distributed training worker pools</p></li><li><p><code>get_hyperparameter_tuning_results_op</code> - Parse hyperparameter tuning results</p></li><li><p><code>get_custom_job_results_op</code> - Extract metrics from training jobs</p></li><li><p><code>lookup_model_op</code> - Find models in Vertex AI Model Registry</p></li><li><p><code>upload_best_model_op</code> - Champion/Challenger model comparison and upload</p></li><li><p><code>model_batch_predict_op</code> - Execute batch predictions with monitoring</p></li></ol><p>Each component follows the <strong>single responsibility principle</strong> &#8212; it does one thing and does it well. Let&#8217;s dive deep into four critical ones.</p><h2><strong>Component Architecture</strong></h2><p>Here&#8217;s how our 8 reusable components interact with pipelines and GCP services:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Mll!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Mll!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png 424w, https://substackcdn.com/image/fetch/$s_!5Mll!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png 848w, https://substackcdn.com/image/fetch/$s_!5Mll!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png 1272w, https://substackcdn.com/image/fetch/$s_!5Mll!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Mll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png" width="784" height="141" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3594d328-47d1-482b-91d2-a4045805490d_784x141.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:141,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5Mll!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png 424w, https://substackcdn.com/image/fetch/$s_!5Mll!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png 848w, https://substackcdn.com/image/fetch/$s_!5Mll!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png 1272w, https://substackcdn.com/image/fetch/$s_!5Mll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3594d328-47d1-482b-91d2-a4045805490d_784x141.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2><strong>Component Deep Dive 1: extract_table_to_gcs_op</strong></h2><p><strong>Purpose</strong>: Export a BigQuery table to Cloud Storage in CSV format.</p><p><strong>Why it exists</strong>: Many ML frameworks expect data in files (CSV, TFRecord) rather than directly from BigQuery. This component bridges that gap.</p><h3><strong>Implementation</strong></h3><pre><code>from kfp.dsl import Dataset, Artifact, component, Input, Output

@component(
    base_image=&#8221;python:3.10.14&#8221;,
    packages_to_install=[&#8221;google-cloud-bigquery==3.24.0&#8221;]
)
def extract_table_to_gcs_op(
    bq_table: Input[Artifact],
    dataset: Output[Dataset],
    location: str = &#8220;US&#8221;,
) -&gt; None:
    &#8220;&#8221;&#8220;Extract a BigQuery table into Google Cloud Storage.&#8221;&#8220;&#8221;
    import google.cloud.bigquery as bq
    # Extract table metadata from input artifact
    project_id = bq_table.metadata[&#8221;projectId&#8221;]
    dataset_id = bq_table.metadata[&#8221;datasetId&#8221;]
    table_id = bq_table.metadata[&#8221;tableId&#8221;]
    # Construct full table ID
    full_table_id = f&#8221;{project_id}.{dataset_id}.{table_id}&#8221;
    table = bq.table.Table(table_ref=full_table_id)
    # Initialize BigQuery client
    client = bq.client.Client(project=project_id, location=location)
    # Submit extract job to GCS
    extract_job = client.extract_table(table, dataset.uri)
    # Wait for completion
    extract_job.result()</code></pre><h3><strong>Key Design Decisions</strong></h3><p><strong>1. Artifact-Based Input</strong></p><pre><code>bq_table: Input[Artifact]</code></pre><p>The component receives table information as an artifact with metadata, not raw strings. This enables:</p><ul><li><p><strong>Lineage tracking</strong>: Vertex AI knows which table produced which dataset</p></li><li><p><strong>Type safety</strong>: Can&#8217;t accidentally pass wrong data</p></li><li><p><strong>Metadata preservation</strong>: Project ID, dataset ID, table ID travel together</p></li></ul><p><strong>2. Output as Dataset</strong></p><pre><code>dataset: Output[Dataset]</code></pre><p>The output is typed as a <code>Dataset</code>, which:</p><ul><li><p>Creates a GCS URI automatically (<code>dataset.uri</code>)</p></li><li><p>Registers the dataset in Vertex AI Metadata Store</p></li><li><p>Enables downstream components to reference it</p></li></ul><p><strong>3. Explicit Location</strong></p><pre><code>location: str = &#8220;US&#8221;</code></pre><p>BigQuery location matters for data residency and performance. Making it explicit prevents subtle bugs.</p><h3><strong>Usage in a Pipeline</strong></h3><pre><code>from kfp import dsl

@dsl.pipeline(name=&#8221;my-pipeline&#8221;)
def my_pipeline():
    # Previous step creates bq_table artifact
    preprocess_task = preprocess_data_op(...)
    # Extract to GCS
    extract_task = extract_table_to_gcs_op(
        bq_table=preprocess_task.outputs[&#8221;bq_table&#8221;],
        location=&#8221;US&#8221;
    )
    # Next step uses the dataset
    train_task = train_model_op(
        training_data=extract_task.outputs[&#8221;dataset&#8221;]
    )</code></pre><p>The output of one component becomes the input of another &#8212; clean, type-safe data flow.</p><h2><strong>Component Deep Dive 2: lookup_model_op</strong></h2><p><strong>Purpose</strong>: Find a model in Vertex AI Model Registry by display name.</p><p><strong>Why it exists</strong>: For predictions, we need to retrieve the &#8220;champion&#8221; model. For champion/challenger comparison, we need to find the existing champion.</p><h3><strong>Implementation (Simplified)</strong></h3><pre><code>from kfp.dsl import component, Output, Model
from typing import NamedTuple

@component(
    base_image=&#8221;python:3.10.14&#8221;,
    packages_to_install=[&#8221;google-cloud-aiplatform==1.55.0&#8221;],
)
def lookup_model_op(
    model_name: str,
    location: str,
    project: str,
    model: Output[Model],
    fail_on_model_not_found: bool = False,
) -&gt; NamedTuple(&#8221;Outputs&#8221;, [(&#8221;model_resource_name&#8221;, str), (&#8221;training_dataset&#8221;, dict)]):
    &#8220;&#8221;&#8220;Fetch a model by display name from Vertex AI Model Registry.&#8221;&#8220;&#8221;
    import json
    import logging
    from pathlib import Path
    from google.cloud.aiplatform import Model
    TRAINING_DATASET_INFO = &#8220;training_dataset.json&#8221;
    logging.info(f&#8221;Listing models with display name {model_name}&#8221;)
    models = Model.list(
        filter=f&#8217;display_name=&#8221;{model_name}&#8221;&#8217;,
        location=location,
        project=project,
    )
    logging.info(f&#8221;Found {len(models)} model(s)&#8221;)
    training_dataset = {}
    model_resource_name = &#8220;&#8221;
    if len(models) == 0:
        logging.error(f&#8221;No model found with name {model_name}&#8221;)
        if fail_on_model_not_found:
            raise RuntimeError(&#8221;Failed as model was not found&#8221;)
    elif len(models) == 1:
        target_model = models[0]
        model_resource_name = target_model.resource_name
        # Populate output artifact
        model.uri = target_model.uri
        model.metadata[&#8221;resourceName&#8221;] = target_model.resource_name
        # Read training dataset metadata (for monitoring)
        path = Path(model.path) / TRAINING_DATASET_INFO
        if path.exists():
            with open(path, &#8220;r&#8221;) as fp:
                training_dataset = json.load(fp)
            logging.info(f&#8221;Training dataset: {training_dataset}&#8221;)
    else:
        raise RuntimeError(f&#8221;Multiple models with name {model_name} found&#8221;)
    return model_resource_name, training_dataset</code></pre><h3><strong>Key Design Decisions</strong></h3><p><strong>1. Multiple Return Values</strong></p><pre><code>-&gt; NamedTuple(&#8221;Outputs&#8221;, [(&#8221;model_resource_name&#8221;, str), (&#8221;training_dataset&#8221;, dict)])</code></pre><p>Components can return multiple outputs. The <code>model_resource_name</code> is used for logging, while <code>training_dataset</code> is used for model monitoring configuration.</p><p><strong>2. Flexible Error Handling</strong></p><pre><code>fail_on_model_not_found: bool = False</code></pre><p>Different scenarios need different behaviors:</p><ul><li><p><strong>First pipeline run</strong>: No model exists yet, don&#8217;t fail</p></li><li><p><strong>Production prediction</strong>: Model must exist, fail if not found</p></li></ul><p><strong>3. Metadata Extraction</strong> The component reads <code>training_dataset.json</code> from the model directory. This metadata (created during training) contains information needed for model monitoring&#8212;a great example of <strong>components communicating via artifacts and metadata</strong>.</p><h3><strong>Usage: Champion Model Lookup</strong></h3><pre><code>@dsl.pipeline(name=&#8221;prediction-pipeline&#8221;)
def prediction_pipeline(model_name: str = &#8220;chicago-taxi-fare&#8221;):
    # Lookup champion model
    lookup_task = lookup_model_op(
        model_name=model_name,
        location=&#8221;us-central1&#8221;,
        project=&#8221;my-project&#8221;,
        fail_on_model_not_found=True  # Must exist for predictions
    )

    # Use champion model for predictions
    predict_task = model_batch_predict_op(
        model=lookup_task.outputs[&#8221;model&#8221;],
        # ... other params
    )</code></pre><h2><strong>Component Deep Dive 3: upload_best_model_op</strong></h2><p><strong>Purpose</strong>: Implement Champion/Challenger pattern &#8212; compare new model against existing champion, upload to registry only if it&#8217;s better.</p><p><strong>Why it exists</strong>: This is the <strong>gatekeeper</strong> that prevents degraded models from reaching production. It&#8217;s the most critical component in the system.</p><h3><strong>Implementation (Simplified)</strong></h3><pre><code>from kfp.dsl import Dataset, Input, Metrics, Model, Output, component
from google_cloud_pipeline_components.types.artifact_types import VertexModel

@component(
    base_image=&#8221;python:3.10&#8221;,
    packages_to_install=[
        &#8220;google-cloud-aiplatform==1.55.0&#8221;,
        &#8220;google-cloud-pipeline-components==2.14.1&#8221;,
    ],
)
def upload_best_model_op(
    model: Input[Model],
    test_data: Input[Dataset],
    model_eval_metrics: Input[Metrics],
    vertex_model: Output[VertexModel],
    project: str,
    location: str,
    model_name: str,
    eval_metric: str,
    eval_lower_is_better: bool,
    pipeline_job_id: str,
    serving_container_image: str,
    model_description: str = None,
    evaluation_name: str = &#8220;Imported evaluation&#8221;,
) -&gt; None:
    &#8220;&#8221;&#8220;Upload model to registry only if it beats the champion.&#8221;&#8220;&#8221;
    import json
    import logging
    import google.cloud.aiplatform as aip
    from google.protobuf.json_format import MessageToDict
    def lookup_model(model_name: str):
        &#8220;&#8221;&#8220;Look up existing champion model.&#8221;&#8220;&#8221;
        models = aip.Model.list(
            filter=f&#8217;display_name=&#8221;{model_name}&#8221;&#8217;,
            location=location,
            project=project,
        )
        if len(models) == 0:
            return None
        elif len(models) == 1:
            return models[0]
        else:
            raise RuntimeError(f&#8221;Multiple models with name {model_name} found&#8221;)
    def compare_models(champion_metrics, challenger_metrics, eval_lower_is_better):
        &#8220;&#8221;&#8220;Compare models by evaluating a primary metric.&#8221;&#8220;&#8221;
        logging.info(f&#8221;Comparing {eval_metric} of models&#8221;)
        m_champ = champion_metrics[eval_metric]
        m_chall = challenger_metrics[eval_metric]
        logging.info(f&#8221;Champion={m_champ} Challenger={m_chall}&#8221;)
        challenger_wins = (
            (m_chall &lt; m_champ) if eval_lower_is_better
            else (m_chall &gt; m_champ)
        )
        logging.info(f&#8221;{&#8217;Challenger&#8217; if challenger_wins else &#8216;Champion&#8217;} wins!&#8221;)
        return challenger_wins
    def upload_model_to_registry(is_default_version, parent_model_uri=None):
        &#8220;&#8221;&#8220;Upload model to Vertex AI Model Registry.&#8221;&#8220;&#8221;
        logging.info(f&#8221;Uploading model {model_name} (default: {is_default_version})&#8221;)
        uploaded_model = aip.Model.upload(
            display_name=model_name,
            description=model_description,
            artifact_uri=model.uri,
            serving_container_image_uri=serving_container_image,
            parent_model=parent_model_uri,
            is_default_version=is_default_version,
        )
        # Populate output artifact for downstream components
        vertex_model.uri = (
            f&#8221;https://{location}-aiplatform.googleapis.com/v1/&#8221;
            f&#8221;{uploaded_model.versioned_resource_name}&#8221;
        )
        vertex_model.metadata[&#8221;resourceName&#8221;] = (
            uploaded_model.versioned_resource_name
        )
        return uploaded_model
    # Parse challenger metrics
    with open(model_eval_metrics.path, &#8220;r&#8221;) as f:
        challenger_metrics = json.load(f)
    # Look up champion model
    champion_model = lookup_model(model_name=model_name)
    challenger_wins = True
    parent_model_uri = None
    if champion_model is None:
        logging.info(&#8221;No champion model found, uploading new model.&#8221;)
    else:
        logging.info(
            f&#8221;Model version {champion_model.version_id} &#8220;
            &#8220;is being challenged by new model.&#8221;
        )
        # Get champion evaluation metrics
        champion_eval = champion_model.get_model_evaluation()
        champion_metrics = MessageToDict(
            champion_eval._gca_resource._pb
        )[&#8221;metrics&#8221;]
        # Compare champion vs challenger
        challenger_wins = compare_models(
            champion_metrics=champion_metrics,
            challenger_metrics=challenger_metrics,
            eval_lower_is_better=eval_lower_is_better,
        )
        parent_model_uri = champion_model.resource_name
    # Upload new model version
    # If challenger wins, it becomes the default version (champion)
    # If challenger loses, it&#8217;s uploaded but not set as default
    model = upload_model_to_registry(
        is_default_version=challenger_wins,
        parent_model_uri=parent_model_uri
    )
    # Import evaluation results to Vertex AI
    import_evaluation(
        parsed_metrics=challenger_metrics,
        challenger_model=model,
        evaluation_name=evaluation_name,
    )</code></pre><h3><strong>Key Design Decisions</strong></h3><p><strong>1. Champion/Challenger Pattern</strong></p><pre><code>is_default_version = challenger_wins</code></pre><p>This single line implements model governance:</p><ul><li><p><strong>Challenger wins</strong>: Becomes the new default (champion) model</p></li><li><p><strong>Challenger loses</strong>: Still uploaded (for audit trail) but not default</p></li></ul><p><strong>2. Metric-Based Comparison</strong></p><pre><code>challenger_wins = (m_chall &lt; m_champ) if eval_lower_is_better else (m_chall &gt; m_champ)</code></pre><p>Flexible comparison logic:</p><ul><li><p><strong>For losses (RMSE, MSE)</strong>: Lower is better</p></li><li><p><strong>For scores (accuracy, AUC)</strong>: Higher is better</p></li></ul><p><strong>3. Model Versioning</strong></p><pre><code>parent_model=parent_model_uri</code></pre><p>All model versions are linked to the same parent, creating a version history in the registry. You can always roll back to a previous version.</p><p><strong>4. Evaluation Import</strong> The component not only uploads the model but also imports its evaluation metrics into Vertex AI, making them visible in the UI. This is critical for:</p><ul><li><p>Comparing model versions visually</p></li><li><p>Audit trails</p></li><li><p>Debugging why a model was/wasn&#8217;t promoted</p></li></ul><h3><strong>The Three Scenarios</strong></h3><p>This component handles three scenarios elegantly:</p><p><strong>Scenario 1: First Model (No Champion)</strong></p><pre><code>No champion found &#8594; Upload as default (becomes champion)</code></pre><p><strong>Scenario 2: Challenger Wins</strong></p><pre><code>Challenger RMSE (2.5) &lt; Champion RMSE (3.1)
&#8594; Upload as default (new champion)</code></pre><p><strong>Scenario 3: Champion Wins</strong></p><pre><code>Challenger RMSE (3.5) &gt; Champion RMSE (3.1)
&#8594; Upload as non-default (champion unchanged)</code></pre><p>Production stays protected &#8212; degraded models never become default.</p><h2><strong>Component Deep Dive 4: model_batch_predict_op</strong></h2><p><strong>Purpose</strong>: Execute batch predictions and enable model monitoring for skew detection.</p><p><strong>Why it exists</strong>: This component combines two critical production needs &#8212; running predictions at scale and monitoring for model degradation.</p><h3><strong>Implementation Highlights</strong></h3><pre><code>from kfp.dsl import Input, Model, component, OutputPath
from typing import List, NamedTuple

@component(
    base_image=&#8221;python:3.10&#8221;,
    packages_to_install=[
        &#8220;google-cloud-pipeline-components==2.14.1&#8221;,
        &#8220;google-cloud-aiplatform==1.55.0&#8221;,
    ],
)
def model_batch_predict_op(
    model: Input[Model],
    gcp_resources: OutputPath(str),
    job_display_name: str,
    location: str,
    project: str,
    source_uri: str,
    destination_uri: str,
    source_format: str,
    destination_format: str,
    machine_type: str = &#8220;n1-standard-2&#8221;,
    starting_replica_count: int = 1,
    max_replica_count: int = 1,
    monitoring_training_dataset: dict = None,
    monitoring_alert_email_addresses: List[str] = None,
    monitoring_skew_config: dict = None,
) -&gt; NamedTuple(&#8221;Outputs&#8221;, [(&#8221;gcp_resources&#8221;, str)]):
    &#8220;&#8221;&#8220;Execute batch prediction with optional monitoring.&#8221;&#8220;&#8221;
    import logging
    import time
    from google.cloud.aiplatform_v1beta1.services.job_service import JobServiceClient
    from google.cloud.aiplatform_v1beta1.types import BatchPredictionJob
    from google.protobuf.json_format import ParseDict
    # Configure input/output based on format
    input_config = {&#8221;instancesFormat&#8221;: source_format}
    output_config = {&#8221;predictionsFormat&#8221;: destination_format}
    if source_format == &#8220;bigquery&#8221; and destination_format == &#8220;bigquery&#8221;:
        input_config[&#8221;bigquerySource&#8221;] = {&#8221;inputUri&#8221;: source_uri}
        output_config[&#8221;bigqueryDestination&#8221;] = {&#8221;outputUri&#8221;: destination_uri}
    else:
        input_config[&#8221;gcsSource&#8221;] = {&#8221;uris&#8221;: [source_uri]}
        output_config[&#8221;gcsDestination&#8221;] = {&#8221;outputUriPrefix&#8221;: destination_uri}
    # Build batch prediction request
    message = {
        &#8220;displayName&#8221;: job_display_name,
        &#8220;model&#8221;: model.metadata[&#8221;resourceName&#8221;],
        &#8220;inputConfig&#8221;: input_config,
        &#8220;outputConfig&#8221;: output_config,
        &#8220;dedicatedResources&#8221;: {
            &#8220;machineSpec&#8221;: {&#8221;machineType&#8221;: machine_type},
            &#8220;startingReplicaCount&#8221;: starting_replica_count,
            &#8220;maxReplicaCount&#8221;: max_replica_count,
        },
    }
    # Add monitoring configuration if provided
    if monitoring_training_dataset and monitoring_skew_config:
        logging.info(&#8221;Adding monitoring config to request&#8221;)
        message[&#8221;modelMonitoringConfig&#8221;] = {
            &#8220;alertConfig&#8221;: {
                &#8220;emailAlertConfig&#8221;: {
                    &#8220;userEmails&#8221;: monitoring_alert_email_addresses or []
                },
                &#8220;enableLogging&#8221;: True,
            },
            &#8220;objectiveConfigs&#8221;: [{
                &#8220;trainingDataset&#8221;: monitoring_training_dataset,
                &#8220;trainingPredictionSkewDetectionConfig&#8221;: monitoring_skew_config,
            }],
        }
    # Submit batch prediction job
    request = ParseDict(message, BatchPredictionJob()._pb)
    client = JobServiceClient(
        client_options={&#8221;api_endpoint&#8221;: f&#8221;{location}-aiplatform.googleapis.com&#8221;}
    )
    response = client.create_batch_prediction_job(
        parent=f&#8221;projects/{project}/locations/{location}&#8221;,
        batch_prediction_job=request,
    )
    logging.info(f&#8221;Submitted batch prediction job: {response.name}&#8221;)
    # Poll until job completes
    POLLING_INTERVAL = 20
    while True:
        job_status = client.get_batch_prediction_job(name=response.name)
        if job_status.state == JobState.JOB_STATE_SUCCEEDED:
            logging.info(&#8221;Job completed successfully&#8221;)
            break
        elif job_status.state in [JobState.JOB_STATE_FAILED,
                                   JobState.JOB_STATE_CANCELLED]:
            raise RuntimeError(f&#8221;Job failed with state: {job_status.state}&#8221;)
        logging.info(f&#8221;Job in progress, waiting {POLLING_INTERVAL}s...&#8221;)
        time.sleep(POLLING_INTERVAL)
    return (gcp_resources,)</code></pre><h3><strong>Key Design Decisions</strong></h3><p><strong>1. Flexible Input/Output Formats</strong></p><pre><code>if source_format == &#8220;bigquery&#8221; and destination_format == &#8220;bigquery&#8221;:
    # BQ &#8594; BQ (most common for our use case)
else:
    # GCS &#8594; GCS</code></pre><p>The component supports both BigQuery and Cloud Storage, making it reusable for different scenarios.</p><p><strong>2. Optional Monitoring</strong></p><pre><code>if monitoring_training_dataset and monitoring_skew_config:
    message[&#8221;modelMonitoringConfig&#8221;] = {...}</code></pre><p>Monitoring is optional &#8212; you can run predictions without it. But when enabled, Vertex AI automatically:</p><ul><li><p>Compares prediction data distribution to training data</p></li><li><p>Detects training-serving skew</p></li><li><p>Sends email alerts if thresholds are exceeded</p></li></ul><p><strong>3. Synchronous Execution with Polling</strong></p><pre><code>while True:
    job_status = client.get_batch_prediction_job(...)
    if job_status.state == JOB_STATE_SUCCEEDED:
        break</code></pre><p>The component <strong>waits</strong> for the batch prediction to complete. This is intentional:</p><ul><li><p>Downstream components need the predictions to exist</p></li><li><p>Failures are caught immediately, not discovered later</p></li><li><p>Pipeline DAG reflects actual dependencies</p></li></ul><p><strong>4. Resource Configuration</strong></p><pre><code>machine_type: str = &#8220;n1-standard-2&#8221;,
starting_replica_count: int = 1,
max_replica_count: int = 1,</code></pre><p>Predictions can scale horizontally. For large datasets, increase replicas for parallel processing.</p><h2><strong>Component Design Patterns</strong></h2><p>After examining four components, let&#8217;s extract the patterns that make them production-ready.</p><h3><strong>Pattern 1: Explicit Input/Output Types</strong></h3><p><strong>Bad</strong>:</p><pre><code>def my_component(input_path: str) -&gt; str:
    # Returns a string path, no lineage tracking
    return &#8220;gs://bucket/output.csv&#8221;</code></pre><p><strong>Good</strong>:</p><pre><code>def my_component(
    input_data: Input[Dataset],
    output_data: Output[Dataset]
) -&gt; None:
    # KFP tracks lineage automatically
    process(input_data.path, output_data.path)</code></pre><p>Explicit types enable:</p><ul><li><p><strong>Lineage tracking</strong>: Vertex AI knows data flow</p></li><li><p><strong>Type safety</strong>: Can&#8217;t pass a Model where a Dataset is expected</p></li><li><p><strong>Automatic URI generation</strong>: <code>output_data.uri</code> is created for you</p></li></ul><h3><strong>Pattern 2: Metadata for Communication</strong></h3><p>Components pass data via artifacts, and metadata via artifact properties:</p><pre><code># Component A: Sets metadata
def create_table_op(bq_table: Output[Artifact]) -&gt; None:
    bq_table.metadata[&#8221;projectId&#8221;] = &#8220;my-project&#8221;
    bq_table.metadata[&#8221;datasetId&#8221;] = &#8220;my_dataset&#8221;
    bq_table.metadata[&#8221;tableId&#8221;] = &#8220;my_table&#8221;

# Component B: Reads metadata
def extract_table_op(bq_table: Input[Artifact]) -&gt; None:
    project = bq_table.metadata[&#8221;projectId&#8221;]  # Metadata preserved</code></pre><p>This is cleaner than passing 10 string parameters.</p><h3><strong>Pattern 3: Logging for Observability</strong></h3><p>Every component logs extensively:</p><pre><code>import logging

logging.info(f&#8221;Processing {len(data)} records&#8221;)
logging.debug(f&#8221;Raw metrics: {raw_metrics}&#8221;)
logging.warning(&#8221;Model not found, using default&#8221;)
logging.error(f&#8221;Validation failed: {error_msg}&#8221;)</code></pre><p>These logs appear in:</p><ul><li><p>Cloud Logging (searchable, filterable)</p></li><li><p>Vertex AI Pipeline UI (per-step)</p></li><li><p>Component artifacts</p></li></ul><p><strong>Pro tip</strong>: Use structured logging with key-value pairs for easier searching:</p><pre><code>logging.info(f&#8221;model_upload status=success model_id={model_id} rmse={rmse}&#8221;)</code></pre><h3><strong>Pattern 4: Graceful Error Handling</strong></h3><p>Components should fail fast and clearly:</p><p><strong>Bad</strong>:</p><pre><code>models = Model.list(...)
model = models[0]  # IndexError if no models!</code></pre><p><strong>Good</strong>:</p><pre><code>models = Model.list(...)
if len(models) == 0:
    if fail_on_model_not_found:
        raise RuntimeError(
            f&#8221;No model found with name {model_name}. &#8220;
            f&#8221;Expected at least one model in {project}/{location}.&#8221;
        )
    else:
        logging.warning(&#8221;No model found, continuing...&#8221;)
        return None</code></pre><p>Clear error messages save hours of debugging.</p><h3><strong>Pattern 5: Configuration via Parameters</strong></h3><p>Never hardcode:</p><p><strong>Bad</strong>:</p><pre><code>def train_model_op(...):
    BATCH_SIZE = 32  # Hardcoded!
    LEARNING_RATE = 0.001  # Can&#8217;t change without editing code</code></pre><p><strong>Good</strong>:</p><pre><code>def train_model_op(
    batch_size: int = 32,
    learning_rate: float = 0.001
):
    # Configurable via pipeline parameters</code></pre><p>This makes components reusable across different experiments.</p><h2><strong>Testing Strategies for Components</strong></h2><p>Testing ML components requires different approaches than traditional software.</p><h3><strong>Level 1: Unit Tests</strong></h3><p>Test component logic in isolation by calling <code>component.python_func</code>:</p><pre><code>import components

# Extract the underlying Python function
upload_model = components.upload_best_model_op.python_func
def test_model_upload_no_champion(mock_model_class, tmp_path):
    &#8220;&#8221;&#8220;Test uploading first model (no champion exists).&#8221;&#8220;&#8221;
    # Mock Vertex AI Model.list to return no models
    mock_model_class.list.return_value = []
    # Create test inputs
    model = Model(uri=&#8221;gs://bucket/model&#8221;)
    metrics_file = tmp_path / &#8220;metrics.json&#8221;
    metrics_file.write_text(&#8217;{&#8221;problemType&#8221;: &#8220;regression&#8221;, &#8220;rmse&#8221;: 2.5}&#8217;)
    # Call component function
    upload_model(
        model=model,
        model_eval_metrics=metrics_file,
        eval_metric=&#8221;rmse&#8221;,
        eval_lower_is_better=True,
        model_name=&#8221;test-model&#8221;,
        # ... other params
    )
    # Assert model was uploaded as default
    mock_model_class.upload.assert_called_once_with(
        display_name=&#8221;test-model&#8221;,
        is_default_version=True,  # First model becomes champion
        # ...
    )</code></pre><p><strong>Benefits</strong>:</p><ul><li><p>Fast (no actual GCP calls)</p></li><li><p>Cheap (no cloud resources)</p></li><li><p>Isolated (test one component at a time)</p></li></ul><h3><strong>Level 2: Integration Tests</strong></h3><p>Test component compilation (validates KFP syntax):</p><pre><code>def test_component_compiles():
    &#8220;&#8221;&#8220;Ensure component definition is valid.&#8221;&#8220;&#8221;
    from kfp import compiler
    compiler.Compiler().compile(
        pipeline_func=my_pipeline,
        package_path=&#8221;pipeline.yaml&#8221;
    )
    # If this doesn&#8217;t raise, component syntax is valid</code></pre><h3><strong>Level 3: End-to-End Tests</strong></h3><p>Run actual pipeline in a dev environment:</p><pre><code># Build container
make build
# Run full training pipeline in dev
make e2e-tests pipeline=training
# Verify outputs
# - Check GCS for artifacts
# - Check Model Registry for uploaded model
# - Check BigQuery for prediction results</code></pre><p>E2E tests catch:</p><ul><li><p>IAM permission issues</p></li><li><p>API enablement problems</p></li><li><p>Resource quota limits</p></li><li><p>Real data issues</p></li></ul><p><strong>Run E2E tests on every PR</strong> to catch breaking changes before they reach production.</p><h2><strong>Mocking GCP Services</strong></h2><p>Use <code>unittest.mock</code> to avoid hitting real GCP APIs:</p><pre><code>from unittest.mock import Mock, patch

@patch(&#8221;google.cloud.aiplatform.Model&#8221;)
@patch(&#8221;google.cloud.aiplatform_v1.ModelServiceClient&#8221;)
def test_upload_best_model(mock_model_service, mock_model_class):
    # Mock returns
    mock_model_class.list.return_value = []
    mock_model_class.upload.return_value = Mock(
        versioned_resource_name=&#8221;models/123/versions/1&#8221;
    )
    # Test component
    # ...</code></pre><p>This is essential for:</p><ul><li><p>Fast test execution</p></li><li><p>Testing error conditions</p></li><li><p>CI/CD without GCP credentials</p></li></ul><h2><strong>Best Practices for Component Development</strong></h2><h3><strong>1. Single Responsibility Principle</strong></h3><p>Each component should do <strong>one thing</strong>:</p><p><strong>Bad</strong>: <code>process_and_train_op</code> (does two things)</p><p><strong>Good</strong>:</p><ul><li><p><code>process_data_op</code> (preprocessing only)</p></li><li><p><code>train_model_op</code> (training only)</p></li></ul><p>Smaller components are easier to:</p><ul><li><p>Test</p></li><li><p>Debug</p></li><li><p>Reuse</p></li><li><p>Understand</p></li></ul><h3><strong>2. Idempotency</strong></h3><p>Components should produce the same output given the same input:</p><p><strong>Bad</strong>:</p><pre><code>timestamp = time.time()  # Different every run!
output_path = f&#8221;gs://bucket/data_{timestamp}.csv&#8221;</code></pre><p><strong>Good</strong>:</p><pre><code># Use pipeline-provided timestamp
output_path = f&#8221;gs://bucket/data_{pipeline_timestamp}.csv&#8221;</code></pre><p>Idempotency enables:</p><ul><li><p>Pipeline retries</p></li><li><p>Reproducible results</p></li><li><p>Caching</p></li></ul><h3><strong>3. Avoid External State</strong></h3><p>Components should be <strong>self-contained</strong>:</p><p><strong>Bad</strong>:</p><pre><code># Reads from external config file
with open(&#8221;/config/settings.yaml&#8221;) as f:
    config = yaml.load(f)  # Where does this file come from?</code></pre><p><strong>Good</strong>:</p><pre><code># Configuration passed as parameters
def my_component(batch_size: int, learning_rate: float):
    # Everything needed is in the function signature</code></pre><h3><strong>4. Version Dependencies Explicitly</strong></h3><pre><code>@component(
    base_image=&#8221;python:3.10.14&#8221;,  # Exact version
    packages_to_install=[
        &#8220;google-cloud-aiplatform==1.55.0&#8221;,  # Exact version
        &#8220;pandas==2.0.3&#8221;,  # Exact version
    ],
)</code></pre><p>Exact versions ensure:</p><ul><li><p>Reproducible builds</p></li><li><p>No surprise breakages from dependency updates</p></li><li><p>Clear dependency audit trail</p></li></ul><h3><strong>5. Document Inputs and Outputs</strong></h3><pre><code>def my_component(
    input_data: Input[Dataset],
    threshold: float = 0.5,
) -&gt; None:
    &#8220;&#8221;&#8220;
    Process input data and filter by threshold. 
    Args:
        input_data: Dataset containing features to process.
        threshold: Minimum score to include in output (default: 0.5).
    Outputs:
        output_data: Filtered dataset written to GCS.
    &#8220;&#8221;&#8220;
</code></pre><p>Good documentation helps:</p><ul><li><p>Other developers use your components</p></li><li><p>Future you remember what it does</p></li><li><p>Auto-generated pipeline documentation</p></li></ul><h2><strong>Composing Components into Pipelines</strong></h2><p>Components are the building blocks; pipelines are the assemblies.</p><h3><strong>Simple Pipeline Example</strong></h3><pre><code>from kfp import dsl

@dsl.pipeline(
    name=&#8221;training-pipeline&#8221;,
    description=&#8221;Train taxi fare prediction model&#8221;
)
def training_pipeline(
    project: str,
    location: str,
    model_name: str = &#8220;chicago-taxi-fare&#8221;,
):
    # Step 1: Preprocess data
    preprocess_task = preprocess_data_bq_op(
        project=project,
        location=location,
    )
    # Step 2: Extract to GCS
    extract_task = extract_table_to_gcs_op(
        bq_table=preprocess_task.outputs[&#8221;bq_table&#8221;],
        location=location,
    )
    # Step 3: Train model
    train_task = train_model_op(
        training_data=extract_task.outputs[&#8221;dataset&#8221;],
        # ...
    )
    # Step 4: Evaluate model
    eval_task = evaluate_model_op(
        model=train_task.outputs[&#8221;model&#8221;],
        test_data=extract_task.outputs[&#8221;test_dataset&#8221;],
    )
    # Step 5: Upload if better than champion
    upload_task = upload_best_model_op(
        model=train_task.outputs[&#8221;model&#8221;],
        model_eval_metrics=eval_task.outputs[&#8221;metrics&#8221;],
        eval_metric=&#8221;rmse&#8221;,
        eval_lower_is_better=True,
        model_name=model_name,
        # ...
    )</code></pre><p><strong>Notice</strong>:</p><ul><li><p>Each task is a component invocation</p></li><li><p>Outputs of one task feed inputs of another</p></li><li><p>Pipeline is a Python function decorated with <code>@dsl.pipeline</code></p></li></ul><p>The next article will dive deep into this training pipeline.</p><h2><strong>Debugging Components</strong></h2><p>When components fail, here&#8217;s how to debug:</p><h3><strong>1. Check Component Logs</strong></h3><p>In Vertex AI Pipeline UI:</p><ol><li><p>Click on failed component</p></li><li><p>View &#8220;Logs&#8221; tab</p></li><li><p>Filter by severity (ERROR, WARNING)</p></li></ol><h3><strong>2. Examine Artifacts</strong></h3><p>Components write artifacts to GCS:</p><pre><code># List artifacts from a pipeline run
gsutil ls gs://my-project-pl-root/artifacts/training-pipeline-123/

# Read a specific artifact
gsutil cat gs://my-project-pl-root/.../metrics.json</code></pre><h3><strong>3. Test Locally</strong></h3><pre><code># Import component function
from components import extract_table_to_gcs_op

# Call directly (not as a KFP component)
extract_func = extract_table_to_gcs_op.python_func
# Test with local mock data
extract_func(
    bq_table=mock_table,
    dataset=mock_dataset,
    location=&#8221;US&#8221;
)</code></pre><h3><strong>4. Increase Logging</strong></h3><p>Add more logging statements temporarily:</p><pre><code>logging.info(f&#8221;DEBUG: input_data.uri = {input_data.uri}&#8221;)
logging.info(f&#8221;DEBUG: metadata = {input_data.metadata}&#8221;)</code></pre><p>Redeploy and rerun to see additional context.</p><h2><strong>Conclusion</strong></h2><p>Reusable Kubeflow Pipeline components are the foundation of maintainable MLOps systems. By following the patterns we&#8217;ve explored:</p><ul><li><p><strong>Single responsibility</strong>: Each component does one thing well</p></li><li><p><strong>Explicit types</strong>: Inputs and outputs are strongly typed</p></li><li><p><strong>Metadata communication</strong>: Pass rich information via artifacts</p></li><li><p><strong>Extensive logging</strong>: Make debugging possible</p></li><li><p><strong>Comprehensive testing</strong>: Unit, integration, and E2E tests</p></li></ul><p>You can build a component library that makes creating new ML pipelines fast, reliable, and enjoyable.</p><p>Our 8 components cover the essentials:</p><ul><li><p>Data movement (BigQuery &#8596; GCS)</p></li><li><p>Model registry operations (lookup, upload)</p></li><li><p>Training configuration (worker specs, hyperparameters)</p></li><li><p>Inference (batch predictions with monitoring)</p></li></ul><p>In the next article, we&#8217;ll combine these components into a complete production training pipeline that handles everything from raw data to a deployed model in Vertex AI Model Registry.</p><p><strong>Key Takeaways:</strong></p><ul><li><p>KFP components are containerized Python functions with explicit inputs/outputs</p></li><li><p>Components should be small, focused, and reusable</p></li><li><p>Strong typing enables lineage tracking and type safety</p></li><li><p>Comprehensive testing (unit + integration + E2E) prevents production issues</p></li><li><p>Champion/Challenger pattern ensures only better models reach production</p></li><li><p>Good logging makes debugging 10x easier</p></li></ul><p><strong>Next in Series</strong>: Designing Production ML Pipelines: Training Pipeline Deep Dive</p><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP">production-ready-MLOps-on-GCP</a></p><p><strong>Component Code</strong>:</p><ul><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/tree/main/components/src/components">components/src/components/</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/tree/main/components/tests">Component tests</a></p></li></ul><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Production-Ready MLOps on GCP Part 3: Infrastructure as Code for ML( Terraform + Vertex AI)]]></title><description><![CDATA[Part 3 of a 8-part series on building enterprise-grade MLOps systems]]></description><link>https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-06c</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-06c</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 10:12:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Complete Series</strong>:</p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 1: Architecture Overview</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-5f1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 2: Tools &amp; Workflows for ML Teams</a> (You are here)</p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-06c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 3: Infrastructure as Code</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-8ac?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 4: Reusable KFP Components</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-022?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 5: Production Training Pipeline</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-a6c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 6: Production Prediction Pipeline </a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-9c6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 7: CI/CD for ML</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-e8f?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 8: Model Monitoring &amp; Continuous Training</a></p></li></ul><h2><strong>Introduction</strong></h2><p>In the <a href="https://medium.com/google-cloud/production-ready-mlops-on-gcp-part-3-infrastructure-as-code-for-ml-terraform-vertex-ai-86c48fa40bbd#">previous article</a>, we explored the overall architecture of a production-ready MLOps system on GCP. Now comes the critical question: how do you actually provision all of this infrastructure reliably across dev, test, and production environments?</p><p>If you&#8217;ve ever manually clicked through the Google Cloud Console to set up Vertex AI pipelines, BigQuery datasets, service accounts, and IAM roles, you know the pain. It&#8217;s error-prone, hard to replicate, and impossible to version control. What worked in dev mysteriously breaks in prod. You forget a critical IAM permission. A teammate can&#8217;t reproduce your setup.</p><p>This is where <strong>Infrastructure as Code (IaC)</strong> transforms MLOps from fragile to rock-solid.</p><p>In this article, we&#8217;ll dive deep into:</p><ul><li><p>Why Infrastructure as Code is non-negotiable for MLOps</p></li><li><p>How to structure Terraform modules for ML workloads</p></li><li><p>Setting up Vertex AI infrastructure across multiple environments</p></li><li><p>IAM best practices for secure, least-privilege ML pipelines</p></li><li><p>Managing Terraform state and deployment workflows</p></li></ul><p>Let&#8217;s build infrastructure that&#8217;s as version-controlled and testable as your ML code.</p><h2><strong>Why Infrastructure as Code for MLOps?</strong></h2><p>Before we dive into code, let&#8217;s address the elephant in the room: why bother with IaC when you can just create resources in the Cloud Console?</p><h3><strong>The Manual Approach Doesn&#8217;t Scale</strong></h3><p>Imagine this scenario:</p><ol><li><p>You manually set up Vertex AI in your dev project</p></li><li><p>Three months later, you need to replicate it in prod</p></li><li><p>You can&#8217;t remember all the steps</p></li><li><p>IAM roles are different between environments</p></li><li><p>The prod deployment fails mysteriously</p></li><li><p>You spend days debugging what should have been a 10-minute deployment</p></li></ol><p>With Infrastructure as Code:</p><ol><li><p>You define your infrastructure once in Terraform</p></li><li><p>You apply it to dev: <code>terraform apply</code></p></li><li><p>You apply the same code to prod: <code>terraform apply</code></p></li><li><p>Everything is identical and reproducible</p></li><li><p>Changes are tracked in Git with full audit history</p></li></ol><h3><strong>Key Benefits for MLOps</strong></h3><p><strong>1. Reproducibility</strong> Every environment (dev/test/prod) uses the exact same code. No configuration drift.</p><p><strong>2. Version Control</strong> Infrastructure changes go through pull requests, just like application code. You can see who changed what and when.</p><p><strong>3. Environment Parity</strong> Test environments mirror production exactly, reducing &#8220;works on my machine&#8221; issues.</p><p><strong>4. Disaster Recovery</strong> If a project gets accidentally deleted or corrupted, you can recreate it in minutes from code.</p><p><strong>5. Documentation</strong> Your Terraform code is living documentation of your infrastructure.</p><p><strong>6. Collaboration</strong> Team members can review and understand infrastructure changes before they&#8217;re deployed.</p><p><strong>7. Testing</strong> Infrastructure changes can be previewed with <code>terraform plan</code> before applying.</p><h2><strong>Our Terraform Architecture</strong></h2><p>Our infrastructure follows a <strong>modular design</strong> with clear separation between reusable modules and environment-specific configurations.</p><h3><strong>Directory Structure</strong></h3><pre><code>terraform/
&#9500;&#9472;&#9472; environments/           # Environment-specific configurations
&#9474;   &#9500;&#9472;&#9472; dev/
&#9474;   &#9474;   &#9500;&#9472;&#9472; main.tf        # Dev environment setup
&#9474;   &#9474;   &#9500;&#9472;&#9472; variables.tf   # Dev-specific variables
&#9474;   &#9474;   &#9500;&#9472;&#9472; auto.tfvars    # Dev variable values
&#9474;   &#9474;   &#9492;&#9472;&#9472; backend.tf     # State backend configuration
&#9474;   &#9500;&#9472;&#9472; test/
&#9474;   &#9474;   &#9492;&#9472;&#9472; ...            # Same structure as dev
&#9474;   &#9492;&#9472;&#9472; prod/
&#9474;       &#9492;&#9472;&#9472; ...            # Same structure as dev
&#9474;
&#9492;&#9472;&#9472; modules/               # Reusable Terraform modules
    &#9500;&#9472;&#9472; vertex_deployment/  # Core Vertex AI infrastructure
    &#9474;   &#9500;&#9472;&#9472; main.tf        # Resource definitions
    &#9474;   &#9500;&#9472;&#9472; variables.tf   # Module variables
    &#9474;   &#9500;&#9472;&#9472; iam.tf         # IAM roles and permissions
    &#9474;   &#9500;&#9472;&#9472; outputs.tf     # Exported values
    &#9474;   &#9492;&#9472;&#9472; versions.tf    # Provider versions
    &#9474;
    &#9492;&#9472;&#9472; cloudrunfunction/   # Cloud Run Function for triggers
        &#9492;&#9472;&#9472; ...</code></pre><p><strong>Key Design Principles:</strong></p><ol><li><p><strong>DRY (Don&#8217;t Repeat Yourself)</strong>: Common infrastructure is defined once in modules</p></li><li><p><strong>Environment Isolation</strong>: Each environment has its own state and configuration</p></li><li><p><strong>Separation of Concerns</strong>: Modules handle specific capabilities (Vertex, Cloud Functions, etc.)</p></li><li><p><strong>Consistent Interface</strong>: All environments use the same module interface</p></li></ol><h2><strong>The vertex_deployment Module: Core Infrastructure</strong></h2><p>The <code>vertex_deployment</code> module is the heart of our MLOps infrastructure.</p><p>The following diagram shows all resources provisioned by our Terraform modules:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_YIE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_YIE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png 424w, https://substackcdn.com/image/fetch/$s_!_YIE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png 848w, https://substackcdn.com/image/fetch/$s_!_YIE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png 1272w, https://substackcdn.com/image/fetch/$s_!_YIE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_YIE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png" width="784" height="185" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/577dbf6b-879a-4915-95d0-56393c300b38_784x185.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:185,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!_YIE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png 424w, https://substackcdn.com/image/fetch/$s_!_YIE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png 848w, https://substackcdn.com/image/fetch/$s_!_YIE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png 1272w, https://substackcdn.com/image/fetch/$s_!_YIE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577dbf6b-879a-4915-95d0-56393c300b38_784x185.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Let&#8217;s break down what it provisions.</p><h3><strong>1. Google Cloud APIs</strong></h3><p>First, we enable all required GCP services:</p><pre><code>resource &#8220;google_project_service&#8221; &#8220;gcp_services&#8221; {
  for_each                   = toset(var.gcp_service_list)
  project                    = var.project_id
  service                    = each.key
  disable_on_destroy         = var.disable_services_on_destroy
  disable_dependent_services = var.disable_dependent_services
}</code></pre><p><strong>Services enabled (17 total):</strong></p><ul><li><p><code>aiplatform.googleapis.com</code> - Vertex AI core</p></li><li><p><code>artifactregistry.googleapis.com</code> - Docker images and pipelines</p></li><li><p><code>bigquery.googleapis.com</code> - Data warehouse</p></li><li><p><code>cloudbuild.googleapis.com</code> - CI/CD</p></li><li><p><code>cloudfunctions.googleapis.com</code> - Event triggers</p></li><li><p><code>cloudscheduler.googleapis.com</code> - Scheduled runs</p></li><li><p><code>pubsub.googleapis.com</code> - Event messaging</p></li><li><p><code>iam.googleapis.com</code> - Access control</p></li><li><p>And 9 more supporting services&#8230;</p></li></ul><p><strong>Why this matters:</strong> Forgetting to enable a single API can cause cryptic failures. By declaring all dependencies in code, we ensure consistent setup every time.</p><h3><strong>2. Service Accounts: Identity and Access</strong></h3><p>We create two dedicated service accounts with minimal permissions:</p><pre><code># Service account for Vertex AI Pipelines
resource &#8220;google_service_account&#8221; &#8220;pipelines_sa&#8221; {
  project      = var.project_id
  account_id   = &#8220;vertex-pipelines&#8221;
  display_name = &#8220;Vertex Pipelines Service Account&#8221;
  depends_on   = [google_project_service.gcp_services]
}
# Service account for Cloud Run Function (pipeline trigger)
resource &#8220;google_service_account&#8221; &#8220;vertex_cloudrunfunction_sa&#8221; {
  project      = var.project_id
  account_id   = &#8220;vertex-cloudrunfunction-sa&#8221;
  display_name = &#8220;Cloud Run Function Service Account&#8221;
  depends_on   = [google_project_service.gcp_services]
}</code></pre><p><strong>Security principle:</strong> Each component has its own identity with only the permissions it needs. If the Cloud Run Function is compromised, it can&#8217;t access resources meant only for pipelines.</p><h3><strong>3. Cloud Storage: Artifact Storage</strong></h3><p>We provision two GCS buckets with security best practices:</p><pre><code># Pipeline artifacts and outputs
resource &#8220;google_storage_bucket&#8221; &#8220;pipeline_root_bucket&#8221; {
  name                        = &#8220;${var.project_id}-pl-root&#8221;
  location                    = var.region
  project                     = var.project_id
  uniform_bucket_level_access = true
  public_access_prevention    = &#8220;enforced&#8221;
  depends_on                  = [google_project_service.gcp_services]
}
# Cloud Run Function source code
resource &#8220;google_storage_bucket&#8221; &#8220;gcf_source_bucket&#8221; {
  name                        = &#8220;${var.project_id}-gcf-source&#8221;
  location                    = local.cloudrunfunction_region
  project                     = var.project_id
  uniform_bucket_level_access = true
  public_access_prevention    = &#8220;enforced&#8221;
  depends_on                  = [google_project_service.gcp_services]
}</code></pre><p><strong>Security features:</strong></p><ul><li><p><code>uniform_bucket_level_access</code>: Consistent permissions using IAM only (no legacy ACLs)</p></li><li><p><code>public_access_prevention</code>: Blocks any attempt to make objects public</p></li><li><p>Region-specific: Data stays in your preferred location</p></li></ul><h3><strong>4. Vertex AI Metadata Store</strong></h3><p>The metadata store provides lineage tracking for all ML artifacts:</p><pre><code>resource &#8220;google_vertex_ai_metadata_store&#8221; &#8220;default_metadata_store&#8221; {
  provider    = google-beta
  name        = &#8220;default&#8221;
  description = &#8220;Default metadata store&#8221;
  project     = var.project_id
  region      = var.region
  depends_on  = [google_project_service.gcp_services]
}</code></pre><p>This enables:</p><ul><li><p><strong>Lineage tracking</strong>: See which data produced which models</p></li><li><p><strong>Experiment tracking</strong>: Compare training runs and hyperparameters</p></li><li><p><strong>Reproducibility</strong>: Trace any prediction back to its training data</p></li><li><p><strong>Compliance</strong>: Audit trails for regulatory requirements</p></li></ul><h3><strong>5. Artifact Registry: Docker Images and Pipelines</strong></h3><p>We create two repositories with different formats:</p><pre><code># Docker container images (training containers)
resource &#8220;google_artifact_registry_repository&#8221; &#8220;mlops_docker_repo&#8221; {
  repository_id = &#8220;mlops-docker-repo&#8221;
  description   = &#8220;Container images for model training&#8221;
  project       = var.project_id
  location      = var.region
  format        = &#8220;DOCKER&#8221;
  depends_on    = [google_project_service.gcp_services]
}

# Kubeflow Pipeline definitions
resource &#8220;google_artifact_registry_repository&#8221; &#8220;mlops_pipeline_repo&#8221; {
  repository_id = &#8220;mlops-pipeline-repo&#8221;
  description   = &#8220;KFP repository for Vertex Pipelines&#8221;
  project       = var.project_id
  location      = var.region
  format        = &#8220;KFP&#8221;
  depends_on    = [google_project_service.gcp_services]
}</code></pre><p><strong>Why separate repositories?</strong></p><ul><li><p>Docker and KFP formats have different versioning and metadata needs</p></li><li><p>Separate permissions: training job builders need Docker access, pipeline deployers need KFP access</p></li><li><p>Cleaner organization and lifecycle management</p></li></ul><h3><strong>6. Pub/Sub: Event-Driven Orchestration</strong></h3><p>For asynchronous pipeline notifications:</p><pre><code>resource &#8220;google_pubsub_topic&#8221; &#8220;pipeline_completion&#8221; {
  name       = &#8220;pipeline-completion&#8221;
  project    = var.project_id
  depends_on = [google_project_service.gcp_services]
}

resource &#8220;google_pubsub_subscription&#8221; &#8220;pipeline_completion_subscription&#8221; {
  name    = &#8220;pipeline-completion-subscription&#8221;
  topic   = google_pubsub_topic.pipeline_completion.id
  project = var.project_id
  push_config {
    push_endpoint = module.cloudrunfunction.function_uri
  }
}</code></pre><p><strong>Event flow:</strong></p><ol><li><p>Pipeline completes (success or failure)</p></li><li><p>Vertex AI publishes to <code>pipeline-completion</code> topic</p></li><li><p>Pub/Sub pushes notification to Cloud Run Function</p></li><li><p>Function can trigger dependent pipelines (e.g., training completes &#8594; run prediction)</p></li></ol><h2><strong>The cloudrunfunction Module: Event-Driven Triggers</strong></h2><p>While scheduled pipelines run periodically, the <code>cloudrunfunction</code> module enables <strong>event-driven execution</strong> triggered by new data in BigQuery.</p><h3><strong>Module Overview</strong></h3><pre><code># terraform/environments/prod/main.tf
module &#8220;cloudrunfunction&#8221; {
  source = &#8220;../../modules/cloudrunfunction&#8221;
  project_id          = var.project_id
  region              = var.region
  crf_service_account = module.vertex_deployment.cloudrunfunction_sa_email
  gcf_source_bucket   = module.vertex_deployment.gcf_source_bucket
  # Pipeline configuration (JSON-encoded)
  pipeline_config = {
    type                     = &#8220;training&#8221;
    display_name             = &#8220;event-driven-training&#8221;
    bq_location              = var.bq_location
    use_latest_data          = true
    timestamp                = &#8220;&#8221;
    training_template_path   = &#8220;https://${var.region}-kfp.pkg.dev/${var.project_id}/mlops-pipeline-repo/taxifare-training-pipeline/latest&#8221;
    prediction_template_path = &#8220;https://${var.region}-kfp.pkg.dev/${var.project_id}/mlops-pipeline-repo/taxifare-batch-prediction-pipeline/latest&#8221;
    pubsub_topic_name        = &#8220;training-pipeline-complete&#8221;
  }
  # BigQuery trigger: fires when new data inserted
  dataset_id = &#8220;chicago_taxi_trips&#8221;
  table_id   = &#8220;taxi_trips&#8221;
}</code></pre><h3><strong>How It Works</strong></h3><ol><li><p><strong>BigQuery Audit Logs</strong> generate events when data is inserted</p></li><li><p><strong>Cloud Run Function</strong> is triggered by the audit log event</p></li><li><p><strong>Function reads</strong> <code>PIPELINE_CONFIG</code> from environment variables</p></li><li><p><strong>Resolves template URI</strong> from Artifact Registry (tag &#8594; digest)</p></li><li><p><strong>Submits pipeline job</strong> to Vertex AI</p></li><li><p><strong>Listens on Pub/Sub</strong> for training completion</p></li><li><p><strong>Triggers prediction</strong> pipeline automatically</p></li></ol><p>Function code location: <code>terraform/modules/cloudrunfunction/src/main.py</code></p><h3><strong>Key Configuration</strong></h3><p><strong>Trigger Configuration</strong>:</p><pre><code>event_type = &#8220;google.cloud.audit.log.v1.written&#8221;
methodName = &#8220;google.cloud.bigquery.v2.JobService.InsertJob&#8221;
resourceName = &#8220;projects/.../datasets/{dataset_id}/tables/{table_id}&#8221;</code></pre><p><strong>Environment Variables</strong>:</p><ul><li><p><code>PIPELINE_CONFIG</code>: JSON with pipeline template paths and parameters</p></li><li><p>Standard Vertex AI variables (project, location, service account)</p></li></ul><p>This provides an alternative to scheduled runs, enabling <strong>continuous training</strong> as new data arrives.</p><h2><strong>IAM: Security by Design</strong></h2><p>IAM configuration is where many MLOps projects go wrong. Too permissive, and you&#8217;ve created security holes. Too restrictive, and pipelines fail with cryptic permission errors.</p><p>Our IAM strategy follows <strong>least privilege</strong>: each service account gets only the permissions it needs, nothing more.</p><h3><strong>Vertex Pipelines Service Account Permissions</strong></h3><pre><code># Project-level roles for Vertex Pipelines SA
resource &#8220;google_project_iam_member&#8221; &#8220;pipelines_sa_project_roles&#8221; {
  for_each = toset(var.pipelines_sa_project_roles)
  project  = var.project_id
  role     = each.key
  member   = &#8220;serviceAccount:${google_service_account.pipelines_sa.email}&#8221;
}

# Default roles:
# - roles/aiplatform.user          # Submit Vertex AI jobs
# - roles/logging.logWriter        # Write logs
# - roles/bigquery.dataEditor      # Read/write BigQuery
# - roles/bigquery.jobUser         # Run BigQuery jobs
# - roles/artifactregistry.reader  # Pull Docker images</code></pre><p><strong>Bucket-specific permissions:</strong></p><pre><code>resource &#8220;google_storage_bucket_iam_member&#8221; &#8220;pipelines_sa_pipeline_root_bucket_iam&#8221; {
  for_each = toset([
    &#8220;roles/storage.objectAdmin&#8221;,
    &#8220;roles/storage.legacyBucketReader&#8221;,
  ])
  bucket = google_storage_bucket.pipeline_root_bucket.name
  member = &#8220;serviceAccount:${google_service_account.pipelines_sa.email}&#8221;
  role   = each.value
}</code></pre><p><strong>Why both roles?</strong></p><ul><li><p><code>objectAdmin</code>: Create, read, update, delete objects in the bucket</p></li><li><p><code>legacyBucketReader</code>: List bucket contents (required for Vertex AI)</p></li></ul><h3><strong>Cloud Run Function Service Account Permissions</strong></h3><p>The function needs to trigger pipelines and access compiled pipeline definitions:</p><pre><code># Allow function SA to impersonate pipelines SA
resource &#8220;google_service_account_iam_member&#8221; &#8220;cloudrunfunction_sa_can_use_pipelines_sa&#8221; {
  service_account_id = google_service_account.pipelines_sa.name
  role               = &#8220;roles/iam.serviceAccountUser&#8221;
  member             = &#8220;serviceAccount:${google_service_account.vertex_cloudrunfunction_sa.email}&#8221;
}

# Access to Artifact Registry for compiled pipelines
resource &#8220;google_artifact_registry_repository_iam_member&#8221; &#8220;cloudrunfunction_sa_can_access_ar&#8221; {
  project    = google_artifact_registry_repository.mlops_pipeline_repo.project
  location   = google_artifact_registry_repository.mlops_pipeline_repo.location
  repository = google_artifact_registry_repository.mlops_pipeline_repo.name
  role       = &#8220;roles/artifactregistry.reader&#8221;
  member     = &#8220;serviceAccount:${google_service_account.vertex_cloudrunfunction_sa.email}&#8221;
}</code></pre><p><strong>Permission chain:</strong></p><ol><li><p>Cloud Run Function executes with <code>vertex_cloudrunfunction_sa</code></p></li><li><p>To submit a pipeline, it needs to use <code>pipelines_sa</code> (service account impersonation)</p></li><li><p>It needs to read the compiled pipeline from Artifact Registry</p></li></ol><p>This separation ensures the function can trigger pipelines but can&#8217;t directly access pipeline data or training artifacts.</p><h2><strong>Environment Configuration</strong></h2><p>Each environment (dev/test/prod) has its own configuration but uses the same module:</p><h3><strong>Dev Environment (terraform/environments/dev/main.tf)</strong></h3><pre><code>terraform {
  required_version = &#8220;&gt;= 1.9&#8221;</code></pre><h3><strong>Environment-Specific Variables (auto.tfvars)</strong></h3><pre><code># Dev environment
project_id = &#8220;my-mlops-dev&#8221;
region     = &#8220;us-central1&#8221;
dataset_id = &#8220;chicago_taxi_dev&#8221;
table_id   = &#8220;taxi_trips&#8221;

# Test environment
project_id = &#8220;my-mlops-test&#8221;
region     = &#8220;us-central1&#8221;
dataset_id = &#8220;chicago_taxi_test&#8221;
table_id   = &#8220;taxi_trips&#8221;

# Prod environment
project_id = &#8220;my-mlops-prod&#8221;
region     = &#8220;us-central1&#8221;
dataset_id = &#8220;chicago_taxi_prod&#8221;
table_id   = &#8220;taxi_trips&#8221;</code></pre><p><strong>Same module, different values.</strong> This ensures environment parity while allowing environment-specific configuration.</p><h2><strong>Terraform State Management</strong></h2><p>Terraform state is critical &#8212; it&#8217;s the source of truth for what&#8217;s deployed. Losing state means losing track of your infrastructure.</p><h3><strong>Remote State in GCS</strong></h3><p>Each environment has its own state bucket:</p><pre><code># Create state bucket for dev environment
export DEV_PROJECT_ID=my-mlops-dev
export DEV_LOCATION=us-central1

gsutil mb -p $DEV_PROJECT_ID \
  -l $DEV_LOCATION \
  gs://${DEV_PROJECT_ID}-tfstate
# Enable versioning for state recovery
gsutil versioning set on gs://${DEV_PROJECT_ID}-tfstate</code></pre><h3><strong>Backend Configuration</strong></h3><p>During Terraform initialization, specify the backend:</p><pre><code>cd terraform/environments/dev

terraform init \
  -backend-config=&#8221;bucket=${DEV_PROJECT_ID}-tfstate&#8221; \
  -backend-config=&#8221;prefix=terraform/state&#8221;</code></pre><p><strong>Benefits:</strong></p><ul><li><p><strong>Shared state</strong>: Team members see the same infrastructure state</p></li><li><p><strong>Locking</strong>: Prevents concurrent modifications</p></li><li><p><strong>Versioning</strong>: Recover from accidental deletions or bad changes</p></li><li><p><strong>Encryption</strong>: State is encrypted at rest in GCS</p></li></ul><h2><strong>Deployment Workflow</strong></h2><p>Here&#8217;s how infrastructure changes flow through environments:</p><h3><strong>1. Local Development</strong></h3><pre><code># Make infrastructure changes
cd terraform/modules/vertex_deployment
# Edit main.tf, iam.tf, etc.

# Validate syntax
terraform fmt -check
terraform validate</code></pre><h3><strong>2. Pull Request</strong></h3><p>Open a PR with your changes. Cloud Build automatically runs:</p><pre><code># cloudbuild/terraform-plan.yaml
steps:
  - name: &#8216;hashicorp/terraform&#8217;
    args:
      - init
      - -backend-config=bucket=${_TFSTATE_BUCKET}

  - name: &#8216;hashicorp/terraform&#8217;
      args:
        - plan
        - -out=tfplan</code></pre><p><strong>Review the plan output:</strong></p><ul><li><p>Resources to be added (green +)</p></li><li><p>Resources to be modified (yellow ~)</p></li><li><p>Resources to be destroyed (red -)</p></li></ul><p>This preview helps catch unintended changes before they reach production.</p><h3><strong>3. Merge to Main</strong></h3><p>When the PR merges, Cloud Build automatically applies changes:</p><pre><code># cloudbuild/terraform-apply.yaml
steps:
  - name: &#8216;hashicorp/terraform&#8217;
    args:
      - init
      - -backend-config=bucket=${_TFSTATE_BUCKET}

  - name: &#8216;hashicorp/terraform&#8217;
      args:
        - apply
        - -auto-approve</code></pre><p><strong>Deployment order:</strong></p><ol><li><p>Dev environment (lowest risk)</p></li><li><p>Test environment (validate before prod)</p></li><li><p>Prod environment (final deployment)</p></li></ol><p>Separate Cloud Build triggers ensure controlled, sequential rollout.</p><h2><strong>Best Practices We Follow</strong></h2><h3><strong>1. Explicit Dependencies</strong></h3><p>Always use <code>depends_on</code> when resources have implicit dependencies:</p><pre><code>resource &#8220;google_storage_bucket&#8221; &#8220;pipeline_root_bucket&#8221; {
  name       = &#8220;${var.project_id}-pl-root&#8221;
  # ... other config ...
  depends_on = [google_project_service.gcp_services]
}</code></pre><p>This ensures APIs are enabled before creating resources that use them.</p><h3><strong>2. Parameterized Modules</strong></h3><p>Use variables for everything that might change:</p><pre><code>variable &#8220;region&#8221; {
  description = &#8220;GCP region for resources&#8221;
  type        = string
}

variable &#8220;project_id&#8221; {
  description = &#8220;GCP project ID&#8221;
  type        = string
}</code></pre><p><strong>Never hardcode</strong> project IDs, regions, or environment-specific values in modules.</p><h3><strong>3. Resource Naming Conventions</strong></h3><p>Use consistent, predictable naming:</p><pre><code>name = &#8220;${var.project_id}-pl-root&#8221;  # GCS bucket
account_id = &#8220;vertex-pipelines&#8221;      # Service account
repository_id = &#8220;mlops-docker-repo&#8221;  # Artifact Registry</code></pre><p>This makes resources easy to identify and troubleshoot.</p><h3><strong>4. Security-First Configuration</strong></h3><p>Always use the most restrictive settings:</p><pre><code>uniform_bucket_level_access = true   # No legacy ACLs
public_access_prevention    = &#8220;enforced&#8221;  # Never public</code></pre><p>Loosen restrictions only when absolutely necessary with documented justification.</p><h3><strong>5. Module Outputs</strong></h3><p>Export values needed by other modules or external tools:</p><pre><code># outputs.tf
output &#8220;pipeline_root_bucket&#8221; {
  description = &#8220;GCS bucket for pipeline artifacts&#8221;
  value       = google_storage_bucket.pipeline_root_bucket.name
}

output &#8220;pipelines_sa_email&#8221; {
  description = &#8220;Email of Vertex Pipelines service account&#8221;
  value       = google_service_account.pipelines_sa.email
}</code></pre><p>This enables:</p><ul><li><p>Passing values between modules</p></li><li><p>Using Terraform outputs in CI/CD pipelines</p></li><li><p>Documentation of important resource identifiers</p></li></ul><h2><strong>Common Pitfalls and How We Avoid Them</strong></h2><h3><strong>Pitfall 1: API Enablement Race Conditions</strong></h3><p><strong>Problem:</strong> Creating resources before APIs are fully enabled causes failures.</p><p><strong>Solution:</strong> Explicit <code>depends_on</code> for all resources:</p><pre><code>resource &#8220;google_vertex_ai_metadata_store&#8221; &#8220;default_metadata_store&#8221; {
  # ... config ...
  depends_on = [google_project_service.gcp_services]
}</code></pre><h3><strong>Pitfall 2: Service Account Permissions Missing</strong></h3><p><strong>Problem:</strong> Pipelines fail with cryptic &#8220;Permission denied&#8221; errors.</p><p><strong>Solution:</strong> Comprehensive IAM configuration with commented explanations:</p><pre><code># Grant BigQuery access for data preprocessing
&#8220;roles/bigquery.dataEditor&#8221;,
# Enable job submission for BigQuery queries
&#8220;roles/bigquery.jobUser&#8221;,</code></pre><h3><strong>Pitfall 3: Bucket Permissions Too Broad</strong></h3><p><strong>Problem:</strong> Using <code>roles/storage.admin</code> on buckets grants excessive permissions.</p><p><strong>Solution:</strong> Minimal bucket-level permissions:</p><pre><code>for_each = toset([
  &#8220;roles/storage.objectAdmin&#8221;,        # Object operations only
  &#8220;roles/storage.legacyBucketReader&#8221;, # List bucket contents
])</code></pre><h3><strong>Pitfall 4: State File Conflicts</strong></h3><p><strong>Problem:</strong> Multiple developers running <code>terraform apply</code> simultaneously corrupts state.</p><p><strong>Solution:</strong> GCS backend with automatic locking:</p><pre><code>backend &#8220;gcs&#8221; {
  bucket = &#8220;my-project-tfstate&#8221;
  # Locking is automatic with GCS backend
}</code></pre><h2><strong>Testing Infrastructure Code</strong></h2><p>Just like application code, infrastructure should be tested:</p><h3><strong>1. Terraform Validate</strong></h3><pre><code>terraform validate</code></pre><p>Checks syntax and internal consistency.</p><h3><strong>2. Terraform Plan</strong></h3><pre><code>terraform plan -out=tfplan</code></pre><p>Preview changes before applying. Review the plan in CI/CD.</p><h3><strong>3. tflint (Optional)</strong></h3><pre><code>tflint --init
tflint</code></pre><p>Catches common errors and enforces best practices.</p><h3><strong>4. terraform-docs (Documentation)</strong></h3><pre><code>terraform-docs markdown table . &gt; README.md</code></pre><p>Generates documentation from your Terraform code.</p><h2><strong>Real-World Example: Adding a New Service</strong></h2><p>Let&#8217;s walk through adding Cloud Scheduler support to enable scheduled pipeline runs.</p><h3><strong>Step 1: Update Module Variables</strong></h3><pre><code># terraform/modules/vertex_deployment/variables.tf
variable &#8220;enable_scheduler&#8221; {
  description = &#8220;Enable Cloud Scheduler for periodic pipeline runs&#8221;
  type        = bool
  default     = false
}

variable &#8220;training_schedule&#8221; {
  description = &#8220;Cron expression for training pipeline schedule&#8221;
  type        = string
  default     = &#8220;0 2 * * 0&#8221;  # Weekly on Sunday at 2 AM
}</code></pre><h3><strong>Step 2: Add Cloud Scheduler Resource</strong></h3><pre><code># terraform/modules/vertex_deployment/scheduler.tf
resource &#8220;google_cloud_scheduler_job&#8221; &#8220;training_pipeline_schedule&#8221; {
  count = var.enable_scheduler ? 1 : 0

name     = &#8220;training-pipeline-schedule&#8221;
  schedule = var.training_schedule
  region   = var.region
  pubsub_target {
    topic_name = google_pubsub_topic.pipeline_trigger.id
    data       = base64encode(jsonencode({
      pipeline_type = &#8220;training&#8221;
    }))
  }
}</code></pre><h3><strong>Step 3: Enable in Prod Only</strong></h3><pre><code># terraform/environments/prod/main.tf
module &#8220;vertex_deployment&#8221; {
  source     = &#8220;../../modules/vertex_deployment&#8221;
  project_id = var.project_id
  region     = var.region

# Enable scheduled runs in prod only
  enable_scheduler  = true
  training_schedule = &#8220;0 2 * * 0&#8221;  # Weekly retraining
}</code></pre><h3><strong>Step 4: Deploy</strong></h3><pre><code>cd terraform/environments/prod
terraform plan  # Review changes
terraform apply # Deploy scheduler</code></pre><p>Dev and test remain unchanged (scheduler disabled by default).</p><h2><strong>Conclusion</strong></h2><p>Infrastructure as Code transforms MLOps from fragile and manual to robust and automated. With Terraform:</p><ul><li><p><strong>Environments are identical and reproducible</strong></p></li><li><p><strong>Changes are version-controlled and reviewed</strong></p></li><li><p><strong>Deployments are automated and consistent</strong></p></li><li><p><strong>Security is built-in from day one</strong></p></li><li><p><strong>Team collaboration is streamlined</strong></p></li></ul><p>Our Terraform modules provide:</p><ol><li><p>Complete Vertex AI infrastructure (Pipelines, Training, Registry)</p></li><li><p>Secure IAM with least-privilege service accounts</p></li><li><p>Event-driven orchestration with Pub/Sub</p></li><li><p>Artifact storage with GCS and Artifact Registry</p></li><li><p>Multi-environment deployment patterns</p></li></ol><p>In the next article, we&#8217;ll build on this infrastructure to create reusable Kubeflow Pipeline components that execute our ML workflows.</p><p><strong>Key Takeaways:</strong></p><ul><li><p>Use modules for reusable infrastructure patterns</p></li><li><p>Separate environments with distinct configurations</p></li><li><p>Store Terraform state remotely in GCS with versioning</p></li><li><p>Follow least-privilege IAM principles</p></li><li><p>Automate deployment with Cloud Build</p></li><li><p>Preview changes with <code>terraform plan</code> before applying</p></li></ul><p><strong>Next in Series</strong>: Building Reusable Kubeflow Pipeline Components</p><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP">production-ready-MLOps-on-GCP</a></p><p><strong>Terraform Files</strong>:</p><ul><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/tree/main/terraform/modules/vertex_deployment">vertex_deployment module</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/tree/main/terraform/environments">Environment configs</a></p></li></ul><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Production-Ready MLOps on GCP Part 2: Tools & Workflows for ML Teams]]></title><description><![CDATA[Part 2 of a 8-part series on building enterprise-grade MLOps systems]]></description><link>https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-5f1</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part-5f1</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 10:12:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5OUf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Complete Series</strong>:</p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 1: Architecture Overview</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-5f1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 2: Tools &amp; Workflows for ML Teams</a> (You are here)</p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-06c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 3: Infrastructure as Code</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-8ac?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 4: Reusable KFP Components</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-022?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 5: Production Training Pipeline</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-a6c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 6: Production Prediction Pipeline </a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-9c6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 7: CI/CD for ML</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-e8f?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 8: Model Monitoring &amp; Continuous Training</a></p></li></ul><h2><strong>Introduction</strong></h2><p>We&#8217;re building a complete production-ready MLOps system:</p><p>But here&#8217;s the truth: <strong>the best-designed system in the world is useless if developers hate using it.</strong></p><p>Developer experience (DX) is what determines whether your MLOps platform gets adopted or abandoned. A great DX means:</p><ul><li><p>Fast iteration cycles</p></li><li><p>Easy to get started</p></li><li><p>Simple debugging</p></li><li><p>Clear documentation</p></li><li><p>Smooth collaboration</p></li></ul><p>In this final article, we&#8217;ll explore:</p><ul><li><p>Makefile shortcuts for common tasks</p></li><li><p>Poetry for dependency management</p></li><li><p>Pre-commit hooks for code quality</p></li><li><p>Local testing and debugging workflows</p></li><li><p>IDE setup and productivity tips</p></li><li><p>Team collaboration patterns</p></li></ul><p>By the end, you&#8217;ll understand how to create an MLOps platform that developers actually enjoy using.</p><h2><strong>The Developer&#8217;s Daily Workflow</strong></h2><p>Let&#8217;s follow a typical development day:</p><h3><strong>Morning: Pick Up a Task</strong></h3><pre><code># Pull latest changes
git pull origin main# Create feature branch
git checkout -b feature/add-new-feature# Activate environment
cd pipelines
poetry shell</code></pre><h3><strong>Mid-Morning: Develop Locally</strong></h3><pre><code># Make changes to a component
vim components/src/components/my_component.py# Run unit tests (fast feedback)
make test-components# Compile pipeline to check syntax
make compile pipeline=training</code></pre><h3><strong>Afternoon: Test in Cloud</strong></h3><pre><code># Build training container
make build# Run full pipeline in dev environment
make training enable_caching=false # Check Vertex AI UI for results</code></pre><h3><strong>End of Day: Open PR</strong></h3><pre><code># Commit changes (pre-commit hooks run)
git add .
git commit -m &#8220;Add new feature for X&#8221;# Push and create PR
git push origin feature/add-new-feature
gh pr create --title &#8220;Add new feature&#8221; --body &#8220;...&#8221;</code></pre><p><strong>Key observation</strong>: Notice the <strong>short feedback loops</strong>. Developers don&#8217;t wait hours for CI/CD &#8212; they get fast local feedback first.</p><h2><strong>Makefile: Developer-Friendly Automation</strong></h2><p>Typing long commands is tedious. Our Makefile provides shortcuts:</p><h3><strong>Common Commands</strong></h3><pre><code># Install dependencies
make install# Run unit tests
make test-components
make test-pipelines# Compile pipelines
make compile pipeline=training
make compile pipeline=prediction

# Build and push Docker image
make build
# Run pipelines
make training
make prediction
# Run E2E tests
make e2e-tests pipeline=training</code></pre><h3><strong>Behind the Scenes</strong></h3><p>Let&#8217;s look at the <code>make training</code> command:</p><pre><code>training: ## Run training pipeline
&#9;@$(MAKE) run pipeline=trainingrun: ## Run a pipeline. Set pipeline=&lt;training|prediction&gt;.
&#9;@if [ $(compile) = &#8220;true&#8221; ]; then \
&#9;&#9;$(MAKE) compile ; \
&#9;fi &amp;&amp; \
&#9;if [ $(build) = &#8220;true&#8221; ]; then \
&#9;&#9;$(MAKE) build ; \
&#9;fi &amp;&amp; \
&#9;cd pipelines/src &amp;&amp; \
&#9;poetry run python -m pipelines.utils.trigger_pipeline \
&#9;  --template_path=./taxifare-${pipeline}-pipeline.yaml \
&#9;  --display_name=taxifare-${pipeline}-pipeline \
&#9;  --enable_caching=${enable_caching} \
&#9;  --use_latest_data=${use_latest_data}</code></pre><p><strong>Benefits</strong>:</p><ul><li><p>Simple interface: <code>make training</code> instead of 5 commands</p></li><li><p>Configurable: <code>make training build=false enable_caching=true</code></p></li><li><p>Self-documenting: <code>make help</code> shows all targets</p></li></ul><h3><strong>Custom Targets</strong></h3><p>Add your own shortcuts:</p><pre><code># Quick iteration: compile + run (no build)
quick: compile=true build=false enable_caching=true
&#9;@$(MAKE) training
# Full test: build + compile + run
full: compile=true build=true enable_caching=false
&#9;@$(MAKE) training</code></pre><p>Usage:</p><pre><code>make quick  # Fast iteration
make full   # Complete test</code></pre><h2><strong>Poetry: Dependency Management Done Right</strong></h2><p><strong>Why Poetry?</strong> Better than <code>pip + requirements.txt</code>:</p><h3><strong>Clean Dependency Declaration</strong></h3><pre><code># pyproject.toml
[tool.poetry.dependencies]
python = &#8220;^3.10&#8221;
google-cloud-aiplatform = &#8220;^1.55.0&#8221;
kfp = &#8220;^2.7.0&#8221;
pandas = &#8220;^2.0.0&#8221;
[tool.poetry.group.dev.dependencies]
pytest = &#8220;^7.4.0&#8221;
black = &#8220;^23.7.0&#8221;
flake8 = &#8220;^6.1.0&#8221;
pre-commit = &#8220;^3.3.0&#8221;</code></pre><p><strong>Benefits</strong>:</p><ul><li><p>Lock file ensures reproducibility</p></li><li><p>Dependency groups (dev, test, prod)</p></li><li><p>Semantic versioning (&#185;.55.0 = 1.55.0 to &lt;2.0.0)</p></li><li><p>Fast dependency resolution</p></li></ul><h3><strong>Common Poetry Commands</strong></h3><pre><code># Install all dependencies (including dev)
poetry install --with dev
# Install production dependencies only
poetry install --without dev
# Add a new dependency
poetry add google-cloud-bigquery
# Add a dev dependency
poetry add --group dev pytest-cov
# Update dependencies
poetry update
# Activate virtual environment
poetry shell
# Run command in environment
poetry run python -m pipelines.training</code></pre><h3><strong>Virtual Environment Management</strong></h3><p>Poetry creates isolated environments:</p><pre><code># Where is the environment?
poetry env info
# Output:
# Virtualenv
# Python:         3.10.14
# Path:           /home/user/.cache/pypoetry/virtualenvs/pipelines-abc123
# Multiple Python versions
poetry env use python3.10
poetry env use python3.11</code></pre><h2><strong>Pre-commit Hooks: Automatic Code Quality</strong></h2><p><strong>Problem</strong>: Code style inconsistencies slow down code reviews.</p><p><strong>Solution</strong>: Automate formatting and linting before commit.</p><h3><strong>Hook Configuration</strong></h3><pre><code>repos:
  # Basic checks
  - repo: https://github.com/pre-commit/pre-commit-hooks
    hooks:
      - id: trailing-whitespace    # Remove trailing whitespace
      - id: end-of-file-fixer      # Ensure files end with newline
      - id: check-yaml             # Validate YAML files
      - id: check-added-large-files  # Prevent huge files
      - id: check-merge-conflict   # Catch merge markers
# Code formatting
  - repo: https://github.com/psf/black
    hooks:
      - id: black
        args: [--line-length=100]
# Linting
  - repo: https://github.com/pycqa/flake8
    hooks:
      - id: flake8
        args: [--max-line-length=100, --ignore=E203,W503]
# Modern linting + auto-fix
  - repo: https://github.com/astral-sh/ruff-pre-commit
    hooks:
      - id: ruff
        args: [--fix]
      - id: ruff-format</code></pre><h3><strong>Installation and Usage</strong></h3><pre><code># Install hooks (once per repository)
cd pipelines
poetry run pre-commit install

# Hooks run automatically on git commit
git add my_file.py
git commit -m &#8220;Add feature&#8221;

# Pre-commit runs:
# &#9989; Removes trailing whitespace
# &#9989; Formats code with black
# &#9989; Checks with flake8
# &#9989; Auto-fixes with ruff

# Commit proceeds only if all pass</code></pre><h3><strong>Manual Execution</strong></h3><pre><code># Run all hooks on all files
pre-commit run --all-files
# Run specific hook
pre-commit run black --all-files
# Skip hooks (use sparingly!)
git commit --no-verify -m &#8220;Emergency fix&#8221;</code></pre><h3><strong>What Gets Caught</strong></h3><p><strong>Before black</strong>:</p><pre><code>def my_function(x,y,z):
    result=x+y+z
    return result</code></pre><p><strong>After black</strong>:</p><pre><code>def my_function(x, y, z):
    result = x + y + z
    return result</code></pre><p><strong>flake8 errors</strong>:</p><pre><code>my_file.py:45:80: E501 line too long (101 &gt; 100 characters)
my_file.py:67:1: F401 &#8216;os&#8217; imported but unused</code></pre><h2><strong>Local Testing and Debugging</strong></h2><h3><strong>Unit Tests with pytest</strong></h3><pre><code># Run all tests
cd components
poetry run pytest tests/
# Run specific test file
poetry run pytest tests/test_upload_best_model_op.py
# Run specific test
poetry run pytest tests/test_upload_best_model_op.py::test_champion_wins
# Show print statements
poetry run pytest -s tests/
# Stop at first failure
poetry run pytest -x tests/
# Show coverage
poetry run pytest --cov=components tests/</code></pre><h3><strong>Test Structure</strong></h3><pre><code>components/
&#9500;&#9472;&#9472; src/
&#9474;   &#9492;&#9472;&#9472; components/
&#9474;       &#9492;&#9472;&#9472; upload_best_model_op.py
&#9492;&#9472;&#9472; tests/
    &#9500;&#9472;&#9472; conftest.py                      # Shared fixtures
    &#9500;&#9472;&#9472; test_upload_best_model_op.py
    &#9492;&#9472;&#9472; test_lookup_model_op.py</code></pre><h3><strong>Example Test with Mocking</strong></h3><pre><code># tests/test_upload_best_model_op.py
import pytest
from unittest.mock import Mock, patch

@patch(&#8221;google.cloud.aiplatform.Model&#8221;)
def test_first_model_upload(mock_model_class, tmp_path):
    &#8220;&#8221;&#8220;Test uploading first model (no champion exists).&#8221;&#8220;&#8221;

    # Mock: No existing models
    mock_model_class.list.return_value = []

    # Mock: Upload returns model
    mock_uploaded = Mock()
    mock_uploaded.versioned_resource_name = &#8220;models/123/versions/1&#8221;
    mock_model_class.upload.return_value = mock_uploaded

    # Create test metrics
    metrics = {&#8221;problemType&#8221;: &#8220;regression&#8221;, &#8220;rmse&#8221;: 2.5}

    # Call component
    upload_best_model_op.python_func(
        model_eval_metrics=create_metrics_file(tmp_path, metrics),
        eval_metric=&#8221;rmse&#8221;,
        eval_lower_is_better=True,
        model_name=&#8221;test-model&#8221;,
        # ... other params
    )

    # Verify upload was called with is_default_version=True
    mock_model_class.upload.assert_called_once()
    call_args = mock_model_class.upload.call_args
    assert call_args.kwargs[&#8221;is_default_version&#8221;] == True</code></pre><p><strong>Key testing patterns</strong>:</p><ul><li><p>Mock GCP APIs (no real API calls)</p></li><li><p>Use <code>tmp_path</code> fixture for file operations</p></li><li><p>Test edge cases (no champion, champion wins, challenger wins)</p></li><li><p>Verify function calls with <code>assert_called_once_with</code></p></li></ul><h2><strong>Debugging Failed Pipelines</strong></h2><p><strong>Scenario</strong>: Pipeline failed in Vertex AI. How to debug locally?</p><pre><code># 1. Find the failed step in Vertex AI UI
#    Example: &#8220;Upload model&#8221; component failed

# 2. Extract component function
cd components
poetry shell

python
&gt;&gt;&gt; from components import upload_best_model_op
&gt;&gt;&gt; func = upload_best_model_op.python_func

# 3. Call with test data
&gt;&gt;&gt; func(
...     model=test_model,
...     model_eval_metrics=test_metrics,
...     eval_metric=&#8221;rmse&#8221;,
...     eval_lower_is_better=True,
...     model_name=&#8221;test-model&#8221;,
...     # ... other params
... )

# 4. Add print/logging for debugging
&gt;&gt;&gt; import logging
&gt;&gt;&gt; logging.basicConfig(level=logging.DEBUG)
&gt;&gt;&gt; func(...)  # Run again with debug logging</code></pre><h2><strong>Local Pipeline Compilation</strong></h2><p>Test pipeline compiles before pushing:</p><pre><code># Compile training pipeline
cd pipelines/src
poetry run python -m pipelines.training

# Output: taxifare-training-pipeline.yaml

# Inspect compiled YAML
head -50 taxifare-training-pipeline.yaml</code></pre><p><strong>Common compilation errors</strong>:</p><pre><code>Error: Component &#8216;my_component_op&#8217; not found
&#8594; Check import in training.py

Error: Type mismatch: expected Dataset, got Model
&#8594; Check component input/output types

Error: Missing required parameter &#8216;project&#8217;
&#8594; Check pipeline function signature

Error: Type mismatch: expected Dataset, got Model
&#8594; Check component input/output types

Error: Missing required parameter &#8216;project&#8217;
&#8594; Check pipeline function signature</code></pre><h2><strong>IDE Setup and Productivity</strong></h2><h3><strong>VS Code Configuration</strong></h3><pre><code>// .vscode/settings.json
{
  // Python interpreter
  &#8220;python.defaultInterpreterPath&#8221;: &#8220;${workspaceFolder}/pipelines/.venv/bin/python&#8221;,

  // Formatting
  &#8220;python.formatting.provider&#8221;: &#8220;black&#8221;,
  &#8220;editor.formatOnSave&#8221;: true,

  // Linting
  &#8220;python.linting.enabled&#8221;: true,
  &#8220;python.linting.flake8Enabled&#8221;: true,
  &#8220;python.linting.pylintEnabled&#8221;: false,

  // Type checking
  &#8220;python.analysis.typeCheckingMode&#8221;: &#8220;basic&#8221;,

  // Pytest
  &#8220;python.testing.pytestEnabled&#8221;: true,
  &#8220;python.testing.unittestEnabled&#8221;: false,

  // File associations
  &#8220;files.associations&#8221;: {
    &#8220;*.yaml&#8221;: &#8220;yaml&#8221;,
    &#8220;*.tf&#8221;: &#8220;terraform&#8221;
  }
}  // Formatting
  &#8220;python.formatting.provider&#8221;: &#8220;black&#8221;,
  &#8220;editor.formatOnSave&#8221;: true,</code></pre><h3><strong>Recommended VS Code Extensions</strong></h3><pre><code>{
  &#8220;recommendations&#8221;: [
    &#8220;ms-python.python&#8221;,           // Python support
    &#8220;ms-python.vscode-pylance&#8221;,   // Fast type checking
    &#8220;hashicorp.terraform&#8221;,        // Terraform support
    &#8220;redhat.vscode-yaml&#8221;,         // YAML support
    &#8220;eamodio.gitlens&#8221;,            // Git superpowers
    &#8220;ms-azuretools.vscode-docker&#8221; // Docker support
  ]
}</code></pre><h2><strong>Environment Management</strong></h2><h3><strong>Environment Variables</strong></h3><pre><code># env.sh (never commit!)
export VERTEX_PROJECT_ID=&#8221;my-dev-project&#8221;
export VERTEX_LOCATION=&#8221;us-central1&#8221;
export BQ_LOCATION=&#8221;US&#8221;
export VERTEX_PIPELINE_ROOT=&#8221;gs://my-dev-project-pl-root&#8221;
export VERTEX_SA_EMAIL=&#8221;vertex-pipelines@my-dev-project.iam.gserviceaccount.com&#8221;
export IMAGE_NAME=&#8221;training&#8221;
export IMAGE_TAG=&#8221;latest&#8221;

# Load environment
source env.sh

# Verify
echo $VERTEX_PROJECT_ID</code></pre><h3><strong>Example File</strong></h3><pre><code># env.sh.example (committed to Git)
export VERTEX_PROJECT_ID=&#8221;your-project-id&#8221;
export VERTEX_LOCATION=&#8221;us-central1&#8221;
export BQ_LOCATION=&#8221;US&#8221;
export VERTEX_PIPELINE_ROOT=&#8221;gs://your-bucket/pipeline-root&#8221;
export VERTEX_SA_EMAIL=&#8221;vertex-pipelines@your-project.iam.gserviceaccount.com&#8221;
export IMAGE_NAME=&#8221;training&#8221;
export IMAGE_TAG=&#8221;latest&#8221;</code></pre><p>New developers:</p><pre><code>cp env.sh.example env.sh
vim env.sh  # Update with your values
source env.sh</code></pre><h2><strong>Git Workflow and Collaboration</strong></h2><h3><strong>Branch Naming Conventions</strong></h3><pre><code>feature/add-hyperparameter-tuning
bugfix/fix-preprocessing-null-handling
refactor/simplify-upload-component
docs/update-readme</code></pre><p><strong>Examples</strong>:</p><pre><code>feat: add learning rate scheduling to training

Implements cosine annealing learning rate schedule.
Improves model convergence speed by 20%.

Closes #123</code></pre><pre><code>fix: handle null values in preprocessing query

Previously, rows with null trip_seconds caused
preprocessing to fail. Now uses COALESCE to replace
with median value.

Fixes #456</code></pre><h2><strong>Commit Message Guidelines</strong></h2><h3><strong>Using Commitizen (Conventional Commits)</strong></h3><p><strong>Commit format</strong></p><pre><code>&lt;type&gt;(optional scope): &lt;short description&gt;
[optional body]
[optional footer]</code></pre><h3><strong>Type Mapping (Your Guidelines &#8594; Commitizen)</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5OUf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5OUf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png 424w, https://substackcdn.com/image/fetch/$s_!5OUf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png 848w, https://substackcdn.com/image/fetch/$s_!5OUf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png 1272w, https://substackcdn.com/image/fetch/$s_!5OUf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5OUf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png" width="788" height="503" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:503,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5OUf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png 424w, https://substackcdn.com/image/fetch/$s_!5OUf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png 848w, https://substackcdn.com/image/fetch/$s_!5OUf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png 1272w, https://substackcdn.com/image/fetch/$s_!5OUf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd36af116-11e6-45fe-a3ef-e84d5c3224f4_788x503.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Feature</strong></p><pre><code>feat(auth): add OAuth2 login support</code></pre><p><strong>Bug fix</strong></p><pre><code>fix(api): handle null user response</code></pre><p><strong>Refactor</strong></p><pre><code>refactor(pipeline): simplify data preprocessing logic</code></pre><p><strong>Docs</strong></p><pre><code>docs(readme): add setup instructions for local dev</code></pre><h2><strong>Code Review Checklist</strong></h2><p><strong>For reviewers</strong>:</p><ul><li><p>[ ] Does code follow existing patterns?</p></li><li><p>[ ] Are tests added/updated?</p></li><li><p>[ ] Is documentation updated?</p></li><li><p>[ ] Do pipelines compile?</p></li><li><p>[ ] Are there any security issues?</p></li></ul><p><strong>For authors</strong>:</p><ul><li><p>[ ] Run pre-commit hooks</p></li><li><p>[ ] Run unit tests locally</p></li><li><p>[ ] Compile pipelines</p></li><li><p>[ ] Update CHANGELOG if needed</p></li><li><p>[ ] Request specific reviewers</p></li></ul><h2><strong>Productivity Tips</strong></h2><h3><strong>Shell Aliases</strong></h3><pre><code># ~/.bashrc or ~/.zshrc
alias gs=&#8217;git status&#8217;
alias gl=&#8217;git log --oneline -10&#8217;
alias gp=&#8217;git pull origin main&#8217;
alias v=&#8217;source env.sh&#8217;

# Poetry shortcuts
alias pi=&#8217;poetry install&#8217;
alias pr=&#8217;poetry run&#8217;
alias ps=&#8217;poetry shell&#8217;

# Make shortcuts
alias mt=&#8217;make test-components &amp;&amp; make test-pipelines&#8217;
alias mc=&#8217;make compile pipeline=training &amp;&amp; make compile pipeline=prediction&#8217;</code></pre><h3><strong>Quick Iteration Workflow</strong></h3><pre><code># One-liner: test + compile + run
make test-components &amp;&amp; make compile pipeline=training &amp;&amp; make training build=false</code></pre><h2><strong>Jupyter Notebooks for Exploration</strong></h2><pre><code># Install Jupyter
poetry add --group dev jupyter

# Launch
poetry run jupyter notebook

# Explore data
import pandas as pd
from google.cloud import bigquery

client = bigquery.Client(project=&#8221;my-dev-project&#8221;)
df = client.query(&#8221;SELECT * FROM dataset.table LIMIT 100&#8221;).to_dataframe()
df.head()</code></pre><h2><strong>Onboarding New Team Members</strong></h2><h3><strong>Day 1 Checklist</strong></h3><pre><code>## Setup Checklist

- [ ] Install prerequisites
  - [ ] Python 3.10+
  - [ ] Poetry
  - [ ] Docker
  - [ ] gcloud CLI
  - [ ] Terraform
  - [ ] Git

- [ ] Clone repository
  ```bash
  git clone https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP
  cd production-ready-MLOps-on-GCP</code></pre><ul><li><p>[ ] Configure environment</p></li></ul><pre><code>cp env.sh.example env.sh
# Edit env.sh with your values
source env.sh</code></pre><ul><li><p>[ ] Install dependencies</p></li></ul><pre><code>make install</code></pre><ul><li><p>[ ] Set up pre-commit hooks</p></li></ul><pre><code>cd pipelines
poetry run pre-commit install</code></pre><ul><li><p>[ ] Authenticate to GCP</p></li></ul><pre><code>gcloud auth login
gcloud auth application-default login</code></pre><ul><li><p>[ ] Run tests</p></li></ul><pre><code>make test-components
make test-pipelines</code></pre><ul><li><p>[ ] Compile pipelines</p></li></ul><pre><code>make compile pipeline=training
make compile pipeline=prediction</code></pre><ul><li><p>[ ] Run first pipeline</p></li></ul><pre><code>make training build=true enable_caching=false</code></pre><pre><code>### First Contribution

```bash
# Pick a good first issue
# Look for issues labeled &#8220;good first issue&#8221; or &#8220;beginner-friendly&#8221;

# Example: Update documentation
git checkout -b docs/fix-typo-in-readme
vim README.md
git add README.md
git commit -m &#8220;docs: fix typo in README&#8221;
git push origin docs/fix-typo-in-readme
gh pr create</code></pre><h2><strong>Documentation Best Practices</strong></h2><h3><strong>Code Documentation</strong></h3><pre><code>def upload_best_model_op(
    model: Input[Model],
    model_eval_metrics: Input[Metrics],
    eval_metric: str,
    eval_lower_is_better: bool,
    model_name: str,
) -&gt; None:
    &#8220;&#8221;&#8220;
    Upload model to registry only if it beats the champion.

    Implements the Champion/Challenger pattern: compares new model
    against current default model in registry. Uploads new model
    as default only if it has better performance on eval_metric.

    Args:
        model: Trained model to evaluate as challenger.
        model_eval_metrics: Evaluation metrics from test set.
        eval_metric: Metric name for comparison (e.g., &#8220;rmse&#8221;).
        eval_lower_is_better: True for losses, False for scores.
        model_name: Display name in Model Registry.

    Returns:
        None. Uploads model to Vertex AI Model Registry.

    Example:
        &gt;&gt;&gt; upload_best_model_op(
        ...     model=trained_model,
        ...     model_eval_metrics=metrics,
        ...     eval_metric=&#8221;rmse&#8221;,
        ...     eval_lower_is_better=True,
        ...     model_name=&#8221;taxi-fare-model&#8221;
        ... )
    &#8220;&#8221;&#8220;</code></pre><h3><strong>README Structure</strong></h3><pre><code># Project Name

## Overview
[Brief description]

## Prerequisites
[Required tools and versions]

## Setup
[Step-by-step installation]

## Usage
[Common commands]

## Testing
[How to run tests]

## Contributing
[Contribution guidelines]

## Troubleshooting
[Common issues and solutions]</code></pre><h2><strong>Conclusion</strong></h2><p>Developer experience is what separates a good MLOps platform from a great one. By focusing on:</p><ul><li><p><strong>Makefile shortcuts</strong>: Common tasks are one command away</p></li><li><p><strong>Poetry</strong>: Reliable dependency management</p></li><li><p><strong>Pre-commit hooks</strong>: Automatic code quality</p></li><li><p><strong>Fast local testing</strong>: Iteration without waiting for CI</p></li><li><p><strong>Clear documentation</strong>: Easy onboarding</p></li><li><p><strong>Smooth Git workflow</strong>: Collaboration without friction</p></li></ul><p>You create an environment where developers can focus on improving models, not fighting tools.</p><p>This series has taken you from architecture to implementation to deployment to developer experience. You now have a complete blueprint for building production-ready MLOps systems on GCP.</p><p><strong>What&#8217;s next?</strong></p><ul><li><p>Star the <a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP">GitHub repository</a></p></li><li><p>Try implementing it for your use case</p></li><li><p>Share feedback and improvements</p></li><li><p>Help others by answering questions</p></li></ul><p>Thank you for following this series. Now go build amazing ML systems!</p><p><strong>Key Takeaways:</strong></p><ul><li><p>Makefile provides developer-friendly shortcuts</p></li><li><p>Poetry manages dependencies reliably</p></li><li><p>Pre-commit hooks enforce code quality automatically</p></li><li><p>Fast local feedback loops increase productivity</p></li><li><p>Good documentation lowers onboarding time</p></li><li><p>Developer experience determines platform adoption</p></li></ul><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP">production-ready-MLOps-on-GCP</a></p><p><strong>Developer Tools</strong>:</p><ul><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/blob/main/Makefile">Makefile</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/blob/main/pipelines/pyproject.toml">pyproject.toml</a></p></li><li><p><a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP/blob/main/.pre-commit-config.yaml">pre-commit-config.yaml</a></p><p></p></li></ul><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Production-Ready MLOps on GCP Part 1: Architecture Overview]]></title><description><![CDATA[Part 1 of a 8-part series on building enterprise-grade MLOps systems]]></description><link>https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/production-ready-mlops-on-gcp-part</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 10:12:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!tQ3c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>Introduction</strong></h2><blockquote><p><em>If you&#8217;ve ever tried to move a machine learning model from your Jupyter notebook to production, you know the struggle. The model works beautifully on your laptop, but suddenly you&#8217;re drowning in questions: How do I retrain it automatically? How do I version my models? How do I deploy to multiple environments? How do I monitor model performance over time?</em></p></blockquote><p>Welcome to the world of MLOps &#8212; where the real challenge isn&#8217;t building models, it&#8217;s building <strong>systems</strong> that can train, deploy, and maintain models reliably at scale.</p><p>In this series, I&#8217;ll walk you through a complete production-ready MLOps implementation on Google Cloud Platform. We&#8217;ll use a real-world use case (predicting Chicago taxi fares) to demonstrate how to build an ML system that&#8217;s actually ready for production, not just a proof-of-concept.</p><p><strong>Complete Series</strong>:</p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 1: Architecture Overview</a> (You are here)</p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-5f1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 2: Tools &amp; Workflows for ML Teams</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-06c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 3: Infrastructure as Code</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-8ac?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 4: Reusable KFP Components</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-022?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 5: Production Training Pipeline</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-a6c?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 6: Production Prediction Pipeline </a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-9c6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 7: CI/CD for ML</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/production-ready-mlops-on-gcp-part-e8f?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Article 8: Model Monitoring &amp; Continuous Training</a></p></li></ul><p>By the end of this series, you&#8217;ll understand how to:</p><ul><li><p>Structure multi-environment ML infrastructure (dev/test/prod)</p></li><li><p>Build reusable, testable ML pipeline components</p></li><li><p>Automate CI/CD for machine learning workflows</p></li><li><p>Implement model versioning, evaluation, and monitoring</p></li><li><p>Design event-driven continuous training systems</p></li></ul><p>Let&#8217;s start with the big picture.</p><h2><strong>The Challenge: Why Most ML Projects Fail in Production</strong></h2><p>According to various industry reports, 85&#8211;90% of ML projects never make it to production. Even when they do, many fail within the first year. Why?</p><p>The gap between a working ML model and a production ML system is enormous:</p><ol><li><p><strong>Automation</strong>: Models need to retrain automatically when new data arrives</p></li><li><p><strong>Reproducibility</strong>: You need to recreate any model version from the past</p></li><li><p><strong>Testing</strong>: Both code and data need comprehensive validation</p></li><li><p><strong>Monitoring</strong>: Model performance degrades over time and needs tracking</p></li><li><p><strong>Multi-environment deployment</strong>: Changes must flow through dev &#8594; test &#8594; prod</p></li><li><p><strong>Compliance</strong>: You need audit trails, lineage tracking, and governance</p></li></ol><p>This is where MLOps comes in, applying DevOps principles to machine learning workflows.</p><h2><strong>Our Solution: A Complete MLOps Architecture on GCP</strong></h2><p>Our reference implementation addresses these challenges with a comprehensive architecture built on Google Cloud Platform. Let&#8217;s break down the key components.</p><h3><strong>The Use Case: Chicago Taxi Fare Prediction</strong></h3><p>To keep this practical, we&#8217;re solving a real problem: predicting taxi fares for Chicago taxi trips. This use case demonstrates common ML patterns:</p><ul><li><p><strong>Tabular data</strong> from BigQuery (public Chicago taxi dataset)</p></li><li><p><strong>Feature engineering</strong> with both numeric and categorical variables</p></li><li><p><strong>Batch predictions</strong> for generating forecasts at scale</p></li><li><p><strong>Model retraining</strong> when new data becomes available</p></li><li><p><strong>Champion/Challenger comparison</strong> for model evaluation</p></li></ul><p>The model predicts fare amounts based on:</p><ul><li><p>Trip characteristics (distance, duration, time of day)</p></li><li><p>Temporal features (day of week, hour of day)</p></li><li><p>Categorical features (payment type, company)</p></li></ul><h2><strong>High-Level Architecture</strong></h2><p>Our architecture follows a <strong>multi-project strategy</strong> with four distinct GCP projects:</p><pre><code>&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;  Admin Project                                              &#9474;
&#9474;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;   &#9474;
&#9474;  &#9474;  Cloud Build CI/CD Pipelines                         &#9474;   &#9474;
&#9474;  &#9474;  - PR Checks        - Terraform Plan/Apply           &#9474;   &#9474;
&#9474;  &#9474;  - E2E Tests        - Release Management             &#9474;   &#9474;
&#9474;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;   &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
                            &#9474;
                            &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
                            &#9660;              &#9660;              &#9660;
              &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
              &#9474;  Dev Project     &#9474;  &#9474; Test Project &#9474;  &#9474; Prod Project &#9474;
              &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;  &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;  &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
              &#9474; &#8226; Vertex AI      &#9474;  &#9474; &#8226; Vertex AI  &#9474;  &#9474; &#8226; Vertex AI  &#9474;
              &#9474; &#8226; BigQuery       &#9474;  &#9474; &#8226; BigQuery   &#9474;  &#9474; &#8226; BigQuery   &#9474;
              &#9474; &#8226; GCS            &#9474;  &#9474; &#8226; GCS        &#9474;  &#9474; &#8226; GCS        &#9474;
              &#9474; &#8226; Artifact Reg.  &#9474;  &#9474; &#8226; Artifact   &#9474;  &#9474; &#8226; Artifact   &#9474;
              &#9474; &#8226; Model Registry &#9474;  &#9474; &#8226; Model Reg. &#9474;  &#9474; &#8226; Model Reg. &#9474;
              &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9474; &#8226; Cloud Run  &#9474;
                                                      &#9474; &#8226; Pub/Sub    &#9474;
                                                      &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre><p><strong>Why four projects?</strong></p><ol><li><p><strong>Dev Project</strong>: Shared sandbox for experimentation and development</p></li><li><p><strong>Test Project</strong>: Mirrors production for validation before release</p></li><li><p><strong>Prod Project</strong>: Production environment with strict controls</p></li><li><p><strong>Admin Project</strong>: Centralized CI/CD that deploys to all environments</p></li></ol><p>This separation provides:</p><ul><li><p><strong>Isolation</strong>: Changes in dev don&#8217;t affect production</p></li><li><p><strong>Security</strong>: Different IAM policies per environment</p></li><li><p><strong>Cost tracking</strong>: Separate billing for each environment</p></li><li><p><strong>Compliance</strong>: Clear audit trails and change controls</p></li></ul><h2><strong>The GCP Services Stack</strong></h2><p>Our solution leverages these Google Cloud services:</p><p><strong>ML Platform (Vertex AI)</strong></p><ul><li><p><strong>Vertex AI Pipelines</strong>: Orchestrates ML workflows using Kubeflow</p></li><li><p><strong>Vertex AI Training</strong>: Runs custom training jobs with hyperparameter tuning</p></li><li><p><strong>Vertex AI Model Registry</strong>: Versions and manages models with lineage</p></li><li><p><strong>Vertex AI Batch Prediction</strong>: Executes large-scale inference</p></li><li><p><strong>Vertex AI Metadata Store</strong>: Tracks artifacts, lineage, and experiments</p></li></ul><p><strong>Data &amp; Storage</strong></p><ul><li><p><strong>BigQuery</strong>: Data warehouse for preprocessing and feature engineering</p></li><li><p><strong>Cloud Storage</strong>: Stores artifacts, datasets, and pipeline outputs</p></li><li><p><strong>Artifact Registry</strong>: Hosts Docker images and compiled pipelines</p></li></ul><p><strong>Automation &amp; Orchestration</strong></p><ul><li><p><strong>Cloud Build</strong>: CI/CD pipelines for testing and deployment</p></li><li><p><strong>Cloud Run Functions</strong>: Event-driven pipeline triggers</p></li><li><p><strong>Cloud Pub/Sub</strong>: Asynchronous messaging for pipeline events</p></li><li><p><strong>Vertex AI Pipeline Schedules:</strong> Periodic pipeline execution</p></li></ul><p><strong>Infrastructure &amp; Security</strong></p><ul><li><p><strong>Terraform</strong>: Infrastructure as Code for reproducible deployments</p></li><li><p><strong>IAM &amp; Service Accounts</strong>: Fine-grained access control</p></li><li><p><strong>Cloud Monitoring &amp; Logging</strong>: Observability and debugging</p></li></ul><h2><strong>The Two Core ML Pipelines</strong></h2><p>Our system implements two main pipelines orchestrated with Kubeflow Pipelines (KFP):</p><h3><strong>1. Training Pipeline</strong></h3><p>The training pipeline executes these steps:</p><pre><code>Data Preprocessing (BigQuery)
         &#8595;
   Data Splitting (80/10/10)
         &#8595;
  Export to GCS (CSV)
         &#8595;
Hyperparameter Tuning (6 trials)
         &#8595;
  Model Training (Custom TF Container)
         &#8595;
Model Evaluation (Test Set)
         &#8595;
Champion/Challenger Comparison
         &#8595;
Upload to Model Registry</code></pre><p><strong>Key features:</strong></p><ul><li><p><strong>Repeatable data splits</strong>: Same random seed ensures reproducibility</p></li><li><p><strong>BigQuery-native preprocessing</strong>: SQL-based feature engineering</p></li><li><p><strong>Custom TensorFlow container</strong>: Full control over training logic</p></li><li><p><strong>Automatic hyperparameter tuning</strong>: Vertex AI optimizes learning rate and batch size</p></li><li><p><strong>Champion/Challenger pattern</strong>: New models must beat existing champion on RMSE</p></li><li><p><strong>Model versioning</strong>: All models tagged and stored in registry</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tQ3c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tQ3c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png 424w, https://substackcdn.com/image/fetch/$s_!tQ3c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png 848w, https://substackcdn.com/image/fetch/$s_!tQ3c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png 1272w, https://substackcdn.com/image/fetch/$s_!tQ3c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tQ3c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png" width="788" height="373" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:373,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!tQ3c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png 424w, https://substackcdn.com/image/fetch/$s_!tQ3c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png 848w, https://substackcdn.com/image/fetch/$s_!tQ3c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png 1272w, https://substackcdn.com/image/fetch/$s_!tQ3c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299860ad-e4b5-4351-aa84-dc32c23cef5d_788x373.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!80Nd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!80Nd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png 424w, https://substackcdn.com/image/fetch/$s_!80Nd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png 848w, https://substackcdn.com/image/fetch/$s_!80Nd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png 1272w, https://substackcdn.com/image/fetch/$s_!80Nd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!80Nd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png" width="788" height="321" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:321,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!80Nd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png 424w, https://substackcdn.com/image/fetch/$s_!80Nd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png 848w, https://substackcdn.com/image/fetch/$s_!80Nd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png 1272w, https://substackcdn.com/image/fetch/$s_!80Nd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e0b4999-6672-4639-b062-725b13d92b7d_788x321.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>2. Prediction Pipeline</strong></h3><p>The prediction pipeline handles batch inference:</p><p>Lookup Champion Model (Registry)<br>&#8595;<br>Data Preprocessing (BigQuery)<br>&#8595;<br>Batch Prediction (BQ &#8594; BQ)<br>&#8595;<br>Model Monitoring (Skew Detection)<br>&#8595;<br>Alert on Issues</p><p><strong>Key features:</strong></p><ul><li><p><strong>Consistent preprocessing</strong>: Uses same SQL logic as training</p></li><li><p><strong>Scalable inference</strong>: BigQuery batch predictions for millions of rows</p></li><li><p><strong>Training-serving skew detection</strong>: Monitors for data drift</p></li><li><p><strong>Automated alerts</strong>: Email notifications when skew is detected</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-12d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-12d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png 424w, https://substackcdn.com/image/fetch/$s_!-12d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png 848w, https://substackcdn.com/image/fetch/$s_!-12d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png 1272w, https://substackcdn.com/image/fetch/$s_!-12d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-12d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png" width="788" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-12d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png 424w, https://substackcdn.com/image/fetch/$s_!-12d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png 848w, https://substackcdn.com/image/fetch/$s_!-12d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png 1272w, https://substackcdn.com/image/fetch/$s_!-12d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa47c21f0-a865-47d5-99b7-74bf2f2aa2a3_788x345.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Reusable Component Library</strong></h2><p>A critical aspect of our architecture is <strong>reusability</strong>. We&#8217;ve built 8 custom Kubeflow components that can be mixed and matched:</p><ol><li><p><code>extract_table_to_gcs_op</code>: Export BigQuery &#8594; Cloud Storage</p></li><li><p><code>get_training_args_dict_op</code>: Build training configuration</p></li><li><p><code>get_workerpool_spec_op</code>: Configure distributed training</p></li><li><p><code>get_hyperparameter_tuning_results_op</code>: Parse tuning results</p></li><li><p><code>get_custom_job_results_op</code>: Extract training metrics</p></li><li><p><code>lookup_model_op</code>: Find models in registry by criteria</p></li><li><p><code>upload_best_model_op</code>: Champion/Challenger comparison</p></li><li><p><code>model_batch_predict_op</code>: Execute predictions with monitoring</p></li></ol><p>These components are the building blocks that make our pipelines composable and maintainable.</p><h2><strong>The Development Workflow</strong></h2><p>Here&#8217;s how a typical development cycle works:</p><h3><strong>1. Local Development</strong></h3><pre><code># Developer makes changes locally
git checkout -b feature/improve-model# Run pre-commit hooks (linting, formatting)
poetry run pre-commit run --all-files# Run unit tests
make test# Compile pipeline locally
make run pipeline=training compile=true build=false# Test in dev environment
make training enable_caching=false</code></pre><h3><strong>2. Pull Request &amp; CI</strong></h3><p>When you open a PR, Cloud Build automatically:</p><ul><li><p>Runs pre-commit hooks (flake8, black, ruff)</p></li><li><p>Executes unit tests on components and pipelines</p></li><li><p>Compiles pipelines to verify syntax</p></li><li><p>Runs Terraform plan to preview infrastructure changes</p></li><li><p>(Optionally) Runs E2E tests with <code>/gcbrun</code> comment</p></li></ul><h3><strong>3. Merge &amp; Deployment</strong></h3><p>On merge to main:</p><ul><li><p>Terraform Apply deploys infrastructure changes</p></li><li><p>Code is ready for release</p></li></ul><h3><strong>4. Release</strong></h3><p>Creating a git tag triggers:</p><ul><li><p>Docker image builds for all environments</p></li><li><p>Pipeline compilation and versioning</p></li><li><p>Upload to Artifact Registry with semantic version tags</p></li><li><p>Ready for scheduling in test/prod</p></li></ul><h3><strong>5. Production Execution</strong></h3><p>In production:</p><ul><li><p><strong>Scheduled</strong>: Vertex AI Pipeline Schedules trigger pipelines periodically</p></li><li><p><strong>Event-driven</strong>: Cloud Run Function triggers on new data arrival</p></li><li><p><strong>Manual</strong>: Direct pipeline submission for ad-hoc runs</p></li></ul><h2><strong>Key Design Principles</strong></h2><p>Our architecture follows several important principles:</p><h3><strong>1. Everything as Code</strong></h3><ul><li><p>Infrastructure: Terraform modules</p></li><li><p>Pipelines: Python with KFP SDK</p></li><li><p>Training logic: Containerized Python</p></li><li><p>Configuration: Version-controlled YAML</p></li></ul><h3><strong>2. Environment Parity</strong></h3><p>Test and production environments are identical, ensuring:</p><ul><li><p>What works in test will work in prod</p></li><li><p>No surprises during deployment</p></li><li><p>Reduced debugging time</p></li></ul><h3><strong>3. Immutable Artifacts</strong></h3><p>Once built, artifacts never change:</p><ul><li><p>Docker images tagged with git SHA and version</p></li><li><p>Compiled pipelines versioned in Artifact Registry</p></li><li><p>Models versioned in Model Registry</p></li></ul><h3><strong>4. Automated Testing</strong></h3><p>Multiple testing layers:</p><ul><li><p><strong>Unit tests</strong>: Component logic validation</p></li><li><p><strong>Integration tests</strong>: Pipeline compilation</p></li><li><p><strong>E2E tests</strong>: Full pipeline execution in dev</p></li><li><p><strong>Infrastructure tests</strong>: Terraform validation</p></li></ul><h3><strong>5. Security by Design</strong></h3><ul><li><p>Least privilege IAM (separate service accounts per pipeline)</p></li><li><p>No public bucket access</p></li><li><p>Secrets managed by GCP Secret Manager</p></li><li><p>Audit logging enabled</p></li></ul><h3><strong>6. Observability First</strong></h3><ul><li><p>Cloud Logging for all pipeline steps</p></li><li><p>Vertex AI Metadata for lineage tracking</p></li><li><p>Model monitoring for performance degradation</p></li><li><p>Alerting on failures and anomalies</p></li></ul><h2><strong>Getting Started</strong></h2><p>The complete code for this reference implementation is available on GitHub: <a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP">production-ready-MLOps-on-GCP</a></p><p>To follow along, you&#8217;ll need:</p><ul><li><p>GCP account with billing enabled</p></li><li><p>Four GCP projects (or start with one for learning)</p></li><li><p>Basic knowledge of Python, Terraform, and ML concepts</p></li><li><p>Familiarity with Docker and CI/CD concepts</p></li></ul><h2><strong>Conclusion</strong></h2><p>Building production-ready ML systems is complex, but it doesn&#8217;t have to be mysterious. By following proven patterns and leveraging the right GCP services, you can create ML systems that are:</p><ul><li><p><strong>Reliable</strong>: Automated testing and validation</p></li><li><p><strong>Scalable</strong>: Leveraging managed GCP services</p></li><li><p><strong>Maintainable</strong>: Modular, reusable components</p></li><li><p><strong>Auditable</strong>: Complete lineage and versioning</p></li><li><p><strong>Secure</strong>: Proper IAM and access controls</p></li></ul><p>In the next article, we&#8217;ll dive into the infrastructure layer, exploring how Terraform modules provision and manage our Vertex AI environment across dev, test, and prod projects.</p><p><strong>Next in Series</strong>: Infrastructure as Code for ML: Terraform + Vertex AI</p><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/production-ready-MLOps-on-GCP">production-ready-MLOps-on-GCP</a></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Building Distributed Multi-Agent Systems with Google’s AI Stack: Part 5]]></title><description><![CDATA[External Tool Integration via Model Context Protocol (MCP)]]></description><link>https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-e64</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-e64</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 09:44:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!chAO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Building Distributed Multi-Agent Systems with Google&#8217;s AI Stack series:</strong></p><ul><li><p><a href="https://medium.com/google-cloud/building-distributed-multi-agent-systems-with-googles-ai-stack-part-1-c2f872f35bcf">Part 1: From Monolithic AI to Distributed Intelligence: Building Your First Multi-Agent System</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-2a2?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2: Making Agents Talk: Agent-to-Agent (A2A) Protocol Deep Dive</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-9a3?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3: Building the Orchestrator: Coordinating Agents with the AgentTool Pattern</a></p></li><li><p><a href="https://medium.com/google-cloud/building-distributed-multi-agent-systems-with-googles-ai-stack-part-4-e2d58bfb3957">Part 4: Scaling Multi-Agent Workflows: Solving the Token Limit Problem</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-d85?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 5: External Tool Integration via Model Context Protocol (MCP)</a></strong> &#8592; You are here</p></li><li><p>Part 6: Deploying to Cloud: Cloud Run and Vertex AI Agent Engine</p></li></ul><h2><strong>Welcome Back!</strong></h2><p>In <a href="https://medium.com/google-cloud/article-05-context-compaction.md">Part </a>4, we solved the token limit problem with context compaction. Our multi-agent system now handles complex workflows beautifully.</p><p>But there&#8217;s one more capability we need: <strong>connecting to external services</strong>.</p><p>Our Project Manager agent needs to:</p><ul><li><p>Create tasks in Notion</p></li><li><p>Link tasks to projects</p></li><li><p>Work with any Notion database structure</p></li><li><p>Support multilingual property names</p></li></ul><p>Enter <strong>Model Context Protocol (MCP)</strong> &#8212; a standardized way to connect LLMs to external tools.</p><p>In this article, we&#8217;ll:</p><ul><li><p>Understand what MCP is and why it matters</p></li><li><p>Integrate the official Notion MCP server</p></li><li><p>Implement <strong>dynamic schema discovery</strong></p></li><li><p>Deploy MCP-enabled agents to Cloud Run</p></li></ul><p>Let&#8217;s connect our agents to the real world!</p><h2><strong>What is Model Context Protocol (MCP)?</strong></h2><p>MCP is a <strong>standardized protocol</strong> for connecting LLMs to external tools and data sources, created by Anthropic.</p><h3><strong>Why MCP?</strong></h3><p><strong>Without MCP</strong> (Traditional approach):</p><pre><code># Custom integration for each service
def create_notion_task(title, status, due_date):
    # Custom API client
    # Custom request formatting
    # Custom error handling
    # Custom response parsing
    ...
def create_slack_message(channel, text):
    # Different custom implementation
    ...
def query_database(query):
    # Yet another custom implementation
    ...</code></pre><p><strong>With MCP</strong>:</p><pre><code># Single standard interface for all tools
mcp_toolset = McpToolset(connection_params=...)
# Agent automatically discovers and uses tools
agent = Agent(
    name=&#8221;project_manager&#8221;,
    tools=[mcp_toolset]  # All tools available!
)</code></pre><h3><strong>MCP Benefits</strong></h3><ul><li><p><strong>Standardized</strong>: One protocol for all external tools</p></li><li><p><strong>Discoverable</strong>: Tools describe themselves</p></li><li><p><strong>Composable</strong>: Mix and match tool servers</p></li><li><p><strong>Secure</strong>: Controlled access and permissions</p></li><li><p><strong>Community-driven</strong>: Growing ecosystem of MCP servers</p></li></ul><h2><strong>MCP Architecture</strong></h2><pre><code>&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;     Agent (LLM)                 &#9474;
&#9474;                                 &#9474;
&#9474;  &#8220;I need to create a task       &#9474;
&#9474;   in Notion...&#8221;                 &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
               &#9474;
               &#8595; Tool Discovery
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;   MCP Toolset (ADK Integration)  &#9474;
&#9474;                                  &#9474;
&#9474;   - Discovers available tools    &#9474;
&#9474;   - Formats requests             &#9474;
&#9474;   - Handles responses            &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
               &#9474;
               &#8595; Stdio/HTTP
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;   MCP Server                     &#9474;
&#9474;   (@notionhq/notion-mcp-server)  &#9474;
&#9474;                                  &#9474;
&#9474;   - Exposes Notion API as tools  &#9474;
&#9474;   - Handles authentication       &#9474;
&#9474;   - Provides tool descriptions   &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
               &#9474;
               &#8595; HTTPS
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;   Notion API                     &#9474;
&#9474;                                  &#9474;
&#9474;   - Actual database operations   &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!chAO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!chAO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png 424w, https://substackcdn.com/image/fetch/$s_!chAO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png 848w, https://substackcdn.com/image/fetch/$s_!chAO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!chAO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!chAO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png" width="784" height="1075" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1075,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!chAO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png 424w, https://substackcdn.com/image/fetch/$s_!chAO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png 848w, https://substackcdn.com/image/fetch/$s_!chAO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!chAO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43420be8-0b07-4cd4-a381-897496a7f1aa_784x1075.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Setting Up Notion for MCP</strong></h2><h3><strong>Step 1: Create Notion Integration</strong></h3><ol><li><p>Go to <a href="https://www.notion.so/my-integrations">notion.so/my-integrations</a></p></li><li><p>Click &#8220;New integration&#8221;</p></li><li><p>Name it &#8220;AI Creative Studio&#8221;</p></li><li><p>Select your workspace</p></li><li><p>Click &#8220;Submit&#8221;</p></li><li><p>Copy the <strong>Internal Integration Token</strong> (starts with <code>secret_</code>)</p></li></ol><h3><strong>Step 2: Create Two Notion Databases</strong></h3><p>We need <strong>TWO databases</strong>: Projects and Tasks</p><p><strong>Projects Database</strong>:</p><pre><code>Properties:
- Project name (Title) &#8592; required
- Status (Status: Not started, In progress, Completed)
- Priority (Select: High, Medium, Low)
- Dates (Date with start and end)
- Summary (Rich text)</code></pre><p><strong>Tasks Database</strong>:</p><pre><code>Properties:
- Task name (Title) &#8592; required
- Status (Status: Not started, In progress, Done)
- Priority (Select: High, Medium, Low)
- Due (Date)
- Project (Relation &#8594; Projects database)</code></pre><h3><strong>Step 3: Share Databases with Integration</strong></h3><ol><li><p>Open each database</p></li><li><p>Click &#8220;&#8230;&#8221; menu &#8594; &#8220;Add connections&#8221;</p></li><li><p>Select your &#8220;AI Creative Studio&#8221; integration</p></li><li><p>Repeat for both databases</p></li></ol><h3><strong>Step 4: Get Database IDs</strong></h3><p><strong>Projects Database</strong>:</p><pre><code>URL: https://www.notion.so/workspace/abc123...
                                     ^^^^^^^^
                                     This is the database ID</code></pre><p><strong>Tasks Database</strong>:</p><pre><code>URL: https://www.notion.so/workspace/def456...
                                     ^^^^^^^^
                                     This is the database ID</code></pre><h3><strong>Step 5: Configure Environment Variables</strong></h3><pre><code># .env
NOTION_API_KEY=secret_abc123...
NOTION_DATABASE_ID=abc123...  # Projects database
TASKS_DATABASE_ID=def456...   # Tasks database</code></pre><p><strong>Installing Notion MCP Server</strong></p><p>The Project Manager needs Node.js to run the Notion MCP server:</p><p><em><strong>Local Development</strong></em></p><pre><code># Install Node.js (if not already installed)
# macOS:
brew install node
# Ubuntu/Debian:
sudo apt install nodejs npm
# Verify
node --version  # Should be 18+
npm --version</code></pre><p><em><strong>Cloud Run (Dockerfile)</strong></em></p><pre><code>FROM python:3.12-slim
WORKDIR /app
# Install Node.js for MCP server
RUN apt-get update &amp;&amp; apt-get install -y \
    nodejs \
    npm \
    curl \
    &amp;&amp; rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy agent code
COPY agent.py .
# ... rest of Dockerfile</code></pre><h2><strong>Integrating MCP with Project Manager Agent</strong></h2><h3><strong>Step 1: Import MCP Tools</strong></h3><pre><code># agents/project_manager/agent.py
import os
import logging
from google.adk.agents import Agent
from google.adk.tools.mcp_tool import McpToolset, StdioConnectionParams
from mcp import StdioServerParameters
from dotenv import load_dotenv
load_dotenv()
logger = logging.getLogger(&#8221;ai_creative_studio.project_manager&#8221;)</code></pre><h3><strong>Step 2: Configure Notion MCP Server</strong></h3><pre><code>def create_project_manager():
    &#8220;&#8221;&#8220;Create Project Manager agent with Notion MCP integration&#8221;&#8220;&#8221;
    # Get configuration
    notion_api_key = os.getenv(&#8221;NOTION_API_KEY&#8221;)
    projects_db_id = os.getenv(&#8221;NOTION_DATABASE_ID&#8221;)
    tasks_db_id = os.getenv(&#8221;TASKS_DATABASE_ID&#8221;, &#8220;2ceb1b31123181508894ddb3c597dc48&#8221;)
    if not notion_api_key or not projects_db_id:
        logger.warning(&#8221;&#9888;&#65039;  Notion credentials not set - agent will work without Notion integration&#8221;)
        notion_toolset = None
    else:
        logger.info(&#8221;&#9989; Configuring Notion MCP integration&#8221;)
        # IMPORTANT: Notion MCP server expects NOTION_TOKEN, not NOTION_API_KEY
        mcp_env = {
            &#8220;NOTION_TOKEN&#8221;: notion_api_key,  # &#8592; Note: NOTION_TOKEN
            &#8220;PATH&#8221;: os.environ.get(&#8221;PATH&#8221;, &#8220;/usr/local/bin:/usr/bin:/bin&#8221;)
        }
        # Configure Notion MCP server using globally installed version
        server_params = StdioServerParameters(
            command=&#8221;notion-mcp-server&#8221;,  # Use globally installed version
            args=[],
            env=mcp_env
        )
        # Create MCP toolset
        notion_toolset = McpToolset(
            connection_params=StdioConnectionParams(
                server_params=server_params,
                timeout=30.0  # 30 second timeout for MCP server startup
            )
        )
        logger.info(&#8221;&#9989; Notion MCP toolset configured&#8221;)
    # Create agent with MCP tools
    agent = Agent(
        name=&#8221;project_manager&#8221;,
        model=&#8221;gemini-2.5-flash&#8221;,
        instruction=get_system_instruction(projects_db_id, tasks_db_id),
        description=&#8221;Project manager for creating timelines, tasks, and organizing deliverables&#8221;,
        tools=[notion_toolset] if notion_toolset else []
    )
    logger.info(&#8221;&#9989; Project Manager agent created&#8221;)
    return agent

root_agent = create_project_manager()</code></pre><p><strong>Key points</strong>:</p><ul><li><p>Uses globally installed <code>@notionhq/notion-mcp-server</code> (pinned to v1.9.1 in Dockerfile)</p></li><li><p>Passes <code>NOTION_TOKEN</code> (not <code>NOTION_API_KEY</code>) to MCP server</p></li><li><p>Stdio transport (communication via stdin/stdout)</p></li><li><p>30-second timeout for server startup</p></li></ul><blockquote><p><em><strong>Note</strong>: We use the globally installed version instead of </em><code>npx -y</code><em> to control the exact MCP server version (see Version Pinning Considerations section below).</em></p></blockquote><h2><strong>Dynamic Schema Discovery</strong></h2><p>Here&#8217;s the problem: Hardcoded property names break easily.</p><h3><strong>The Hardcoded Approach (Fragile)</strong></h3><pre><code>INSTRUCTION = &#8220;&#8221;&#8220;
Create a page in Notion:
properties = {
    &#8220;Name&#8221;: {&#8221;title&#8221;: [{&#8221;text&#8221;: {&#8221;content&#8221;: &#8220;Project X&#8221;}}]},
    &#8220;Status&#8221;: {&#8221;status&#8221;: {&#8221;name&#8221;: &#8220;In progress&#8221;}},
    &#8220;Priority&#8221;: {&#8221;select&#8221;: {&#8221;name&#8221;: &#8220;High&#8221;}}
}</code></pre><p><strong>Problems</strong>:</p><ul><li><p>Breaks if property names change</p></li><li><p>Doesn&#8217;t work with multilingual databases (&#8220;Nom&#8221;, &#8220;Statut&#8221;, &#8220;Priorit&#233;&#8221;)</p></li><li><p>Requires code changes for different databases</p></li><li><p>No flexibility</p></li></ul><h3><strong>Dynamic Schema Discovery (Robust)</strong></h3><p>Instead, we <strong>discover the schema at runtime</strong>:</p><pre><code>def get_system_instruction(projects_db_id: str, tasks_db_id: str) -&gt; str:
    return f&#8221;&#8220;&#8221;You are a Project Manager with Notion MCP integration.
**CRITICAL: Dynamic Schema Discovery**
Before creating any pages, you MUST discover the actual database schema.
**Step 1: Discover Projects Database Schema**
Use: API-retrieve-a-database
Database ID: {projects_db_id}
This returns:
- Actual property names (might be &#8220;Project name&#8221;, &#8220;Nom du projet&#8221;, etc.)
- Property types (title, status, select, date, etc.)
- Available options for status and select properties
- Relation configurations
**Step 2: Adapt to Actual Schema**
DO NOT assume property names! Use the discovered schema:
Example response:
{{
    &#8220;properties&#8221;: {{
        &#8220;Project name&#8221;: {{&#8221;type&#8221;: &#8220;title&#8221;}},  &#8592; Could be different!
        &#8220;&#201;tat&#8221;: {{&#8221;type&#8221;: &#8220;status&#8221;}},         &#8592; French!
        &#8220;Priorit&#233;&#8221;: {{&#8221;type&#8221;: &#8220;select&#8221;}},     &#8592; French!
        &#8220;Dates&#8221;: {{&#8221;type&#8221;: &#8220;date&#8221;}}
    }}
}}
Create pages using the ACTUAL property names from the schema.
**Step 3: Create Project Page**
Use: API-post-page
Database ID: {projects_db_id}
Properties: [Use discovered names]
**Step 4: Extract Project ID**
From the response, extract the page ID:
{{
    &#8220;id&#8221;: &#8220;abc-123-def-456&#8221;,  &#8592; Save this!
    ...
}}
**Step 5: Discover Tasks Database Schema**
Use: API-retrieve-a-database
Database ID: {tasks_db_id}
**Step 6: Create Task Pages**
Use: API-post-page (multiple times)
Database ID: {tasks_db_id}
Properties: [Use discovered names from tasks schema]
Link to project using the relation property:
{{
    &#8220;[Relation Property Name]&#8221;: {{
        &#8220;relation&#8221;: [{{&#8221;id&#8221;: &#8220;abc-123-def-456&#8221;}}]  &#8592; Project ID from step 4
    }}
}}
**Example Workflow:**
1. Discover Projects DB &#8594; Get actual property names
2. Create project page &#8594; Get project ID
3. Discover Tasks DB &#8594; Get actual property names
4. Create task 1 &#8594; Link to project ID
5. Create task 2 &#8594; Link to project ID
... (5-10 tasks total)
**IMPORTANT RULES:**
- NEVER hardcode property names like &#8220;Name&#8221;, &#8220;Status&#8221;, &#8220;Priority&#8221;
- ALWAYS use API-retrieve-a-database first
- ALWAYS adapt to the actual schema
- Property names can be in any language
- Relation properties link databases together
**Your Primary Output:**
Create a text-based project timeline with:
- Milestones
- Tasks and deadlines
- Team responsibilities
- Budget breakdown
THEN (if Notion credentials available):
- Create project and tasks in Notion
- Provide links to created pages
&#8220;&#8221;&#8220;</code></pre><h3><strong>How It Works in Practice</strong></h3><pre><code>Agent: &#8220;I need to create a project in Notion&#8221;
    &#8595;
Step 1: Call API-retrieve-a-database (Projects DB)
    &#8595;
Response: {
    &#8220;properties&#8221;: {
        &#8220;Nom du projet&#8221;: {&#8221;type&#8221;: &#8220;title&#8221;},     &#8592; French!
        &#8220;Statut&#8221;: {&#8221;type&#8221;: &#8220;status&#8221;},
        &#8220;Priorit&#233;&#8221;: {&#8221;select&#8221;: {
            &#8220;options&#8221;: [
                {&#8221;name&#8221;: &#8220;Haute&#8221;},
                {&#8221;name&#8221;: &#8220;Moyenne&#8221;},
                {&#8221;name&#8221;: &#8220;Basse&#8221;}
            ]
        }}
    }
}
    &#8595;
Step 2: Agent adapts - uses &#8220;Nom du projet&#8221;, &#8220;Statut&#8221;, &#8220;Priorit&#233;&#8221;
    &#8595;
Step 3: Create page with ACTUAL property names
    &#8595;
&#9989; Works with any database structure!</code></pre><p><strong>Benefits</strong>:</p><ul><li><p><strong>Language-agnostic</strong>: Works with French, Spanish, Japanese databases</p></li><li><p><strong>Flexible</strong>: No hardcoded property names</p></li><li><p><strong>Resilient</strong>: Adapts to schema changes</p></li><li><p><strong>Portable</strong>: Same code works with different Notion workspaces</p></li></ul><h3><strong>Available MCP Tools</strong></h3><p>The Notion MCP server exposes these tools:</p><p><strong>API-retrieve-a-database</strong></p><pre><code># Get database schema
{
    &#8220;name&#8221;: &#8220;API-retrieve-a-database&#8221;,
    &#8220;description&#8221;: &#8220;Retrieve database schema and properties&#8221;,
    &#8220;parameters&#8221;: {
        &#8220;database_id&#8221;: &#8220;abc123...&#8221;
    }
}</code></pre><p><strong>API-post-page</strong></p><pre><code># Create a new page
{
    &#8220;name&#8221;: &#8220;API-post-page&#8221;,
    &#8220;description&#8221;: &#8220;Create a new page in a database&#8221;,
    &#8220;parameters&#8221;: {
        &#8220;parent&#8221;: {&#8221;database_id&#8221;: &#8220;abc123...&#8221;},
        &#8220;properties&#8221;: {
            &#8220;Title Property&#8221;: {&#8221;title&#8221;: [...]},
            &#8220;Status Property&#8221;: {&#8221;status&#8221;: {&#8221;name&#8221;: &#8220;In progress&#8221;}},
            ...
        }
    }
}</code></pre><p><strong>API-patch-page</strong></p><pre><code># Update an existing page
{
    &#8220;name&#8221;: &#8220;API-patch-page&#8221;,
    &#8220;description&#8221;: &#8220;Update page properties&#8221;,
    &#8220;parameters&#8221;: {
        &#8220;page_id&#8221;: &#8220;page-123...&#8221;,
        &#8220;properties&#8221;: {...}
    }
}</code></pre><p><strong>API-post-database-query</strong></p><pre><code># Query database with filters
{
    &#8220;name&#8221;: &#8220;API-post-database-query&#8221;,
    &#8220;description&#8221;: &#8220;Query database with filters and sorts&#8221;,
    &#8220;parameters&#8221;: {
        &#8220;database_id&#8221;: &#8220;abc123...&#8221;,
        &#8220;filter&#8221;: {...},
        &#8220;sorts&#8221;: [...]
    }
}</code></pre><h3><strong>Testing MCP Integration Locally</strong></h3><p><strong>Test Script</strong></p><pre><code># agents/project_manager/test_local_notion.py
import asyncio
from agent import root_agent
from google.adk import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types
async def test_notion_integration():
    &#8220;&#8221;&#8220;Test Project Manager with Notion MCP&#8221;&#8220;&#8221;
    brief = &#8220;&#8221;&#8220;
    Create a project timeline for the EcoFlow Instagram campaign.
    Campaign details:
    - Product: EcoFlow smart water bottle
    - Target: Millennials 25-34
    - Budget: $5,000
    - Duration: 2 weeks
    - Deliverables: 5 Instagram posts, visuals, timeline
    Please create:
    1. A text-based project timeline
    2. Project and tasks in Notion (if available)
    &#8220;&#8221;&#8220;
    print(&#8221;&#128203; Testing Project Manager with Notion MCP\n&#8221;)
    print(f&#8221;Brief: {brief}\n&#8221;)
    session_service = InMemorySessionService()
    runner = Runner(
        app_name=&#8221;project_manager&#8221;,
        agent=root_agent,
        session_service=session_service
    )
    session_id = &#8220;test_notion&#8221;
    user_id = &#8220;test_user&#8221;
    try:
        await session_service.create_session(
            app_name=&#8221;project_manager&#8221;,
            user_id=user_id,
            session_id=session_id
        )
        print(&#8221;project_manager &gt; &#8220;, end=&#8217;&#8216;, flush=True)
        async for event in runner.run_async(
            user_id=user_id,
            session_id=session_id,
            new_message=types.Content(parts=[types.Part(text=brief)])
        ):
            if hasattr(event, &#8216;text&#8217;) and event.text:
                text = event.text
                # Highlight MCP tool calls
                if &#8220;API-retrieve-a-database&#8221; in text:
                    print(&#8221;\n[MCP] Discovering database schema...&#8221;, end=&#8217;&#8216;)
                elif &#8220;API-post-page&#8221; in text:
                    print(&#8221;\n[MCP] Creating page in Notion...&#8221;, end=&#8217;&#8216;)
                print(text, end=&#8217;&#8216;, flush=True)
        print(&#8221;\n\n&#9989; Project Manager test complete!&#8221;)
    finally:
        await runner.close()

if __name__ == &#8220;__main__&#8221;:
    asyncio.run(test_notion_integration())</code></pre><h3><strong>Expected Output</strong></h3><pre><code>&#128203; Testing Project Manager with Notion MCP
project_manager &gt; I&#8217;ll create a project timeline for your EcoFlow campaign.
[MCP] Discovering database schema...
I&#8217;ve discovered the Projects database schema.
[MCP] Creating page in Notion...
&#10003; Created project: &#8220;EcoFlow Instagram Campaign&#8221;
Project URL: https://notion.so/...
[MCP] Discovering database schema...
I&#8217;ve discovered the Tasks database schema.
[MCP] Creating page in Notion...
&#10003; Created task: &#8220;Market Research&#8221;
[MCP] Creating page in Notion...
&#10003; Created task: &#8220;Content Creation (5 posts)&#8221;
[MCP] Creating page in Notion...
&#10003; Created task: &#8220;Visual Design&#8221;
... (more tasks)
**Project Timeline:**
Week 1:
- Days 1-2: Market Research &amp; Strategy
- Days 3-5: Content Creation (5 Instagram posts)
- Days 6-7: Visual Design &amp; Image Generation
Week 2:
- Days 1-2: Review &amp; Revisions
- Days 3-5: Final Approvals
- Days 6-7: Campaign Launch
**Notion Pages Created:**
&#9989; Project: EcoFlow Instagram Campaign
&#9989; 8 tasks created and linked to project
&#9989; Project Manager test complete!</code></pre><h2><strong>Two-Database Architecture</strong></h2><h3><strong>Why Two Databases?</strong></h3><p><strong>Projects Database</strong>: High-level campaigns</p><ul><li><p>One project = one campaign</p></li><li><p>Contains overview information</p></li><li><p>Has dates, budget, status</p></li></ul><p><strong>Tasks Database</strong>: Granular work items</p><ul><li><p>Multiple tasks per project</p></li><li><p>Detailed action items</p></li><li><p>Assigned to team members</p></li><li><p>Has deadlines, priorities</p></li></ul><p><strong>Relation</strong>: Tasks link to Projects via relation property</p><h3><strong>Creating the Relation</strong></h3><pre><code># In Tasks database, create &#8220;Project&#8221; relation property:
1. Add property &#8594; Relation
2. Name: &#8220;Project&#8221; (or any name)
3. Select: Projects database
4. Save
# Now tasks can link to projects:
{
    &#8220;Project&#8221;: {
        &#8220;relation&#8221;: [{&#8221;id&#8221;: &#8220;project-page-id&#8221;}]
    }
}</code></pre><h2><strong>Deploying MCP-Enabled Agent to Cloud Run</strong></h2><h3><strong>Updated Dockerfile</strong></h3><pre><code>FROM python:3.12-slim
WORKDIR /app
# Install Node.js for MCP server
RUN apt-get update &amp;&amp; apt-get install -y \
    gcc \
    curl \
    &amp;&amp; curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
    &amp;&amp; apt-get install -y nodejs \
    &amp;&amp; rm -rf /var/lib/apt/lists/*
# Install Notion MCP server globally (pinned to 1.9.1)
RUN npm install -g @notionhq/notion-mcp-server@1.9.1
# Verify installations
RUN node --version &amp;&amp; npm --version
# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy agent code
COPY agent.py .
# Create non-root user
RUN useradd -m -u 1000 appuser &amp;&amp; chown -R appuser:appuser /app
USER appuser
# Environment
ENV PORT=8080
ENV HOST=0.0.0.0
EXPOSE 8080
CMD [&#8221;python&#8221;, &#8220;agent.py&#8221;]</code></pre><h3><strong>Deployment with Notion Credentials</strong></h3><pre><code># deploy.sh
# Deploy with Notion environment variables
gcloud run deploy project-manager \
    --source=. \
    --region=us-central1 \
    --set-env-vars=NOTION_API_KEY=$NOTION_API_KEY,NOTION_DATABASE_ID=$NOTION_DATABASE_ID,TASKS_DATABASE_ID=$TASKS_DATABASE_ID \
    --memory=1Gi \
    --cpu=1 \
    --timeout=300
echo &#8220;&#9989; Project Manager deployed with Notion MCP integration&#8221;</code></pre><h3><strong>Troubleshooting MCP</strong></h3><p><strong>Issue 1: MCP Server Won&#8217;t Start</strong></p><p><strong>Error</strong>: <code>TimeoutError: MCP server did not start within 30 seconds</code></p><p><strong>Solutions</strong>:</p><pre><code># Increase timeout
connection_params=StdioConnectionParams(
    server_params=server_params,
    timeout=60.0  # Increase to 60 seconds
)
# Verify Node.js is installed
# docker exec -it container bash
# node --version</code></pre><p>Issue 2: Notion Authentication Fails</p><p><strong>Error</strong>: <code>unauthorized</code></p><p><strong>Solutions</strong>:</p><ul><li><p>Verify NOTION_API_KEY is correct (starts with <code>secret_</code>)</p></li><li><p>Ensure databases are shared with integration</p></li><li><p>Check environment variable name: <code>NOTION_TOKEN</code> for MCP server</p></li></ul><p><strong>Issue 3: Property Not Found</strong></p><p><strong>Error</strong>: <code>Property "Name" does not exist</code></p><p><strong>Solution</strong>: Use dynamic schema discovery!</p><pre><code># Don&#8217;t hardcode &#8220;Name&#8221;
# Instead, discover the actual property name</code></pre><h2><strong>MCP Version Pinning Considerations</strong></h2><h3><strong>The Problem with Latest Versions</strong></h3><p>When deploying to cloud environments, you might encounter this issue:</p><pre><code># &#10060; DON&#8217;T DO THIS in cloud deployment
server_params = StdioServerParameters(
    command=&#8221;npx&#8221;,
    args=[&#8221;-y&#8221;, &#8220;@notionhq/notion-mcp-server&#8221;],  # Downloads latest version!
    env=mcp_env
)</code></pre><p><strong>Why this is risky</strong>:</p><ul><li><p><code>npx -y</code> downloads the <strong>latest</strong> version every time</p></li><li><p>Version 2.0.0 introduced a UUID reformatting bug</p></li><li><p>Database IDs like <code>2ceb1b311231...</code> get reformatted to <code>2ceb1b31-1231-...</code> with hyphens</p></li><li><p>This breaks Notion API calls &#8594; 404 errors</p></li></ul><h3><strong>The Solution: Version Pinning</strong></h3><p><strong>1. Install specific version in Dockerfile</strong>:</p><pre><code># &#9989; Pin to known working version
RUN npm install -g @notionhq/notion-mcp-server@1.9.1</code></pre><p><strong>2. Use globally installed version</strong>:</p><pre><code># &#9989; Use the pinned version
server_params = StdioServerParameters(
    command=&#8221;notion-mcp-server&#8221;,  # Uses globally installed 1.9.1
    args=[],  # No npx needed!
    env=mcp_env
)</code></pre><h3><strong>Why Version 1.9.1?</strong></h3><ul><li><p><strong>Stable</strong>: No UUID reformatting bugs</p></li><li><p><strong>Tested</strong>: Works with all Notion database IDs</p></li><li><p><strong>Reliable</strong>: Consistent behavior across deployments</p></li><li><p><strong>Predictable</strong>: Same version every time</p></li></ul><h3><strong>Testing Different Versions</strong></h3><p>To test a new MCP server version before pinning:</p><pre><code># Install specific version locally
npm install -g @notionhq/notion-mcp-server@2.0.0
# Test with your agent
cd agents/project_manager
python test_notion_local.py
# Check logs for errors
# If stable, update Dockerfile version pin</code></pre><p><strong>Best Practice</strong>: Always pin to specific versions. Only upgrade after thorough testing.</p><p>We&#8217;ve built all the agents and integrated external tools. Now it&#8217;s time to <strong>deploy everything to the cloud</strong>!</p><p>Get ready to go from localhost to the cloud!</p><p><strong>Code Repository</strong>: <a href="https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun">https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun</a></p><p><strong>Next</strong>: Part 6: Deploying to the Cloud &#8594;</p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Building Distributed Multi-Agent Systems with Google’s AI Stack: Part 4]]></title><description><![CDATA[Scaling Multi-Agent Workflows: Solving the Token Limit Problem]]></description><link>https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-d85</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-d85</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 09:44:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Q2j9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Building Distributed Multi-Agent Systems with Google&#8217;s AI Stack series:</strong></p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent?utm_campaign=post-expanded-share&amp;utm_medium=web">Part 1: From Monolithic AI to Distributed Intelligence: Building Your First Multi-Agent System</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-2a2?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2: Making Agents Talk: Agent-to-Agent (A2A) Protocol Deep Dive</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-9a3?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3: Building the Orchestrator: Coordinating Agents with the AgentTool Pattern</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-d85?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4: Scaling Multi-Agent Workflows: Solving the Token Limit Problem</a></strong> &#8592; You are here</p></li><li><p><a href="https://saoussenchaabnia.substack.com/publish/post/184416479">Part 5: External Tool Integration via Model Context Protocol (MCP)</a></p></li><li><p>Part 6: Deploying to Cloud: Cloud Run and Vertex AI Agent Engine</p></li></ul><h2><strong>Welcome Back!</strong></h2><p>In <a href="https://medium.com/google-cloud/article-04-orchestrator.md">Part </a>3, we built an intelligent orchestrator that coordinates 5 specialist agents. It works beautifully&#8230; until it doesn&#8217;t.</p><h2><strong>The Problem</strong></h2><p>You test a complete 5-agent campaign workflow:</p><pre><code>&#9989; Agent 1 (Brand Strategist): Complete - 2,000 tokens output
&#9989; Agent 2 (Copywriter): Complete - 2,500 tokens output
&#9989; Agent 3 (Designer): Complete - 1,800 tokens output
&#10060; Agent 4 (Critic): Workflow stops!
&#10060; Agent 5 (Project Manager): Never reached!</code></pre><p><strong>What happened?</strong> You hit the <strong>token output limit</strong>.</p><p>In this article, we&#8217;ll solve this with <strong>Lazy Context Compaction; </strong>an elegant solution that:</p><ul><li><p>Summarizes older agent outputs intelligently</p></li><li><p>Preserves recent context quality</p></li><li><p>Scales workflows to 10+ agents</p></li><li><p>Reduces token costs</p></li></ul><p>Let&#8217;s fix it!</p><h2><strong>Understanding the Token Limit Problem</strong></h2><h3><strong>What Are Token Limits?</strong></h3><p>LLMs have two token limits:</p><ol><li><p><strong>Input limit</strong>: How much context they can read (e.g., 128K tokens)</p></li><li><p><strong>Output limit</strong>: How much they can generate (e.g., 8,192 tokens)</p></li></ol><p>Our problem is the <strong>output limit</strong>.</p><h3><strong>Why Multi-Agent Workflows Hit Limits</strong></h3><pre><code>User Brief: 200 tokens
&#8595;
Agent 1 Output: 2,000 tokens
Agent 2 Output: 2,500 tokens
Agent 3 Output: 1,800 tokens
-----------------------------------
Orchestrator&#8217;s response so far: 6,500 tokens
Agent 4 tries to start...
&#10060; Would exceed 8,192 token limit!
Workflow stops prematurely.</code></pre><h3><strong>Why This Happens</strong></h3><p>The orchestrator presents the <strong>full output</strong> from each agent to maintain transparency. After 3 agents, it&#8217;s already used most of its output budget!</p><p><strong>Traditional solutions</strong>:</p><p>. Increase max_output_tokens &#8594; Still fails with more agents</p><p>. Summarize everything &#8594; Loses important context</p><p>. Reduce agent outputs &#8594; Loses quality</p><p><strong>Our solution</strong>: <strong>Lazy Context Compaction</strong></p><h2><strong>What is Lazy Context Compaction?</strong></h2><p>Lazy Context Compaction is a strategy that:</p><ol><li><p><strong>Compacts only when needed</strong> (after N agents)</p></li><li><p><strong>Summarizes older outputs</strong> (saves tokens)</p></li><li><p><strong>Preserves recent outputs</strong> (maintains quality)</p></li><li><p><strong>Uses LLM for summarization</strong> (intelligent compression)</p></li></ol><h2><strong>The Strategy</strong></h2><pre><code>Agents 1-3: Full context preserved
    &#8595;
After Agent 3: Compaction triggered
    &#8595;
Agents 1-2: Summarized &#8594; ~500 tokens
Agent 3: Full output preserved &#8594; ~1,800 tokens
    &#8595;
Agents 4-5: Execute with room to spare!</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q2j9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q2j9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png 424w, https://substackcdn.com/image/fetch/$s_!Q2j9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png 848w, https://substackcdn.com/image/fetch/$s_!Q2j9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png 1272w, https://substackcdn.com/image/fetch/$s_!Q2j9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q2j9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png" width="784" height="1609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1609,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Q2j9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png 424w, https://substackcdn.com/image/fetch/$s_!Q2j9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png 848w, https://substackcdn.com/image/fetch/$s_!Q2j9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png 1272w, https://substackcdn.com/image/fetch/$s_!Q2j9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2124ebcf-5e99-4790-b8fa-e6b6f7785563_784x1609.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result</strong>: Workflow completes successfully with high-quality outputs.</p><h2><strong>Implementing Context Compaction with ADK</strong></h2><h3><strong>Step 1: Import Required Components</strong></h3><pre><code># agents/creative_director/agent.py
from google.adk.apps.llm_event_summarizer import LlmEventSummarizer
from google.adk.apps.app import EventsCompactionConfig
from google.adk.apps import App
from google.adk.models import Gemini</code></pre><h3><strong>Step 2: Create Summarizer</strong></h3><pre><code># Use fast model for summarization (cost-efficient)
summarization_llm = Gemini(model_id=&#8221;gemini-2.5-flash&#8221;)
summarizer = LlmEventSummarizer(llm=summarization_llm)</code></pre><p><strong>Why Gemini Flash?</strong></p><ul><li><p>Fast summarization</p></li><li><p>Cost-efficient</p></li><li><p>High-quality summaries</p></li><li><p>Same model family as main agent</p></li></ul><h3><strong>Step 3: Configure Compaction</strong></h3><pre><code>compaction_config = EventsCompactionConfig(
    summarizer=summarizer,
    compaction_interval=3,  # Summarize after every 3 agents
    overlap_size=1          # Keep most recent agent&#8217;s full output
)</code></pre><p><strong>Configuration explained</strong>:</p><ul><li><p><code>compaction_interval=3</code>: Compact after 3 agent completions</p></li><li><p><code>overlap_size=1</code>: Keep 1 most recent agent full (preserve quality)</p></li></ul><h3><strong>Step 4: Wrap Agent in App</strong></h3><pre><code>def create_creative_director():
    # ... (agent creation code from Part 4) ...
    agent = Agent(
        name=&#8221;creative_director&#8221;,
        model=&#8221;gemini-2.5-flash&#8221;,
        tools=agent_tools,
        instruction=system_instruction,
        generate_content_config=GenerateContentConfig(
            max_output_tokens=20000,  # Increased from 8,192
            temperature=0.2
        )
    )
    # Wrap agent in App with compaction config
    app = App(
        name=&#8221;creative_director&#8221;,
        root_agent=agent,
        events_compaction_config=compaction_config
    )
    logger.info(&#8221;&#9989; App created with lazy context compaction&#8221;)
    logger.info(&#8221;   Compaction interval: 3 agents&#8221;)
    logger.info(&#8221;   Overlap size: 1 agent&#8221;)
    logger.info(&#8221;   Context will be summarized only when necessary&#8221;)
    return app

# Create app (not just agent)
root_agent = create_creative_director()</code></pre><p><strong>Important</strong>: We return an <code>App</code>, not just an <code>Agent</code>!</p><h2><strong>How It Works: Step by Step</strong></h2><h3><strong>5-Agent Workflow Example</strong></h3><p><strong>Phase 1: Agents 1&#8211;3 (No Compaction)</strong></p><pre><code>User: &#8220;Create complete Instagram campaign&#8221;
Orchestrator announces plan:
&#8220;I&#8217;ll coordinate our team:
1. Brand Strategist &#8594; research
2. Copywriter &#8594; posts
3. Designer &#8594; visuals
4. Critic &#8594; review
5. Project Manager &#8594; timeline&#8221;
Agent 1 (Brand Strategist) executes:
&#8594; Output: 2,000 tokens (FULL)
&#8594; Total context: 2,000 tokens
Agent 2 (Copywriter) executes:
&#8594; Output: 2,500 tokens (FULL)
&#8594; Total context: 4,500 tokens
Agent 3 (Designer) executes:
&#8594; Output: 1,800 tokens (FULL)
&#8594; Total context: 6,300 tokens</code></pre><p><strong>Status</strong>: No compaction yet. All outputs preserved.</p><p><strong>Phase 2: After Agent 3 (Compaction Triggered)</strong></p><pre><code>Compaction interval reached (3 agents)
    &#8595;
Summarizer analyzes:
- Agent 1 output (2,000 tokens)
- Agent 2 output (2,500 tokens)
    &#8595;
Creates intelligent summary:
- &#8220;Brand Strategist research: [key points] (300 tokens)
- Copywriter posts: [post summaries] (200 tokens)
Total summary: 500 tokens
    &#8595;
Keeps Agent 3 full (overlap_size=1):
- Designer visuals: [full output] (1,800 tokens)
    &#8595;
New context size: 500 + 1,800 = 2,300 tokens</code></pre><p><strong>Saved</strong>: 4,000 tokens! (from 6,300 &#8594; 2,300)</p><h3><strong>Phase 3: Agents 4&#8211;5 (With Compacted Context)</strong></h3><pre><code>Agent 4 (Critic) executes:
&#8594; Context available: 2,300 tokens
&#8594; Has: Summary of research/posts + Full visual concepts
&#8594; Output: 1,500 tokens
&#8594; Total: 3,800 tokens
Agent 5 (Project Manager) executes:
&#8594; Context available: 3,800 tokens
&#8594; Output: 2,000 tokens
&#8594; Total: 5,800 tokens
&#9989; Workflow completes successfully!
&#9989; Under 8,192 token limit
&#9989; All 5 agents executed</code></pre><h2><strong>Configuration Strategies</strong></h2><h3><strong>Short Workflows (3&#8211;5 agents)</strong></h3><pre><code>compaction_config = EventsCompactionConfig(
    summarizer=summarizer,
    compaction_interval=3,  # Compact after 3 agents
    overlap_size=1          # Keep last 1 full
)</code></pre><p><strong>Use when</strong>:</p><ul><li><p>3&#8211;5 agents total</p></li><li><p>Moderate output per agent</p></li><li><p>Quality is critical</p></li></ul><h3><strong>Long Workflows (5&#8211;10 agents)</strong></h3><pre><code>compaction_config = EventsCompactionConfig(
    summarizer=summarizer,
    compaction_interval=4,  # Compact after 4 agents
    overlap_size=2          # Keep last 2 full
)</code></pre><p><strong>Use when</strong>:</p><ul><li><p>5&#8211;10 agents total</p></li><li><p>Need more recent context preserved</p></li><li><p>Complex interdependencies</p></li></ul><h3><strong>Very Long Workflows (10+ agents)</strong></h3><pre><code>compaction_config = EventsCompactionConfig(
    summarizer=summarizer,
    compaction_interval=5,  # Compact every 5 agents
    overlap_size=2          # Keep last 2 full
)</code></pre><p><strong>Use when</strong>:</p><ul><li><p>10+ agents total</p></li><li><p>Very complex workflows</p></li><li><p>Multiple rounds of compaction needed</p></li></ul><h3><strong>Quality-Critical Workflows</strong></h3><pre><code>compaction_config = EventsCompactionConfig(
    summarizer=summarizer,
    compaction_interval=3,
    overlap_size=2  # Keep last 2 full (more quality)
)</code></pre><p><strong>Use when</strong>:</p><ul><li><p>Quality &gt; token savings</p></li><li><p>Later agents need rich context</p></li><li><p>Acceptable to compact more frequently</p></li></ul><h2><strong>Testing Context Compaction</strong></h2><h3><strong>Test Script</strong></h3><pre><code># test_context_compaction.py
import asyncio
from creative_director.agent import root_agent
from google.adk import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types
async def test_full_workflow():
    &#8220;&#8221;&#8220;Test complete 5-agent workflow with compaction&#8221;&#8220;&#8221;
    brief = &#8220;&#8221;&#8220;
    Create a complete Instagram campaign for EcoFlow smart water bottle.
    Target: Health-conscious millennials (25-34).
    Budget: $5,000. Launch in 2 weeks.
    Include research, posts, visuals, review, and full timeline.
    &#8220;&#8221;&#8220;
    print(&#8221;=&#8221;*70)
    print(&#8221;Testing 5-Agent Workflow with Context Compaction&#8221;)
    print(&#8221;=&#8221;*70)
    print(f&#8221;\nBrief: {brief}\n&#8221;)
    session_service = InMemorySessionService()
    runner = Runner(
        app_name=&#8221;creative_director&#8221;,
        agent=root_agent,  # This is now an App, not just Agent
        session_service=session_service
    )
    session_id = &#8220;test_compaction&#8221;
    user_id = &#8220;test_user&#8221;
    agent_count = 0
    try:
        await session_service.create_session(
            app_name=&#8221;creative_director&#8221;,
            user_id=user_id,
            session_id=session_id
        )
        async for event in runner.run_async(
            user_id=user_id,
            session_id=session_id,
            new_message=types.Content(parts=[types.Part(text=brief)])
        ):
            if hasattr(event, &#8216;text&#8217;) and event.text:
                text = event.text
                # Count agent completions
                if &#8220;&#10003;&#8221; in text and &#8220;complete&#8221; in text.lower():
                    agent_count += 1
                    print(f&#8221;\n[Agent {agent_count} completed]&#8221;)
                # Detect compaction
                if &#8220;summariz&#8221; in text.lower():
                    print(&#8221;\n[!] Context compaction triggered&#8221;                print(text, end=&#8217;&#8216;, flush=True)
        print(f&#8221;\n\n{&#8217;=&#8217;*70}&#8221;)
        print(f&#8221;&#9989; Workflow complete!&#8221;)
        print(f&#8221;   Agents executed: {agent_count}/5&#8221;)
        print(f&#8221;{&#8217;=&#8217;*70}&#8221;)
        if agent_count == 5:
            print(&#8221;&#9989; SUCCESS: All 5 agents completed (compaction worked!)&#8221;)
        else:
            print(f&#8221;&#10060; PARTIAL: Only {agent_count}/5 agents completed&#8221;)
    finally:
        await runner.close()

if __name__ == &#8220;__main__&#8221;:
    asyncio.run(test_full_workflow())</code></pre><h2><strong>Expected Output</strong></h2><pre><code>======================================================================
Testing 5-Agent Workflow with Context Compaction
======================================================================
creative_director &gt; I&#8217;ll coordinate our team to create your campaign:
1. Brand Strategist &#8594; research
2. Copywriter &#8594; posts
3. Designer &#8594; visuals
4. Critic &#8594; review
5. Project Manager &#8594; timeline
Let&#8217;s begin!
[Agent 1 completed]
&#10003; Research complete. I received audience insights...
[Agent 2 completed]
&#10003; Copywriting complete. I received 5 Instagram posts...
[Agent 3 completed]
&#10003; Design complete. I received image concepts...
[!] Context compaction triggered
[Agent 4 completed]
&#10003; Review complete. Quality score: 8.5/10...
[Agent 5 completed]
&#10003; Timeline complete. Project plan created...
======================================================================
&#9989; Workflow complete!
   Agents executed: 5/5
======================================================================
&#9989; SUCCESS: All 5 agents completed (compaction worked!)</code></pre><h2><strong>Token Usage Comparison</strong></h2><h3><strong>Without Compaction</strong></h3><pre><code>Agent 1: 2,000 tokens output
Agent 2: 2,500 tokens output
Agent 3: 1,800 tokens output
-----------------------------------
Total: 6,300 tokens
Agent 4: &#10060; Cannot start (would exceed 8,192)
Result: FAILURE (3/5 agents completed)</code></pre><h3><strong>With Compaction (interval=3, overlap=1)</strong></h3><pre><code>Agent 1: 2,000 tokens output
Agent 2: 2,500 tokens output
Agent 3: 1,800 tokens output
Total before compaction: 6,300 tokens
&#8594; Compaction triggered
Agents 1-2 summarized: 500 tokens
Agent 3 preserved: 1,800 tokens
Total after compaction: 2,300 tokens
Agent 4: 1,500 tokens output (2,300 &#8594; 3,800 total)
Agent 5: 2,000 tokens output (3,800 &#8594; 5,800 total)
-----------------------------------
Final: 5,800 tokens (under 8,192 limit)
Result: &#9989; SUCCESS (5/5 agents completed)</code></pre><p><strong>Token savings</strong>: 500 tokens from compaction <strong>Workflow success</strong>: 100% (vs 60% without)</p><h2><strong>Quality Preservation</strong></h2><h3><strong>What Gets Summarized?</strong></h3><p>The summarizer preserves <strong>key information</strong>:</p><p><strong>Original Agent 1 Output</strong> (2,000 tokens):</p><pre><code>**Audience Insights:**
Health-conscious millennials (25-34) are increasingly seeking products...
[1,500 words of detailed analysis]
**Competitive Analysis:**
1. Hydro Flask - Established brand with strong loyalty...
[800 words of competitor details]
**Trending Topics:**
1. #SustainableLiving - 2.3M posts, growing 15% monthly...
[700 words of trend analysis]</code></pre><p><strong>Summarized Version</strong> (300 tokens):</p><pre><code>Research Summary: Target audience is health-conscious millennials (25-34)
valuing sustainability and smart features. Main competitors: Hydro Flask
(premium, no tech), S&#8217;well (design-focused), HidrateSpark (smart but basic).
Key trends: sustainable living, hydration tracking, minimalist aesthetics.
Opportunity: premium sustainable + smart features gap in market.</code></pre><p><strong>Key points preserved</strong>:</p><ul><li><p>Target audience demographics</p></li><li><p>Main competitors identified</p></li><li><p>Key trends listed</p></li><li><p>Strategic opportunity highlighted</p></li></ul><p><strong>Details lost</strong>:</p><ul><li><p>Full competitor analysis</p></li><li><p>Detailed trend statistics</p></li><li><p>Extended audience behaviors</p></li></ul><h3><strong>Quality vs Efficiency Trade-off</strong></h3><pre><code>overlap_size=0: Maximum compression, minimal quality
overlap_size=1: Balanced (recommended)
overlap_size=2: High quality, less compression
overlap_size=3: Maximum quality, minimal compression</code></pre><p><strong>Recommendation</strong>: Start with <code>overlap_size=1</code>, increase if quality issues arise.</p><h2><strong>When NOT to Use Compaction</strong></h2><h3><strong>Scenario 1: Short Workflows</strong></h3><pre><code># 2-agent workflow
brief = &#8220;Research the market and write 3 posts&#8221;
# No compaction needed - output is small</code></pre><h3><strong>Scenario 2: Small Outputs</strong></h3><pre><code># Each agent outputs &lt; 500 tokens
# Total for 5 agents: 2,500 tokens
# Well under limit - compaction unnecessary</code></pre><h3><strong>Scenario 3: Context-Critical Tasks</strong></h3><pre><code># Legal document review where every detail matters
# Better to split into multiple sessions than compress</code></pre><h3><strong>Advanced: Multiple Compaction Rounds</strong></h3><p>For very long workflows (15+ agents), multiple compaction rounds occur:</p><pre><code>Agents 1-3: Full
&#8594; Compaction 1: Agents 1-2 summarized, Agent 3 kept
Agents 4-6: Execute
&#8594; Compaction 2: Agents 1-4 summarized, Agents 5-6 kept
Agents 7-9: Execute
&#8594; Compaction 3: Agents 1-7 summarized, Agents 8-9 kept
... and so on</code></pre><p>Each round further compresses older context while preserving recent work.</p><h2><strong>Troubleshooting</strong></h2><h3><strong>Issue 1: Workflow Still Stops Early</strong></h3><p><strong>Solution</strong>: Reduce <code>compaction_interval</code>:</p><pre><code>compaction_interval=2  # Compact more frequently</code></pre><h3><strong>Issue 2: Quality Degradation</strong></h3><p><strong>Solution</strong>: Increase <code>overlap_size</code>:</p><pre><code>overlap_size=2  # Keep more recent context</code></pre><h3><strong>Issue 3: Too Much Compaction</strong></h3><p><strong>Solution</strong>: Increase <code>compaction_interval</code>:</p><pre><code>compaction_interval=4  # Compact less frequently</code></pre><h2><strong>Cost Analysis</strong></h2><h3><strong>Without Compaction (Failed Workflow)</strong></h3><pre><code>Agents executed: 3/5
Input tokens: 6,300 (wasted partial context)
Output tokens: 6,300
Cost: ~$0.05 (but incomplete workflow)
Value: $0 (workflow failed)</code></pre><h3><strong>With Compaction (Successful Workflow)</strong></h3><pre><code>Agents executed: 5/5 &#9989;
Input tokens: 8,000 (including summarization)
Output tokens: 5,800
Summarization cost: ~$0.01
Total cost: ~$0.07
Value: Complete campaign delivered &#9989;</code></pre><p><strong>ROI</strong>: 40% more cost, but 100% success vs failure!</p><p>Our agents can now scale to handle complex workflows. But what about integrating with <strong>external services</strong>?</p><p>In Part 6, we&#8217;ll add <strong>Model Context Protocol (MCP)</strong> integration to the Project Manager agent.</p><p><strong>Code Repository</strong>: <a href="https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun">https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun</a></p><p><strong>Next</strong>: <a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-d85?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 5: External Tool Integration via MCP &#8594;</a></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Building Distributed Multi-Agent Systems with Google’s AI Stack: Part 3]]></title><description><![CDATA[Building the Orchestrator: Coordinating Agents with the AgentTool Pattern]]></description><link>https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-9a3</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-9a3</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 09:44:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fXge!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Building Distributed Multi-Agent Systems with Google&#8217;s AI Stack series:</strong></p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent?utm_campaign=post-expanded-share&amp;utm_medium=web">Part 1: From Monolithic AI to Distributed Intelligence: Building Your First Multi-Agent System</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-2a2?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2: Making Agents Talk: Agent-to-Agent (A2A) Protocol Deep Dive</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-9a3?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3: Building the Orchestrator: Coordinating Agents with the AgentTool Pattern</a></strong> &#8592; You are here</p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-d85?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4: Scaling Multi-Agent Workflows: Solving the Token Limit Problem</a></p></li><li><p><a href="https://saoussenchaabnia.substack.com/publish/post/184416479">Part 5: External Tool Integration via Model Context Protocol (MCP)</a></p></li><li><p>Part 6: Deploying to Cloud: Cloud Run and Vertex AI Agent Engine</p></li></ul><h2><strong>Welcome Back!</strong></h2><p>In <a href="https://medium.com/google-cloud/article-03-a2a-protocol.md">Part </a>2, we made our specialist agents accessible via A2A protocol. Now we have:</p><ul><li><p>Brand Strategist (A2A server running)</p></li><li><p>Copywriter (A2A server running)</p></li><li><p>Designer (A2A server running)</p></li><li><p>Critic (A2A server running)</p></li><li><p>Project Manager (A2A server running)</p></li></ul><p>But there&#8217;s a problem: <strong>Who coordinates them?</strong></p><p>In this article, we&#8217;ll build the <strong>Creative Director; </strong>an intelligent orchestrator that:</p><ul><li><p>Routes requests to the right agents</p></li><li><p>Creates execution plans before acting</p></li><li><p>Passes context between agents</p></li><li><p>Handles errors gracefully</p></li><li><p>Lets the LLM decide the workflow</p></li></ul><p>This is where the <strong>AgentTool pattern</strong> shines. Let&#8217;s build it!</p><h2><strong>The Orchestration Challenge</strong></h2><h3><strong>Naive Approach: Hardcoded Workflow</strong></h3><pre><code>def create_campaign(brief):
    # Always call all agents in fixed order
    research = call_brand_strategist(brief)
    posts = call_copywriter(brief, research)
    visuals = call_designer(posts)
    feedback = call_critic(research, posts, visuals)
    timeline = call_project_manager(brief, feedback)
    return compile_results(research, posts, visuals, feedback, timeline)</code></pre><p><strong>Problems</strong>:</p><p>. Not flexible: user just wants research? Runs all 5 agents anyway</p><p>. No intelligence: can&#8217;t adapt to different requests</p><p>. Error handling is hard: what if copywriter fails?</p><p>. Can&#8217;t revise : &#8220;make the copy more playful&#8221; requires code changes</p><h3><strong>Better Approach: LLM-Driven Routing</strong></h3><p>What if the <strong>LLM decides</strong> which agents to call based on the user&#8217;s request?</p><pre><code>User: &#8220;Just do market research for eco water bottles&#8221;
&#8594; LLM: Call ONLY brand_strategist
User: &#8220;Create complete campaign with timeline&#8221;
&#8594; LLM: Call all 5 agents sequentially
User: &#8220;Make the copy more playful&#8221;
&#8594; LLM: Call copywriter again with feedback</code></pre><p>This is the <strong>AgentTool pattern</strong>!</p><h3><strong>What is the AgentTool Pattern?</strong></h3><p>The AgentTool pattern <strong>wraps remote A2A agents as callable tools</strong> that the orchestrator&#8217;s LLM can use.</p><p><strong>How It Works</strong></p><pre><code>&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;     Orchestrator (Agent)            &#9474;
&#9474;                                     &#9474;
&#9474;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9474;
&#9474;  &#9474;  Gemini 2.5 Flash (LLM)      &#9474;  &#9474;
&#9474;  &#9474;                              &#9474;  &#9474;
&#9474;  &#9474;  &#8220;I need to call the         &#9474;  &#9474;
&#9474;  &#9474;   brand_strategist tool&#8221;     &#9474;  &#9474;
&#9474;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9474;
&#9474;              &#8595;                      &#9474;
&#9474;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;  &#9474;
&#9474;  &#9474;  AgentTool (Wrapper)         &#9474;  &#9474;
&#9474;  &#9474;  - brand_strategist          &#9474;  &#9474;
&#9474;  &#9474;  - copywriter                &#9474;  &#9474;
&#9474;  &#9474;  - designer                  &#9474;  &#9474;
&#9474;  &#9474;  - critic                    &#9474;  &#9474;
&#9474;  &#9474;  - project_manager           &#9474;  &#9474;
&#9474;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
              &#8595; A2A Protocol
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;    Remote A2A Agents (Cloud Run)    &#9474;
&#9474;                                     &#9474;
&#9474;  &#8226; Brand Strategist                 &#9474;
&#9474;  &#8226; Copywriter                       &#9474;
&#9474;  &#8226; Designer                         &#9474;
&#9474;  &#8226; Critic                           &#9474;
&#9474;  &#8226; Project Manager                  &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre><p><strong>Key Benefits</strong>:</p><ul><li><p><strong>LLM decides</strong> which agents to call</p></li><li><p><strong>Flexible routing</strong> &#8212; adapt to any request</p></li><li><p><strong>Reusability</strong> &#8212; call same agent multiple times</p></li><li><p><strong>Natural interface</strong> &#8212; function calling</p></li></ul><h2><strong>Building the Creative Director</strong></h2><h3><strong>Step 1: Import Dependencies</strong></h3><pre><code># agents/creative_director/agent.py
import os
import logging
from google.adk.agents import Agent
from google.adk.agents.remote_a2a_agent import RemoteA2aAgent
from google.adk.tools.agent_tool import AgentTool
from google.genai.types import GenerateContentConfig
from dotenv import load_dotenv
load_dotenv()
logger = logging.getLogger(&#8221;ai_creative_studio.creative_director&#8221;)</code></pre><h3><strong>Step 2: Create Remote A2A Agents</strong></h3><pre><code>def create_creative_director():
    &#8220;&#8221;&#8220;
    Create the Creative Director orchestrator using AgentTool pattern.
    Features: Dynamic agent list, LLM-driven routing, planning-first.
    &#8220;&#8221;&#8220;
    logger.info(&#8221;=&#8221;*70)
    logger.info(&#8221;Initializing Creative Director Orchestrator&#8221;)
    logger.info(&#8221;=&#8221;*70)
    # Read agent URLs AT RUNTIME from environment variables
    # This is crucial for Vertex AI Agent Engine deployment
    copywriter_url = os.getenv(&#8221;COPYWRITER_AGENT_URL&#8221;)
    designer_url = os.getenv(&#8221;DESIGNER_AGENT_URL&#8221;)
    strategist_url = os.getenv(&#8221;STRATEGIST_AGENT_URL&#8221;)
    critic_url = os.getenv(&#8221;CRITIC_AGENT_URL&#8221;)
    pm_url = os.getenv(&#8221;PM_AGENT_URL&#8221;)
    # Build dynamic agent list and tools
    available_agents_list = []
    agent_tools = []
    # Brand Strategist
    if strategist_url:
        available_agents_list.append(
            &#8220;- **brand_strategist**: Researches market trends, competitors, and target audience insights&#8221;
        )
        strategist_agent = RemoteA2aAgent(
            name=&#8221;brand_strategist&#8221;,
            description=&#8221;Brand strategist for market research, trend analysis, and competitive insights&#8221;,
            agent_card=f&#8221;{strategist_url}/.well-known/agent.json&#8221;
        )
        agent_tools.append(AgentTool(agent=strategist_agent))
        logger.info(f&#8221;&#9989; Configured brand_strategist: {strategist_url}&#8221;)
    # Copywriter
    if copywriter_url:
        available_agents_list.append(
            &#8220;- **copywriter**: Creates engaging social media captions and copy&#8221;
        )
        copywriter_agent = RemoteA2aAgent(
            name=&#8221;copywriter&#8221;,
            description=&#8221;Expert social media copywriter for creating engaging captions and copy&#8221;,
            agent_card=f&#8221;{copywriter_url}/.well-known/agent.json&#8221;
        )
        agent_tools.append(AgentTool(agent=copywriter_agent))
        logger.info(f&#8221;&#9989; Configured copywriter: {copywriter_url}&#8221;)
    # Designer
    if designer_url:
        available_agents_list.append(
            &#8220;- **designer**: Generates AI image concepts and visual design prompts&#8221;
        )
        designer_agent = RemoteA2aAgent(
            name=&#8221;designer&#8221;,
            description=&#8221;Creative visual designer for generating social media image concepts&#8221;,
            agent_card=f&#8221;{designer_url}/.well-known/agent.json&#8221;
        )
        agent_tools.append(AgentTool(agent=designer_agent))
        logger.info(f&#8221;&#9989; Configured designer: {designer_url}&#8221;)
    # Critic
    if critic_url:
        available_agents_list.append(
            &#8220;- **critic**: Reviews creative work and provides quality feedback&#8221;
        )
        critic_agent = RemoteA2aAgent(
            name=&#8221;critic&#8221;,
            description=&#8221;Creative critic for reviewing campaign materials and providing constructive feedback&#8221;,
            agent_card=f&#8221;{critic_url}/.well-known/agent.json&#8221;
        )
        agent_tools.append(AgentTool(agent=critic_agent))
        logger.info(f&#8221;&#9989; Configured critic: {critic_url}&#8221;)
    # Project Manager
    if pm_url:
        available_agents_list.append(
            &#8220;- **project_manager**: Creates project timelines, tasks, and deliverables&#8221;
        )
        pm_agent = RemoteA2aAgent(
            name=&#8221;project_manager&#8221;,
            description=&#8221;Project manager for creating timelines, tasks, and organizing campaign deliverables&#8221;,
            agent_card=f&#8221;{pm_url}/.well-known/agent.json&#8221;
        )
        agent_tools.append(AgentTool(agent=pm_agent))
        logger.info(f&#8221;&#9989; Configured project_manager: {pm_url}&#8221;)
    # Format available agents for prompt injection
    if available_agents_list:
        available_agents_text = &#8220;\n&#8221;.join(available_agents_list)
        logger.info(f&#8221;&#9989; Configured {len(agent_tools)} specialist agents&#8221;)
    else:
        available_agents_text = &#8220;&#9888;&#65039; No specialist agents configured. Set agent URLs in environment variables.&#8221;
        logger.warning(&#8221;&#9888;&#65039;  No specialist agents configured!&#8221;)
    # ... (continued in next section)</code></pre><p><strong>Key Innovation</strong>: Dynamic agent discovery at runtime!</p><h3><strong>Step 3: The Planning-First Instruction</strong></h3><p>This is where the magic happens. We give the LLM clear instructions on <strong>how to orchestrate</strong>:</p><pre><code># Inject dynamic agent list into instruction template
    system_instruction = SYSTEM_INSTRUCTION_TEMPLATE.format(
        available_agents=available_agents_text
    )
    # ... (continued)</code></pre><h3><strong>The Instruction Template</strong></h3><pre><code>SYSTEM_INSTRUCTION_TEMPLATE = &#8220;&#8221;&#8220;You are an expert Creative Director AI Orchestrator for social media campaign creation.
**Your Role:**
You interpret campaign requests, create execution plans, and delegate to specialist agents.
You do NOT create content yourself - you manage the specialists who do.
**Your Available Specialist Tools:**
{available_agents}
**Core Directives &amp; Decision Making:**
1. **Understand User Intent &amp; Complexity**
   Carefully analyze the user&#8217;s request to determine the core task(s).
   **Request Classification:**
   - **SIMPLE**: &#8220;just do market research&#8221; &#8594; ONE agent needed
   - **COMPLEX**: &#8220;create complete campaign&#8221; &#8594; MULTIPLE agents needed
   **Examples:**
   - &#8220;Research eco-friendly water bottle market&#8221; &#8594; brand_strategist only
   - &#8220;Write 5 Instagram captions&#8221; &#8594; copywriter only
   - &#8220;Create complete campaign with timeline&#8221; &#8594; ALL 5 agents sequentially
2. **Task Planning &amp; Sequencing (CRITICAL - Do This BEFORE Delegating)**
   **Before calling ANY tool**, you MUST:
   - **Outline the complete plan** in your response to the user
   - **Example plan format:**
     &#8220;I&#8217;ll coordinate our team to create your campaign. Here&#8217;s my plan:
     1. **Brand Strategist** will research the market, competitors, and target audience
     2. **Copywriter** will create 5 Instagram posts using those insights
     3. **Designer** will generate image concepts for each post
     4. **Critic** will review all creative work for quality
     5. **Project Manager** will create the project timeline and deliverables
     Let&#8217;s begin with the market research!&#8221;
3. **Task Delegation &amp; Execution (Executing Your Plan)**
   For each agent in your plan, follow this EXACT sequence:
   **a) CALL** the appropriate tool with complete context
   - Include ALL relevant information from user&#8217;s request
   - For sequential tasks, include output from previous agents
   - Be explicit! Remote agents don&#8217;t have conversation history
   **b) WAIT** for tool_output
   - **DO NOT** proceed until you receive the complete response
   - **DO NOT** assume what the response will be
   **c) VERIFY** tool_output shows successful completion
   - Check that tool_output contains actual results (not an error)
   - **IF ERROR detected:** Go to step (e)
   - **IF SUCCESS:** Go to step (d)
   **d) CONFIRM** to user with specific details
   - Format: &#8220;&#10003; [Agent] complete. I received [brief summary of actual output]&#8221;
   - Examples:
     - &#8220;&#10003; Research complete. I received insights on target audience, 3 competitors, and 5 trending topics&#8221;
     - &#8220;&#10003; Copywriting complete. I received 5 Instagram posts with captions and hashtags&#8221;
   - **Then announce next step:** &#8220;Now moving to [next agent]...&#8221;
   **e) IF ERROR - STOP and Report**
   - **STOP the sequence immediately**
   - Report to user: &#8220;&#10060; Error in [Agent]: [exact error message from tool_output]&#8221;
   - Explain impact: &#8220;Cannot proceed with [next step] without [failed step results]&#8221;
   - Ask: &#8220;Would you like me to retry [failed agent] or adjust the approach?&#8221;
   - **DO NOT** continue to next agent until issue is resolved
4. **CRITICAL Success Verification**
   You **MUST**:
   - Wait for tool_output after EVERY agent tool call
   - Base your decision to proceed ENTIRELY on confirmation from tool_output
   - STOP if ANY tool call fails or produces ambiguous output
   - Report exact failure messages to the user
   You **MUST NOT**:
   - Assume a task was successful
   - Invent success messages
   - Proceed if the previous tool_output shows an error
   - Continue workflow if a critical step failed
   **Only state that a task is complete if the tool_output explicitly shows successful completion.**
5. **Error Handling &amp; Ambiguity Resolution**
   **When a Tool Fails:**
   1. **STOP** the workflow immediately
   2. **Report exact error:** &#8220;&#10060; Error in [Agent]: [exact error message]&#8221;
   3. **Explain impact:** &#8220;Cannot proceed with [next steps] without [failed step]&#8221;
   4. **Offer options:** &#8220;Would you like me to retry or adjust?&#8221;
   5. **Wait for user decision** before proceeding
6. **Communication with User**
   - **Transparency First:** Always present the complete response from each agent
   - **Progress Updates:** Inform user which agent is currently working
   - **No Hallucination:** NEVER say results are ready unless you received them
   - **Present Full Outputs:** Show the user exactly what each specialist produced
**CRITICAL WORKFLOW COMPLETION REQUIREMENT:**
When you create a plan listing multiple agents (e.g., &#8220;I&#8217;ll use agents 1, 2, 3, 4, 5&#8221;),
you MUST execute EVERY SINGLE agent in that plan. Do NOT stop after 2 or 3 agents -
continue until ALL planned agents have been called and have responded.
&#8220;&#8221;&#8220;</code></pre><p><strong>Key Patterns</strong>:</p><ul><li><p><strong>Planning-first</strong>: Create plan before execution</p></li><li><p><strong>Verification</strong>: Check each step succeeds</p></li><li><p><strong>Error handling</strong>: Stop and report on failure</p></li><li><p><strong>Context passing</strong>: Each agent gets previous outputs</p></li></ul><h2><strong>Instruction Design Tips: What Makes a Great Orchestrator</strong></h2><p>Writing effective orchestrator instructions is an art. Here are battle-tested tips:</p><h3><strong>1. Be Explicit About Workflow Steps</strong></h3><p>&#10060; <strong>Vague</strong>:</p><pre><code>&#8220;Call the agents to create a campaign&#8221;</code></pre><p>&#9989; <strong>Clear</strong>:</p><pre><code>&#8220;Before calling ANY tool, you MUST:
1. Outline the complete plan
2. Execute each step sequentially
3. Verify success before continuing
4. Report exact errors if any step fails&#8221;</code></pre><p><strong>Why</strong>: LLMs need explicit step-by-step instructions. Vague directions lead to skipped steps.</p><h3><strong>2. Use Imperative Language with Strong Verbs</strong></h3><p>&#10060; <strong>Weak</strong>:</p><pre><code>&#8220;You should probably check if the task completed&#8221;</code></pre><p>&#9989; <strong>Strong</strong>:</p><pre><code>&#8220;You MUST verify tool_output shows successful completion&#8221;</code></pre><p><strong>Magic words</strong>: MUST, NEVER, ALWAYS, DO NOT, CRITICAL, STOP</p><p><strong>Why</strong>: Strong imperatives reduce ambiguity and increase compliance.</p><h3><strong>3. Provide Concrete Examples</strong></h3><p>&#10060; <strong>Abstract</strong>:</p><pre><code>&#8220;Classify the request complexity&#8221;</code></pre><p>&#9989; <strong>Concrete</strong>:</p><pre><code>&#8220;Request Classification:
- SIMPLE: &#8216;just do market research&#8217; &#8594; brand_strategist only
- COMPLEX: &#8216;create complete campaign&#8217; &#8594; ALL 5 agents sequentially&#8221;</code></pre><p><strong>Why</strong>: Examples ground abstract concepts and reduce misinterpretation.</p><h3><strong>4. Specify Error Behavior Exactly</strong></h3><p>&#10060; <strong>Unclear</strong>:</p><pre><code>&#8220;Handle errors appropriately&#8221;</code></pre><p>&#9989; <strong>Precise</strong>:</p><pre><code>&#8220;When a Tool Fails:
1. STOP the workflow immediately
2. Report: &#8216;&#10060; Error in [Agent]: [exact error message]&#8217;
3. Explain: &#8216;Cannot proceed with [next step]&#8217;
4. Ask: &#8216;Would you like me to retry?&#8217;
5. WAIT for user decision&#8221;</code></pre><p><strong>Why</strong>: Precise error handling prevents the LLM from &#8220;creative&#8221; error recovery that makes things worse.</p><h3><strong>5. Prevent Hallucination with Negative Instructions</strong></h3><p>&#10060; <strong>Allowing hallucination</strong>:</p><pre><code>&#8220;Summarize the results&#8221;</code></pre><p>&#9989; <strong>Preventing hallucination</strong>:</p><pre><code>&#8220;You MUST NOT:
- Assume a task was successful
- Invent success messages like &#8216;Research complete&#8217;
- Proceed if tool_output shows an error
- Summarize or filter error messages
ONLY state a task is complete if tool_output explicitly shows success.&#8221;</code></pre><p><strong>Why</strong>: LLMs tend to fill gaps with plausible-sounding content. Negative instructions prevent this.</p><h3><strong>6. Use Formatting for Emphasis</strong></h3><pre><code>&#8220;**CRITICAL WORKFLOW COMPLETION REQUIREMENT:**
When you create a plan listing multiple agents,
you MUST execute EVERY SINGLE agent in that plan.
Do NOT stop after 2 or 3 agents.&#8221;</code></pre><p><strong>Techniques</strong>:</p><ul><li><p><strong>Bold</strong> for critical points</p></li><li><p>ALL CAPS for emphasis</p></li><li><p>Numbered lists for sequences</p></li><li><p>Bullet points for options</p></li></ul><p><strong>Why</strong>: Visual hierarchy helps LLMs prioritize instructions.</p><h3><strong>7. Include Verification Checkpoints</strong></h3><pre><code>&#8220;<strong>**Workflow checklist before finishing:**</strong>
- &#10003; Did I announce a plan with N agents?
- &#10003; Have I called ALL N agents from my plan?
- &#10003; Did each agent respond successfully?
- &#10003; Am I presenting complete results from ALL agents?
If you cannot answer YES to all, DO NOT finish.&#8221;</code></pre><p><strong>Why</strong>: Explicit checkpoints catch premature completions.</p><h3><strong>8. Design for Revision Loops</strong></h3><pre><code>&#8220;**Revision Workflow:**
If Critic returns &#8216;Status: NEEDS_REVISION&#8217;:
1. Announce to user what needs improvement
2. Call the relevant agent (copywriter/designer) with:
   - Original brief
   - First version
   - Critic&#8217;s exact feedback
3. Maximum 1 revision per agent (prevent infinite loops)
4. Proceed to next step with revised version&#8221;</code></pre><p><strong>Why</strong>: Structured revision logic ensures quality without cost explosion.</p><h3><strong>9. Set Clear Role Boundaries</strong></h3><pre><code>&#8220;<strong>**Your Role:**</strong>
You do NOT create content yourself - you manage specialists.
<strong>**DO:**</strong>
- Interpret requests
- Create execution plans
- Delegate to specialists
- Verify outputs
- Handle errors
<strong>**DO NOT:**</strong>
- Write campaign copy
- Create visual concepts
- Generate research insights&#8221;</code></pre><p><strong>Why</strong>: Clear boundaries prevent the orchestrator from doing specialist work.</p><h3><strong>10. Test with Edge Cases</strong></h3><p>After writing instructions, test with:</p><pre><code>&#9989; &#8220;Research coffee market&#8221; (simple, 1 agent)
&#9989; &#8220;Create complete campaign&#8221; (complex, all 5 agents)
&#9989; &#8220;Make posts more professional&#8221; (revision, context required)
&#10060; Simulate agent failure (does it stop gracefully?)
&#10060; Ambiguous request (does it ask for clarification?)</code></pre><p><strong>Pro Tip</strong>: Add example patterns directly in the instruction to show expected behavior!</p><h3><strong>Common Instruction Mistakes to Avoid</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fXge!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fXge!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png 424w, https://substackcdn.com/image/fetch/$s_!fXge!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png 848w, https://substackcdn.com/image/fetch/$s_!fXge!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png 1272w, https://substackcdn.com/image/fetch/$s_!fXge!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fXge!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png" width="557" height="332" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:332,&quot;width&quot;:557,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fXge!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png 424w, https://substackcdn.com/image/fetch/$s_!fXge!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png 848w, https://substackcdn.com/image/fetch/$s_!fXge!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png 1272w, https://substackcdn.com/image/fetch/$s_!fXge!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c4e855c-12a2-499d-a240-6719ae23df3b_557x332.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Step 4: Create the Agent</strong></h3><pre><code># Create orchestrator using Agent (not LlmAgent) with AgentTools
    generation_config = GenerateContentConfig(
        max_output_tokens=20000,  # Increased to support full 5-agent workflows
        temperature=0.2,  # Lower temperature for consistent execution
    )
    agent = Agent(
        name=&#8221;creative_director&#8221;,
        model=&#8221;gemini-2.5-flash&#8221;,
        description=&#8221;Creative Director orchestrator with lazy context compaction&#8221;,
        instruction=system_instruction,
        tools=agent_tools,  # &#128295; AgentTools! LLM can call these as tools
        generate_content_config=generation_config,
    )
    logger.info(&#8221;&#9989; Agent created successfully&#8221;)
    logger.info(&#8221;=&#8221;*70)
    return agent

# Create root_agent for deployment
root_agent = create_creative_director()</code></pre><p><strong>Important</strong>: We use <code>Agent</code> (not <code>LlmAgent</code>) because we&#8217;re using tools!</p><h3><strong>Testing the Orchestrator Locally</strong></h3><p><strong>Setup</strong></p><pre><code># In .env file
STRATEGIST_AGENT_URL=http://localhost:8082
COPYWRITER_AGENT_URL=http://localhost:8083
DESIGNER_AGENT_URL=http://localhost:8084
CRITIC_AGENT_URL=http://localhost:8085
PM_AGENT_URL=http://localhost:8086
GOOGLE_API_KEY=your_api_key</code></pre><p><strong>Start All Agent Servers</strong></p><p>Terminal 1:</p><pre><code>cd agents/brand_strategist
python agent.py  # Runs on 8082</code></pre><p>Terminal 2:</p><pre><code>cd agents/copywriter
PORT=8083 python agent.py</code></pre><p>Terminal 3:</p><pre><code>cd agents/designer
PORT=8084 python agent.py</code></pre><p>Terminal 4:</p><pre><code>cd agents/critic
PORT=8085 python agent.py</code></pre><p>Terminal 5:</p><pre><code>cd agents/project_manager
PORT=8086 python agent.py</code></pre><p><strong>Test the Orchestrator</strong></p><pre><code># test_orchestrator_local.py
import asyncio
from creative_director.agent import root_agent
from google.adk import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types
async def test_simple_request():
    &#8220;&#8221;&#8220;Test with simple request - should call only 1 agent&#8221;&#8220;&#8221;
    brief = &#8220;Research the market for eco-friendly smart water bottles&#8221;
    print(&#8221;&#127916; Testing Simple Request (1 agent expected)\n&#8221;)
    print(f&#8221;Brief: {brief}\n&#8221;)
    session_service = InMemorySessionService()
    runner = Runner(
        app_name=&#8221;creative_director&#8221;,
        agent=root_agent,
        session_service=session_service
    )
    session_id = &#8220;test_simple&#8221;
    user_id = &#8220;test_user&#8221;
    try:
        await session_service.create_session(
            app_name=&#8221;creative_director&#8221;,
            user_id=user_id,
            session_id=session_id
        )
        async for event in runner.run_async(
            user_id=user_id,
            session_id=session_id,
            new_message=types.Content(parts=[types.Part(text=brief)])
        ):
            if hasattr(event, &#8216;text&#8217;) and event.text:
                print(event.text, end=&#8217;&#8216;, flush=True)
        print(&#8221;\n\n&#9989; Simple request complete!&#8221;)
    finally:
        await runner.close()

async def test_complex_request():
    &#8220;&#8221;&#8220;Test with complex request - should call all 5 agents&#8221;&#8220;&#8221;
    brief = &#8220;&#8221;&#8220;
    Create a complete Instagram campaign for EcoFlow smart water bottle.
    Target: Health-conscious millennials (25-34).
    Budget: $5,000. Launch in 2 weeks.
    Include full campaign with timeline and tasks.
    &#8220;&#8221;&#8220;
    print(&#8221;\n&#8221; + &#8220;=&#8221;*70)
    print(&#8221;&#127916; Testing Complex Request (5 agents expected)&#8221;)
    print(&#8221;=&#8221;*70 + &#8220;\n&#8221;)
    print(f&#8221;Brief: {brief}\n&#8221;)
    session_service = InMemorySessionService()
    runner = Runner(
        app_name=&#8221;creative_director&#8221;,
        agent=root_agent,
        session_service=session_service
    )
    session_id = &#8220;test_complex&#8221;
    user_id = &#8220;test_user&#8221;
    try:
        await session_service.create_session(
            app_name=&#8221;creative_director&#8221;,
            user_id=user_id,
            session_id=session_id
        )
        async for event in runner.run_async(
            user_id=user_id,
            session_id=session_id,
            new_message=types.Content(parts=[types.Part(text=brief)])
        ):
            if hasattr(event, &#8216;text&#8217;) and event.text:
                print(event.text, end=&#8217;&#8216;, flush=True)
        print(&#8221;\n\n&#9989; Complex request complete!&#8221;)
    finally:
        await runner.close()

async def main():
    # Test simple request
    await test_simple_request()
    # Test complex request
    await test_complex_request()

if __name__ == &#8220;__main__&#8221;:
    asyncio.run(main())</code></pre><p><strong>Expected Output</strong></p><p><strong>Simple Request</strong>:</p><pre><code>&#127916; Testing Simple Request (1 agent expected)

Brief: Research the market for eco-friendly smart water bottles
creative_director &gt; I&#8217;ll help you research the eco-friendly smart water bottle market.
Let me use our Brand Strategist to gather market insights.
**Audience Insights:**
[Research results from Brand Strategist...]
**Competitive Analysis:**
[Competitor analysis...]
**Trending Topics:**
[Current trends...]
&#10003; Research complete. I received insights on target audience, 3 main competitors, and 5 trending topics.
&#9989; Simple request complete!</code></pre><p><strong>Complex Request</strong>:</p><pre><code>&#127916; Testing Complex Request (5 agents expected)
creative_director &gt; I&#8217;ll coordinate our team to create your complete Instagram campaign. Here&#8217;s my plan:
1. **Brand Strategist** will research the market, competitors, and target audience
2. **Copywriter** will create 5 Instagram posts using those insights
3. **Designer** will generate image concepts for each post
4. **Critic** will review all creative work for quality
5. **Project Manager** will create the project timeline and deliverables
Let&#8217;s begin with the market research!
[Calls brand_strategist...]
&#10003; Research complete. I received audience insights, competitive analysis, and trending topics.
Now moving to copywriting...
[Calls copywriter...]
&#10003; Copywriting complete. I received 5 Instagram posts with captions and hashtags.
Now creating visual concepts...
[Calls designer...]
&#10003; Design complete. I received image concepts for all 5 posts.
Now getting quality review...
[Calls critic...]
&#10003; Review complete. Quality score: 8.5/10
Finally, creating project timeline...
[Calls project_manager...]
&#10003; Timeline complete. Project plan created with tasks.
Here&#8217;s your complete campaign:
[Full campaign output...]
&#9989; Complex request complete!</code></pre><h2><strong>How the LLM Decides</strong></h2><p>The orchestrator&#8217;s LLM analyzes the user&#8217;s request and decides which tools to call:</p><h3><strong>Decision Tree</strong></h3><pre><code>User Request
    &#8595;
Analyze keywords &amp; intent
    &#8595;
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;                    &#9474;                    &#9474;
&#8220;research&#8221;           &#8220;complete campaign&#8221;  &#8220;make it more playful&#8221;
&#8220;just write&#8221;         &#8220;full package&#8221;       &#8220;try different visuals&#8221;
&#8220;review this&#8221;        &#8220;with timeline&#8221;      &#8220;revise the copy&#8221;
&#9474;                    &#9474;                    &#9474;
&#8595;                    &#8595;                    &#8595;
Call 1 agent         Call all 5           Call 1 agent again
                     sequentially         (revision)</code></pre><h3><strong>Example Requests and Routing</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qorP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qorP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png 424w, https://substackcdn.com/image/fetch/$s_!qorP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png 848w, https://substackcdn.com/image/fetch/$s_!qorP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png 1272w, https://substackcdn.com/image/fetch/$s_!qorP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qorP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png" width="703" height="281" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:281,&quot;width&quot;:703,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qorP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png 424w, https://substackcdn.com/image/fetch/$s_!qorP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png 848w, https://substackcdn.com/image/fetch/$s_!qorP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png 1272w, https://substackcdn.com/image/fetch/$s_!qorP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0397cf3-b65d-4ab5-9e9e-59798714db15_703x281.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Context Passing Between Agents</strong></h3><p>The orchestrator passes context from one agent to the next:</p><pre><code># Pseudo-code of what the LLM does
# Step 1: Call brand_strategist
strategist_output = call_tool(
    name=&#8221;brand_strategist&#8221;,
    query=user_brief
)
# Step 2: Call copywriter with strategist&#8217;s output
copywriter_output = call_tool(
    name=&#8221;copywriter&#8221;,
    query=f&#8221;{user_brief}\n\nResearch Insights:\n{strategist_output}&#8221;
)
# Step 3: Call designer with copywriter&#8217;s output
designer_output = call_tool(
    name=&#8221;designer&#8221;,
    query=f&#8221;Create visuals for these posts:\n{copywriter_output}&#8221;
)
# And so on...</code></pre><p><strong>Key point</strong>: Each agent receives <strong>relevant context</strong> from previous agents!</p><h3><strong>Error Handling Example</strong></h3><p>What happens when an agent fails?</p><pre><code># Scenario: Copywriter fails
User: &#8220;Create complete campaign&#8221;
Orchestrator: &#8220;I&#8217;ll coordinate our team... [shows plan]&#8221;
[Calls brand_strategist - SUCCESS]
Orchestrator: &#8220;&#10003; Research complete...&#8221;
[Calls copywriter - FAILS]
Orchestrator: &#8220;&#10060; Error in copywriter: Connection timeout
Cannot proceed with designer and critic without the social media posts from the copywriter.
Would you like me to:
1. Retry the copywriter
2. Skip copywriting and continue with what we have
3. Abort the workflow
Please let me know how you&#8217;d like to proceed.&#8221;
[STOPS - waits for user input]</code></pre><p><strong>The workflow stops gracefully</strong> and asks the user what to do!</p><h2><strong>Advantages of the AgentTool Pattern</strong></h2><h3><strong>1. Flexibility</strong></h3><pre><code># Same orchestrator handles different requests:
&#8220;Just research&#8221; &#8594; Calls 1 agent
&#8220;Complete campaign&#8221; &#8594; Calls 5 agents
&#8220;Make it playful&#8221; &#8594; Calls copywriter again
&#8220;Add more visuals&#8221; &#8594; Calls designer again</code></pre><h3><strong>2. Reusability</strong></h3><pre><code># Call same agent multiple times for revisions
User: &#8220;Create posts&#8221;
&#8594; [Calls copywriter]
User: &#8220;Make them more professional&#8221;
&#8594; [Calls copywriter again with feedback]
User: &#8220;Add more CTAs&#8221;
&#8594; [Calls copywriter third time]</code></pre><h3><strong>3. Natural Error Recovery</strong></h3><pre><code># LLM can handle errors intelligently
If critic fails:
&#8594; LLM decides whether to retry or skip
&#8594; Can adjust plan on the fly
&#8594; Asks user for guidance when needed</code></pre><h2><strong>Dynamic Agent Discovery Benefits</strong></h2><p>Our orchestrator discovers agents <strong>at runtime</strong>:</p><pre><code># Local development
STRATEGIST_AGENT_URL=http://localhost:8082
# Production
STRATEGIST_AGENT_URL=https://brand-strategist-xxx.run.app</code></pre><p><strong>Benefits</strong>:</p><ul><li><p><strong>Environment-agnostic</strong>: No code</p></li><li><p><strong>Graceful degradation</strong>: Missing agents just aren&#8217;t listed</p></li><li><p><strong>Easy updates</strong>: Change URLs without redeploying</p></li><li><p><strong>Testing</strong>: Point to test vs production agents</p></li></ul><h2><strong>Using adk web for Interactive Testing</strong></h2><p>Test the orchestrator interactively:</p><pre><code>cd agents/creative_director
adk web --log_level DEBUG</code></pre><p>Open </p><p>http://localhost:8000</p><p> and try different requests:</p><ul><li><p>&#8220;Research eco water bottles&#8221;</p></li><li><p>&#8220;Create 3 Instagram posts&#8221;</p></li><li><p>&#8220;Complete campaign with timeline&#8221;</p></li><li><p>&#8220;Make the copy more playful&#8221;</p></li></ul><p>Watch the LLM decide which agents to call!</p><h2><strong>Common Patterns</strong></h2><h3><strong>Pattern 1: Research Only</strong></h3><pre><code>User: &#8220;Research the market for sustainable fashion&#8221;
&#8594; Orchestrator calls: brand_strategist
&#8594; Returns: Research insights only</code></pre><h3><strong>Pattern 2: Content Creation</strong></h3><pre><code>User: &#8220;Write 5 TikTok scripts for a coffee brand&#8221;
&#8594; Orchestrator might call:
  1. brand_strategist (quick trend check)
  2. copywriter (create scripts)
&#8594; Returns: 5 TikTok scripts</code></pre><h3><strong>Pattern 3: Complete Campaign</strong></h3><pre><code>User: &#8220;Create full Instagram campaign with timeline&#8221;
&#8594; Orchestrator calls all 5 agents:
  1. brand_strategist &#8594; research
  2. copywriter &#8594; posts
  3. designer &#8594; visuals
  4. critic &#8594; review
  5. project_manager &#8594; timeline
&#8594; Returns: Complete campaign package</code></pre><h3><strong>Pattern 4: Iterative Refinement</strong></h3><pre><code>User: &#8220;Create 3 posts&#8221;
&#8594; [Orchestrator calls copywriter]
User: &#8220;Make them more formal&#8221;
&#8594; [Orchestrator calls copywriter again with feedback]
User: &#8220;Add more hashtags&#8221;
&#8594; [Orchestrator calls copywriter third time]</code></pre><h3><strong>Pattern 5: Critic Revision Workflow (Quality Improvement Loop)</strong></h3><pre><code>User: &#8220;Create complete campaign for luxury watches&#8221;
&#8594; Orchestrator plans:
  1. brand_strategist &#8594; market research
  2. copywriter &#8594; create posts
  3. designer &#8594; create visuals
  4. critic &#8594; review everything
  5. [Automatic revisions if needed] &#8592; KEY!
  6. project_manager &#8594; timeline
&#8594; Execution:
  Step 1-3: Complete successfully
  Step 4: Critic reviews and returns:
    **POSTS REVIEW:**
    - Score: 6/10
    - Status: NEEDS_REVISION
    - Issue: Tone too casual for luxury audience
    **VISUALS REVIEW:**
    - Score: 8/10
    - Status: APPROVED &#10003;
  Step 5: Orchestrator sees &#8220;NEEDS_REVISION&#8221;
    &#8594; Announces to user: &#8220;Critic identified improvements needed&#8221;
    &#8594; Calls copywriter again with:
      - Original brief
      - First version of posts
      - Critic&#8217;s specific feedback
    &#8594; Copywriter creates revised posts
  Step 6: Project Manager receives:
    - Revised (approved) posts &#10003;
    - Approved visuals &#10003;
    - Complete campaign ready!
&#8594; Returns: High-quality campaign with automatic QA</code></pre><p><strong>How it works</strong>:</p><ol><li><p>Critic provides <strong>structured feedback</strong> with Status: <code>APPROVED</code> or <code>NEEDS_REVISION</code></p></li><li><p>Orchestrator <strong>parses the feedback</strong> automatically</p></li><li><p>If revision needed, orchestrator <strong>calls relevant agent</strong> with critic&#8217;s feedback</p></li><li><p><strong>Maximum 1 revision</strong> per agent (prevents infinite loops)</p></li><li><p>Only <strong>quality-approved deliverables</strong> reach Project Manager</p></li></ol><p><strong>Why this matters</strong>:</p><ul><li><p>Built-in quality assurance</p></li><li><p>No manual intervention needed</p></li><li><p>Consistent quality standards</p></li><li><p>Prevents flawed work from reaching final output</p></li><li><p>Cost-efficient (max 1 revision)</p></li></ul><p><strong>Agent mapping</strong>:</p><ul><li><p>Posts need revision &#8594; Call <strong>copywriter</strong> with feedback</p></li><li><p>Visuals need revision &#8594; Call <strong>designer</strong> with feedback</p></li><li><p>Both need revision &#8594; Call both agents sequentially</p></li></ul><p>This revision workflow ensures every campaign meets quality standards before delivery!</p><p>We have a working orchestrator, but there&#8217;s a problem: <strong>token limits</strong>.</p><p>When running all 5 agents, the context can exceed the model&#8217;s token limit, causing the workflow to stop prematurely. In Part 4, we&#8217;ll solve this with <strong>Lazy Context Compaction.</strong></p><p><strong>Code Repository</strong>: <a href="https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun">https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun</a></p><p><strong>Next</strong>: <a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-d85?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4: Scaling with Context Compaction &#8594;</a></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Building Distributed Multi-Agent Systems with Google’s AI Stack: Part 2]]></title><description><![CDATA[Building Distributed Multi-Agent Systems with Google&#8217;s AI Stack series:]]></description><link>https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-2a2</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent-2a2</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 09:44:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9U_H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Building Distributed Multi-Agent Systems with Google&#8217;s AI Stack series:</strong></p><ul><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent?utm_campaign=post-expanded-share&amp;utm_medium=web">Part 1: From Monolithic AI to Distributed Intelligence: Building Your First Multi-Agent System</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-2a2?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2: Making Agents Talk: Agent-to-Agent (A2A) Protocol Deep Dive</a></strong> &#8592; You are here</p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-9a3?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3: Building the Orchestrator: Coordinating Agents with the AgentTool Pattern</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-d85?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4: Scaling Multi-Agent Workflows: Solving the Token Limit Problem</a></p></li><li><p><a href="https://saoussenchaabnia.substack.com/publish/post/184416479">Part 5: External Tool Integration via Model Context Protocol (MCP)</a></p></li><li><p>Part 6: Deploying to Cloud: Cloud Run and Vertex AI Agent Engine</p></li></ul><h2><strong>Welcome Back!</strong></h2><p>In <a href="https://medium.com/google-cloud/article-02-building-first-agent.md">Part </a>1, we built three specialist agents that run locally. But here&#8217;s the problem: <strong>they can&#8217;t talk to each other yet</strong>.</p><p>In this article, we&#8217;ll solve that using the <strong>Agent-to-Agent (A2A) Protocol, </strong>and I&#8217;ll share the <strong>KEY TRICKS </strong>that makes A2A work seamlessly in both local development and Cloud Run deployment.</p><p><strong>What you&#8217;ll learn</strong>:</p><ul><li><p>A2A protocol fundamentals</p></li><li><p>Creating A2A servers</p></li><li><p><strong>The dual configuration pattern</strong></p></li><li><p>Testing A2A endpoints</p></li><li><p>Agent card creation</p></li></ul><p>Let&#8217;s make our agents communicate!</p><h2><strong>The Communication Challenge</strong></h2><p>Currently, our agents are standalone Python processes:</p><pre><code>[Brand Strategist] &#8592; Can&#8217;t talk to each other
[Copywriter]       &#8592; Running separately
[Designer]         &#8592; No communication</code></pre><p>We need them to communicate over HTTP so we can:</p><ul><li><p>Deploy to different servers</p></li><li><p>Scale independently</p></li><li><p>Use standardized protocols</p></li><li><p>Test in isolation</p></li></ul><p><strong>Enter A2A Protocol.</strong></p><h2><strong>What is A2A Protocol?</strong></h2><p>A2A (Agent-to-Agent) is a <strong>standardized protocol</strong> for agent communication developed by Google.</p><h3><strong>Key Features</strong></h3><blockquote><p><em><strong>JSONRPC 2.0 based</strong>: Standard, well-understood format<br><strong>Agent cards</strong>: Discoverable metadata at </em><code>/.well-known/agent.json<br></code><em><strong>Stateless</strong>: Each request is independent<br><strong>HTTP/HTTPS</strong>: Works anywhere<br><strong>Language-agnostic</strong>: Any language can implement</em></p></blockquote><h3><strong>Message Format</strong></h3><p><strong>Request</strong>:</p><pre><code>{
  &#8220;jsonrpc&#8221;: &#8220;2.0&#8221;,
  &#8220;id&#8221;: 1,
  &#8220;method&#8221;: &#8220;agent/invoke&#8221;,
  &#8220;params&#8221;: {
    &#8220;prompt&#8221;: &#8220;Research eco-friendly water bottles...&#8221;
  }
}</code></pre><p><strong>Response</strong>:</p><pre><code>{
  &#8220;jsonrpc&#8221;: &#8220;2.0&#8221;,
  &#8220;id&#8221;: 1,
  &#8220;result&#8221;: {
    &#8220;content&#8221;: &#8220;**Audience Insights:**\n...&#8221;
  }
}</code></pre><h2><strong>Agent Card</strong></h2><p>Every A2A agent exposes metadata at <code>/.well-known/agent.json</code>:</p><pre><code>{
  &#8220;name&#8221;: &#8220;brand_strategist&#8221;,
  &#8220;description&#8221;: &#8220;Market research and trend analysis&#8221;,
  &#8220;rpc_url&#8221;: &#8220;https://brand-strategist-xxx.run.app&#8221;,
  &#8220;capabilities&#8221;: [&#8221;research&#8221;, &#8220;analysis&#8221;]
}</code></pre><h2><strong>Creating an A2A Server (Simple Approach)</strong></h2><p>Let&#8217;s convert our Brand Strategist to an A2A server using ADK&#8217;s built-in <code>to_a2a</code>:</p><pre><code># agents/brand_strategist/agent.py
from google.adk.agents import Agent
from google.adk.tools import google_search
import os
# ... (agent creation code from Part 2) ...

if __name__ == &#8220;__main__&#8221;:
    import uvicorn
    from google.adk.a2a.utils.agent_to_a2a import to_a2a

    PORT = int(os.getenv(&#8221;PORT&#8221;, &#8220;8082&#8221;))
    HOST = os.getenv(&#8221;HOST&#8221;, &#8220;0.0.0.0&#8221;)

    # Convert agent to A2A application
    a2a_app = to_a2a(root_agent, host=HOST, port=PORT, protocol=&#8221;http&#8221;)

    # Start server
    print(f&#8221;&#128640; Starting Brand Strategist A2A Server on http://{HOST}:{PORT}&#8221;)
    print(f&#8221;&#128203; Agent card: http://{HOST}:{PORT}/.well-known/agent.json&#8221;)
    uvicorn.run(a2a_app, host=HOST, port=PORT)</code></pre><p>Run it:</p><pre><code>python agent.py</code></pre><p>Output:</p><pre><code>&#128640; Starting Brand Strategist A2A Server on http://0.0.0.0:8082
&#128203; Agent card: http://0.0.0.0:8082/.well-known/agent.json</code></pre><h2><strong>Testing the A2A Endpoint</strong></h2><h3><strong>Test 1: Agent Card</strong></h3><pre><code>curl http://localhost:8082/.well-known/agent.json</code></pre><p>Response:</p><pre><code>{
  &#8220;name&#8221;: &#8220;brand_strategist&#8221;,
  &#8220;description&#8221;: &#8220;Brand strategist for market research...&#8221;,
  &#8220;rpc_url&#8221;: &#8220;http://localhost:8082&#8221;
}</code></pre><h3><strong>Test 2: Invoke Agent</strong></h3><pre><code>curl -X POST http://localhost:8082/ \
  -H &#8220;Content-Type: application/json&#8221; \
  -d &#8216;{
    &#8220;jsonrpc&#8221;: &#8220;2.0&#8221;,
    &#8220;id&#8221;: 1,
    &#8220;method&#8221;: &#8220;agent/invoke&#8221;,
    &#8220;params&#8221;: {
      &#8220;prompt&#8221;: &#8220;Research the smart water bottle market&#8221;
    }
  }&#8217;</code></pre><p>Response:</p><pre><code>{
  &#8220;jsonrpc&#8221;: &#8220;2.0&#8221;,
  &#8220;id&#8221;: 1,
  &#8220;result&#8221;: {
    &#8220;content&#8221;: &#8220;**Audience Insights:**\n[Research results...]&#8221;
  }
}</code></pre><h2><strong>Testing with A2A Inspector</strong></h2><p>While <code>curl</code> works for basic testing, there&#8217;s a much better tool: <strong>A2A Inspector, </strong>a web-based debugging tool specifically designed for A2A agents.</p><h3><strong>What is A2A Inspector?</strong></h3><p>A2A Inspector is an open-source tool that:</p><ul><li><p>Connects to A2A agents (local or cloud)</p></li><li><p>Shows agent cards with full metadata</p></li><li><p>Sends test queries with a visual interface</p></li><li><p>Displays JSONRPC messages (request/response)</p></li><li><p>Validates A2A protocol compliance</p></li></ul><h3><strong>Installing A2A Inspector</strong></h3><pre><code># Clone the repository
git clone https://github.com/a2aproject/a2a-inspector.git ~/a2a-inspector
cd ~/a2a-inspector</code></pre><pre><code># Install dependencies
npm install
cd frontend &amp;&amp; npm install &amp;&amp; cd ..# Start the inspector
bash scripts/run.sh</code></pre><p>The inspector will start at: </p><p>http://localhost:5001</p><h3><strong>Using A2A Inspector</strong></h3><p><strong>Step 1: Open the inspector</strong></p><pre><code></code></pre><p>http://localhost:5001</p><p><strong>Step 2: Connect to your agent</strong></p><ul><li><p>Enter agent URL: </p></li></ul><p>http://localhost:8082</p><ul><li><p>Click &#8220;Connect&#8221;</p></li></ul><p><strong>Step 3: View the agent card</strong></p><p>The inspector automatically fetches and displays:</p><pre><code>{
  &#8220;name&#8221;: &#8220;brand_strategist&#8221;,
  &#8220;description&#8221;: &#8220;Brand strategist for market research...&#8221;,
  &#8220;protocol&#8221;: &#8220;a2a&#8221;,
  &#8220;version&#8221;: &#8220;1.0&#8221;,
  &#8220;capabilities&#8221;: {
    &#8220;streaming&#8221;: true
  },
  &#8220;endpoints&#8221;: {
    &#8220;query&#8221;: &#8220;/query&#8221;
  }
}</code></pre><p><strong>Step 4: Send test queries</strong></p><p>Use the visual interface to send queries:</p><ul><li><p>Query: <code>"Research the eco-friendly water bottle market"</code></p></li><li><p>Click &#8220;Send&#8221;</p></li></ul><p><strong>Step 5: View JSONRPC messages</strong></p><p>The inspector shows both request and response:</p><p><strong>Request:</strong></p><pre><code>{
  &#8220;jsonrpc&#8221;: &#8220;2.0&#8221;,
  &#8220;method&#8221;: &#8220;query&#8221;,
  &#8220;params&#8221;: {
    &#8220;query&#8221;: &#8220;Research the eco-friendly water bottle market&#8221;
  },
  &#8220;id&#8221;: 1
}</code></pre><p><strong>Response:</strong></p><pre><code>{
  &#8220;jsonrpc&#8221;: &#8220;2.0&#8221;,
  &#8220;result&#8221;: {
    &#8220;content&#8221;: &#8220;**Target Audience Insights:**\n\nGen Z (18-25)...&#8221;
  },
  &#8220;id&#8221;: 1
}</code></pre><h3><strong>Why Use A2A Inspector?</strong></h3><p><strong>vs curl:</strong></p><ul><li><p>curl: Manual JSON formatting, hard to read responses</p></li><li><p>Inspector: Visual interface, formatted display</p></li></ul><p><strong>Benefits:</strong></p><ul><li><p><strong>Debug protocol issues</strong>: See exact JSONRPC messages</p></li><li><p><strong>Test faster</strong>: No typing JSON by hand</p></li><li><p><strong>Validate compliance</strong>: Ensures your agent follows A2A spec</p></li><li><p><strong>Test cloud agents</strong>: Works with Cloud Run URLs too</p></li></ul><p><strong>Testing Cloud Agents:</strong></p><p>After deployment (covered in Part 6), you can test cloud agents:</p><pre><code>Agent URL: https://brand-strategist-xxx.run.app</code></pre><p>The inspector works identically with cloud URLs!</p><h2><strong>&#9888; The Local vs Cloud Run Challenge</strong></h2><p>When we deploy to Cloud Run, we hit a <strong>critical issue</strong>:</p><h3><strong>The Problem</strong></h3><p><strong>Local environment</strong>:</p><ul><li><p>Server listens on: <code>0.0.0.0:8082</code></p></li><li><p>Agent card should advertise: </p></li></ul><p>http://localhost:8082</p><p><strong>Cloud Run environment</strong>:</p><ul><li><p>Server listens on: <code>0.0.0.0:8080</code> (internal)</p></li><li><p>Cloud Run routes external <code>443</code> &#8594; internal <code>8080</code></p></li><li><p>Agent card should advertise: </p></li></ul><p>https://brand-strategist-xxx.run.app:443</p><p>If we hardcode the URL, it won&#8217;t work in both environments!</p><h2><strong>The Solution: Dual Configuration Pattern</strong></h2><p>This is our <strong>KEY TRICK, </strong>separating listening configuration from public configuration.</p><h3><strong>The Pattern</strong></h3><pre><code># agents/brand_strategist/agent.py

if __name__ == &#8220;__main__&#8221;:
    import uvicorn
    from google.adk.a2a.utils.agent_to_a2a import to_a2a

    # === LISTENING CONFIGURATION (Internal) ===
    # Where the server binds and listens
    PORT = int(os.getenv(&#8221;PORT&#8221;, &#8220;8082&#8221;))
    HOST = os.getenv(&#8221;HOST&#8221;, &#8220;0.0.0.0&#8221;)

    # === PUBLIC CONFIGURATION (External) ===
    # What the agent card advertises
    PUBLIC_HOST = os.getenv(&#8221;PUBLIC_HOST&#8221;, &#8220;localhost&#8221;)
    PUBLIC_PORT = int(os.getenv(&#8221;PUBLIC_PORT&#8221;, str(PORT)))
    PROTOCOL = os.getenv(&#8221;PROTOCOL&#8221;, &#8220;http&#8221;)

    # Create A2A app with PUBLIC configuration
    a2a_app = to_a2a(
        root_agent,
        host=PUBLIC_HOST,      # &#8592; Goes in agent card
        port=PUBLIC_PORT,      # &#8592; Goes in agent card
        protocol=PROTOCOL      # &#8592; Goes in agent card
    )
    # Run server with LISTENING configuration
    print(f&#8221;&#128640; Starting on {PROTOCOL}://{HOST}:{PORT}&#8221;)
    print(f&#8221;&#127760; Public URL: {PROTOCOL}://{PUBLIC_HOST}:{PUBLIC_PORT}&#8221;)

    uvicorn.run(a2a_app, host=HOST, port=PORT)</code></pre><h3><strong>Local Configuration</strong></h3><p>Create <code>.env</code> in <code>agents/brand_strategist/</code>:</p><pre><code># Listening configuration
HOST=0.0.0.0
PORT=8082

# Public configuration (for agent card)
PUBLIC_HOST=localhost
PUBLIC_PORT=8082
PROTOCOL=http</code></pre><h3><strong>Cloud Run Configuration</strong></h3><p>Set during deployment (automatically):</p><pre><code># Listening configuration
HOST=0.0.0.0
PORT=8080

# Public configuration (updated after deployment)
PUBLIC_HOST=brand-strategist-xxx.us-central1.run.app
PUBLIC_PORT=443
PROTOCOL=https</code></pre><h3><strong>How It Works</strong></h3><h3><strong>Step 1: Agent Card Creation</strong></h3><p>When <code>to_a2a()</code> is called with <code>PUBLIC_HOST</code>, <code>PUBLIC_PORT</code>, and <code>PROTOCOL</code>:</p><pre><code>a2a_app = to_a2a(
    root_agent,
    host=&#8221;brand-strategist-xxx.run.app&#8221;,  # PUBLIC_HOST
    port=443,                               # PUBLIC_PORT
    protocol=&#8221;https&#8221;                        # PROTOCOL
)</code></pre><p>The agent card is created with the <strong>public URL</strong>:</p><pre><code>{
  &#8220;name&#8221;: &#8220;brand_strategist&#8221;,
  &#8220;rpc_url&#8221;: &#8220;https://brand-strategist-xxx.run.app:443&#8221;
}</code></pre><h3><strong>Step 2: Server Listening</strong></h3><p>But uvicorn listens on the <strong>internal address</strong>:</p><pre><code>uvicorn.run(
    a2a_app,
    host=&#8221;0.0.0.0&#8221;,  # HOST - internal listening
    port=8080         # PORT - internal listening
)</code></pre><h3><strong>Step 3: Cloud Run Routing</strong></h3><p>Cloud Run automatically routes:</p><ul><li><p>External requests to </p></li></ul><p>https://service-xxx.run.app:443</p><ul><li><p>&#8594; Internal server at <code>0.0.0.0:8080</code></p></li></ul><h3><strong>The Magic</strong></h3><ul><li><p>Agent card advertises the <strong>public URL</strong> (what clients use)</p></li><li><p>Server listens on the <strong>internal address</strong> (what Cloud Run expects)</p></li><li><p><strong>Same code works in both environments!</strong></p></li></ul><h3><strong>Benefits of This Pattern</strong></h3><ul><li><p><strong>Environment-agnostic code</strong>: No changes between local and cloud</p></li><li><p><strong>Clean separation</strong>: Listening vs public configuration</p></li><li><p><strong>Secure by default</strong>: Internal ports not exposed</p></li><li><p><strong>Standard ADK tools</strong>: Uses <code>to_a2a</code> without modifications</p></li><li><p><strong>Easy testing</strong>: Local URLs for development, production URLs for deployment</p></li></ul><h3><strong>Complete Example: Brand Strategist with Dual Configuration</strong></h3><pre><code># agents/brand_strategist/agent.py

import logging
import datetime
import os
from google.adk.agents import Agent
from google.adk.tools import google_search
from dotenv import load_dotenv

load_dotenv()

logger = logging.getLogger(&#8221;ai_creative_studio.brand_strategist&#8221;)

SYSTEM_INSTRUCTION = f&#8221;&#8220;&#8221;You are a Brand Strategist...
[Full instruction from Part 2]
&#8220;&#8221;&#8220;

root_agent = Agent(
    name=&#8221;brand_strategist&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=SYSTEM_INSTRUCTION,
    description=&#8221;Brand strategist for market research, trend analysis, and competitive insights&#8221;,
    tools=[google_search]
)
logger.info(&#8221;Brand Strategist agent created successfully&#8221;)

if __name__ == &#8220;__main__&#8221;:
    import uvicorn
    from google.adk.a2a.utils.agent_to_a2a import to_a2a

    # LISTENING CONFIGURATION (where server binds)
    PORT = int(os.getenv(&#8221;PORT&#8221;, &#8220;8082&#8221;))
    HOST = os.getenv(&#8221;HOST&#8221;, &#8220;0.0.0.0&#8221;)

    # PUBLIC CONFIGURATION (what agent card advertises)
    # In Cloud Run: PUBLIC_HOST is the full domain, PUBLIC_PORT is 443
    PUBLIC_HOST = os.getenv(&#8221;PUBLIC_HOST&#8221;, &#8220;localhost&#8221;)
    PUBLIC_PORT = int(os.getenv(&#8221;PUBLIC_PORT&#8221;, str(PORT)))
    PROTOCOL = os.getenv(&#8221;PROTOCOL&#8221;, &#8220;http&#8221;)

    # Convert agent to A2A application with PUBLIC info
    a2a_app = to_a2a(root_agent, host=PUBLIC_HOST, port=PUBLIC_PORT, protocol=PROTOCOL)

    # Start server on INTERNAL host and port
    logger.info(f&#8221;&#128640; Starting Brand Strategist A2A Server on {PROTOCOL}://{HOST}:{PORT}&#8221;)
    logger.info(f&#8221;&#128203; Agent card: {PROTOCOL}://{HOST}:{PORT}/.well-known/agent-card.json&#8221;)
    logger.info(f&#8221;&#127760; Public URL: {PROTOCOL}://{PUBLIC_HOST}:{PUBLIC_PORT}&#8221;)
    uvicorn.run(a2a_app, host=HOST, port=PORT)</code></pre><h3><strong>Testing Locally with Dual Configuration</strong></h3><p><strong>1. Create </strong><code>.env</code></p><pre><code># agents/brand_strategist/.env
HOST=0.0.0.0
PORT=8082
PUBLIC_HOST=localhost
PUBLIC_PORT=8082
PROTOCOL=http</code></pre><p><strong>2. Run the server</strong></p><pre><code>cd agents/brand_strategist
python agent.py</code></pre><p><strong>3. Check the agent card</strong></p><pre><code>curl http://localhost:8082/.well-known/agent.json</code></pre><p>Response shows <strong>localhost</strong> (correct for local):</p><pre><code>{
  &#8220;name&#8221;: &#8220;brand_strategist&#8221;,
  &#8220;rpc_url&#8221;: &#8220;http://localhost:8082&#8221;
}</code></pre><p><strong>4. Test invocation</strong></p><pre><code>curl -X POST http://localhost:8082/ \
  -H &#8220;Content-Type: application/json&#8221; \
  -d &#8216;{
    &#8220;jsonrpc&#8221;: &#8220;2.0&#8221;,
    &#8220;id&#8221;: 1,
    &#8220;method&#8221;: &#8220;agent/invoke&#8221;,
    &#8220;params&#8221;: {&#8221;prompt&#8221;: &#8220;Research smart water bottles&#8221;}
  }&#8217;</code></pre><p>Works perfectly with localhost URLs!</p><h3><strong>Cloud Run Configuration (Preview)</strong></h3><p>When deployed to Cloud Run, the deployment script will:</p><pre><code># 1. Deploy service
gcloud run deploy brand-strategist --source=. --region=us-central1

# 2. Get the public URL
SERVICE_URL=$(gcloud run services describe brand-strategist \
  --region=us-central1 \
  --format=&#8217;value(status.url)&#8217;)

# 3. Extract hostname
PUBLIC_HOST=$(echo $SERVICE_URL | sed &#8216;s|https://||&#8217; | sed &#8216;s|/||&#8217;)

# 4. Update environment variables
gcloud run services update brand-strategist \
  --region=us-central1 \
  --update-env-vars=PUBLIC_HOST=$PUBLIC_HOST,PUBLIC_PORT=443,PROTOCOL=https</code></pre><p>Agent card will then show:</p><pre><code>{
  &#8220;name&#8221;: &#8220;brand_strategist&#8221;,
  &#8220;rpc_url&#8221;: &#8220;https://brand-strategist-xxx.us-central1.run.app:443&#8221;
}</code></pre><p><strong>Same code, different configuration!</strong></p><h2><strong>Apply to All Agents</strong></h2><p>Update Copywriter, Designer, Critic, and Project Manager with the same pattern:</p><pre><code>agents/
&#9500;&#9472;&#9472; brand_strategist/
&#9474;   &#9500;&#9472;&#9472; agent.py          # &#9989; With dual configuration
&#9474;   &#9492;&#9472;&#9472; .env              # &#9989; Local config
&#9500;&#9472;&#9472; copywriter/
&#9474;   &#9500;&#9472;&#9472; agent.py          # &#8592; Apply pattern
&#9474;   &#9492;&#9472;&#9472; .env              # &#8592; Add config
&#9500;&#9472;&#9472; designer/
&#9474;   &#9500;&#9472;&#9472; agent.py          # &#8592; Apply pattern
&#9474;   &#9492;&#9472;&#9472; .env              # &#8592; Add config
&#9500;&#9472;&#9472; critic/
&#9474;   &#9500;&#9472;&#9472; agent.py          # &#8592; Apply pattern
&#9474;   &#9492;&#9472;&#9472; .env              # &#8592; Add config
&#9492;&#9472;&#9472; project_manager/
    &#9500;&#9472;&#9472; agent.py          # &#8592; Apply pattern
    &#9492;&#9472;&#9472; .env              # &#8592; Add config</code></pre><h2><strong>A2A Clients: The Other Half</strong></h2><p>So far we&#8217;ve built A2A <strong>servers</strong> (the specialist agents). But how do we actually <strong>call</strong> them from code?</p><h2><strong>The A2A Client Side</strong></h2><p>While we&#8217;ve tested with curl and A2A Inspector, production systems need to call agents programmatically:</p><pre><code># How do we call our A2A agents from another agent?
# How does the orchestrator invoke specialists?</code></pre><p><strong>Answer:</strong> ADK provides <code>RemoteA2aAgent</code> &#8212; a client for calling A2A servers.</p><h2><strong>Brief Example (Full Details in Part 3)</strong></h2><pre><code>from google.adk.agents.remote_a2a_agent import RemoteA2aAgent

# Create a client for the Brand Strategist
strategist = RemoteA2aAgent(
    name=&#8221;brand_strategist&#8221;,
    description=&#8221;Brand strategist for market research&#8221;,
    agent_card=&#8221;http://localhost:8082/.well-known/agent.json&#8221;
)

# Call the agent (from orchestrator code)
result = await strategist.invoke(&#8221;Research eco-friendly water bottles&#8221;)</code></pre><h3><strong>What We&#8217;ve Covered vs What&#8217;s Next</strong></h3><p><strong>This article (Part 2):</strong></p><ul><li><p>A2A <strong>servers</strong> (creating specialist agents)</p></li><li><p>A2A protocol and JSONRPC</p></li><li><p>Testing with curl and A2A Inspector</p></li><li><p>Dual configuration pattern</p></li></ul><p><strong>Next article (Part 3):</strong></p><ul><li><p>A2A <strong>clients</strong> (<code>RemoteA2aAgent</code>)</p></li><li><p>Building the orchestrator</p></li><li><p>AgentTool pattern</p></li><li><p>Coordinating multiple agents</p></li></ul><p>We&#8217;ll dive deep into A2A clients when we build the orchestrator!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9U_H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9U_H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png 424w, https://substackcdn.com/image/fetch/$s_!9U_H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png 848w, https://substackcdn.com/image/fetch/$s_!9U_H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png 1272w, https://substackcdn.com/image/fetch/$s_!9U_H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9U_H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png" width="784" height="480" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:480,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9U_H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png 424w, https://substackcdn.com/image/fetch/$s_!9U_H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png 848w, https://substackcdn.com/image/fetch/$s_!9U_H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png 1272w, https://substackcdn.com/image/fetch/$s_!9U_H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79a62c56-9fe5-4748-8c71-ad1ae3142733_784x480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>A2A Protocol Benefits Recap</strong></h2><ul><li><p><strong>Standardized</strong>: JSONRPC 2.0, widely supported</p></li><li><p><strong>Discoverable</strong>: Agent cards expose metadata</p></li><li><p><strong>Stateless</strong>: No session management complexity</p></li><li><p><strong>HTTP-based</strong>: Works with existing infrastructure</p></li><li><p><strong>Scalable</strong>: Deploy agents independently</p></li><li><p><strong>Testable</strong>: Curl, Postman, custom clients</p></li><li><p><strong>Language-agnostic</strong>: Implement in any language</p></li></ul><p><strong>Code Repository</strong>: <a href="https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun">https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun</a></p><p><strong>Next</strong>: <a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-9a3?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3: Building the Orchestrator &#8594;</a></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Building Distributed Multi-Agent Systems with Google’s AI Stack Part 1]]></title><description><![CDATA[From Monolithic AI to Distributed Intelligence: Why Multi-Agent Systems Matter]]></description><link>https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/building-distributed-multi-agent</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 13 Jan 2026 09:44:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eo-v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Building Distributed Multi-Agent Systems with Google&#8217;s AI Stack series:</strong></p><ul><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent?utm_campaign=post-expanded-share&amp;utm_medium=web">Part 1: From Monolithic AI to Distributed Intelligence: Building Your First Multi-Agent System</a></strong> &#8592; You are here</p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-2a2?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2: Making Agents Talk: Agent-to-Agent (A2A) Protocol Deep Dive</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-9a3?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3: Building the Orchestrator: Coordinating Agents with the AgentTool Pattern</a></p></li><li><p><a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-d85?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4: Scaling Multi-Agent Workflows: Solving the Token Limit Problem</a></p></li><li><p><a href="https://saoussenchaabnia.substack.com/publish/post/184416479">Part 5: External Tool Integration via Model Context Protocol (MCP)</a></p></li><li><p>Part 6: Deploying to Cloud: Cloud Run and Vertex AI Agent Engine</p></li></ul><h2><strong>Introduction</strong></h2><p>Imagine you&#8217;re building an AI system to create complete social media campaigns. Your agent needs to:</p><ul><li><p>Research market trends and competitors</p></li><li><p>Write engaging social media copy</p></li><li><p>Generate visual design concepts</p></li><li><p>Review quality and provide feedback</p></li><li><p>Create project timelines and tasks</p></li></ul><p>You could build a single, monolithic AI agent to do all of this. But should you?</p><p>In this 6-part series, I&#8217;ll show you <strong>why the answer is no</strong> &#8212; and demonstrate how to build a distributed multi-agent system using Google&#8217;s AI stack. We&#8217;ll explore:</p><ul><li><p><strong>Google Agent Development Kit (ADK)</strong> for building agents</p></li><li><p><strong>Agent-to-Agent (A2A) Protocol</strong> for communication</p></li><li><p><strong>Model Context Protocol (MCP)</strong> for external tool integration</p></li><li><p><strong>Vertex AI Agent Engine</strong> for managed orchestration</p></li><li><p><strong>Cloud Run</strong> for scalable agent deployment</p></li></ul><p>By the end of this series, you&#8217;ll have learned from a real system that generates complete social media campaigns &#8212; and you&#8217;ll be able to apply these patterns to your own projects.</p><h2><strong>Part 1: Why Multi-Agent Systems Matter</strong></h2><h3><strong>The Problem with Monolithic AI Agents</strong></h3><p><strong>Single Agent Approach</strong></p><pre><code>class MonolithicCampaignAgent:
    def create_campaign(self, brief):
        # Research the market
        research = self.research_market(brief
)        # Write social media posts
        posts = self.write_posts(research)
        # Generate visual concepts
        visuals = self.design_visuals(posts)
        # Review quality
        feedback = self.review_quality(posts, visuals)
        # Create timeline
        timeline = self.create_timeline(feedback)
        return {
            &#8216;research&#8217;: research,
            &#8216;posts&#8217;: posts,
            &#8216;visuals&#8217;: visuals,
            &#8216;feedback&#8217;: feedback,
            &#8216;timeline&#8217;: timeline
        }</code></pre><p>This looks clean, but it has <strong>serious problems</strong>:</p><p><strong>Problem 1: Lack of Separation of Concerns</strong></p><p>All functionality lives in one agent. A bug in the research logic can affect the entire system. Changes to the visual generation require redeploying everything.</p><p><strong>Problem 2: No Independent Scaling</strong></p><p>Need more copywriting capacity? You have to scale the entire agent, including the expensive research and visual generation components.</p><p><strong>Problem 3: Prompt Complexity</strong></p><p>Your system instruction becomes a massive document trying to teach one LLM how to:</p><ul><li><p>Research like a market analyst</p></li><li><p>Write like a copywriter</p></li><li><p>Design like a visual artist</p></li><li><p>Review like a creative director</p></li><li><p>Plan like a project manager</p></li></ul><p>The result? A confused agent that&#8217;s mediocre at everything.</p><p><strong>Problem 4: Limited Flexibility</strong></p><p>Want to use the copywriter for a different project? You can&#8217;t &#8212; it&#8217;s tightly coupled to the campaign workflow.</p><p><strong>Problem 5: Testing Nightmare</strong></p><p>How do you test just the visual generation? You can&#8217;t, without running the entire pipeline.</p><h2><strong>The Multi-Agent Solution</strong></h2><p>Instead of one agent doing everything, we create <strong>specialized agents</strong> that each do one thing extremely well:</p><pre><code>&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;         &#127916; Creative Director                &#9474;
&#9474;         (Orchestrator)                      &#9474;
&#9474;    - Routes requests intelligently          &#9474;
&#9474;    - Coordinates specialists                &#9474;
&#9474;    - Passes context between agents          &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
           &#9474;
           &#9474; A2A Protocol (HTTPS)
           &#9474;
    &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
    &#9474;              &#9474;       &#9474;        &#9474;      &#9474;
&#9484;&#9472;&#9472;&#9472;&#9660;&#9472;&#9472;&#9472;&#9488;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9660;&#9472;&#9472;&#9488; &#9484;&#9472;&#9472;&#9660;&#9472;&#9472;&#9472;&#9472;&#9488; &#9484;&#9472;&#9660;&#9472;&#9472;&#9472;&#9488; &#9484;&#9660;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474; &#128269;    &#9474;  &#9474; &#9997;&#65039;      &#9474; &#9474; &#127912;    &#9474; &#9474; &#11088;  &#9474; &#9474; &#128203;   &#9474;
&#9474;Research&#9474;  &#9474;Copywriter&#9474; &#9474;Designer&#9474; &#9474;Review&#9474; &#9474;Planning&#9474;
&#9474;Agent   &#9474;  &#9474;Agent    &#9474; &#9474;Agent   &#9474; &#9474;Agent &#9474; &#9474;Agent  &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre><h3><strong>Benefits</strong></h3><p><strong>Separation of Concerns</strong></p><ul><li><p>Each agent has one responsibility</p></li><li><p>Bugs are isolated</p></li><li><p>Independent updates and improvements</p></li></ul><p><strong>Independent Scaling</strong></p><ul><li><p>Scale copywriter separately from designer</p></li><li><p>Cost-efficient resource allocation</p></li><li><p>Match capacity to demand</p></li></ul><p><strong>Specialized Expertise</strong></p><ul><li><p>Each agent has focused instructions</p></li><li><p>Better quality output</p></li><li><p>Clear responsibilities</p></li></ul><p><strong>Flexibility and Reusability</strong></p><ul><li><p>Use copywriter in other projects</p></li><li><p>Mix and match agents</p></li><li><p>Compose new workflows easily</p></li></ul><p><strong>Easier Testing</strong></p><ul><li><p>Test each agent independently</p></li><li><p>Mock dependencies</p></li><li><p>Clear success criteria</p></li></ul><h2><strong>Enter Google&#8217;s Agent Development Kit (ADK)</strong></h2><p>Building a multi-agent system from scratch is complex. You need:</p><ul><li><p>Agent runtime and lifecycle management</p></li><li><p>Communication protocols</p></li><li><p>Tool integration</p></li><li><p>Session management</p></li><li><p>Deployment infrastructure</p></li></ul><p><strong>Google ADK provides all of this out of the box.</strong></p><h3><strong>What is ADK?</strong></h3><p>The Agent Development Kit is a framework for building, deploying, and managing AI agents. It provides:</p><ul><li><p><strong>Agent Types</strong>: <code>LlmAgent</code> for simple agents, <code>Agent</code> for complex orchestration</p></li><li><p><strong>Built-in Tools</strong>: Google Search, code execution, and more</p></li><li><p><strong>Remote Agent Support</strong>: Call agents over HTTP via A2A protocol</p></li><li><p><strong>Session Management</strong>: Built-in state management</p></li><li><p><strong>Cloud Integration</strong>: Deploy to Vertex AI Agent Engine</p></li></ul><h3><strong>Core Concepts</strong></h3><p><strong>1. Agents</strong></p><pre><code>from google.adk.agents import Agent

agent = Agent(
    name=&#8221;brand_strategist&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=&#8221;You are a brand strategist...&#8221;,
    description=&#8221;Market research and trend analysis&#8221;,
    tools=[google_search]
)</code></pre><p><strong>2. Tools</strong></p><p>Tools extend agent capabilities:</p><pre><code>from google.adk.tools import google_search

# Built-in tool
tools = [google_search]

# Custom tool
@function_tool
def analyze_sentiment(text: str) -&gt; dict:
    &#8220;&#8221;&#8220;Analyze sentiment of text&#8221;&#8220;&#8221;
    # Your implementation
    return {&#8221;sentiment&#8221;: &#8220;positive&#8221;, &#8220;score&#8221;: 0.85}</code></pre><p><strong>3. Sessions</strong></p><p>Sessions maintain conversation context:</p><pre><code>from google.adk.sessions import InMemorySessionService

session_service = InMemorySessionService()</code></pre><p><strong>4. Runners</strong></p><p>Runners execute agents:</p><pre><code>from google.adk import Runner

runner = Runner(
    app_name=&#8221;my_agent&#8221;,
    agent=agent,
    session_service=session_service
)
async for event in runner.run_async(
    user_id=&#8221;user_123&#8221;,
    session_id=&#8221;session_456&#8221;,
    new_message=Content(parts=[Part(text=&#8221;Hello!&#8221;)])
):
    print(event.text)</code></pre><h2><strong>Introducing AI Creative Studio: A Real-World Example</strong></h2><p>Throughout this series, we&#8217;ll build <strong>AI Creative Studio</strong> &#8212; a distributed multi-agent system for creating complete social media campaigns.</p><h3><strong>System Architecture</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eo-v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eo-v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png 424w, https://substackcdn.com/image/fetch/$s_!eo-v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png 848w, https://substackcdn.com/image/fetch/$s_!eo-v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png 1272w, https://substackcdn.com/image/fetch/$s_!eo-v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eo-v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png" width="784" height="461" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:461,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!eo-v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png 424w, https://substackcdn.com/image/fetch/$s_!eo-v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png 848w, https://substackcdn.com/image/fetch/$s_!eo-v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png 1272w, https://substackcdn.com/image/fetch/$s_!eo-v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f0dc26f-d1b7-4f6c-bfd0-aed1891e2c9d_784x461.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The Agents</strong></h3><p><strong>1. Brand Strategist</strong> (<code>LlmAgent</code> + Google Search)</p><ul><li><p>Researches market trends</p></li><li><p>Analyzes competitors</p></li><li><p>Identifies target audience insights</p></li></ul><p><strong>2. Copywriter</strong> (<code>LlmAgent</code>)</p><ul><li><p>Creates engaging social media captions</p></li><li><p>Writes hashtags and CTAs</p></li><li><p>Adapts tone and style</p></li></ul><p><strong>3. Designer</strong> (<code>LlmAgent</code>)</p><ul><li><p>Generates visual concepts</p></li><li><p>Creates AI image generation prompts</p></li><li><p>Defines style and mood</p></li></ul><p><strong>4. Critic</strong> (<code>LlmAgent</code>)</p><ul><li><p>Reviews all creative work</p></li><li><p>Provides constructive feedback</p></li><li><p>Scores quality</p></li></ul><p><strong>5. Project Manager</strong> (<code>Agent</code> + Notion MCP)</p><ul><li><p>Creates project timeline</p></li><li><p>Generates task list</p></li><li><p>Integrates with Notion for task management</p></li></ul><p><strong>6. Creative Director</strong> (<code>Agent</code> - Orchestrator)</p><ul><li><p>Coordinates all specialists</p></li><li><p>Implements planning-first workflow</p></li><li><p>Manages context and error handling</p></li></ul><h3><strong>Deployment Architecture</strong></h3><p><strong>Specialists &#8594; Cloud Run</strong></p><ul><li><p>Containerized services</p></li><li><p>Auto-scaling (0&#8211;100 instances)</p></li><li><p>A2A server endpoints</p></li><li><p>HTTPS communication</p></li></ul><p><strong>Orchestrator &#8594; Vertex AI Agent Engine</strong></p><ul><li><p>Managed runtime</p></li><li><p>No containerization needed</p></li><li><p>Environment-based configuration</p></li></ul><h2><strong>Part 2: Building Your First ADK Agents</strong></h2><p>Now that we understand why multi-agent systems matter, let&#8217;s get hands-on and build our first specialist agents.</p><h3><strong>Setup: Installing ADK</strong></h3><p>First, let&#8217;s set up our development environment.</p><p><strong>Prerequisites</strong></p><pre><code># Python 3.11 or higher
python --version  # Should be 3.11+

# Create project directory
mkdir ai-creative-studio
cd ai-creative-studio

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Google ADK
pip install google-adk google-genai python-dotenv</code></pre><p><strong>Environment Configuration</strong></p><p>Create a <code>.env</code> file:</p><pre><code># Get your API key from: https://aistudio.google.com/app/apikey
GOOGLE_API_KEY=your_gemini_api_key_here</code></pre><h2><strong>Understanding Agent Anatomy</strong></h2><p>Before we build, let&#8217;s understand what makes up an ADK agent:</p><pre><code>from google.adk.agents import Agent

agent = Agent(
    name=&#8221;agent_name&#8221;,              # Identifier for logging/debugging
    model=&#8221;gemini-2.5-flash&#8221;,       # LLM model to use
    instruction=&#8221;System instruction...&#8221;,  # Agent&#8217;s role and behavior
    description=&#8221;Brief description...&#8221;,   # What this agent does
    tools=[...]                     # Optional: External capabilities
)</code></pre><h3><strong>Key Components</strong></h3><p><strong>1. System Instruction</strong></p><ul><li><p>Defines the agent&#8217;s role and expertise</p></li><li><p>Sets boundaries (what it should/shouldn&#8217;t do)</p></li><li><p>Provides output format guidelines</p></li><li><p>Includes examples and best practices</p></li></ul><p><strong>2. Model Selection</strong></p><ul><li><p><code>gemini-2.5-flash</code>: Fast, efficient (our choice)</p></li><li><p><code>gemini-2.5-pro</code>: More capable, slower</p></li><li><p><code>gemini-2.0-ultra</code>: Most powerful</p></li></ul><p><strong>3. Tools</strong></p><ul><li><p>Built-in: <code>google_search</code>, <code>code_execution</code></p></li><li><p>Custom: Your own functions</p></li><li><p>MCP: External services</p></li></ul><h2><strong>Agent 1: Brand Strategist (Research Specialist)</strong></h2><p>Our Brand Strategist needs to research markets, analyze competitors, and identify trends. This requires the <strong>Google Search tool</strong>.</p><h3><strong>Step 1: Create the File</strong></h3><pre><code>mkdir -p agents/brand_strategist
cd agents/brand_strategist
touch agent.py</code></pre><h3><strong>Step 2: Define the System Instruction</strong></h3><pre><code># agents/brand_strategist/agent.py
import logging
import datetime
from google.adk.agents import Agent
from google.adk.tools import google_search
import os
from dotenv import load_dotenv

load_dotenv()

SYSTEM_INSTRUCTION = f&#8221;&#8220;&#8221;You are a Brand Strategist specializing in market research and trend analysis.

IMPORTANT: Today&#8217;s date is {datetime.date.today().strftime(&#8217;%B %d, %Y&#8217;)}.
When conducting research, focus on current trends from {datetime.date.today().year}.

Your expertise includes:
- Identifying target audience insights and behaviors
- Analyzing competitor strategies
- Researching current social media trends
- Understanding platform algorithms and best practices

You have access to tools:
- google_search: Search the web for competitors, trends, and market insights

When given a campaign brief:
1. Use google_search to research the target audience&#8217;s current interests
2. Search for and analyze 2-3 competitor brands
3. Identify 3-5 trending topics related to the product category
4. Provide high-level strategic insights

DO NOT:
- Create captions, copy, or specific messaging
- Generate image concepts or designs
- Write TikTok scripts or Instagram posts
- Create content calendars

Your job is to provide RESEARCH INSIGHTS that other specialists will use.

Format your output as:

**Audience Insights:**
[Key behaviors, preferences, and pain points based on research]
**Competitive Analysis:**
[What 2-3 competitors are doing - their strengths and weaknesses]
**Trending Topics:**
[3-5 relevant trends to consider]
**Key Strategic Insights:**
[High-level themes and positioning opportunities]
&#8220;&#8221;&#8220;</code></pre><p><strong>Why This Instruction Works</strong></p><p><strong>Date-aware</strong>: Ensures current research, not outdated information<br><strong>Clear boundaries</strong>: Explicitly states what NOT to do<br><strong>Tool guidance</strong>: Tells agent when and how to use google_search<br><strong>Structured output</strong>: Provides consistent format for downstream agents</p><h3><strong>Step 3: Create the Agent</strong></h3><pre><code># Continue in agent.py
logger = logging.getLogger(&#8221;ai_creative_studio.brand_strategist&#8221;)

root_agent = Agent(
    name=&#8221;brand_strategist&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=SYSTEM_INSTRUCTION,
    description=&#8221;Brand strategist for market research, trend analysis, and competitive insights&#8221;,
    tools=[google_search]  # &#8592; Built-in Google Search tool
)
logger.info(&#8221;Brand Strategist agent created successfully&#8221;)</code></pre><h3><strong>Step 4: Add Local Testing</strong></h3><pre><code># Continue in agent.py
if __name__ == &#8220;__main__&#8221;:
    import asyncio
    from google.adk import Runner
    from google.adk.sessions import InMemorySessionService
    from google.genai import types
    async def main():
        print(&#8221;&#128269; Starting Brand Strategist Agent...\n&#8221;)
        brief = &#8220;&#8221;&#8220;
        Research the market for eco-friendly smart water bottles
        targeting health-conscious millennials.
        &#8220;&#8221;&#8220;
        print(f&#8221;Brief: {brief}\n&#8221;)
        # Create session service
        session_service = InMemorySessionService()
        # Create runner
        runner = Runner(
            app_name=&#8221;brand_strategist&#8221;,
            agent=root_agent,
            session_service=session_service
        )
        session_id = &#8220;test_session&#8221;
        user_id = &#8220;test_user&#8221;
        try:
            # Create session
            await session_service.create_session(
                app_name=&#8221;brand_strategist&#8221;,
                user_id=user_id,
                session_id=session_id
            )
            # Run agent
            print(&#8221;brand_strategist &gt; &#8220;, end=&#8217;&#8216;, flush=True)
            async for event in runner.run_async(
                user_id=user_id,
                session_id=session_id,
                new_message=types.Content(parts=[types.Part(text=brief)])
            ):
                if hasattr(event, &#8216;text&#8217;) and event.text:
                    print(event.text, end=&#8217;&#8216;, flush=True)
            print(&#8221;\n\n&#9989; Research Complete!&#8221;)
        finally:
            await runner.close()
    asyncio.run(main())</code></pre><h3><strong>Step 5: Test It!</strong></h3><pre><code>python agent.py</code></pre><p>Expected output:</p><pre><code>&#128269; Starting Brand Strategist Agent...

Brief: Research the market for eco-friendly smart water bottles...
brand_strategist &gt; **Audience Insights:**
Health-conscious millennials (25-34) are increasingly seeking products
that align with their values. They prioritize:
- Sustainability and eco-friendly materials
- Smart features for health tracking
- Aesthetic design for social media sharing
- Convenience for active lifestyles
**Competitive Analysis:**
1. Hydro Flask: Strong brand loyalty, premium pricing ($30-50),
   lacks smart features
2. S&#8217;well: Fashion-forward design, sustainability focus,
   limited tech integration
3. HidrateSpark: Smart bottle with app, moderate price ($40-60),
   opportunity for better design
**Trending Topics:**
1. #SustainableLiving - 2.3M posts, growing 15% monthly
2. #HydrationChallenge - Viral trend, 500K+ posts
3. Smart health wearables integration
4. Minimalist lifestyle aesthetics
5. Water bottle as fashion accessory
**Key Strategic Insights:**
- Gap in market: Premium sustainable + smart features
- Millennials willing to pay $50-80 for value-aligned products
- Instagram and TikTok key platforms for awareness
- Positioning opportunity: &#8220;Tech meets sustainability&#8221;
&#9989; Research Complete!</code></pre><p>Perfect! Our Brand Strategist is working. It used Google Search to find real market data and presented insights in a structured format.</p><h2><strong>Agent 2: Copywriter (Pure LLM)</strong></h2><p>The Copywriter creates engaging social media captions. Unlike the Brand Strategist, it doesn&#8217;t need external tools &#8212; just excellent writing skills.</p><h3><strong>Create the Agent</strong></h3><pre><code># agents/copywriter/agent.py
from google.adk.agents import Agent
import logging
from dotenv import load_dotenv
load_dotenv()
logger = logging.getLogger(&#8221;ai_creative_studio.copywriter&#8221;)

SYSTEM_INSTRUCTION = &#8220;&#8221;&#8220;You are an expert Social Media Copywriter specializing in Instagram and TikTok content.

Your expertise includes:
- Writing engaging, scroll-stopping captions
- Creating platform-optimized hashtag strategies
- Crafting clear, compelling CTAs
- Adapting tone and voice to brand personality

When given a campaign brief and research insights:
1. Create 3-5 Instagram posts with complete captions
2. Include relevant hashtags (mix of popular and niche)
3. Suggest strong CTAs that drive action
4. Match the brand voice and target audience

DO NOT:
- Conduct market research (Brand Strategist&#8217;s job)
- Create visual design concepts (Designer&#8217;s job)
- Review your own work (Critic&#8217;s job)

Format each post as:
### Post [Number]: [Theme]
**Full Caption:**
[Engaging caption with emojis where appropriate]
**Hashtags:**
#hashtag1 #hashtag2 #hashtag3...
**Suggested CTA:**
[Clear call-to-action]
---
Remember: You receive research insights from the Brand Strategist.
Use those insights to inform your copy, but create original,
engaging content that resonates with the target audience.
&#8220;&#8221;&#8220;

root_agent = Agent(
    name=&#8221;copywriter&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=SYSTEM_INSTRUCTION,
    description=&#8221;Expert social media copywriter for creating engaging captions and copy&#8221;,
    tools=[]  # &#8592; No tools needed, pure LLM
)
logger.info(&#8221;Copywriter agent created successfully&#8221;)
# Add testing code similar to Brand Strategist...</code></pre><h3><strong>Why No Tools?</strong></h3><p>The Copywriter is a <strong>pure LLM agent</strong> because:</p><ul><li><p>Creative writing doesn&#8217;t require external data</p></li><li><p>LLMs excel at language generation</p></li><li><p>Simpler is better when tools aren&#8217;t needed</p></li><li><p>Faster and more cost-efficient</p></li></ul><h2><strong>Agent 3: Designer (Pure LLM)</strong></h2><p>The Designer generates visual concepts and AI image generation prompts.</p><pre><code># agents/designer/agent.py
from google.adk.agents import Agent
import logging
from dotenv import load_dotenv
load_dotenv()
logger = logging.getLogger(&#8221;ai_creative_studio.designer&#8221;)

SYSTEM_INSTRUCTION = &#8220;&#8221;&#8220;You are a Creative Visual Designer specializing in social media visual concepts.

Your expertise includes:
- Creating detailed AI image generation prompts
- Defining visual style, mood, and composition
- Selecting color palettes and design elements
- Ensuring brand consistency

When given social media posts:
1. Create 2-3 visual concepts per post
2. Write detailed Imagen/DALL-E prompts for each concept
3. Specify style, mood, colors, and composition
4. Ensure Instagram-optimized layouts (1:1 or 4:5)

DO NOT:
- Write captions or copy (Copywriter&#8217;s job)
- Actually generate images (you create prompts only)
- Provide strategic insights (Brand Strategist&#8217;s job)

Format each concept as:

**For Post [Number]: [Theme]**
**Concept A: [Visual Theme]**
- **Prompt**: [Detailed AI image generation prompt]
- **Style**: [e.g., minimalist, vibrant, cinematic, lifestyle]
- **Colors**: [Color palette]
- **Mood**: [e.g., energetic, calm, inspiring, professional]
- **Composition**: [Layout and key elements]
**Concept B: [Alternative Theme]**
[Same structure...]
---
Remember: Create prompts that an AI image generator can understand.
Be specific about elements, style, lighting, and mood.
&#8220;&#8221;&#8220;

root_agent = Agent(
    name=&#8221;designer&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=SYSTEM_INSTRUCTION,
    description=&#8221;Creative visual designer for generating social media image concepts&#8221;,
    tools=[]  # Pure LLM
)
logger.info(&#8221;Designer agent created successfully&#8221;)</code></pre><h2><strong>Key Patterns and Best Practices</strong></h2><h3><strong>1. Single Responsibility</strong></h3><p>Each agent does ONE thing well:</p><ul><li><p>Brand Strategist &#8594; Research</p></li><li><p>Copywriter &#8594; Writing</p></li><li><p>Designer &#8594; Visual concepts</p></li></ul><p>&#10060; Don&#8217;t: Make agents do multiple jobs &#9989; Do: Create focused specialists</p><h3><strong>2. Clear Boundaries</strong></h3><p>Use &#8220;DO NOT&#8221; instructions to prevent scope creep:</p><pre><code>DO NOT:
- Create captions (that&#8217;s Copywriter&#8217;s job)
- Generate images (you create prompts only)</code></pre><p>This prevents agents from overstepping their roles.</p><h3><strong>3. Structured Output</strong></h3><p>Always specify output format:</p><pre><code>Format your output as:
**Section Header:**
[Content]
**Another Section:**
[More content]</code></pre><p>This makes downstream agents&#8217; jobs easier.</p><h3><strong>4. Context Awareness</strong></h3><pre><code>SYSTEM_INSTRUCTION = f&#8221;&#8220;&#8221;
Today&#8217;s date is {datetime.date.today().strftime(&#8217;%B %d, %Y&#8217;)}.
Focus on trends from {datetime.date.today().year}.
&#8220;&#8221;&#8220;</code></pre><p>Date-aware instructions ensure current, relevant outputs.</p><h3><strong>5. Tool Selection</strong></h3><p><strong>Use tools when</strong>:</p><ul><li><p>Need external data (google_search)</p></li><li><p>Need computation (code_execution)</p></li><li><p>Need external services (MCP tools)</p></li></ul><p><strong>Don&#8217;t use tools when</strong>:</p><ul><li><p>Pure language generation (copywriting)</p></li><li><p>Creative tasks (design concepts)</p></li><li><p>Analysis of provided data</p></li></ul><h2><strong>Common Pitfalls and Solutions</strong></h2><h3><strong>Pitfall 1: Over-Complicated Instructions</strong></h3><p>&#10060; Bad:</p><pre><code>instruction = &#8220;&#8221;&#8220;You are an expert in everything related to marketing,
including but not limited to research, copywriting, design, analytics,
SEO, SEM, content strategy...&#8221;&#8220;&#8221;  # 500 lines later...</code></pre><p>&#9989; Good:</p><pre><code>instruction = &#8220;&#8221;&#8220;You are a Brand Strategist specializing in market research.
Your expertise: [3-4 bullet points]
When given a brief: [3-4 steps]
DO NOT: [3-4 boundaries]
Format: [clear structure]
&#8220;&#8221;&#8220;</code></pre><h3><strong>Pitfall 2: Missing Boundaries</strong></h3><p>&#10060; Bad:</p><pre><code>instruction = &#8220;You are a copywriter. Write great content.&#8221;</code></pre><p>Result: Agent might also try to do research, design, strategy&#8230;</p><p>&#9989; Good:</p><pre><code>instruction = &#8220;&#8221;&#8220;You are a copywriter.
DO NOT:
- Conduct research (Brand Strategist&#8217;s job)
- Create visuals (Designer&#8217;s job)
&#8220;&#8221;&#8220;</code></pre><h3><strong>Pitfall 3: Ignoring Output Format</strong></h3><p>&#10060; Bad: No format specification &#8594; inconsistent outputs</p><p>&#9989; Good: Clear format &#8594; predictable, parseable outputs</p><h2><strong>Local Testing with </strong><code>adk web</code></h2><p>ADK provides a web UI for interactive testing:</p><pre><code>cd agents/brand_strategist
adk web --log_level DEBUG</code></pre><p>Then open </p><p>http://localhost:8000</p><p> in your browser.</p><p>Benefits:</p><ul><li><p>Nice UI for testing</p></li><li><p>See full conversation history</p></li><li><p>Debug mode shows tool calls</p></li><li><p>Export conversations</p></li></ul><p><strong>Code Repository</strong>: <a href="https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun">https://github.com/Saoussen-CH/ai-creative-studio-adk-a2a-mcp-vertexai-cloudrun</a></p><p><strong>Next</strong>: <a href="https://open.substack.com/pub/saoussenchaabnia/p/building-distributed-multi-agent-2a2?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2: A2A Protocol Deep Dive &#8594;</a></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Google ADK: From Local Development to Vertex AI Deployment: Part 9]]></title><description><![CDATA[Full-Stack Deployment &#8212; Frontend + Backend to Cloud Run]]></description><link>https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-9b6</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-9b6</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 06 Jan 2026 19:05:21 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Part 9 of <strong>Google ADK: From Local Development to Vertex AI Deployment</strong> &#8212; the series finale! You&#8217;ve journeyed from your first agent to cloud deployment. Now let&#8217;s complete the stack with a <strong>full web application</strong>.</p><h2><strong>Google ADK: From Local Development to Vertex AI Deployment series:</strong></h2><ol><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 1</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Building Your First AI Agent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Custom Tools &#8212; Extending Agent Capabilities </a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Multi-Agent Orchestration with Agent-as-a-Tool</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Sequential Workflows with SequentialAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 5</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Self-Improving Agents with LoopAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 6</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Efficient Workflows with ParallelAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 7</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Complete Multi-Agent System &#8212; The Capstone</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 8</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Deploying to Vertex AI Agent Engine</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 9</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Full-Stack Deployment with Cloud Run (You are here)</a></p></li></ol><h2><strong>Introduction</strong></h2><p>The final step: A production web app accessible to anyone!</p><blockquote><p><em>This tutorial demonstrates full-stack deployment for <strong>learning, demonstration, and testing purposes</strong>.</em></p></blockquote><p><strong>Architecture:</strong></p><pre><code>React Frontend &#8594; FastAPI Backend &#8594; Agent Engine &#8594; Gemini</code></pre><p>All on Cloud Run (serverless, auto-scaling).</p><p><strong>GitHub</strong>: <a href="https://github.com/Saoussen-CH/content_creation_mas_workshop">content_creation_mas_workshop</a></p><h2><strong>Understanding the Architecture</strong></h2><pre><code>User &#8594; Cloud Run (Frontend + Backend) &#8594; Agent Engine &#8594; Gemini</code></pre><p><strong>Components:</strong></p><ol><li><p><strong>Cloud Run Service</strong> &#8212; Hosts both frontend and backend</p></li><li><p><strong>Agent Engine</strong> &#8212; Runs your multi-agent system</p></li><li><p><strong>Connection</strong> &#8212; Backend uses RemoteRunner</p></li></ol><h2><strong>Backend-to-Agent Engine Integration</strong></h2><h3><strong>Connection Flow</strong></h3><pre><code>1. User &#8594; POST /api/create-content
2. Backend &#8594; Initialize RemoteRunner with AGENT_ENGINE_RESOURCE_NAME
3. RemoteRunner &#8594; Authenticate with Google Cloud
4. Request &#8594; Agent Engine via gRPC
5. Agent Engine &#8594; Execute workflow
6. Response &#8594; Stream back to user via SSE</code></pre><h3><strong>Backend Code (</strong><code>backend/api_server.py</code><strong>)</strong></h3><pre><code>from google import genai
import os

# Initialize client
client = genai.Client(
    vertexai=True,
    project=os.getenv(&#8221;GOOGLE_CLOUD_PROJECT&#8221;),
    location=os.getenv(&#8221;GOOGLE_CLOUD_LOCATION&#8221;)
)

# Get Agent Engine resource
AGENT_ENGINE_RESOURCE_NAME = os.getenv(&#8221;AGENT_ENGINE_RESOURCE_NAME&#8221;)

@app.post(&#8221;/api/create-content&#8221;)
async def create_content(request: ContentRequest):
    # Connect to Agent Engine
    agent = client.agentic.get_agent(AGENT_ENGINE_RESOURCE_NAME)

    # Send request
    response = agent.query(
        user_query=request.topic,
        session_id=request.session_id or generate_session_id()
    )

    # Stream response
    async def event_stream():
        for chunk in response:
            yield f&#8221;data: {json.dumps(chunk)}\n\n&#8221;

    return StreamingResponse(event_stream(), media_type=&#8221;text/event-stream&#8221;)</code></pre><h3><strong>Docker Container</strong></h3><p><strong>Multi-Stage Dockerfile</strong></p><pre><code># Stage 1: Build React frontend
FROM node:20-alpine AS frontend-build
WORKDIR /frontend
COPY frontend/ ./
RUN npm ci &amp;&amp; npm run build

# Stage 2: Python backend
FROM python:3.11-slim
WORKDIR /app

# Install dependencies
COPY backend/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy frontend build
COPY --from=frontend-build /frontend/dist ./static

# Copy backend
COPY backend/ ./

# Environment variables (set by Cloud Run)
ENV AGENT_ENGINE_RESOURCE_NAME=&#8221;&#8220;

# Start server
CMD [&#8221;uvicorn&#8221;, &#8220;api_server:app&#8221;, &#8220;--host&#8221;, &#8220;0.0.0.0&#8221;, &#8220;--port&#8221;, &#8220;8080&#8221;]</code></pre><h2><strong>Deployment Process</strong></h2><h3><strong>Step 1: Deploy Agent to Agent Engine</strong></h3><pre><code>cd deployment
python deploy.py --action deploy

# Copy the AGENT_ENGINE_RESOURCE_NAME output</code></pre><h3><strong>Step 2: Update .env File</strong></h3><pre><code>cat &gt; .env &lt;&lt; EOF
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1
AGENT_ENGINE_RESOURCE_NAME=projects/.../reasoningEngines/...
EOF</code></pre><h3><strong>Step 3: Deploy to Cloud Run</strong></h3><pre><code>./deployment/deploy-combined.sh</code></pre><p><strong>What it does:</strong></p><ol><li><p>Builds Docker image (frontend + backend)</p></li><li><p>Pushes to Artifact Registry</p></li><li><p>Deploys to Cloud Run</p></li><li><p>Sets environment variables</p></li><li><p>Returns service URL</p></li></ol><h2><strong>IAM and Security</strong></h2><h3><strong>Required Roles</strong></h3><pre><code>SERVICE_ACCOUNT=&#8221;content-studio-sa@${PROJECT_ID}.iam.gserviceaccount.com&#8221;

# Add roles
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=&#8221;serviceAccount:${SERVICE_ACCOUNT}&#8221; \
    --role=&#8221;roles/aiplatform.user&#8221;

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=&#8221;serviceAccount:${SERVICE_ACCOUNT}&#8221; \
    --role=&#8221;roles/ml.developer&#8221;</code></pre><h2><strong>Testing Production Deployment</strong></h2><h3><strong>Access the Application</strong></h3><pre><code># Get URL
gcloud run services describe content-studio \
    --region=us-central1 \
    --format=&#8217;value(status.url)&#8217;

# Test API
curl -X POST &#8220;YOUR_CLOUD_RUN_URL/api/create-content&#8221; \
    -H &#8220;Content-Type: application/json&#8221; \
    -d &#8216;{
        &#8220;topic&#8221;: &#8220;AI in Healthcare&#8221;,
        &#8220;target_audience&#8221;: &#8220;Healthcare professionals&#8221;,
        &#8220;tone&#8221;: &#8220;Professional&#8221;,
        &#8220;keywords&#8221;: &#8220;AI, healthcare&#8221;
    }&#8217;</code></pre><h2><strong>Monitoring and Logs</strong></h2><h3><strong>Cloud Run Logs</strong></h3><pre><code># Real-time logs
gcloud run services logs read content-studio \
    --region=us-central1 \
    --follow

# Search for errors
gcloud run services logs read content-studio \
    --region=us-central1 \
    --filter=&#8221;severity&gt;=ERROR&#8221;</code></pre><h3><strong>Agent Engine Logs</strong></h3><pre><code>gcloud logging read \
    &#8220;resource.type=aiplatform.googleapis.com/ReasoningEngine&#8221; \
    --limit=50</code></pre><h3><strong>Health Check</strong></h3><pre><code>curl https://YOUR_URL/health

# Response:
{
  &#8220;status&#8221;: &#8220;healthy&#8221;,
  &#8220;agent_engine_configured&#8221;: true,
  &#8220;agent_engine_resource&#8221;: &#8220;projects/.../reasoningEngines/...&#8221;
}</code></pre><h2><strong>Troubleshooting</strong></h2><h3><strong>Common Issues</strong></h3><p><strong>1. &#8220;AGENT_ENGINE_RESOURCE_NAME not set&#8221;</strong></p><pre><code># Check environment variable
gcloud run services describe content-studio \
    --region=us-central1 \
    --format=&#8221;value(spec.template.spec.containers[0].env)&#8221;

# Update if missing
gcloud run services update content-studio \
    --region=us-central1 \
    --set-env-vars=&#8221;AGENT_ENGINE_RESOURCE_NAME=projects/...&#8221;</code></pre><p><strong>2. &#8220;Permission Denied&#8221;</strong></p><pre><code># Check service account
gcloud run services describe content-studio \
    --region=us-central1 \
    --format=&#8221;value(spec.template.spec.serviceAccountName)&#8221;

# Add required role
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member=&#8221;serviceAccount:SERVICE_ACCOUNT&#8221; \
    --role=&#8221;roles/aiplatform.user&#8221;</code></pre><p><strong>3. &#8220;Agent Engine Not Found&#8221;</strong></p><ul><li><p>Verify Agent Engine is deployed</p></li><li><p>Check resource name matches</p></li><li><p>Ensure both services in same region</p></li></ul><p><strong>4. Connection Timeout</strong></p><pre><code># Increase timeout
gcloud run services update content-studio \
    --region=us-central1 \
    --timeout=300</code></pre><h2><strong>Deployment Best Practices for Learning/Demo</strong></h2><blockquote><p><em><strong>Note:</strong> These are basic best practices for demonstration deployments. Production systems would require significantly more robust practices including comprehensive testing, CI/CD automation, security hardening, and advanced monitoring.</em></p></blockquote><h3><strong>1. Region Consistency</strong></h3><pre><code># All in same region
REGION=&#8221;us-central1&#8221;</code></pre><h3><strong>2. Error Handling</strong></h3><pre><code>try:
    agent = client.agentic.get_agent(RESOURCE_NAME)
    response = agent.query(user_query)
except google.api_core.exceptions.GoogleAPIError as e:
    logger.error(f&#8221;Agent Engine error: {e}&#8221;)
    raise HTTPException(status_code=500, detail=&#8221;Agent unavailable&#8221;)</code></pre><h3><strong>3. Timeouts</strong></h3><pre><code># Cloud Run timeout (deploy-combined.sh)
--timeout=300  # 5 minutes

# Request timeout in code
response = agent.query(user_query, timeout=240)</code></pre><h3><strong>4. Retry Logic</strong></h3><pre><code>from google.api_core.retry import Retry

retry = Retry(
    initial=1.0,
    maximum=10.0,
    multiplier=2.0,
    deadline=60.0
)

response = agent.query(user_query, retry=retry)</code></pre><h2><strong>Congratulations!</strong></h2><p>You&#8217;ve completed the entire series! You now know how to:</p><ul><li><p>Build AI agents (Parts 1&#8211;2)</p></li><li><p>Create agent teams (Part 3)</p></li><li><p>Design workflows (Parts 4&#8211;6)</p></li><li><p>Build complete agent systems (Part 7)</p></li><li><p>Deploy to cloud for learning/demo (Parts 8&#8211;9)</p></li></ul><p><strong>Access the full code at GitHub</strong>: <a href="https://github.com/Saoussen-CH/content_creation_mas_workshop">content_creation_mas_workshop</a></p><p><strong>Next Steps:</strong></p><ol><li><p>Experiment with your own use cases</p></li><li><p>Extend the system</p></li><li><p>Deploy to production</p></li><li><p>Build amazing things!</p></li></ol><p>Thank you for following along! &#128640;</p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Google ADK: From Local Development to Vertex AI Deployment: Part 8]]></title><description><![CDATA[Deploying AI Agents to Google Cloud&#8217;s Agent Engine]]></description><link>https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-df1</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-df1</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 06 Jan 2026 19:01:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Part 8 of <strong>Google ADK: From Local Development to Vertex AI Deployment</strong>! You&#8217;ve built sophisticated agents locally. Now comes the pivotal transition: <strong>deploying to the cloud</strong>.</p><h2><strong>Google ADK: From Local Development to Vertex AI Deployment series:</strong></h2><ol><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 1</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Building Your First AI Agent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Custom Tools &#8212; Extending Agent Capabilities </a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Multi-Agent Orchestration with Agent-as-a-Tool</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Sequential Workflows with SequentialAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 5</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Self-Improving Agents with LoopAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 6</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Efficient Workflows with ParallelAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 7</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Complete Multi-Agent System &#8212; The Capstone</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 8</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Deploying to Vertex AI Agent Engine (You are here)</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 9</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Full-Stack Deployment with Cloud Run</a></p></li></ol><h2><strong>Introduction: From Prototype to Cloud Deployment</strong></h2><p>You&#8217;ve built an incredible Content Creation Studio locally &#8212; 11 specialist agents working together in sophisticated workflows. It works beautifully on your laptop. But now you face the critical question:</p><p><strong>How do you make your local prototype accessible to others by deploying it to the cloud?</strong></p><p>This is where many AI projects stall. Moving from &#8220;works on my machine&#8221; to a scalable, cloud-hosted system requires:</p><ul><li><p><strong>Infrastructure management</strong> &#8212; Servers, containers, orchestration</p></li><li><p><strong>Scalability</strong> &#8212; Handling 1 user vs 1,000 users simultaneously</p></li><li><p><strong>Reliability</strong> &#8212; Uptime, error handling, monitoring</p></li><li><p><strong>Security</strong> &#8212; Authentication, authorization, data protection</p></li><li><p><strong>Cost optimization</strong> &#8212; Efficient resource usage</p></li><li><p><strong>Maintenance</strong> &#8212; Updates, logging, debugging</p></li></ul><p>Building all of this from scratch is complex, time-consuming, and expensive. That&#8217;s the problem Google Cloud&#8217;s <strong>Vertex AI Agent Engine</strong> solves for deployment.</p><p><strong>What you&#8217;ll learn in this part:</strong></p><ul><li><p>Why deployment matters and when to deploy</p></li><li><p>The power of Vertex AI Agent Engine</p></li><li><p>Setting up your Google Cloud environment with <code>setup_gcp.sh</code></p></li><li><p>Deploying your multi-agent system</p></li><li><p>Understanding what you get: the Agent Engine endpoint</p></li><li><p>How this connects to the full-stack app in Part 9</p></li></ul><p><strong>Colab Notebook</strong>: <a href="https://colab.research.google.com/github/Saoussen-CH/content_creation_mas_workshop/blob/main/content_creation_mas/notebooks/part8_deployment_agent_engine.ipynb">Part 8 &#8212; Agent Engine Deployment</a></p><h2><strong>Why Deployment Matters</strong></h2><h3><strong>The Local Development Trap</strong></h3><p>Your Content Creation Studio works perfectly locally. You run it in a Jupyter notebook or Python script:</p><pre><code># Local execution
async def create_content():
    session = await session_service.create_session(...)
    await run_agent_query(coordinator_agent, query, session, user_id)</code></pre><p><strong>But this has fundamental limitations:</strong></p><ul><li><p><strong>Single User</strong> &#8212; Only you can use it</p></li><li><p><strong>No Persistence</strong> &#8212; Restart = lose everything (InMemorySessionService)</p></li><li><p><strong>Your Computer</strong> &#8212; Requires your machine to be running</p></li><li><p><strong>No Scalability</strong> &#8212; Can&#8217;t handle multiple requests</p></li><li><p><strong>No Monitoring</strong> &#8212; Can&#8217;t track performance or errors</p></li><li><p><strong>No API</strong> &#8212; Can&#8217;t integrate with web apps or other services</p></li><li><p><strong>No Reliability</strong> &#8212; Crashes stop everything</p></li></ul><h2><strong>The Cloud Deployment Requirements</strong></h2><p>To make your agent accessible to others, you need:</p><ul><li><p><strong>Accessible 24/7</strong> &#8212; Available from anywhere, anytime</p></li><li><p><strong>Scalable</strong> &#8212; Handles 1 or 1,000 concurrent users</p></li><li><p><strong>Reliable</strong> &#8212; Automatic restarts, error recovery</p></li><li><p><strong>Secure</strong> &#8212; Authentication, authorization, data encryption</p></li><li><p><strong>Fast</strong> &#8212; Low latency, optimized performance</p></li><li><p><strong>Observable</strong> &#8212; Logs, metrics, tracing</p></li><li><p><strong>API-First</strong> &#8212; HTTP endpoints for integration</p></li><li><p><strong>Cost-Effective</strong> &#8212; Pay for what you use</p></li></ul><p><strong>This is what cloud deployment enables.</strong></p><h3><strong>When Should You Deploy?</strong></h3><p><strong>Deploy to the cloud when:</strong></p><ul><li><p>Your agent system works reliably in local testing</p></li><li><p>You want others to use it (team, colleagues, demo purposes)</p></li><li><p>You need it integrated with a web app or service</p></li><li><p>You want 24/7 availability for testing and demos</p></li><li><p>You need to handle multiple concurrent requests</p></li><li><p>You want cloud monitoring and logging</p></li></ul><blockquote><p><em><strong>&#9888;&#65039; Note:</strong> This deployment approach is suitable for learning, testing, and demonstration. For production use, you&#8217;ll need to add comprehensive testing, CI/CD pipelines, evaluation frameworks, and other production-grade infrastructure not covered in this tutorial.</em></p></blockquote><h2><strong>Introducing Vertex AI Agent Engine</strong></h2><h3><strong>What is Agent Engine?</strong></h3><p><strong>Vertex AI Agent Engine</strong> is Google Cloud&#8217;s fully-managed platform for deploying and running Google ADK agents at scale. Think of it as &#8220;Cloud Run for AI Agents.&#8221;</p><p>According to Google Cloud documentation:</p><blockquote><p><em>&#8220;Vertex AI Agent Engine is a fully managed service that allows you to deploy reasoning engines (ADK agents) to a scalable, serverless infrastructure. It handles infrastructure, auto-scaling, monitoring, and provides a gRPC API for integration.&#8221;</em></p></blockquote><p><strong>In simple terms:</strong> You give Google your agent code, and they handle everything else.</p><h3><strong>The Power of Agent Engine</strong></h3><p><strong>1. Zero Infrastructure Management</strong></p><p>You don&#8217;t set up:</p><ul><li><p>Virtual machines or Kubernetes clusters</p></li><li><p>Container orchestration</p></li><li><p>Load balancers</p></li><li><p>Auto-scaling rules</p></li><li><p>Health checks</p></li><li><p>Network configuration</p></li></ul><p>Google handles all of it. You focus on your agent logic.</p><p><strong>2. Serverless Auto-Scaling</strong></p><pre><code>1 user  &#8594; 1 instance  &#8594; $X
10 users &#8594; 2 instances &#8594; $2X  (automatically scales up)
1 user  &#8594; 1 instance  &#8594; $X   (automatically scales down)
0 users &#8594; 0 instances &#8594; $0   (scales to zero!)</code></pre><p><strong>Pay only for actual usage.</strong> No idle servers burning money.</p><p><strong>3. Managed Reliability Features</strong></p><ul><li><p><strong>Automatic restarts</strong> &#8212; Agent crashes? Restarted instantly</p></li><li><p><strong>Health monitoring</strong> &#8212; Continuous health checks</p></li><li><p><strong>Multi-zone deployment</strong> &#8212; High availability across data centers</p></li><li><p><strong>Versioning</strong> &#8212; Deploy new versions without downtime</p></li><li><p><strong>Rollback</strong> &#8212; Instant rollback to previous versions</p></li></ul><p><strong>4. Built-in Observability</strong></p><ul><li><p><strong>Cloud Logging</strong> &#8212; All agent logs automatically collected</p></li><li><p><strong>Cloud Monitoring</strong> &#8212; Performance metrics, latency, errors</p></li><li><p><strong>Cloud Trace</strong> &#8212; Request tracing through your agent system</p></li><li><p><strong>Debuggable</strong> &#8212; Full visibility into deployed behavior</p></li></ul><p><strong>5. Secure by Default</strong></p><ul><li><p><strong>IAM integration</strong> &#8212; Fine-grained access control</p></li><li><p><strong>Service accounts</strong> &#8212; Secure authentication</p></li><li><p><strong>VPC integration</strong> &#8212; Private network deployment</p></li><li><p><strong>Encryption</strong> &#8212; Data encrypted in transit and at rest</p></li></ul><p><strong>6. Seamless ADK Integration</strong></p><p>Here&#8217;s the magic: <strong>Your local ADK agent code deploys with minimal changes.</strong></p><pre><code># Local execution
agent = create_content_creation_coordinator()
runner = Runner(agent=agent, session_service=session_service)
await runner.run_async(...)

# Cloud deployment - SAME agent code!
deployed_agent = agent.deploy(
    project=&#8221;my-project&#8221;,
    location=&#8221;us-central1&#8221;
)
# Done! Now accessible via gRPC API</code></pre><p>The agent you built locally <strong>just works</strong> in the cloud.</p><h2><strong>Prerequisites: Setting Up Google Cloud</strong></h2><p>Before deploying your agent, you need to prepare your Google Cloud environment. We&#8217;ve created a <strong>comprehensive setup script</strong> that handles everything automatically.</p><h3><strong>Step 0: Create a Google Cloud Project</strong></h3><ol><li><p>Go to <a href="https://console.cloud.google.com/">Google Cloud Console</a></p></li><li><p>Create a new project (or select an existing one)</p></li><li><p>Enable billing (required for Agent Engine and Cloud Run)</p></li><li><p>Note your <strong>Project ID</strong> (e.g., <code>my-content-studio</code>)</p></li></ol><h3><strong>Step 1: Run the GCP Setup Script</strong></h3><p>We provide <code>setup_gcp.sh</code> which automates your entire Google Cloud environment setup.</p><p><strong>&#128194; Location:</strong> <code>content_creation_mas/deployment/setup_gcp.sh</code></p><p><strong>What Does </strong><code>setup_gcp.sh</code><strong> Do?</strong></p><p>This script is your one-stop setup for Google Cloud. Here&#8217;s everything it handles:</p><p><strong>1. Environment Configuration</strong></p><ul><li><p>Loads your <code>.env</code> file with <code>GOOGLE_CLOUD_PROJECT</code> and <code>GOOGLE_CLOUD_LOCATION</code></p></li><li><p>Validates required variables</p></li><li><p>Sets gcloud defaults</p></li></ul><p><strong>2. Enables Required Google Cloud APIs</strong></p><pre><code>&#10003; aiplatform.googleapis.com           # Vertex AI / Agent Engine
&#10003; run.googleapis.com                  # Cloud Run (Part 9)
&#10003; cloudbuild.googleapis.com           # Cloud Build
&#10003; artifactregistry.googleapis.com     # Docker registry
&#10003; storage.googleapis.com              # Cloud Storage
&#10003; iam.googleapis.com                  # IAM
&#10003; cloudresourcemanager.googleapis.com # Resource Manager</code></pre><p><strong>3. Creates Artifact Registry Repository</strong></p><ul><li><p>Creates a Docker repository: <code>content-studio</code></p></li><li><p>Location: <code>{region}-docker.pkg.dev/{project}/content-studio</code></p></li><li><p>This stores your Cloud Run container images (Part 9)</p></li></ul><p><strong>4. Configures Docker Authentication</strong></p><ul><li><p>Authenticates Docker with Artifact Registry</p></li><li><p>Allows pushing images: <code>docker push {region}-docker.pkg.dev/...</code></p></li></ul><p><strong>5. Creates Service Account</strong></p><ul><li><p>Name: <code>content-studio-sa</code></p></li><li><p>Email: <code>content-studio-sa@{project}.iam.gserviceaccount.com</code></p></li><li><p>This service account will run your deployed services</p></li></ul><p><strong>6. Grants IAM Roles</strong></p><ul><li><p><code>roles/aiplatform.user</code> - Access Agent Engine</p></li><li><p><code>roles/run.invoker</code> - Invoke Cloud Run services</p></li><li><p><code>roles/storage.objectViewer</code> - Read from Cloud Storage</p></li><li><p><code>roles/logging.logWriter</code> - Write logs</p></li></ul><p><strong>7. Creates Cloud Storage Bucket</strong></p><ul><li><p>Bucket: <code>gs://{project}-content-studio</code></p></li><li><p>Used for Agent Engine staging and assets</p></li></ul><p><strong>Output:</strong></p><pre><code>========================================
  Setup Complete!
========================================

Summary:
  Project: my-content-studio
  Region: us-central1
  Artifact Registry: us-central1-docker.pkg.dev/my-content-studio/content-studio
  Service Account: content-studio-sa@my-content-studio.iam.gserviceaccount.com
  Storage Bucket: gs://my-content-studio-content-studio

Next Steps:
1. Deploy your agent to Agent Engine:
   cd deployment
   python deploy.py

2. Deploy frontend and backend to Cloud Run:
   cd deployment
   ./deploy-cloudrun.sh</code></pre><h3><strong>Running the Setup Script</strong></h3><pre><code># Navigate to deployment directory
cd content_creation_mas/deployment

# Create .env file with your project
cat &gt; .env &lt;&lt; EOF
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_API_KEY=your-api-key
EOF

# Make script executable
chmod +x setup_gcp.sh

# Run setup (takes 2-3 minutes)
./setup_gcp.sh</code></pre><p><strong>The script is interactive</strong> &#8212; it will show you what it will do and ask for confirmation before proceeding.</p><p><strong>Important:</strong> Run this script <strong>once</strong> before deploying. It prepares your entire Google Cloud environment.</p><h2><strong>Deploying Your Agent to Agent Engine</strong></h2><p>Now that your Google Cloud environment is ready, let&#8217;s deploy the Content Creation Studio!</p><h3><strong>Step 1: Prepare Your Environment</strong></h3><pre><code>cd content_creation_mas/deployment

# Verify .env file exists with these variables:
cat .env</code></pre><p>Required variables:</p><pre><code>GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_API_KEY=your-gemini-api-key
GOOGLE_CLOUD_STORAGE_BUCKET=gs://your-project-id-content-studio</code></pre><h3><strong>Step 2: Deploy with the Deployment Script</strong></h3><p>We provide <code>deploy.py</code> which handles the deployment:</p><pre><code># Deploy your agent
python deploy.py --action deploy

# What happens:
# 1. Initializes Vertex AI with your project/region/bucket
# 2. Imports your Content Creation Coordinator agent
# 3. Packages all dependencies
# 4. Uploads to Agent Engine
# 5. Configures auto-scaling and monitoring
# 6. Returns the Agent Engine resource name</code></pre><p><strong>Deployment takes 5&#8211;10 minutes</strong> &#8212; Google is:</p><ul><li><p>Packaging your agent code and dependencies</p></li><li><p>Creating the Agent Engine instance</p></li><li><p>Configuring networking and IAM</p></li><li><p>Running health checks</p></li><li><p>Making it available via gRPC API</p></li></ul><blockquote><p><em><strong>&#9888;&#65039; Important:</strong> This deployment is suitable for learning, development, testing, and demonstration only. For production use, you&#8217;d need to add comprehensive testing pipelines, CI/CD automation, model evaluation frameworks, advanced monitoring and alerting, security hardening, and disaster recovery strategies.</em></p></blockquote><p><strong>Output:</strong></p><pre><code>&#128640; Deploying agent to Vertex AI Agent Engine...

&#128230; Packaging agent code and dependencies...
&#10003; Agent code packaged

&#9729;&#65039;  Uploading to Agent Engine...
&#10003; Agent uploaded

&#9881;&#65039;  Configuring deployment...
&#10003; Deployment configured

&#127881; Deployment successful!

Resource Name:
projects/123456789/locations/us-central1/reasoningEngines/987654321

&#128203; Important: Save this resource name!
Add to your .env file:
AGENT_ENGINE_RESOURCE_NAME=projects/123456789/locations/us-central1/reasoningEngines/987654321

&#128279; Your agent is now accessible at this endpoint</code></pre><p><strong>This resource name is critical</strong> &#8212; it&#8217;s the unique identifier for your deployed agent.</p><h3><strong>Step 3: Save the Resource Name</strong></h3><pre><code># Add to your .env file
echo &#8220;AGENT_ENGINE_RESOURCE_NAME=projects/123456789/locations/us-central1/reasoningEngines/987654321&#8221; &gt;&gt; .env</code></pre><p><strong>Why is this important?</strong></p><ul><li><p>This resource name is the <strong>endpoint</strong> for your deployed agent</p></li><li><p>Your backend (Part 9) uses this to connect to the agent</p></li><li><p>It&#8217;s how other services query your agent via API</p></li></ul><h2><strong>Understanding What You Just Deployed</strong></h2><h3><strong>The Agent Engine Endpoint</strong></h3><p>Think of the <strong>resource name</strong> as the URL of your deployed agent:</p><pre><code>projects/123456789/locations/us-central1/reasoningEngines/987654321
           &#8593;                 &#8593;                      &#8593;
      Project ID          Region            Agent Instance ID</code></pre><p>This endpoint:</p><ul><li><p>Is globally unique</p></li><li><p>Is accessible via gRPC API</p></li><li><p>Handles authentication via IAM</p></li><li><p>Auto-scales based on demand</p></li><li><p>Is monitored and logged automatically</p></li></ul><h3><strong>How the Endpoint Works</strong></h3><pre><code>&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;  Your Application   &#9474;  (Backend server, Cloud Run, etc.)
&#9474;  (Part 9)           &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
           &#9474;
           &#9474; gRPC API call with:
           &#9474; - resource_name
           &#9474; - user_query
           &#9474; - session_id
           &#8595;
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;  Vertex AI Agent Engine                 &#9474;
&#9474;  projects/.../reasoningEngines/...      &#9474;
&#9474;                                         &#9474;
&#9474;  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;    &#9474;
&#9474;  &#9474;  Your Content Creation Studio  &#9474;    &#9474;
&#9474;  &#9474;  - 11 Specialist Agents        &#9474;    &#9474;
&#9474;  &#9474;  - Sequential/Parallel/Loop    &#9474;    &#9474;
&#9474;  &#9474;  - Session Management          &#9474;    &#9474;
&#9474;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;    &#9474;
&#9474;                                         &#9474;
&#9474;  Auto-scaling: 0-N instances           &#9474;
&#9474;  Monitoring: Logs, Metrics, Traces     &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
                  &#9474;
                  &#9474; Calls Gemini API
                  &#8595;
         &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
         &#9474;   Gemini 2.5     &#9474;
         &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre><p><strong>The flow:</strong></p><ol><li><p>Your backend sends a request to the Agent Engine endpoint</p></li><li><p>Agent Engine routes it to your Content Creation Studio agent</p></li><li><p>Your agent executes (11 specialist agents working together)</p></li><li><p>Agent calls Gemini API for LLM inference</p></li><li><p>Response streams back through Agent Engine to your backend</p></li><li><p>Your backend serves it to the user</p></li></ol><p><strong>Everything in between is managed by Google.</strong></p><h2><strong>Testing Your Deployed Agent</strong></h2><h3><strong>Using Python Client</strong></h3><pre><code>import vertexai
from vertexai import agent_engines
import os

# Initialize Vertex AI
PROJECT_ID = os.getenv(&#8221;GOOGLE_CLOUD_PROJECT&#8221;)
LOCATION = os.getenv(&#8221;GOOGLE_CLOUD_LOCATION&#8221;)
RESOURCE_NAME = os.getenv(&#8221;AGENT_ENGINE_RESOURCE_NAME&#8221;)

vertexai.init(project=PROJECT_ID, location=LOCATION)

# Connect to deployed agent
agent = agent_engines.ReasoningEngine(RESOURCE_NAME)

# Query the agent
response = agent.query(
    query=&#8221;Create content about sustainable living for eco-conscious millennials&#8221;
)

print(response)</code></pre><h3><strong>Using gcloud CLI</strong></h3><pre><code># List your deployed agents
gcloud beta ai reasoning-engines list \
    --project=your-project-id \
    --location=us-central1

# Get details about your specific agent
gcloud beta ai reasoning-engines describe \
    projects/123456789/locations/us-central1/reasoningEngines/987654321</code></pre><h3><strong>Checking Logs</strong></h3><pre><code># View real-time logs from your agent
gcloud logging read \
    &#8220;resource.type=aiplatform.googleapis.com/ReasoningEngine&#8221; \
    --project=your-project-id \
    --limit=50 \
    --format=json

# Filter for errors
gcloud logging read \
    &#8220;resource.type=aiplatform.googleapis.com/ReasoningEngine AND severity&gt;=ERROR&#8221; \
    --project=your-project-id</code></pre><h2><strong>The Bridge to Part 9: Full-Stack Deployment</strong></h2><h2><strong>What We Have Now</strong></h2><p>After completing Part 8, you have:</p><ul><li><p><strong>Agent Engine Endpoint</strong> &#8212; Your multi-agent system running in the cloud</p></li><li><p><strong>Resource Name</strong> &#8212; The API identifier to access it</p></li><li><p><strong>Auto-scaling Infrastructure</strong> &#8212; Handles any load</p></li><li><p><strong>Production Monitoring</strong> &#8212; Logs and metrics</p></li><li><p><strong>gRPC API</strong> &#8212; Programmatic access</p></li></ul><h2><strong>What&#8217;s Missing</strong></h2><p>But users can&#8217;t interact with it yet because:</p><ul><li><p>No user interface (UI)</p></li><li><p>No web backend to handle HTTP requests</p></li><li><p>No authentication flow</p></li><li><p>No session management for web users</p></li><li><p>No public URL</p></li></ul><p><strong>This is what Part 9 solves. Part 8 provides the AI backend.</strong> <strong>Part 9 provides the user-facing application.</strong></p><p>Together, they form a complete, production-ready system accessible to anyone with a web browser.</p><h2><strong>What&#8217;s Next?</strong></h2><p>Your AI agents are now running in the cloud! But users still can&#8217;t interact with them through a web interface.</p><p>In <strong>Part 9: Full-Stack Deployment with Cloud Run</strong>, we&#8217;ll complete the stack:</p><ul><li><p><strong>React Frontend</strong> &#8212; Beautiful web UI for content creation</p></li><li><p><strong>FastAPI Backend</strong> &#8212; REST API that connects to Agent Engine</p></li><li><p><strong>Docker Containerization</strong> &#8212; Package frontend + backend</p></li><li><p><strong>Cloud Run Deployment</strong> &#8212; Serverless hosting with auto-scaling</p></li><li><p><strong>Complete Integration</strong> &#8212; Users &#8594; Web App &#8594; Agent Engine &#8594; Gemini</p></li></ul><p><strong>The final piece of the puzzle!</strong></p><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/content_creation_mas_workshop">content_creation_mas_workshop</a></p><p><strong>Colab Notebook</strong>: <a href="https://colab.research.google.com/github/Saoussen-CH/content_creation_mas_workshop/blob/main/content_creation_mas/notebooks/part8_deployment_agent_engine.ipynb">Part 8 &#8212; Agent Engine Deployment</a></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Google ADK: From Local Development to Vertex AI Deployment: Part 7]]></title><description><![CDATA[Complete Multi-Agent System &#8212; The Capstone]]></description><link>https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-679</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-679</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 06 Jan 2026 19:00:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Part 7 of <strong>Google ADK: From Local Development to Vertex AI Deployment</strong> &#8212; the capstone! You&#8217;ve mastered individual concepts. Now we&#8217;re bringing <strong>everything together</strong> into one sophisticated system.</p><h2><strong>Google ADK: From Local Development to Vertex AI Deployment series:</strong></h2><ol><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 1</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Building Your First AI Agent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Custom Tools &#8212; Extending Agent Capabilities</a> </p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Multi-Agent Orchestration with Agent-as-a-Tool</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Sequential Workflows with SequentialAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 5</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Self-Improving Agents with LoopAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 6</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Efficient Workflows with ParallelAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 7</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Complete Multi-Agent System &#8212; The Capstone (You are here)</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 8</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Deploying to Vertex AI Agent Engine</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 9</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Full-Stack Deployment with Cloud Run</a></p></li></ol><p><strong>The Journey So Far:</strong></p><ul><li><p>Part 1&#8211;2: Agents and tools</p></li><li><p>Part 3: Agent teams</p></li><li><p>Part 4: Sequential workflows</p></li><li><p>Part 5: Iterative improvement</p></li><li><p>Part 6: Parallel execution</p></li></ul><p><strong>Today</strong>: We combine ALL patterns into one complete multi-agent system!</p><p><strong>Colab</strong>: <a href="https://colab.research.google.com/github/Saoussen-CH/content_creation_mas_workshop/blob/main/content_creation_mas/notebooks/part7_capstone_project.ipynb">Part 7</a></p><h3><strong>Concept: Hierarchical Orchestration</strong></h3><blockquote><p><em><strong>4-Layer Architecture:</strong></em></p></blockquote><ul><li><p>Layer 1: Master Orchestrator (routing)</p></li><li><p>Layer 2: Sub-Workflows (Sequential, Loop, Parallel)</p></li><li><p>Layer 3: Specialist Agents (11 agents)</p></li><li><p>Layer 4: Tools (custom + built-in)</p></li></ul><p><strong>Benefits:</strong></p><ul><li><p>Clean separation of concerns</p></li><li><p>Easy to extend</p></li><li><p>Testable components</p></li><li><p>Scalable design</p></li></ul><h3><strong>Concept: End-to-End Autonomous Workflows</strong></h3><blockquote><p><em><strong>Complete task execution with minimal human intervention:</strong> User Request &#8594; Parse &#8594; Research &#8594; Draft &#8594; Improve (Loop) &#8594; Multi-Channel (Parallel) &#8594; Package &#8594; Deliver</em></p></blockquote><p>All automatic!</p><h3><strong>The Complete Architecture</strong></h3><pre><code>User Query
     &#8595;
Master Orchestrator
     &#9500;&#9472;&#8594; Full Content Workflow
     &#9474;        &#9500;&#9472; Intake Agent
     &#9474;        &#9500;&#9472; Sequential (Research + Draft)
     &#9474;        &#9500;&#9472; Loop (Quality Check + Improve)
     &#9474;        &#9500;&#9472; Parallel (Blog + Social + Email + SEO)
     &#9474;        &#9492;&#9472; Final Packager
     &#9474;
     &#9492;&#9472;&#8594; Content Analyzer (simple analysis)</code></pre><h2><strong>Building the Complete System</strong></h2><p><em>(Due to length, showing key components)<br></em><strong>All 11 Specialist Agents</strong></p><pre><code># 1. intake_agent
# 2. topic_research_agent
# 3. content_drafter_agent
# 4. quality_checker_agent
# 5. content_improver_agent
# 6. blog_post_writer_agent
# 7. social_media_creator_agent
# 8. email_newsletter_writer_agent
# 9. seo_metadata_agent
# 10. content_analyzer_agent
# 11. final_packager_agent</code></pre><h3><strong>Complete Workflow Assembly</strong></h3><pre><code>from google.adk.agents import SequentialAgent, LoopAgent, ParallelAgent

# Sequential: Research + Draft
research_and_draft_workflow = SequentialAgent(
    sub_agents=[topic_research_agent, content_drafter_agent]
)

# Loop: Quality Improvement
quality_improvement_loop = LoopAgent(
    sub_agents=[quality_checker_agent, content_improver_agent],
    max_iterations=3
)

# Parallel: Multi-Channel Content
parallel_content_creation = ParallelAgent(
    sub_agents=[
        blog_post_writer_agent,
        social_media_creator_agent,
        email_newsletter_writer_agent,
        seo_metadata_agent
    ]
)

# Full Content Workflow
full_content_workflow = SequentialAgent(
    sub_agents=[
        intake_agent,
        research_and_draft_workflow,
        quality_improvement_loop,
        parallel_content_creation,
        final_packager_agent
    ]
)

# Master Orchestrator
from google.adk.tools.agent_tool import AgentTool

master_orchestrator_agent = Agent(
    name=&#8221;master_orchestrator_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=&#8221;&#8220;&#8221;
    You are the Master Content Creation Studio orchestrator.

    - For FULL content creation, use `full_content_workflow_tool`.
    - For ANALYZING existing text, use `content_analyzer_tool`.

    Always delegate. Present responses clearly.
    &#8220;&#8221;&#8220;,
    tools=[
        AgentTool(agent=full_content_workflow),
        AgentTool(agent=content_analyzer_agent)
    ]
)</code></pre><h3><strong>Testing the Complete System</strong></h3><pre><code>async def run_capstone_project():
    session = await session_service.create_session(
        app_name=master_orchestrator_agent.name,
        user_id=user_id
    )

    # Query 1: Full Content Creation
    query1 = &#8220;&#8221;&#8220;
    Create a complete content package for:
    - Topic: Productivity hacks using AI for remote workers
    - Target Audience: Remote professionals and digital nomads
    - Tone: Conversational and helpful
    - Keywords: AI productivity, remote work, automation tools
    &#8220;&#8221;&#8220;

    # Query 2: Quick Analysis
    sample_text = &#8220;Remote work has transformed productivity...&#8221;
    query2 = f&#8221;Analyze this text:\n\n{sample_text}&#8221;

    # Run both queries...</code></pre><h2><strong>What&#8217;s Next?</strong></h2><p><strong>Part 8</strong>: Deploy to Google Cloud&#8217;s Agent Engine<br><strong>Part 9</strong>: Full-stack deployment to Cloud Run</p><p><strong>GitHub</strong>: <a href="https://github.com/Saoussen-CH/content_creation_mas_workshop">content_creation_mas_workshop</a></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Google ADK: From Local Development to Vertex AI Deployment: Part 6]]></title><description><![CDATA[Parallel AI Workflows &#8212; 4x Faster Content Creation]]></description><link>https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-aed</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-aed</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 06 Jan 2026 18:58:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Part 6 of <strong>Google ADK: From Local Development to Vertex AI Deployment</strong>! We&#8217;ve built an incredible system of agents that can delegate, follow ordered plans, and even iterate to solve problems. In this lesson, we&#8217;ll tackle a new dimension: <strong>efficiency</strong>.</p><p>Our workflows so far have been linear. What if we need to perform multiple, independent tasks at the same time? For this, we use <strong>ParallelAgent</strong>. Today, we&#8217;ll build a workflow that can simultaneously create blog posts, social media content, and email newsletters &#8212; all from a single request, creating our most efficient agent yet.</p><h2><strong>Prerequisites</strong></h2><p>This article builds on our previous work. Please ensure you&#8217;re familiar with the concepts from the entire series.</p><h2><strong>Google ADK: From Local Development to Vertex AI Deployment series:</strong></h2><ol><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 1</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Building Your First AI Agent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Custom Tools &#8212; Extending Agent Capabilities </a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Multi-Agent Orchestration with Agent-as-a-Tool</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Sequential Workflows with SequentialAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 5</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Self-Improving Agents with LoopAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 6</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Efficient Workflows with ParallelAgent (You are here)</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 7</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Complete Multi-Agent System &#8212; The Capstone</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 8</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Deploying to Vertex AI Agent Engine</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 9</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Full-Stack Deployment with Cloud Run</a></p></li></ol><h2><strong>Introduction</strong></h2><p><strong>The Efficiency Problem:</strong></p><pre><code>Sequential: Blog (10s) &#8594; Social (10s) &#8594; Email (10s) &#8594; SEO (10s) = 40 seconds
Parallel:   [Blog | Social | Email | SEO] all run together = 10 seconds</code></pre><p><strong>Same result, 4x faster!</strong></p><p>Today, you&#8217;ll learn how to build parallel workflows using ParallelAgent &#8212; executing multiple agents simultaneously for maximum efficiency.</p><p><strong>Colab</strong>: <a href="https://colab.research.google.com/github/Saoussen-CH/content_creation_mas_workshop/blob/main/content_creation_mas/notebooks/part6_parallel_workflows.ipynb">Part 6</a></p><h3><strong>&#127381; Concept: ParallelAgent</strong></h3><blockquote><p><em><strong>What is ParallelAgent?</strong> A workflow agent that executes ALL sub-agents concurrently (simultaneously). Perfect for independent tasks that don&#8217;t depend on each other.</em></p></blockquote><p><strong>Key Characteristics:</strong></p><ul><li><p>Runs all sub-agents at the same time</p></li><li><p>Total time = longest single agent (not sum!)</p></li><li><p>Each sub-agent works independently</p></li><li><p>Collects all results via <code>output_key</code></p></li></ul><p><strong>When to use:</strong></p><ul><li><p>Multiple independent tasks</p></li><li><p>Content for different channels</p></li><li><p>Information gathering from multiple sources</p></li></ul><p>&#128214; <a href="https://google.github.io/adk-docs/agents/">Workflow Agents</a></p><h3><strong>&#127381; Concept: Fan-Out/Fan-In Pattern</strong></h3><blockquote><p><em><strong>Architectural pattern for parallel processing:</strong></em></p></blockquote><ul><li><p><strong>Fan-Out</strong>: Distribute single input to multiple workers</p></li><li><p><strong>Fan-In</strong>: Collect all parallel results and combine</p></li></ul><pre><code>Input Brief
     &#8595;
  Fan-Out
     &#9500;&#9472;&#9472;&#8594; Blog Writer
     &#9500;&#9472;&#9472;&#8594; Social Creator
     &#9500;&#9472;&#9472;&#8594; Email Writer
     &#9492;&#9472;&#9472;&#8594; SEO Generator
     &#8595;
  Fan-In
     &#8595;
Complete Package</code></pre><h3><strong>&#127381; Concept: Intake Agent Pattern</strong></h3><blockquote><p><em><strong>Pattern for parsing natural language into structured data:</strong> Instead of requiring structured input, an intake agent extracts parameters from conversational requests.</em></p></blockquote><p><strong>Example:</strong></p><pre><code>User: &#8220;Create content about AI for small businesses, friendly tone&#8221;
     &#8595;
Intake Agent extracts:
     topic = &#8220;AI for small businesses&#8221;
     tone = &#8220;friendly&#8221;
     audience = &#8220;small business owners&#8221;
     &#8595;
Stores in session.state for other agents</code></pre><h2><strong>Building a Parallel Content Factory</strong></h2><h3><strong>Setup</strong></h3><pre><code>!pip install google-adk==1.19.0 -q</code></pre><h3><strong>Step 1: Intake Agent with Session State</strong></h3><pre><code>from google.adk.tools import ToolContext
from google.adk.agents import Agent

def update_session_state(
    tool_context: ToolContext,
    topic: str,
    target_audience: str,
    tone: str,
    keywords: str
) -&gt; str:
    &#8220;&#8221;&#8220;
    Saves extracted content brief parameters to session state.
    &#8220;&#8221;&#8220;
    print(f&#8221;&#128295; Updating session state...&#8221;)
    print(f&#8221;   Topic: {topic}&#8221;)
    print(f&#8221;   Audience: {target_audience}&#8221;)
    print(f&#8221;   Tone: {tone}&#8221;)

    tool_context.state[&#8217;topic&#8217;] = topic
    tool_context.state[&#8217;target_audience&#8217;] = target_audience
    tool_context.state[&#8217;tone&#8217;] = tone
    tool_context.state[&#8217;keywords&#8217;] = keywords

    return &#8220;Session state updated with content brief parameters.&#8221;

intake_agent = Agent(
    name=&#8221;intake_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=&#8221;&#8220;&#8221;
    You are a content brief analyzer. From the user&#8217;s request, identify:
    - The main topic
    - The target audience
    - The desired tone
    - Key SEO keywords (comma-separated)

    Then call the `update_session_state` tool with the extracted values.
    &#8220;&#8221;&#8220;,
    tools=[update_session_state]
)

print(&#8221;&#129502; Intake agent created!&#8221;)</code></pre><h3><strong>Step 2: Create Parallel Content Creators</strong></h3><pre><code># Agent 1: Blog Post Writer
blog_post_writer_agent = Agent(
    name=&#8221;blog_post_writer_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=&#8221;&#8220;&#8221;
    Write a complete blog post about: {{topic}}

    Target audience: {{target_audience}}
    Tone: {{tone}}

    Requirements:
    - 600-800 words
    - Engaging introduction
    - 3-4 H2 headings
    - Clear call-to-action

    Output only the blog post in markdown.
    &#8220;&#8221;&#8220;,
    tools=[],
    output_key=&#8221;blog_post&#8221;
)

# Agent 2: Social Media Creator
social_media_creator_agent = Agent(
    name=&#8221;social_media_creator_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=&#8221;&#8220;&#8221;
    Create social posts about: {{topic}}

    Target audience: {{target_audience}}
    Tone: {{tone}}

    Create THREE posts:

    **1. LinkedIn Post** (150-200 words)
    - Professional and insightful
    - 3-4 professional hashtags

    **2. Twitter/X Thread** (3-4 tweets, 280 chars each)
    - Engaging thread with hashtags
    - Call-to-action in last tweet

    **3. Instagram Caption** (100-150 words)
    - Engaging with emojis
    - 8-10 hashtags at end

    Format clearly with headers for each platform.
    &#8220;&#8221;&#8220;,
    tools=[],
    output_key=&#8221;social_media_content&#8221;
)

# Agent 3: Email Newsletter Writer
email_newsletter_writer_agent = Agent(
    name=&#8221;email_newsletter_writer_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=&#8221;&#8220;&#8221;
    Write an email newsletter about: {{topic}}

    Target audience: {{target_audience}}
    Tone: {{tone}}

    Structure:
    - **Subject Line**: Compelling (50-60 chars)
    - **Preview Text**: Enticing (40-50 chars)
    - **Body** (300-400 words):
      * Personal greeting
      * Engaging introduction
      * 2-3 key points
      * Clear call-to-action
      * Friendly sign-off

    Format with clear sections.
    &#8220;&#8221;&#8220;,
    tools=[],
    output_key=&#8221;email_newsletter&#8221;
)

# Agent 4: SEO Metadata Generator
seo_metadata_agent = Agent(
    name=&#8221;seo_metadata_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=&#8221;&#8220;&#8221;
    Generate SEO metadata for content about: {{topic}}

    Target keywords: {{keywords}}

    Create:
    1. **Meta Title** (50-60 characters)
    2. **Meta Description** (150-160 characters)
    3. **URL Slug** (lowercase with hyphens)
    4. **Focus Keyword**
    5. **5 Related Keywords**
    6. **3 Internal Link Suggestions**

    Format as structured list.
    &#8220;&#8221;&#8220;,
    tools=[],
    output_key=&#8221;seo_metadata&#8221;
)

print(&#8221;&#129502; All parallel content creator agents created!&#8221;)</code></pre><h3><strong>Step 3: Build the Parallel Workflow</strong></h3><pre><code>from google.adk.agents import ParallelAgent

# Create the parallel workflow (Fan-Out)
parallel_content_creation = ParallelAgent(
    name=&#8221;parallel_content_creation&#8221;,
    sub_agents=[
        blog_post_writer_agent,
        social_media_creator_agent,
        email_newsletter_writer_agent,
        seo_metadata_agent
    ]
)

print(&#8221;&#9989; Parallel workflow created!&#8221;)</code></pre><h3><strong>Step 4: Add Synthesizer (Fan-In)</strong></h3><pre><code># Synthesizer combines all parallel outputs
content_package_synthesizer_agent = Agent(
    name=&#8221;content_package_synthesizer_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=&#8221;&#8220;&#8221;
    Combine all created content into one comprehensive package.

    You have:
    - Blog post: {{blog_post}}
    - Social media content: {{social_media_content}}
    - Email newsletter: {{email_newsletter}}
    - SEO metadata: {{seo_metadata}}

    Create a well-organized content package with:
    1. **&#128221; Blog Post** section
    2. **&#128241; Social Media Content** section
    3. **&#128231; Email Newsletter** section
    4. **&#128269; SEO Metadata** section

    Add brief executive summary at top.
    &#8220;&#8221;&#8220;
)

print(&#8221;&#129502; Synthesizer agent created!&#8221;)</code></pre><h3><strong>Step 5: Complete Workflow</strong></h3><pre><code>from google.adk.agents import SequentialAgent

full_parallel_workflow = SequentialAgent(
    name=&#8221;full_parallel_workflow&#8221;,
    sub_agents=[
        intake_agent,                           # Parse brief
        parallel_content_creation,              # Fan-out (parallel)
        content_package_synthesizer_agent      # Fan-in
    ]
)

print(&#8221;&#9989; Complete parallel workflow assembled!&#8221;)</code></pre><h3><strong>Testing the System</strong></h3><pre><code>from google.adk.sessions import InMemorySessionService
from google.adk.runners import Runner
from google.genai.types import Content, Part
from IPython.display import display, Markdown

session_service = InMemorySessionService()
user_id = &#8220;adk_content_creator_001&#8221;

async def run_parallel_content_creation():
    session = await session_service.create_session(
        app_name=full_parallel_workflow.name,
        user_id=user_id
    )

    query = &#8220;&#8221;&#8220;
    Create a complete content package for:
    - Topic: Using AI tools to boost small business productivity
    - Target Audience: Small business owners and solopreneurs
    - Tone: Friendly and approachable, but professional
    - Keywords: AI productivity, small business automation, AI tools for business
    &#8220;&#8221;&#8220;

    print(f&#8221;&#128100; User Content Brief:\n{query}\n&#8221;)

    runner = Runner(
        agent=full_parallel_workflow,
        session_service=session_service,
        app_name=full_parallel_workflow.name
    )

    async for event in runner.run_async(
        user_id=user_id,
        session_id=session.id,
        new_message=Content(parts=[Part(text=query)], role=&#8221;user&#8221;)
    ):
        if event.is_final_response():
            print(&#8221;\n&#8221; + &#8220;=&#8221;*60)
            print(&#8221;&#9989; FINAL CONTENT PACKAGE:&#8221;)
            print(&#8221;=&#8221;*60)
            display(Markdown(event.content.parts[0].text))
            print(&#8221;=&#8221;*60)

await run_parallel_content_creation()</code></pre><p><strong>Sequential Execution (Parts 1&#8211;5):</strong></p><pre><code>Blog (10s) &#8594; Social (10s) &#8594; Email (10s) &#8594; SEO (10s) = 40 seconds total</code></pre><p><strong>Parallel Execution (Part 6):</strong></p><pre><code>Blog    (10s) &#9488;
Social  (10s) &#9500;&#9472; All run together
Email   (10s) &#9508;
SEO     (10s) &#9496;
= 10 seconds total (4x faster!)</code></pre><h2><strong>What&#8217;s Next?</strong></h2><p>We&#8217;ve mastered all workflow patterns! Now it&#8217;s time to <strong>combine them all</strong>.</p><p><strong>Part 7: The Capstone Project</strong> builds a complete production-ready system:</p><ul><li><p>Intake &#8594; Sequential &#8594; Loop &#8594; Parallel &#8594; Package</p></li><li><p>All patterns working together</p></li><li><p>11 specialist agents</p></li><li><p>Hierarchical orchestration</p></li></ul><p><strong>Colab</strong>: <a href="https://colab.research.google.com/github/Saoussen-CH/content_creation_mas_workshop/blob/main/content_creation_mas/notebooks/part7_capstone_project.ipynb">Part 7</a></p><p><strong>GitHub</strong>: <a href="https://github.com/Saoussen-CH/content_creation_mas_workshop">content_creation_mas_workshop</a></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Google ADK: From Local Development to Vertex AI Deployment: Part 5]]></title><description><![CDATA[Self-Improving AI &#8212; Building Iterative Workflows with LoopAgent]]></description><link>https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-6b7</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-6b7</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 06 Jan 2026 18:57:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Part 5 of <strong>Google ADK: From Local Development to Vertex AI Deployment</strong>! You&#8217;ve built agents that work in sequence. Now let&#8217;s add intelligence &#8212; agents that <strong>critique and improve their own work</strong></p><h2><strong>Google ADK: From Local Development to Vertex AI Deployment series:</strong></h2><ol><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 1</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Building Your First AI Agent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Custom Tools &#8212; Extending Agent Capabilities </a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Multi-Agent Orchestration with Agent-as-a-Tool</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Sequential Workflows with SequentialAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 5</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Self-Improving Agents with LoopAgent (You are here)</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 6</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Efficient Workflows with ParallelAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 7</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Complete Multi-Agent System &#8212; The Capstone</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 8</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Deploying to Vertex AI Agent Engine</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 9</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-9b6?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Full-Stack Deployment with Cloud Run</a></p></li></ol><h2><strong>Introduction</strong></h2><p>First drafts are rarely perfect. Professional writers know this &#8212; which is why they revise. But what if your AI agent could <strong>critique and improve its own work</strong> until it meets quality standards?</p><p>Today, you&#8217;ll build a <strong>self-improving AI system</strong> using LoopAgent &#8212; a workflow that iteratively refines content until it passes quality gates. No human intervention required.</p><p><strong>What you&#8217;ll learn:</strong></p><ul><li><p>LoopAgent for iterative workflows</p></li><li><p>ToolContext for runtime control</p></li><li><p><code>tool_context.actions.escalate</code> for loop termination</p></li><li><p>The Critique-Refine Pattern</p></li></ul><p><strong>Colab Notebook</strong>: <a href="https://colab.research.google.com/github/Saoussen-CH/content_creation_mas_workshop/blob/main/content_creation_mas/notebooks/part5_iterative_workflows.ipynb">Part 5</a></p><h2><strong>The Quality Problem</strong></h2><pre><code># Sequential workflow from Part 4
SequentialAgent([research, draft, format])
# Problem: What if the draft is poor quality?</code></pre><p>We need:</p><pre><code># Loop until quality meets threshold
LoopAgent([check_quality, improve], max_iterations=3)
# Stops when: quality &gt;= 70 OR 3 iterations reached</code></pre><h3><strong>&#127381; Concept: LoopAgent</strong></h3><blockquote><p><em><strong>What is LoopAgent?</strong> A workflow agent that repeatedly executes its sub-agents until either:</em></p></blockquote><ul><li><p>A condition is met (<code>tool_context.actions.escalate = True</code>)</p></li><li><p>Maximum iterations reached</p></li></ul><blockquote><p><em>Perfect for quality improvement, optimization, and refinement tasks.</em></p></blockquote><p><strong>Exit mechanisms:</strong></p><ul><li><p><strong>Fixed iterations</strong>: <code>max_iterations=3</code> runs exactly 3 times</p></li><li><p><strong>Conditional exit</strong>: Tool sets <code>escalate = True</code> to stop early</p></li></ul><p>&#128214; <a href="https://google.github.io/adk-docs/agents/">Workflow Agents</a></p><h3><strong>&#127381; Concept: ToolContext</strong></h3><blockquote><p><em><strong>What is ToolContext?</strong> A special parameter that gives tools access to runtime information and control over workflow behavior.</em></p></blockquote><p><strong>Usage:</strong></p><pre><code>def my_tool(tool_context: ToolContext, param: str):
    # Access session state
    data = tool_context.state

    # Control workflow
    tool_context.actions.escalate = True  # Exit loop!</code></pre><p><strong>Capabilities:</strong></p><ul><li><p>Access session state via <code>tool_context.state</code></p></li><li><p>Control workflows via <code>tool_context.actions</code></p></li><li><p>Get runtime context</p></li></ul><p>&#128214; <a href="https://google.github.io/adk-docs/tools/">Tool Context</a></p><h3><strong>&#127381; Concept: tool_context.actions.escalate</strong></h3><blockquote><p><em><strong>The Escalate Flag</strong> Setting </em><code>tool_context.actions.escalate = True</code><em> signals to LoopAgent: &#8220;We&#8217;re done, exit now!&#8221;</em></p></blockquote><p><strong>Pattern:</strong></p><pre><code>def exit_loop(tool_context: ToolContext):
    tool_context.actions.escalate = True
    return {&#8221;result&#8221;: &#8220;Quality threshold met&#8221;}</code></pre><p><strong>When to use:</strong></p><ul><li><p>Quality thresholds met</p></li><li><p>Goal achieved</p></li><li><p>Condition satisfied</p></li></ul><h2><strong>The Critique-Refine Pattern</strong></h2><p><em><strong>Architecture for autonomous quality improvement:</strong></em></p><ul><li><p><strong>Drafter Agent</strong> &#8212; Creates initial version (runs once)</p></li><li><p><strong>Checker Agent</strong> &#8212; Evaluates quality, calculates scores</p></li><li><p><strong>Improver Agent</strong> &#8212; Fixes issues OR exits if quality met</p></li><li><p><strong>Loop</strong>: Checker &#8594; Improver &#8594; Checker &#8594; &#8230; until threshold met</p></li></ul><h2><strong>Building the Quality Loop</strong></h2><h3><strong>Setup</strong></h3><pre><code>!pip install google-adk==1.19.0 -q</code></pre><h3><strong>Step 1: Define Quality Tools</strong></h3><pre><code>from google.adk.tools import ToolContext

def calculate_content_quality_score(
    word_count: int,
    readability_score: float,
    has_headings: bool,
    has_conclusion: bool
) -&gt; dict:
    &#8220;&#8221;&#8220;
    Calculates overall content quality score (0-100).
    Threshold for approval: 70+
    &#8220;&#8221;&#8220;
    print(f&#8221;&#128295; Calculating quality score...&#8221;)

    # Word count scoring (optimal: 800-2000)
    if word_count &lt; 500:
        word_score = 30
    elif word_count &lt; 800:
        word_score = 60
    elif word_count &lt;= 2000:
        word_score = 100
    else:
        word_score = 80

    # Readability scoring
    read_score = min(100, readability_score * 1.5) if readability_score &gt; 0 else 40

    # Structure scoring
    structure_score = 0
    if has_headings:
        structure_score += 50
    if has_conclusion:
        structure_score += 50

    # Overall
    overall = (word_score * 0.3) + (read_score * 0.3) + (structure_score * 0.4)

    result = {
        &#8220;overall_score&#8221;: round(overall, 2),
        &#8220;meets_threshold&#8221;: overall &gt;= 70
    }

    print(f&#8221;   Score: {result[&#8217;overall_score&#8217;]}/100&#8221;)
    return result

QUALITY_THRESHOLD_MET = &#8220;QUALITY_THRESHOLD_MET&#8221;

def exit_loop(tool_context: ToolContext):
    &#8220;&#8221;&#8220;Terminates loop when quality meets threshold.&#8221;&#8220;&#8221;
    print(f&#8221;&#128295; Quality approved! Terminating loop...&#8221;)
    tool_context.actions.escalate = True  # &#8592; THE MAGIC
    return {&#8221;result&#8221;: &#8220;Quality threshold met&#8221;}

print(&#8221;&#9989; Tools defined!&#8221;)</code></pre><h3><strong>Step 2: Create the Agent Team</strong></h3><pre><code>from google.adk.agents import Agent

# Agent 1: Drafter (runs once)
content_drafter_agent = Agent(
    name=&#8221;content_drafter_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=&#8221;&#8220;&#8221;
    Write a blog post about: {{topic}}

    Create a draft (300-500 words) with:
    - Engaging intro
    - At least one H2 heading
    - A conclusion

    Output only the content in markdown.
    &#8220;&#8221;&#8220;,
    tools=[],
    output_key=&#8221;current_content&#8221;
)

# Agent 2: Quality Checker (runs each loop iteration)
quality_checker_agent = Agent(
    name=&#8221;quality_checker_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=f&#8221;&#8220;&#8221;
    Analyze: {{{{current_content}}}}

    Your job:
    1. Count approximate words
    2. Estimate readability (60+ is good)
    3. Check for headings
    4. Check for conclusion

    Use `calculate_content_quality_score` tool.

    Then:
    - IF overall_score &gt;= 70: respond &#8216;{QUALITY_THRESHOLD_MET}&#8217;
    - ELSE: respond &#8216;Score: [X]. Issues: [specific problems]&#8217;
    &#8220;&#8221;&#8220;,
    tools=[calculate_content_quality_score],
    output_key=&#8221;quality_feedback&#8221;
)

# Agent 3: Improver (runs each loop iteration)
content_improver_agent = Agent(
    name=&#8221;content_improver_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    instruction=f&#8221;&#8220;&#8221;
    Current content: {{{{current_content}}}}
    Feedback: {{{{quality_feedback}}}}

    - IF feedback is &#8216;{QUALITY_THRESHOLD_MET}&#8217;: call `exit_loop` immediately
    - ELSE: improve based on issues:
      * Expand if short
      * Simplify if complex
      * Add headings if missing
      * Add conclusion if missing

    Output the COMPLETE improved content.
    &#8220;&#8221;&#8220;,
    tools=[exit_loop],
    output_key=&#8221;current_content&#8221;  # &#8592; Overwrites for next iteration!
)

print(&#8221;&#129502; Agent team created!&#8221;)</code></pre><p><strong>Key insight</strong>: <code>current_content</code> gets overwritten each iteration, allowing content to evolve!</p><h3><strong>Step 3: Build the Loop</strong></h3><pre><code>from google.adk.agents import SequentialAgent, LoopAgent

# The iterative quality loop
quality_improvement_loop = LoopAgent(
    name=&#8221;quality_improvement_loop&#8221;,
    sub_agents=[quality_checker_agent, content_improver_agent],
    max_iterations=3  # Safety limit
)

# Complete workflow: Draft &#8594; Loop &#8594; Present
quality_workflow = SequentialAgent(
    name=&#8221;quality_workflow&#8221;,
    sub_agents=[
        content_drafter_agent,
        quality_improvement_loop,
        # Optional: final presenter agent
    ]
)

print(&#8221;&#9989; Iterative workflow created!&#8221;)</code></pre><h3><strong>Testing the Loop</strong></h3><pre><code>from IPython.display import display, Markdown
from google.adk.sessions import InMemorySessionService
from google.adk.runners import Runner
from google.genai.types import Content, Part

session_service = InMemorySessionService()
user_id = &#8220;adk_content_creator_001&#8221;

async def run_quality_workflow():
    session = await session_service.create_session(
        app_name=quality_workflow.name,
        user_id=user_id
    )

    topic = &#8220;The benefits of meditation for busy professionals&#8221;
    session.state[&#8221;topic&#8221;] = topic

    query = f&#8221;Create high-quality content about: {topic}&#8221;
    print(f&#8221;&#128100; User: {query}\n&#8221;)

    runner = Runner(
        agent=quality_workflow,
        session_service=session_service,
        app_name=quality_workflow.name
    )

    async for event in runner.run_async(
        user_id=user_id,
        session_id=session.id,
        new_message=Content(parts=[Part(text=query)], role=&#8221;user&#8221;),
        state_delta={&#8221;topic&#8221;: topic}
    ):
        if event.is_final_response():
            display(Markdown(event.content.parts[0].text))

await run_quality_workflow()</code></pre><h3><strong>Example Output</strong></h3><pre><code>&#128100; User: Create high-quality content about: The benefits of meditation...

[Iteration 1]
&#128221; Draft created (350 words, score: 55)
&#128295; Calculating quality score...
   Score: 55/100 - BELOW THRESHOLD
Issues: Too short, missing headings, needs more structure

[Iteration 2]
&#9999;&#65039; Improving content...
   Added H2 headings, expanded to 650 words
&#128295; Calculating quality score...
   Score: 68/100 - BELOW THRESHOLD
Issues: Almost there, needs better conclusion

[Iteration 3]
&#9999;&#65039; Final improvements...
   Enhanced conclusion, polished language
&#128295; Calculating quality score...
   Score: 75/100 - MEETS THRESHOLD &#9989;
&#128295; Quality approved! Terminating loop...

&#9989; Final approved content delivered!</code></pre><h2><strong>What&#8217;s Next?</strong></h2><p>We can create self-improving workflows! But what about <strong>efficiency</strong>?</p><p>Sequential execution: 40 seconds Parallel execution: 10 seconds (4x faster!)</p><p><strong>Part 6: Parallel AI Workflows</strong> introduces ParallelAgent for concurrent execution.</p><h2><strong>Try It Yourself!</strong></h2><p>Ready to build loop workflows? Click the button below:</p><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/content_creation_mas_workshop">content_creation_mas_workshop</a></p><p><strong>Happy looping !</strong></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item><item><title><![CDATA[Google ADK: From Local Development to Vertex AI Deployment: Part 4]]></title><description><![CDATA[Building Agent Teams: The Orchestrator Pattern]]></description><link>https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-c50</link><guid isPermaLink="false">https://saoussenchaabnia.substack.com/p/google-adk-from-local-development-c50</guid><dc:creator><![CDATA[Saoussen CHAABNIA]]></dc:creator><pubDate>Tue, 06 Jan 2026 16:40:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!elji!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce34980f-3ab2-43bf-ab9b-b0af6997d534_144x144.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to Part 4 of <strong>Google ADK: From Local Development to Vertex AI Deployment</strong>! You&#8217;ve mastered agent delegation. Now let&#8217;s tackle workflows where <strong>order matters</strong>.</p><h2><strong>Google ADK: From Local Development to Vertex AI Deployment series:</strong></h2><ol><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 1</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Building Your First AI Agent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 2</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-b68?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Custom Tools &#8212; Extending Agent Capabilities </a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 3</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Multi-Agent Orchestration with Agent-as-a-Tool</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 4</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-c50?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Sequential Workflows with SequentialAgent (You are here)</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 5</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-6b7?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Self-Improving Agents with LoopAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 6</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-aed?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Efficient Workflows with ParallelAgent</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 7</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-679?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Complete Multi-Agent System &#8212; The Capstone</a></p></li><li><p><strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Part 8</a></strong><a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-df1?r=4ewvnt&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">: Deploying to Vertex AI Agent Engine</a></p></li><li><p><strong>Part 9</strong>: Full-Stack Deployment with Cloud Run</p></li></ol><h2><strong>Introduction: The Order Problem</strong></h2><p>In <a href="https://open.substack.com/pub/saoussenchaabnia/p/google-adk-from-local-development-696?utm_campaign=post-expanded-share&amp;utm_medium=web">Part 3</a>, we built an orchestrator that delegates to specialists based on user intent. Powerful stuff! But there&#8217;s a problem:</p><p><strong>What if a task inherently requires multiple steps in a specific order?</strong></p><p>Consider creating a blog post:</p><ol><li><p><strong>First</strong>: Research trending topics &#8592; Can&#8217;t skip this</p></li><li><p><strong>Then</strong>: Write content based on research &#8592; Needs topic from step 1</p></li><li><p><strong>Finally</strong>: Format for social media &#8592; Needs content from step 2</p></li></ol><p>You can&#8217;t write before researching. You can&#8217;t format before writing. <strong>Order matters.</strong></p><p>Today, you&#8217;ll learn how to build <strong>sequential workflows</strong> using the <strong>SequentialAgent</strong> pattern. By the end of this article, you&#8217;ll have a three-stage content pipeline where data automatically flows from research &#8594; writing &#8594; social formatting.</p><p><strong>What you&#8217;ll learn:</strong></p><ul><li><p>The SequentialAgent workflow pattern</p></li><li><p>The <code>output_key</code> parameter for state management</p></li><li><p>Variable interpolation with <code>{{variable}}</code></p></li><li><p>Automatic data flow between agents</p></li></ul><p><strong>Prerequisites:</strong></p><ul><li><p>Completed Parts 1&#8211;3 or familiar with agents and orchestration</p></li><li><p>Understanding of sessions and state</p></li><li><p>Google API key</p></li></ul><p>Let&#8217;s build your first workflow!</p><h2><strong>The Need for Sequential Execution</strong></h2><h3><strong>Current Limitation</strong></h3><p>With our orchestrator from Part 3:</p><pre><code># User asks for research
orchestrator &#8594; topic_research_agent &#8594; returns topics

# User asks to write based on those topics
orchestrator &#8594; content_writer_agent &#8594; ... wait, it doesn&#8217;t have the topics!</code></pre><p><strong>The problem</strong>: Each agent call is independent. Data doesn&#8217;t flow automatically.</p><h3><strong>What We Need</strong></h3><pre><code>topic_research_agent &#8594; blog_topic (stored)
         &#8595;
content_writer_agent &#8594; uses blog_topic &#8594; blog_content (stored)
         &#8595;
social_formatter_agent &#8594; uses blog_content &#8594; social_posts</code></pre><p><strong>The solution</strong>: SequentialAgent with automatic state passing!</p><h2><strong>Introducing SequentialAgent</strong></h2><h3><strong>Concept: Workflow Agents</strong></h3><blockquote><p><em><strong>What is SequentialAgent?</strong> A SequentialAgent is a workflow agent that executes its sub-agents in a specific order. It&#8217;s designed for processes where the order of operations matters &#8212; each agent&#8217;s output becomes input for the next.</em></p></blockquote><p><strong>Key Characteristics:</strong></p><ul><li><p>Executes sub-agents one after another (not simultaneously)</p></li><li><p>Automatically passes state between agents</p></li><li><p>Useful for tasks with dependencies</p></li><li><p>Parameter: <code>sub_agents</code> - list of agents to run in order</p></li></ul><p><strong>When to use:</strong></p><ul><li><p>Multi-step processes</p></li><li><p>Data transformations (input &#8594; process &#8594; output)</p></li><li><p>Pipelines with dependencies</p></li></ul><p><strong>Reference</strong>: <a href="https://google.github.io/adk-docs/agents/">Workflow Agents</a></p><h2><strong>State Management: The Key to Data Flow</strong></h2><h3><strong>Concept: output_key Parameter</strong></h3><blockquote><p><em><strong>What is output_key?</strong> The </em><code>output_key</code><em> parameter tells ADK to store an agent&#8217;s final response in the session state under a specific variable name. This makes the output available to subsequent agents in the workflow.</em></p></blockquote><p><strong>How it works:</strong></p><pre><code>agent1 = Agent(
    name=&#8221;researcher&#8221;,
    output_key=&#8221;blog_topic&#8221;  # &#8592; Stores response here
)
# After agent1 runs, session.state[&#8221;blog_topic&#8221;] = agent1&#8217;s response</code></pre><p><strong>The magic:</strong></p><ul><li><p>Agent 1 runs &#8594; output stored as <code>blog_topic</code></p></li><li><p>Agent 2 can reference <code>{{blog_topic}}</code> in its instructions</p></li><li><p>ADK automatically replaces <code>{{blog_topic}}</code> with the actual value</p></li></ul><p><strong>Reference</strong>: <a href="https://google.github.io/adk-docs/agents/">Workflow Agents</a></p><h3><strong>Concept: Variable Interpolation</strong></h3><blockquote><p><em><strong>What is Variable Interpolation?</strong> ADK uses </em><code>{{variable_name}}</code><em> syntax in agent instructions to reference values from the session state. At runtime, ADK automatically replaces these placeholders with actual values.</em></p></blockquote><p><strong>Syntax:</strong></p><pre><code>instruction = &#8220;Write a blog post about: {{blog_topic}}&#8221;
# At runtime: {{blog_topic}} &#8594; &#8220;10 AI Tools for Small Businesses&#8221;</code></pre><p><strong>Rules:</strong></p><ul><li><p>Use double curly braces: <code>{{variable}}</code></p></li><li><p>Variable must exist in session state</p></li><li><p>Previous agents set these via <code>output_key</code></p></li></ul><p><strong>Reference</strong>: <a href="https://google.github.io/adk-docs/agents/">Workflow Agents</a></p><h2><strong>Building a Three-Stage Content Pipeline</strong></h2><p>Let&#8217;s build a complete workflow: Research &#8594; Write &#8594; Format</p><h3><strong>Step 1: Setup</strong></h3><pre><code>!pip install google-adk==1.19.0 -q</code></pre><pre><code>import os
from getpass import getpass

api_key = getpass(&#8217;Enter your Google API Key: &#8216;)
os.environ[&#8217;GOOGLE_API_KEY&#8217;] = api_key
print(&#8217;&#9989; API Key configured!&#8217;)</code></pre><h3><strong>Step 2: Agent 1 &#8212; Topic Researcher</strong></h3><p>This agent finds ONE perfect topic and stores it:</p><pre><code>from google.adk.agents import Agent
from google.adk.tools import google_search

topic_research_agent = Agent(
    name=&#8221;topic_research_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    description=&#8221;Researches trending blog topics&#8221;,
    instruction=&#8221;&#8220;&#8221;
    You are a content strategist. Find compelling blog topics.

    Process:
    1. Search for trending topics in the niche
    2. Select the SINGLE BEST topic
    3. Output ONLY the title

    Example output: &#8220;10 Zero-Waste Swaps to Transform Your Kitchen&#8221;

    Important: Output ONLY the blog post title, nothing else.
    &#8220;&#8221;&#8220;,
    tools=[google_search],
    output_key=&#8221;blog_topic&#8221;  # &#128273; Stores result in session.state[&#8221;blog_topic&#8221;]
)

print(f&#8221;&#129502; Agent &#8216;{topic_research_agent.name}&#8217; created!&#8221;)</code></pre><p><strong>Key points:</strong></p><ul><li><p><code>output_key="blog_topic"</code> : Stores the title for next agent</p></li><li><p>Instructions emphasize : &#8220;Output ONLY the title&#8221;</p></li><li><p>This ensures clean data for the next stage</p></li></ul><h3><strong>Step 3: Agent 2 &#8212; Content Writer</strong></h3><p>This agent uses the topic from Agent 1:</p><pre><code>content_writer_agent = Agent(
    name=&#8221;content_writer_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    description=&#8221;Writes engaging blog posts&#8221;,
    instruction=&#8221;&#8220;&#8221;
    You are a blog writer. Write about: {{blog_topic}}

    Requirements:
    - 400-600 words
    - Engaging intro
    - 3-4 sections with H2 headings
    - Clear conclusion with CTA
    - Conversational tone

    Output ONLY the blog post in markdown.
    &#8220;&#8221;&#8220;,
    tools=[],
    output_key=&#8221;blog_content&#8221;  # &#128273; Stores result in session.state[&#8221;blog_content&#8221;]
)

print(f&#8221;&#129502; Agent &#8216;{content_writer_agent.name}&#8217; created!&#8221;)</code></pre><p><strong>Notice:</strong></p><ul><li><p><code>{{blog_topic}}</code> : References the variable from Agent 1</p></li><li><p><code>output_key="blog_content"</code> : Stores for Agent 3</p></li><li><p>No tools needed, pure content generation</p></li></ul><h3><strong>Step 4: Agent 3 &#8212; Social Media Formatter</strong></h3><p>This agent creates social posts from the blog content:</p><pre><code>social_formatter_agent = Agent(
    name=&#8221;social_formatter_agent&#8221;,
    model=&#8221;gemini-2.5-flash&#8221;,
    description=&#8221;Creates social media posts&#8221;,
    instruction=&#8221;&#8220;&#8221;
    Create social posts from: {{blog_content}}

    Create THREE posts:

    1. **Twitter/X** (280 chars)
       - Hook + hashtags + CTA

    2. **LinkedIn** (150-200 words)
       - Professional tone
       - Key insights
       - Hashtags

    3. **Instagram** (150 words)
       - Engaging + emojis
       - 8-10 hashtags
       - Strong CTA

    Format with clear headers for each platform.
    &#8220;&#8221;&#8220;,
    tools=[]
    # Note: No output_key - this is the final agent
)

print(f&#8221;&#129502; Agent &#8216;{social_formatter_agent.name}&#8217; created!&#8221;)</code></pre><p><strong>Notice:</strong></p><ul><li><p><code>{{blog_content}}</code> : Uses content from Agent 2</p></li><li><p>No <code>output_key</code> : Final output goes to user</p></li></ul><h3><strong>Step 5: Chain Them with SequentialAgent</strong></h3><p>Now the magic happens:</p><pre><code>from google.adk.agents import SequentialAgent

content_creation_workflow = SequentialAgent(
    name=&#8221;content_creation_workflow&#8221;,
    sub_agents=[
        topic_research_agent,      # Step 1: Research
        content_writer_agent,      # Step 2: Write
        social_formatter_agent     # Step 3: Format
    ],
    description=&#8221;Research &#8594; Write &#8594; Format workflow&#8221;
)

print(&#8221;&#9989; Sequential workflow created!&#8221;)
print(&#8221;\n&#128260; Execution Flow:&#8221;)
print(&#8221;   1. Research trending topics &#8594; blog_topic&#8221;)
print(&#8221;   2. Write blog post using {{blog_topic}} &#8594; blog_content&#8221;)
print(&#8221;   3. Format social posts using {{blog_content}} &#8594; final output&#8221;)</code></pre><p><strong>That&#8217;s it!</strong> Three agents, three lines of code, automatic data flow.</p><h2><strong>Understanding the Data Flow</strong></h2><h3><strong>Concept: State Passing Between Agents</strong></h3><blockquote><p><em><strong>How Does State Passing Work?</strong> In a SequentialAgent workflow, data flows automatically:</em></p></blockquote><ul><li><p>Agent 1 runs &#8594; stores output via <code>output_key="var1"</code></p></li><li><p>Agent 2 reads <code>{{var1}}</code> from state &#8594; stores output via <code>output_key="var2"</code></p></li><li><p>Agent 3 reads <code>{{var2}}</code> from state &#8594; produces final output</p></li></ul><blockquote><p><em>ADK handles storing, interpolating, and passing state. Zero manual work!</em></p></blockquote><p><strong>Visual representation:</strong></p><pre><code>&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474; topic_research_agent                &#9474;
&#9474; output_key=&#8221;blog_topic&#8221;             &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
               &#9474;
               &#9660; (blog_topic stored in state)
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474; content_writer_agent                &#9474;
&#9474; instruction: &#8220;Write about:          &#9474;
&#9474;              {{blog_topic}}&#8221;        &#9474; &#8592; ADK replaces this
&#9474; output_key=&#8221;blog_content&#8221;           &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
               &#9474;
               &#9660; (blog_content stored in state)
&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474; social_formatter_agent              &#9474;
&#9474; instruction: &#8220;Create posts from:    &#9474;
&#9474;              {{blog_content}}&#8221;      &#9474; &#8592; ADK replaces this
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
               &#9474;
               &#9660;
         Final Output</code></pre><p><strong>Reference</strong>: <a href="https://google.github.io/adk-docs/agents/">Workflow Agents</a></p><h3><strong>Running the Workflow</strong></h3><p>Setup execution engine:</p><pre><code>from IPython.display import display, Markdown
from google.adk.sessions import InMemorySessionService, Session
from google.adk.runners import Runner
from google.genai.types import Content, Part

session_service = InMemorySessionService()
user_id = &#8220;adk_content_creator_001&#8221;

async def run_agent_query(agent, query, session, user_id):
    print(f&#8221;\n&#128640; Running: &#8216;{agent.name}&#8217;...&#8221;)

    runner = Runner(agent=agent, session_service=session_service, app_name=agent.name)

    final_response = &#8220;&#8221;
    try:
        async for event in runner.run_async(
            user_id=user_id,
            session_id=session.id,
            new_message=Content(parts=[Part(text=query)], role=&#8221;user&#8221;)
        ):
            if event.is_final_response():
                final_response = event.content.parts[0].text
    except Exception as e:
        final_response = f&#8221;Error: {e}&#8221;

    print(&#8221;\n&#8221; + &#8220;-&#8221;*50)
    display(Markdown(final_response))
    print(&#8221;-&#8221;*50)

    return final_response

print(&#8221;&#9989; Execution engine ready!&#8221;)</code></pre><p>Run the complete workflow:</p><pre><code>async def run_workflow():
    session = await session_service.create_session(
        app_name=content_creation_workflow.name,
        user_id=user_id
    )

    query = &#8220;Create content for sustainable living and zero-waste lifestyle blog&#8221;
    print(f&#8221;&#128100; User: {query}\n&#8221;)

    await run_agent_query(content_creation_workflow, query, session, user_id)

await run_workflow()</code></pre><h3><strong>Example Execution</strong></h3><p>Watch the three-stage pipeline in action:</p><pre><code>&#128100; User: Create content for sustainable living and zero-waste lifestyle blog

&#128640; Running: &#8216;content_creation_workflow&#8217;...

[Stage 1: Topic Research]
&#128269; Searching for trending topics...
&#10003; Selected topic: &#8220;10 Zero-Waste Swaps to Transform Your Kitchen&#8221;
&#10003; Stored in state as: blog_topic

[Stage 2: Content Writing]
&#10003; Retrieved from state: blog_topic = &#8220;10 Zero-Waste Swaps...&#8221;
&#9997;&#65039; Writing 500-word blog post...
&#10003; Stored in state as: blog_content

[Stage 3: Social Formatting]
&#10003; Retrieved from state: blog_content = &#8220;Transform your kitchen...&#8221;
&#128241; Creating social posts for 3 platforms...

--------------------------------------------------
&#9989; Final Response:

## Twitter/X
&#127793; Transform your kitchen into a zero-waste powerhouse! Discover 10 simple swaps that
save money &amp; the planet. From beeswax wraps to compost bins. Start today!
#ZeroWaste #SustainableLiving #EcoFriendly #GreenKitchen

## LinkedIn
The average household produces 4.4 pounds of waste daily, with kitchens being the
biggest culprit. But transformation doesn&#8217;t require perfection&#8212;it requires small,
intentional swaps.

In our latest blog post, we explore 10 zero-waste kitchen alternatives:
- Beeswax wraps replace plastic wrap
- Glass containers instead of disposable bags
- Compost bins for organic waste
- Reusable produce bags

Each swap is practical, affordable, and immediately implementable. Perfect for
businesses promoting sustainability or individuals starting their eco-journey.

Read the full guide: [link]

#Sustainability #ZeroWaste #EcoFriendly #GreenBusiness #CircularEconomy

## Instagram
&#127807;&#10024; Your kitchen called&#8212;it wants to go green! &#10024;&#127807;

Tired of single-use plastics? We&#8217;ve got you covered with 10 game-changing swaps:
&#128029; Beeswax wraps &gt; plastic wrap
&#129387; Glass jars &gt; disposable containers
&#128465;&#65039; Compost bin &gt; landfill waste
&#128717;&#65039; Reusable bags &gt; plastic produce bags

Small changes, BIG impact! &#127757;&#128154;

Click the link in bio to discover all 10 swaps + how to implement them TODAY!

#ZeroWaste #SustainableLiving #EcoFriendly #GreenKitchen #PlasticFree
#Zerowaste #Sustainability #EcoConscious #GreenLiving #SaveThePlanet</code></pre><p><strong>Notice how:</strong></p><ul><li><p>Stage 1 found a specific topic</p></li><li><p>Stage 2 wrote content about that exact topic</p></li><li><p>Stage 3 created social posts from that exact content</p></li><li><p>All automatic, no manual copying needed!</p></li></ul><h2><strong>What&#8217;s Next?</strong></h2><p>We can now create ordered workflows where data flows automatically! But what if we need <strong>iterative refinement</strong>?</p><p>Imagine:</p><ul><li><p>Draft content &#8594; <strong>Check quality</strong> &#8594; Improve &#8594; <strong>Check again</strong> &#8594; Improve &#8594; &#8230; &#8594; <strong>Until good enough</strong></p></li></ul><p>Current problem:</p><pre><code># This runs once and stops
SequentialAgent([drafter, checker, improver])</code></pre><p>We need:</p><pre><code># This loops until quality threshold met
LoopAgent([checker, improver], max_iterations=3)</code></pre><p>In <strong>Part 5: Self-Improving AI with LoopAgent</strong>, we&#8217;ll learn how to build <strong>iterative workflows</strong> that improve content through critique-refine cycles until quality standards are met.</p><h2><strong>Try It Yourself!</strong></h2><p>Ready to build sequential workflows? Click the button below:</p><p><strong>GitHub Repository</strong>: <a href="https://github.com/Saoussen-CH/content_creation_mas_workshop">content_creation_mas_workshop</a></p><p><strong>Happy sequential workflow sequencing!</strong></p><div><hr></div><p><strong>Thanks for reading</strong>! If this was helpful, hit the &#10084;&#65039;, drop a comment, &#11088; the GitHub repo, and <strong>subscribe</strong> so you don&#8217;t miss the next one. Let&#8217;s connect on <a href="https://www.linkedin.com/in/saoussen-chaabnia">LinkedIn</a>!</p>]]></content:encoded></item></channel></rss>