Guides

Building CI/CD for AI Apps with open-source tools

This guide outlines the implementation of a CI/CD pipeline for AI applications, focusing on the integration of prompt evaluation, model versioning, and automated deployment strategies. Unlike standard software pipelines, AI CI/CD must account for non-deterministic outputs and the high cost of evaluation runs.

3-4 hours6 steps

Decouple Prompts from Application Logic

Store prompts in structured files (JSON or YAML) rather than hardcoding them in application strings. This allows the CI pipeline to detect changes specifically in prompts and trigger specialized evaluation suites without rebuilding the entire application container.

config/prompts.yaml

prompts:
  summarization_v1:
    template: "Summarize the following text in {{length}} sentences: {{text}}"
    params:
      model: "gpt-4o"
      temperature: 0.3

⚠ Common Pitfalls

•Mixing prompt templates with runtime logic makes it impossible to run diff-based evaluations.
•Failing to version the prompt schema alongside the code.

Implement Automated Prompt Evaluation (Evals)

Integrate a tool like Promptfoo into your CI workflow. Define a set of 'golden' test cases with expected outputs or assertions (e.g., JSON schema validation, keyword presence, or semantic similarity). This step prevents regressions in model behavior when prompts or model versions change.

promptfooconfig.yaml

prompts: [config/prompts.yaml]
providers: [openai:gpt-4o]
tests:
  - vars:
      text: "The quick brown fox jumps over the lazy dog."
      length: "1"
    assert:
      - type: contains
        value: "fox"
      - type: javascript
        value: output.split('.').length <= 2

⚠ Common Pitfalls

•Running full eval suites on every commit can be expensive. Use git diff to trigger evals only when prompt files change.
•Relying on exact string matching for non-deterministic outputs.

Configure CI Secrets and Rate Limiting

Set up GitHub Actions secrets for your LLM API keys. To prevent CI failures due to rate limits, implement a retry mechanism in your evaluation scripts or use a proxy that handles queuing. Ensure the CI environment uses a dedicated 'testing' API key with usage limits to control costs.

.github/workflows/ai-eval.yml

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: npx promptfoo eval

⚠ Common Pitfalls

•Leaking API keys in CI logs by not masking them.
•Failing to set a hard budget cap on the CI API key.

Containerize with Model Layer Optimization

If using local models or heavy dependencies (PyTorch/Transformers), structure your Dockerfile to cache the heavy layers. If using API-based models, keep the image slim but include the prompt configuration files as a separate layer to ensure fast rebuilds.

Dockerfile

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy prompts first to allow caching of code layers
COPY config/prompts.yaml ./config/
COPY . .
CMD ["python", "main.py"]

⚠ Common Pitfalls

•Including large model weights (.bin or .safetensors) directly in the Git repo; use LFS or download them during the build stage.
•Re-downloading heavy dependencies on every CI run due to poor layer ordering.

Deploy to Preview Environments with LLM-as-a-Judge

For every pull request, deploy a preview environment (e.g., on Vercel or Cloudflare Pages). Run a final 'smoke test' where a separate, more capable model (like GPT-4) evaluates the output of the preview deployment to ensure the user-facing AI feature meets quality thresholds before merging.

⚠ Common Pitfalls

•Merging PRs that pass unit tests but produce low-quality AI responses.
•Ignoring the latency impact of running LLM-as-a-judge during the deployment block.

Automated Rollback on Metric Regression

Configure your deployment tool (e.g., ArgoCD or Kamal) to monitor post-deployment metrics. If the LLM integration shows a spike in 5xx errors or if real-time sentiment analysis of user feedback drops below a threshold, trigger an automatic rollback to the previous stable container image and prompt version.

⚠ Common Pitfalls

•Rolling back the code without rolling back the prompt configuration, leading to version mismatch.
•Not having a manual override for the automated rollback logic.

What you built

A successful CI/CD pipeline for AI moves the uncertainty of LLM outputs into the testing phase. By versioning prompts as code, implementing automated semantic evaluations, and optimizing container builds, teams can deploy AI features with the same confidence as traditional software.