Checklists

CI/CD for AI Apps implementation checklist

This checklist provides a technical framework for validating CI/CD pipelines used in AI application development. It focuses on the specific challenges of non-deterministic outputs, prompt versioning, and cost-effective evaluation cycles.

Progress0 / 25 complete (0%)

Prompt and Model Versioning

0/5
  • Decouple Prompts from Source Code

    critical

    Store prompt templates in dedicated JSON or YAML files separate from application logic to enable independent versioning and testing.

  • Implement Schema Validation

    critical

    Run JSON schema validation against prompt templates in the CI pipeline to ensure all required variables are present before deployment.

  • Map Prompts to Specific Model Versions

    critical

    Hardcode specific model versions (e.g., gpt-4-0613) in configuration files rather than using 'latest' aliases to prevent unexpected drift during deployments.

  • Automated Prompt Diffing

    recommended

    Configure the CI pipeline to generate and display a diff of prompt changes in pull request comments for manual reviewer audit.

  • Metadata Tagging

    recommended

    Inject the Git commit SHA and environment name into the metadata of all LLM API calls for downstream traceability.

Automated Evaluation and Testing

0/5
  • Unit Test LLM Wrappers

    critical

    Use mocks for LLM API calls in unit tests to verify that application logic handles various response formats and error codes without incurring costs.

  • Structured Output Verification

    critical

    For prompts intended to return JSON, execute a validation step in CI that parses the response and checks for required fields and data types.

  • Semantic Similarity Testing

    recommended

    Run an evaluation script comparing LLM outputs against a 'golden set' of expected answers using cosine similarity or LLM-as-a-judge metrics.

  • Regression Testing Suite

    recommended

    Maintain a dataset of 50-100 edge-case prompts that must be executed and validated whenever the model version or system prompt changes.

  • Prompt Injection Scanning

    recommended

    Integrate a security scanner in the pipeline to test if system prompts can be bypassed via malicious user input patterns.

Deployment and Infrastructure

0/5
  • Environment Secret Management

    critical

    Store LLM API keys in a secure vault (e.g., GitHub Secrets, HashiCorp Vault) and ensure they are never logged in CI build traces.

  • Health Check Endpoints

    critical

    Implement a dedicated /health/ai endpoint that verifies connectivity to upstream AI providers and model availability before routing traffic.

  • Blue-Green Deployment for Prompts

    recommended

    Configure the deployment pipeline to route a small percentage of traffic to new prompt versions before full cutover.

  • Rollback Automation

    critical

    Define automated rollback triggers based on a spike in 4xx/5xx errors or a drop in semantic similarity scores post-deployment.

  • Inference Latency Benchmarking

    optional

    Measure and log the p95 latency of model responses during integration tests to prevent performance regressions from reaching production.

Cost and Resource Optimization

0/5
  • CI Token Usage Caps

    critical

    Set hard limits on the number of tokens or API spend allowed per CI pipeline run to prevent runaway costs from recursive loops or large test suites.

  • Evaluation Parallelization

    recommended

    Distribute LLM evaluation tasks across multiple parallel CI jobs to reduce build times when running large-scale semantic tests.

  • Model Weight Caching

    recommended

    If self-hosting models, use persistent volume claims or CI caching layers for model weights to avoid multi-gigabyte downloads on every build.

  • Conditional Evaluation Execution

    recommended

    Configure CI to only run expensive LLM evaluation suites if changes are detected in prompt files or model configurations.

  • Small Model Testing

    optional

    Use smaller, cheaper models (e.g., GPT-3.5 or Haiku) for basic logic testing in CI, reserving larger models for final staging evaluations.

Observability Integration

0/5
  • OpenTelemetry Instrumentation

    recommended

    Ensure the CI/CD pipeline verifies that OpenTelemetry spans are correctly configured for all LLM calls to enable production tracing.

  • Log Masking Verification

    critical

    Run a test case in CI that attempts to log PII through the AI feature and verifies that the logging middleware correctly redacts the data.

  • Metric Dashboard Deployment

    recommended

    Automate the deployment of Grafana or Datadog dashboards alongside the application to track token usage and error rates per model.

  • Feedback Loop Capture

    optional

    Verify that the production environment is configured to capture and store user 'thumbs up/down' feedback linked to specific prompt versions.

  • Alerting Thresholds

    critical

    Verify that deployment scripts update alerting thresholds for inference failures and token exhaustion in the monitoring provider.