Checklists

CI/CD for AI Apps implementation checklist

This checklist provides a technical framework for validating CI/CD pipelines used in AI application development. It focuses on the specific challenges of non-deterministic outputs, prompt versioning, and cost-effective evaluation cycles.

Progress0 / 25 complete (0%)

Prompt and Model Versioning

0/5

Decouple Prompts from Source Code
critical
Store prompt templates in dedicated JSON or YAML files separate from application logic to enable independent versioning and testing.
Implement Schema Validation
critical
Run JSON schema validation against prompt templates in the CI pipeline to ensure all required variables are present before deployment.
Map Prompts to Specific Model Versions
critical
Hardcode specific model versions (e.g., gpt-4-0613) in configuration files rather than using 'latest' aliases to prevent unexpected drift during deployments.
Automated Prompt Diffing
recommended
Configure the CI pipeline to generate and display a diff of prompt changes in pull request comments for manual reviewer audit.
Metadata Tagging
recommended
Inject the Git commit SHA and environment name into the metadata of all LLM API calls for downstream traceability.

Automated Evaluation and Testing

0/5

Unit Test LLM Wrappers
critical
Use mocks for LLM API calls in unit tests to verify that application logic handles various response formats and error codes without incurring costs.
Structured Output Verification
critical
For prompts intended to return JSON, execute a validation step in CI that parses the response and checks for required fields and data types.
Semantic Similarity Testing
recommended
Run an evaluation script comparing LLM outputs against a 'golden set' of expected answers using cosine similarity or LLM-as-a-judge metrics.
Regression Testing Suite
recommended
Maintain a dataset of 50-100 edge-case prompts that must be executed and validated whenever the model version or system prompt changes.
Prompt Injection Scanning
recommended
Integrate a security scanner in the pipeline to test if system prompts can be bypassed via malicious user input patterns.

Deployment and Infrastructure

0/5

Environment Secret Management
critical
Store LLM API keys in a secure vault (e.g., GitHub Secrets, HashiCorp Vault) and ensure they are never logged in CI build traces.
Health Check Endpoints
critical
Implement a dedicated /health/ai endpoint that verifies connectivity to upstream AI providers and model availability before routing traffic.
Blue-Green Deployment for Prompts
recommended
Configure the deployment pipeline to route a small percentage of traffic to new prompt versions before full cutover.
Rollback Automation
critical
Define automated rollback triggers based on a spike in 4xx/5xx errors or a drop in semantic similarity scores post-deployment.
Inference Latency Benchmarking
optional
Measure and log the p95 latency of model responses during integration tests to prevent performance regressions from reaching production.

Cost and Resource Optimization

0/5

CI Token Usage Caps
critical
Set hard limits on the number of tokens or API spend allowed per CI pipeline run to prevent runaway costs from recursive loops or large test suites.
Evaluation Parallelization
recommended
Distribute LLM evaluation tasks across multiple parallel CI jobs to reduce build times when running large-scale semantic tests.
Model Weight Caching
recommended
If self-hosting models, use persistent volume claims or CI caching layers for model weights to avoid multi-gigabyte downloads on every build.
Conditional Evaluation Execution
recommended
Configure CI to only run expensive LLM evaluation suites if changes are detected in prompt files or model configurations.
Small Model Testing
optional
Use smaller, cheaper models (e.g., GPT-3.5 or Haiku) for basic logic testing in CI, reserving larger models for final staging evaluations.

Observability Integration

0/5

OpenTelemetry Instrumentation
recommended
Ensure the CI/CD pipeline verifies that OpenTelemetry spans are correctly configured for all LLM calls to enable production tracing.
Log Masking Verification
critical
Run a test case in CI that attempts to log PII through the AI feature and verifies that the logging middleware correctly redacts the data.
Metric Dashboard Deployment
recommended
Automate the deployment of Grafana or Datadog dashboards alongside the application to track token usage and error rates per model.
Feedback Loop Capture
optional
Verify that the production environment is configured to capture and store user 'thumbs up/down' feedback linked to specific prompt versions.
Alerting Thresholds
critical
Verify that deployment scripts update alerting thresholds for inference failures and token exhaustion in the monitoring provider.