Checklists

Prompt Engineering implementation checklist

This checklist provides actionable verification steps for developers to ensure LLM prompts are robust, secure, and cost-effective before deployment to production environments.

Progress0 / 25 complete (0%)

Output Reliability and Formatting

0/5
  • Schema Validation Enforcement

    critical

    Verify that the model output consistently adheres to a strict JSON or XML schema using tools like Zod, Pydantic, or model-native structured output modes.

  • Few-Shot Example Diversity

    recommended

    Include at least 3-5 diverse input/output pairs in the prompt to demonstrate handling of edge cases and specific formatting requirements.

  • Negative Constraint Definition

    recommended

    Explicitly list forbidden elements (e.g., 'Do not include conversational filler', 'Do not mention your training data') to prevent unwanted verbosity.

  • Delimiter Implementation

    critical

    Use clear delimiters like triple backticks (```) or XML tags to separate system instructions, context, and user-provided input.

  • Fallback Logic for Parsing Errors

    critical

    Implement a retry mechanism or a secondary prompt to fix malformed JSON strings when the initial generation fails validation.

Reasoning and Logic Accuracy

0/5
  • Chain-of-Thought (CoT) Verification

    recommended

    For complex tasks, instruct the model to 'think step-by-step' and verify that the reasoning path leads to the correct conclusion.

  • Multi-Step Task Decomposition

    recommended

    Break prompts requiring more than three logical steps into a chain of separate, smaller LLM calls to reduce reasoning errors.

  • Context Window Optimization

    critical

    Ensure the most critical information is placed at the very beginning or end of the prompt to mitigate 'lost in the middle' retrieval issues.

  • Self-Correction Loop

    optional

    Include a step where the model reviews its own output for logical inconsistencies before returning the final response.

  • Reference Material Grounding

    critical

    Provide a 'ground truth' document and instruct the model to only use provided facts, citing specific sections to reduce hallucinations.

Security and Safety Guardrails

0/5
  • Prompt Injection Testing

    critical

    Test the prompt against common injection attacks like 'Ignore previous instructions' or 'System override' to ensure instruction persistence.

  • PII and Sensitive Data Masking

    critical

    Implement regex or NLP-based pre-processing to scrub personally identifiable information from user inputs before they reach the LLM.

  • Output Content Filtering

    critical

    Enable and configure provider-specific safety settings (e.g., OpenAI Moderation API) to block harmful or inappropriate content generation.

  • System Prompt Hardening

    critical

    Treat the system prompt as code; ensure it is stored in version control and not directly exposed to end-users via client-side code.

  • Input Length Sanitization

    recommended

    Set hard limits on user input character counts to prevent denial-of-service style attacks via massive token consumption.

Performance and Cost Management

0/5
  • Token Usage Benchmarking

    recommended

    Calculate the average token count for system instructions and few-shot examples to estimate cost per 1,000 requests.

  • Instruction Compression

    recommended

    Remove redundant adjectives and filler words from the prompt to reduce input token costs without impacting output quality.

  • Model Tier Selection

    recommended

    Verify if the task can be performed by a lower-cost model (e.g., GPT-4o-mini vs GPT-4o) and document any performance degradation.

  • Latency Measurement (TTFT)

    critical

    Measure the Time to First Token and total generation time across 100 samples to ensure it meets application SLA requirements.

  • Response Caching Strategy

    optional

    Implement semantic or exact-match caching for frequent queries to reduce API costs and improve response times.

Evaluation and Regression Testing

0/5
  • Golden Dataset Creation

    critical

    Maintain a versioned set of 50+ diverse inputs and their expected 'perfect' outputs for automated testing.

  • LLM-as-a-Judge Implementation

    recommended

    Use a more capable model (e.g., Claude 3.5 Sonnet) to grade the performance of the production model based on a rubric.

  • Prompt Versioning

    critical

    Assign semantic versions to prompt templates and track which version was used for every production inference call.

  • A/B Testing Deployment

    recommended

    Run the new prompt version alongside the current baseline for 5% of traffic to compare success metrics (e.g., conversion, error rate).

  • Provider Parity Check

    optional

    Test the prompt across at least two different model providers (e.g., OpenAI and Anthropic) to ensure portability and identify provider-specific biases.