Prompt Engineering in Production
Battle-tested strategies for building reliable LLM-powered features, from prompt versioning to failure handling and evaluation
Prompt Engineering in Production#
The gap between "works in the playground" and "works in production" is where most LLM projects fail. Prompts that seem robust during development break in surprising ways when exposed to real user input. Here's what I've learned about building LLM features that actually ship.
Treat Prompts as Code#
Your prompts deserve the same rigor as your application code:
// prompt-registry.ts
export const PROMPTS = {
summarize: {
version: "2.3.1",
template: `You are a summarization assistant...`,
model: "gpt-4o",
temperature: 0.3,
maxTokens: 500,
schema: SummaryResponseSchema
}
} as const;
// Version your prompts, track changes, review diffs
// Never edit prompts directly in productionWe maintain a prompt registry with semantic versioning. Breaking changes (different output format) bump major versions. A/B tests compare versions with real traffic before rollout.
Defensive Input Handling#
Production inputs are adversarial—not because users are malicious, but because they're creative. Your prompt needs guardrails:
async function processUserQuery(userInput: string): Promise<Result> {
// 1. Input validation
if (userInput.length > MAX_INPUT_LENGTH) {
return { error: "Input too long", code: "INPUT_OVERFLOW" };
}
// 2. Injection detection (basic example)
const suspiciousPatterns = [
/ignore previous instructions/i,
/system prompt/i,
/you are now/i
];
if (suspiciousPatterns.some(p => p.test(userInput))) {
logger.warn("Potential injection attempt", { input: userInput });
return { error: "Invalid input", code: "VALIDATION_FAILED" };
}
// 3. Sanitization
const sanitized = sanitizeForPrompt(userInput);
return await executePrompt(sanitized);
}Structured Outputs Are Non-Negotiable#
Free-form text responses are a reliability nightmare. Force structure:
import { z } from "zod";
const AnalysisSchema = z.object({
sentiment: z.enum(["positive", "negative", "neutral"]),
confidence: z.number().min(0).max(1),
keyTopics: z.array(z.string()).max(5),
summary: z.string().max(200)
});
async function analyzeText(text: string) {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: buildAnalysisPrompt(text) }],
response_format: { type: "json_object" }
});
// Parse and validate—always
const parsed = JSON.parse(response.choices[0].message.content);
return AnalysisSchema.parse(parsed);
}When structured outputs fail validation, you have clear options: retry, fallback, or error. With free-form text, you're parsing hope.
Evaluation That Catches Regressions#
Manual testing doesn't scale. Build automated evaluation:
# eval_suite.py
class PromptEvalSuite:
def __init__(self, prompt_version: str):
self.prompt_version = prompt_version
self.test_cases = load_golden_set("summarization_golden.json")
def run(self) -> EvalResults:
results = []
for case in self.test_cases:
output = run_prompt(self.prompt_version, case.input)
results.append({
"case_id": case.id,
"output": output,
"rouge_score": calculate_rouge(output, case.expected),
"format_valid": validate_format(output),
"latency_ms": output.latency
})
return EvalResults(
prompt_version=self.prompt_version,
pass_rate=sum(r["format_valid"] for r in results) / len(results),
avg_rouge=mean(r["rouge_score"] for r in results),
p95_latency=percentile([r["latency_ms"] for r in results], 95)
)Run evals on every prompt change. Block deployment if metrics regress.
Graceful Degradation#
LLMs will fail. Rate limits, timeouts, malformed outputs—plan for all of them:
async function withLLMFallback<T>(
primary: () => Promise<T>,
fallback: () => T,
options: { maxRetries: number; timeoutMs: number }
): Promise<T> {
for (let attempt = 0; attempt < options.maxRetries; attempt++) {
try {
return await Promise.race([
primary(),
timeout(options.timeoutMs)
]);
} catch (e) {
logger.warn("LLM attempt failed", { attempt, error: e });
}
}
logger.error("All LLM attempts failed, using fallback");
return fallback();
}The fallback might be a cached response, a simpler model, or a graceful error message. But never let an LLM failure crash your application.
Production LLM systems are 20% prompt engineering and 80% defensive programming. The prompt is just the beginning.