AI Evals Engineer (LLM Evaluation & Customer Traceability)
Docket
We’re hiring an AI Evals Engineer to own the evaluation and observability systems that keep our AI clear, accurate, and trustworthy while closing the loop with customers. You’ll design gold‑standard test sets, automate offline/online evaluation, trace customer queries end‑to‑end, and wire quality signals into our product and release process so we can move fast without breaking trust.
What You’ll DoBuild eval pipelines: Implement automated offline evals for key use cases (RAG, agents, chat, extraction), including data ingestion, labeling strategies (human + LLM‑as‑judge), scoring, and regression suites in CI/CD.
Define quality metrics: Formalize task‑specific and aggregate KPIs (e.g., factuality, faithfulness, toxicity, grounding precision/recall, nDCG/Recall@k for retrieval, latency/cost) with clear acceptance bars.
Own test datasets: Create and maintain golden sets, synthetic variants, adversarial cases, and red‑teaming corpora; version them and track drift.
Production monitoring: Instrument tracing and post‑deployment checks; detect regressions using canary traffic, shadow tests, and guardrails; trigger auto‑rollbacks or alerts.
Customer query tracing & triage: Instrument end‑to‑end traces from a customer question through retrieval, inference, and post‑processing. Reproduce and debug customer‑reported issues, perform root‑cause analysis across ML pipelines, and prioritize fixes.
Customer feedback loop: Partner with Customer Success & Support to translate tickets and feedback into structured QA signals, updated test cases, and proactive communications back to customers.
Experimentation: Run A/Bs and prompt/model experiments; ensure statistical rigor (power, MDE, CUPED when needed); analyze and ship recommendations.
Dashboards & visibility: Build self‑serve dashboards and runbooks that surface trends by segment, feature, and model/prompt version.
Cross‑functional partner: Validate new features, escalate quality risks early, and translate customer feedback & support tickets into structured QA signals.
Process & governance: Establish eval review rituals, documentation, and change management for prompts, tools, model versions, and customer‑facing releases.
What You’ll Bring2+ years in product analytics, QA, data, or ML—ideally in AI/LLM products.
Strong Python and SQL; comfort querying production data, reading logs, and debugging flows.
Hands‑on experience designing LLM evals (rubric design, LLM‑as‑judge, human review, inter‑rater reliability).
Experience with distributed tracing or observability stacks (OpenTelemetry, Honeycomb, Datadog) and debugging end‑to‑end request paths.
Familiarity with retrieval metrics, prompt/version control, and basic statistics for experiments.
Very high agency and ownership: you turn ambiguity into a plan, ship iteratively, and measure impact.
Nice to HaveExperience with any of: LangSmith / OpenAI Evals / Ragas / DeepEval / Arize / WhyLabs; Airflow/Prefect/dbt; BigQuery/Snowflake; MLflow/W&B; Metabase/Mode/Hex; OpenTelemetry/Trace‑based evaluation.
Safety & red‑teaming experience (prompt injection, PII leakage, jailbreaks).
Knowledge of CI/CD and release gates for prompts/tools/models.
Prior collaboration with Customer Success or Support teams.
How We’ll Measure Success (First 90 Days)30 days: Baseline dashboards live for top journeys; initial golden sets and acceptance criteria published; trace logging in place for all customer‑facing endpoints.
60 days: CI evals block regressions; first A/Bs shipped with readouts; on‑call quality & customer‑impact alerting in place; mean time‑to‑identify (MTTI) for customer issues < 30 min.
90 days: Coverage of priority use cases ≥ 80%; measurable quality lift (e.g., +X pts factuality, −Y % escalations); mean time‑to‑resolution (MTTR) for customer quality tickets reduced by Z %; documented playbook for ongoing evals and customer feedback triage.
(If you’re passionate about building trustworthy AI and love digging into both metrics and customer stories, we’d love to meet you!)