hiring.interview.design

paste to any AI agent

view raw

# Usage: curl -sSL https://seed.show/hiring.interview.design | bash -s <install-path>
# <install-path> is the directory where the file should land.

set -euo pipefail
[ -z "${1:-}" ] && {
echo "install requires a path: curl -sSL https://seed.show/hiring.interview.design | bash -s <install-path>" >&2
exit 1
}
TARGET="$1"
mkdir -p "$TARGET"
DEST="$TARGET/seed-fold.DMMvS9.folded.md"

cat > "$DEST" <<'PORTDOWN_4B09CA6C'
<!--
This is a `.folded.md` archive — a directory packed into one markdown
file. The markers below are load-bearing; don't edit them directly.

To unpack (requires bash — if you have no shell, tell the user):
1. curl -sSLf https://fold.dom.vin/skill | bash -s <INSTALL_DIR>
2. <INSTALL_DIR>/fold/scripts/unfold <this-file>
(or: unfold <this-file> if fold/scripts is on your PATH)
-->

---
fold: true
marker: d6e99f
at: 2026-05-07T16:16:08Z
root: seed-pack.nzEyCU
---

# hiring.interview.design

Interview design context for agents. What to know before generating questions, building rubrics, or advising on hiring process.

## The load-bearing fact

Structured interviews significantly outperform unstructured ones. Schmidt & Hunter's 1998 meta-analysis (source 1 in sources.md) found structured interviews have a validity coefficient of ~0.51 versus ~0.38 for unstructured — explaining roughly 26% vs. 14% of variance in job performance. Unstructured interviews are barely better than chance. Everything else in this file is downstream of that finding.

## Mental model: interview as a measurement instrument

A job interview is a prediction instrument, not a conversation. Its purpose is to predict future job performance. Everything else — rapport, impressionistic fit, "culture," whether the candidate seems like a good person — is noise that degrades the prediction. Structure is what makes the instrument valid.

Three conditions determine validity:
- **Consistency**: every candidate faces the same questions in the same order
- **Structure**: questions are derived from job analysis, not intuition
- **Independence**: interviewers score before they discuss

When those three hold, interviews predict. When any is missing, you're mostly measuring interviewer confidence and similarity bias.

Why unstructured interviews fail: interviewers form an impression in the first few minutes and spend the rest of the session confirming it. With no common questions across candidates, each candidate is evaluated on different evidence, so panel consensus collapses to whoever spoke most confidently. Research treats this not as a flaw to manage but as a fundamental validity failure.

## What structured interviewing is — and is not

Structured interviewing is not a rigid script that prevents follow-up. It is a discipline: evaluation criteria are defined before any candidate is seen, questions are designed to elicit evidence about those criteria, and scoring is done independently before discussion contaminates it.

Four components distinguish structured from unstructured:

1. **Job analysis first.** Every question traces back to a specific competency derived from analyzing the job — not from intuition about what "a good hire" looks like.
2. **Standardized questions.** Every candidate for a given role answers the same questions, in the same order, with the same probing structure. This creates a common evidence base.
3. **Behavioral or situational format.** Questions elicit concrete past behavior or a response to a hypothetical scenario — not abstract self-description. "What are your strengths?" is not a structured question.
4. **Anchored scoring.** Each question has a rubric: what a 1, 3, and 5 answer looks like for this specific competency in this specific role. Interviewers score independently before any panel discussion.

## Behavioral vs. situational questions

Two valid formats. Not interchangeable.

**Behavioral questions** — *Tell me about a time when...* — rest on the premise that past behavior predicts future behavior. They work best when candidates have had prior opportunity to demonstrate the competency. The STAR probing structure (Situation, Task, Action, Result) extracts the actual behavior from the story rather than letting candidates describe what they would have done. The action is the signal; everything else is context.

**Situational questions** — *Imagine you are six weeks into the role and discover...* — work best for early-career candidates or role-specific scenarios unlikely to have been encountered before. The premise (stated intentions predict behavior) is weaker than the behavioral premise, but far stronger than open-ended self-description.

Both formats require pre-written probes. Without probes, interviewers improvise, and improvised follow-ups reintroduce the variability structure was designed to eliminate.

Watch for: **hypothetical behavioral questions** (*"How would you handle..."*) — they sound behavioral but allow abstract answers. The fix is simple: change "would" to "did" and require a real example.

## Scoring rubrics and anchor development

A rubric without behavioral anchors is not a rubric — it is a label. "1 = poor, 3 = acceptable, 5 = excellent" tells a rater nothing about what evidence would justify each score. The result is that raters apply their own implicit standards, and the scoring is unstructured in everything but name.

Behavioral anchors are concrete descriptions of what a score at each level looks like for this competency in this role. They are written before any candidate is interviewed, by people who know the job well enough to describe what exceptional versus adequate versus poor looks like. Anchors must precede candidates — anchors developed after hearing real candidates are contaminated by those candidates.

Anchor development is iterative and expensive the first time. For a given role: draft anchors, pilot on a practice case, revise based on inter-rater disagreements, finalize before the first live interview. The anchors become a durable asset: they survive panel turnover, can be audited for bias, and make onboarding new interviewers fast.

## What AI is changing

AI is entering the interview process from multiple directions simultaneously. Agents advising on interview design need to understand both the claims and the validity evidence — or lack of it.

**Video interview analysis.** Platforms analyze recorded candidate interviews and score on dimensions like "communication," "confidence," or "cognitive ability." The vendors claim predictive validity; the peer-reviewed validation evidence for most systems is thin or nonexistent. Key question to ask any vendor: what is the criterion-related validity coefficient against actual job performance data, from a prospective study, in a population similar to yours? Most cannot answer this.

**Automated question generation.** LLMs can generate structured behavioral and situational questions from a job description at volume. This is useful — but only if the job description reflects an actual job analysis. Generating questions from a stale JD produces questions with no validity basis regardless of how well-formed they look syntactically.

**Real-time coaching.** Tools that coach interviewers in real-time (flagging when they're asking leading questions, tracking time per question, prompting for follow-up probes) have genuine potential to improve consistency. These tools improve structure without replacing the human judgment that scores behavioral evidence.

**Bias detection.** Automated flagging of questions that reference protected characteristics is useful and legally defensible. Harder claims — that AI systems reduce "overall bias" in candidate evaluation — require the same validity evidence as any other selection tool. Bias reduction claims should be accompanied by adverse impact analyses showing reduced disparate impact in actual hiring outcomes.

**Legal landscape — Illinois AI Video Interview Act (2020).** Illinois requires employers using AI to analyze video interviews to: (1) notify candidates before the interview that AI will be used; (2) explain what the AI is evaluating and how; (3) obtain candidate consent; (4) limit sharing of video footage; (5) delete footage within 30 days of request. Illinois is the leading jurisdiction; similar requirements are developing elsewhere. Any agent advising on video-AI tooling must surface these obligations. Source: IAVIA, 820 ILCS 42.

**What AI vendors should demonstrate vs. what they claim.** Vendors should provide: a prospective criterion-related validity study (not retrospective, not concurrent), adverse impact analyses across protected groups, sample sizes adequate for statistical inference, and a clear description of what construct is actually being measured. Most vendor white papers instead provide: testimonials, engagement metrics, time-to-hire reductions, and "our AI was trained on X interviews." Time-to-hire is not a validity criterion.

## What agents get wrong

**1. Generating questions without a job analysis.**
Questions not grounded in specific competencies required by the job have no validity basis. A generically "good" interview question about leadership may not measure anything relevant to the role. Before writing questions, identify: what are the 4–6 competencies this role requires, and what does evidence of each look like in this specific context? Without that, question generation is noise production.

**2. Treating "culture fit" as a valid construct.**
"Culture fit" has no operational definition in selection research (sources.md, sources 1 and 6), no agreed measurement approach, and no demonstrated validity. It functions as a rationalization for similarity bias — candidates who remind interviewers of themselves score higher. The correct framing is usually one of two real constructs: values alignment (does this person operate with the same core commitments?) or collaboration style (will this person work effectively with this team's norms?). Both can be assessed with structured behavioral questions and scored rubrics. "Culture fit" cannot.

**3. Omitting scoring rubrics.**
An agent that generates questions but not rubrics has produced half the artifact. Questions without anchors produce variable scoring, which produces unreliable aggregate evaluations, which means the structure is cosmetic. Every question needs a rubric. The rubric is the hard part.

**4. Failing to distinguish panel structure from evaluation criteria.**
Who interviews the candidate, and what each interviewer is evaluating, are separate design decisions. A common failure: assigning multiple interviewers to evaluate the same competency, which wastes time and creates false precision. Each competency should be evaluated once, by the interviewer best positioned to judge it. The panel is a division of labor problem. It should be designed, not defaulted.

**5. Skipping calibration.**
Calibration is a session before the first interview — and periodically after — where interviewers review the anchors together, score a practice case, and resolve disagreements about what each score level means. Without calibration, two interviewers using the same rubric can still apply it differently. One person's 4 is another's 2. Calibration is how inter-rater reliability is established before it matters. Interview experience does not substitute for calibration — it produces confident but idiosyncratic scoring.

**6. Ignoring legal constraints.**
Questions must be job-relevant. Questions about protected characteristics (age, national origin, disability, religion, family status) are unlawful in most jurisdictions regardless of how they are framed. See EEOC sources in sources.md (sources 2 and 5). For AI-assisted video interviews in Illinois and growing number of jurisdictions: disclosure, consent, and deletion obligations apply (source 8 in sources.md).

## For full pipeline detail

See structure.md: stage-by-stage protocol from job analysis through structured debrief. See sources.md for research citations — fetch the URLs there rather than relying on memory.

# sources

Fetch these at task time. Ordered by importance.

1. Schmidt & Hunter (1998) meta-analysis — the foundational study on selection validity; 85 years of research across 19 selection methods. The source for structured vs. unstructured interview validity coefficients (0.51 vs. 0.38) and the comparison with work sample tests and cognitive ability measures:
https://journals.sagepub.com/doi/10.1177/001872679805100604

2. EEOC: pre-employment inquiries guidance — what questions are lawful and unlawful in an interview context; the legal constraint that interview design must operate within. Job-relevance is the standard; questions about protected characteristics are unlawful regardless of phrasing:
https://www.eeoc.gov/pre-employment-inquiries-and-employment-decisions

3. SHRM: structured interviewing guide — practitioner overview of behavioral and situational questions, panel structure, and scoring rubric design:
https://www.shrm.org/topics-tools/tools/toolkits/conducting-structured-interviews

4. Laszlo Bock, Work Rules (2015), Chapter 4 — Google's move from brainteaser interviews to structured behavioral interviews; the internal validity data that drove the change; practical anchor development. No live URL; cite as a primary source for the practitioner case:
https://www.workrulesbook.com

5. EEOC: Uniform Guidelines on Employee Selection Procedures — the federal regulatory framework governing selection procedures, including the validity and adverse impact standards that structured interviews must satisfy. The standard for adverse impact is the 4/5ths rule:
https://www.eeoc.gov/laws/guidance/questions-and-answers-clarify-and-provide-common-interpretation-uniform-guidelines

6. Society for Industrial-Organizational Psychology (SIOP): interview validity overview — the scientific society's practitioner summary of what makes interviews valid predictors; covers reliability, validity, and legal defensibility. The SIOP position on "culture fit" as a construct is that it lacks operational definition and validity evidence:
https://www.siop.org/Research-Publications/Items-of-Interest/ArticleType/ArticleView/ArticleId/549

7. EEOC: questions and answers on the use of tests and other selection procedures — guidance on how disparate impact analysis applies to structured interview scoring:
https://www.eeoc.gov/laws/guidance/questions-answers-clarify-and-provide-common-interpretation-uniform-guidelines

8. Illinois AI Video Interview Act (IAVIA), 820 ILCS 42 — enacted 2020; requires employer disclosure before AI-analyzed video interviews, explanation of what AI evaluates, candidate consent, restrictions on sharing footage, and deletion of recordings within 30 days upon candidate request. First major AI-specific interviewing regulation in the US; similar legislation developing in other jurisdictions:
https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3996

9. AI in Hiring: Examining Algorithms as Employment Discrimination (Harvard Law Review) — legal analysis of how existing employment discrimination frameworks apply to algorithmic hiring tools, including AI video interview platforms; addresses the gap between vendor validity claims and regulatory standards:
https://harvardlawreview.org/2020/04/ai-in-hiring-examining-algorithms-as-employment-discrimination/

10. EEOC: Using Artificial Intelligence and Algorithms in Employment Decisions — EEOC technical assistance document on how employers using AI selection tools remain responsible for adverse impact even when the tool is vendor-supplied; addresses the "we didn't build it" liability misconception:
https://www.eeoc.gov/laws/guidance/questions-and-answers-clarify-and-provide-common-interpretation-uniform-guidelines-0

# Structured interview anatomy

The pipeline from open role to valid prediction. Each stage is a decision point, not a checkbox. Skipping or compressing any stage propagates forward as error.

---

## Stage 1: Job analysis

**What it is.** A systematic examination of the role to identify the tasks, knowledge, skills, and abilities required for successful performance. Not "what does the hiring manager want in their ideal candidate" — that's preference, not analysis.

**How it's done.** At minimum: interview two to four current high-performers in the role and two to four of their managers. Ask what tasks fill their time, what separates strong performers from adequate ones, and what they'd screen for if they were hiring. Supplement with O*NET data for standard occupational requirements. For technical roles, include a work sample or task inventory.

**Decision point.** Is this a backfill or a new role? Backfills can lean on incumbents. New roles require more inference from adjacent jobs and explicit managerial alignment on what success looks like, because there's no reference population to interview.

**What breaks here.** Using a job description as a proxy for job analysis. Job descriptions are often stale, aspirational, or copied from a template. They describe the credential requirements someone thought mattered, not the tasks that determine performance. Building an interview on an unvalidated job description produces questions with no validity basis.

---

## Stage 2: Competency definition

**What it is.** Translating the job analysis output into 4–6 scoreable competencies — the attributes or behaviors most predictive of performance in this role. Not personality traits ("must be a team player"). Observable, describable behaviors tied to specific job tasks.

**How to choose them.** Sort the job analysis themes by two criteria: frequency (how often does the work require this?) and importance (when it's missing, how much does performance suffer?). The competencies in the top quadrant of both get evaluated in the interview. The rest may surface elsewhere in the process (references, work samples, assessments).

**Decision point.** How many competencies? Four to six is the working limit for a structured interview — enough to cover the critical dimensions of the role, not so many that each gets one thin question with a rushed rubric. If the role genuinely requires more than six, split the competencies across multiple interviewers rather than cramming them into one interview.

**What breaks here.** Competency lists that are really value statements: "integrity," "growth mindset," "executes with urgency." These have no shared operational definition across raters. Before including a competency, ask: can I write a behavioral anchor for a 1 and a 5? If not, it isn't measurable yet.

---

## Stage 3: Question bank design

**What it is.** Writing the specific questions each interviewer will ask, in the order they'll ask them, with the follow-up probes built in. The output is a question bank: a documented set of questions organized by competency, with the rationale for each question's validity and 2–3 pre-written probes per question. The bank is a shared, versioned artifact — not notes in each interviewer's head.

**Behavioral format.** *"Tell me about a time when you had to…"* Appropriate when the competency requires lived experience. Probes follow the STAR frame: push for Situation (what was the context?), Task (what was your specific responsibility?), Action (what did you actually do — not "we," not "would"), Result (what happened, and how do you know?). The action is the signal; everything else is context. Without probes, candidates substitute a polished narrative for a concrete behavioral account.

**Situational format.** *"Imagine you are…"* Appropriate when the competency is hard to probe through past experience — early-career candidates, novel role requirements, or situations where lived experience in the specific domain is uncommon. Probes here push for specificity: what exactly would you do first? Who would you involve? What would success look like?

**Building the bank.** For each competency, write 3–5 candidate questions before selecting the one to use in the interview. Having alternatives matters when a candidate has already answered the primary question in a previous round, or when the primary question doesn't yield behavioral evidence (candidate says "I haven't encountered that specific situation"). The alternate questions are also useful calibration material: anchor writers can score hypothetical responses across multiple questions to stress-test whether the rubric holds.

**Decision point.** Behavioral or situational per competency? The default is behavioral if the candidate pool is likely to have relevant experience. Situational if not. Using both for the same competency is acceptable if they test different facets — but it extends the interview and requires two rubrics.

**What breaks here.** Hypothetical behavioral questions: *"How would you handle a situation where…"* These sound behavioral but allow abstract answers ("I would make sure to communicate clearly"). They measure self-concept, not behavior. The fix is simple: change "would" to "did" and require a real example. A second common failure: building the question bank from a JD rather than from the job analysis output — this produces questions that assess credential-holder stereotypes, not the actual behaviors that drive performance.

---

## Stage 4: Anchor development

**What it is.** Writing behaviorally-specific descriptions of what a 1, 3, and 5 answer looks like for each question. Not abstract quality labels — concrete descriptions of the evidence that would appear in an answer at each level.

**How anchors are written.** For each question, describe the behaviors or outcomes you'd expect from a top performer's answer (5), a competent-but-not-exceptional performer's answer (3), and an inadequate performer's answer (1). Anchors should be specific enough that two raters who have never discussed the role would score the same answer the same way.

**Example structure for a conflict-resolution question:**
- 5: Candidate proactively surfaced the conflict, named specific behavior rather than characterizing the person, proposed a concrete resolution path, checked in afterward to verify the issue resolved. Result: relationship intact and performance improved.
- 3: Candidate addressed the conflict when it became unavoidable, focused on the task rather than the underlying dynamic, resolution was functional but left residual tension. Result: immediate issue resolved.
- 1: Candidate avoided, escalated to a manager without first attempting direct resolution, or described the other party as the problem without examining their own role. Result: unresolved or resolved by attrition.

**Decision point.** 3-point or 5-point scale? A 3-point scale is faster to anchor and easier to calibrate. A 5-point scale gives more discrimination but requires five distinct behavioral descriptions per question. For most roles, 3-point is adequate and more reliably consistent across raters.

**What breaks here.** Writing anchors after interviews have started. Anchors developed after hearing real candidates are contaminated by those candidates — the memorable answer becomes the implicit reference, not the job requirement. Anchors must precede candidates.

---

## Stage 5: Panel design

**What it is.** Deciding who interviews each candidate and which competencies each interviewer evaluates. The panel is a division of labor problem; treat it as one.

**How to assign competencies.** Each competency should be evaluated by the interviewer best positioned to judge it: a technical skill by a technical peer, a stakeholder management competency by someone who works across teams, a leadership competency by a manager or senior individual contributor. One interviewer per competency is the default. Two is acceptable when the competency is the highest-stakes dimension of the role and you want a second independent data point. Three or more produces false consensus, not better data.

**Candidate experience design.** A four-to-six competency interview can be run by one skilled interviewer in 60–75 minutes, or split across two interviewers in 45 minutes each. Longer panels (5+ interviewers) increase candidate fatigue and interviewer coordination costs without proportional validity gain. The marginal interview adds diminishing returns past the competency coverage threshold.

**Decision point.** How many interviewers? The answer is "however many it takes to cover the competencies, with one interviewer per competency." Common failure mode: adding interviewers because seniority expects it, not because they're evaluating something uncovered.

**What breaks here.** Assigning every interviewer to evaluate "overall fit." This produces unstructured panel consensus — everyone forms an impression, and the loudest voice wins the debrief. Each interviewer should own specific competencies and arrive at the debrief with scored evidence, not impressions.

---

## Stage 6: Calibration

**What it is.** A session with all interviewers before the first candidate is seen, where the team reviews the anchors, scores a practice case, and resolves disagreements about what each score level requires.

**How it runs.** Distribute the rubric and a written or video practice case (a fabricated candidate response at each score level). Ask each interviewer to score independently. Compare scores. Where they disagree by more than one point, discuss the anchor: what evidence would justify the higher score? What would have to be absent for the lower score? Revise the anchor language if the disagreement reveals genuine ambiguity. Repeat until inter-rater agreement is tight. A target of ≤1 point of disagreement on the 3-point scale across 80% of practice-case ratings is a reasonable calibration standard before going live.

**Decision point.** How often to calibrate? Once before the first interview for each role, then again if a new interviewer joins the panel mid-hiring cycle. If the hiring process runs long (more than 8–10 weeks), a mid-process calibration check is worth running — rater drift is real and well-documented. The calibration session itself is also a quality gate: if the panel cannot agree on anchor scores after two passes, the anchors need revision before interviews begin.

**What breaks here.** Treating calibration as optional for experienced interviewers. Interview experience does not produce calibrated scoring — it produces confident but idiosyncratic scoring. The whole point of calibration is to establish a shared external reference that doesn't rely on each interviewer's private judgment. Experience doesn't substitute for that.

---

## Stage 7: Structured debrief

**What it is.** A post-interview discussion where each interviewer presents their scored evidence per competency before the group moves to a hire/no-hire recommendation. Structure in the debrief preserves the structure in the evaluation.

**How it runs.** Each interviewer presents scores and behavioral evidence for their assigned competencies before any global recommendation is made. Disagreements between interviewers on the same competency are resolved by going back to the behavioral evidence, not to impressions. The hire/no-hire decision follows from the aggregated scores, not from consensus-seeking.

**Decision point.** What to do when scores are ambiguous or the evidence is mixed? Mixed evidence on a core competency is a signal to dig, not to average. Options: request a second structured interview on the competency in question, add a work sample that directly tests it, or decline and document why the evidence was insufficient. Advancing an ambiguous candidate because they "felt right in the room" undoes every prior stage.

**What breaks here.** Anchoring on the first opinion shared. Whoever speaks first in a debrief sets the frame for every subsequent score. The fix: collect written scores from all interviewers before the debrief begins, reveal them simultaneously, and start discussion at the points of disagreement rather than at confirmation of the first speaker's view.

PORTDOWN_4B09CA6C

# ── post ──
MARKER=$(awk '/^---$/ { f++; if (f==2) exit; next } f==1 && /^marker:[[:space:]]/ { sub(/^marker:[[:space:]]+/, ""); print; exit }' "$DEST")
[ -z "$MARKER" ] && { echo "seed: archive has no marker — corrupt" >&2; exit 1; }
awk -v m="$MARKER" -v outdir="$TARGET" '
BEGIN {
# Match  with an optional mode attr after
# the path (fold emits mode="644" on executables).
file_re = "^$"
end_re = "^$"
}
$0 ~ end_re { if (current) close(current); exit }
$0 ~ file_re {
if (current) close(current)
line = $0
sub(/^<!--fold:[^@]+@file path="/, "", line); sub(/".*$/, "", line)
current = outdir "/" line
dir = current; sub(/\/[^\/]*$/, "", dir)
if (dir != current) system("mkdir -p \"" dir "\"")
printf "" > current
next
}
current { print >> current }
' "$DEST"
SEED_EXTRACTED=$(find "$TARGET" -type f -not -path "$DEST" 2>/dev/null | wc -l)
if [ "$SEED_EXTRACTED" = "0" ]; then
echo "seed: archive contained no files — refusing to delete the source" >&2
echo " archive preserved at: $DEST" >&2
exit 1
fi
rm -f "$DEST"

echo "" >&2
echo "✓ seed unpacked → $TARGET ($SEED_EXTRACTED files)" >&2
find "$TARGET" -type f | sort | while IFS= read -r _sf; do
echo " ${_sf#${TARGET}/}" >&2
done
echo "" >&2
if [ -f "$TARGET/SKILL.md" ]; then
echo "This seed contains a skill (SKILL.md). Install it in your agent's skills directory." >&2
echo "" >&2
fi
echo "Install the seed skill if not already installed:" >&2
echo " https://seed.show/skill" >&2
echo "" >&2
echo "Publisher prompt:" >&2
sed 's/^/ /' >&2 <<'__SEED_PROMPT_END_AC1F2B__'
You have evidence-based interview design context. Read README.md for the mental model, load-bearing facts, AI tooling risks, and what agents get wrong. Read structure.md for the full pipeline: job analysis → competency definition → question bank → rubric anchors → panel design → calibration → structured debrief. Fetch sources.md for research citations (Schmidt & Hunter 1998, EEOC guidelines, Illinois AI Video Interview Act). To share your own context bundles: curl seed.show/skill | bash
__SEED_PROMPT_END_AC1F2B__
exit 0

instructions

You have evidence-based interview design context. Read README.md for the mental model, load-bearing facts, AI tooling risks, and what agents get wrong. Read structure.md for the full pipeline: job analysis → competency definition → question bank → rubric anchors → panel design → calibration → structured debrief. Fetch sources.md for research citations (Schmidt & Hunter 1998, EEOC guidelines, Illinois AI Video Interview Act). To share your own context bundles: curl seed.show/skill | bash

idhiring.interview.design size29.8 KB created2026-05-06 expirespermanent