data.analysis.workflow

paste to any AI agent

view raw

# Usage: curl -sSL https://seed.show/data.analysis.workflow | bash -s <install-path>
# <install-path> is the directory where the file should land.

set -euo pipefail
[ -z "${1:-}" ] && {
echo "install requires a path: curl -sSL https://seed.show/data.analysis.workflow | bash -s <install-path>" >&2
exit 1
}
TARGET="$1"
mkdir -p "$TARGET"
DEST="$TARGET/seed-fold.wiBt7B.folded.md"

cat > "$DEST" <<'PORTDOWN_1104C61A'
<!--
This is a `.folded.md` archive — a directory packed into one markdown
file. The markers below are load-bearing; don't edit them directly.

To unpack (requires bash — if you have no shell, tell the user):
1. curl -sSLf https://fold.dom.vin/skill | bash -s <INSTALL_DIR>
2. <INSTALL_DIR>/fold/scripts/unfold <this-file>
(or: unfold <this-file> if fold/scripts is on your PATH)
-->

---
fold: true
marker: 972332
at: 2026-05-07T16:16:02Z
root: seed-pack.XvG4by
---

# data.analysis.workflow

Data analysis context for agents. What analysis is actually trying to do, the workflow from question to communication, and the inferential errors that technically correct code still makes.

## What analysis is actually trying to do

Analysis is not description. The goal is to **reduce uncertainty about a decision**. Every analysis session should be anchored to a decision someone will make, not to the data itself.

That distinction matters because description and inference are different epistemic activities. "What did revenue look like last quarter?" is a description question. "Should we invest in the enterprise segment?" is a decision question. Agents that conflate them produce technically accurate summaries that leave the decision-maker no better equipped than before.

The standard framing: a decision has multiple possible actions, each with uncertain outcomes. Analysis narrows the range of plausible outcomes. When the analysis is done, the decision-maker should be able to say "I now know X that I didn't know before, and that changes the likelihood of Y." If they can't say that, the analysis didn't land.

**What not to do:** Do not produce a correlation and leave it there. Do not report a p-value without an effect size. Do not frame descriptive findings as causal ones. Do not generate every possible cut of the data and surface the interesting-looking ones — that is manufacturing findings, not analyzing data.

## Analysis as a chain of inferential claims

Every step in an analysis is a claim that can be wrong in a specific way. Treat the workflow as a chain — if any link is broken, everything downstream is invalid.

**Question → Hypothesis → Data → Analysis → Interpretation → Communication**

**Question.** State the decision being made and what would change it. "Is churn increasing?" is a question. "Should we invest in retention programs this quarter, and what magnitude of effect would justify the spend?" is a question that can be analyzed. The question determines what counts as a relevant finding.

**Hypothesis.** Before touching the data, form a prior. What do you expect to find, and why? This is not optional — it is the safeguard against HARKing (hypothesizing after results are known). A written hypothesis, stated before analysis, is the minimal defense against post-hoc rationalization. Exploratory analysis is legitimate; it just cannot produce confirmatory p-values.

**Data.** Understand the data-generating process before analyzing it. Who collected this, and how? What was measured, and what was omitted? What are the known limitations — non-response, measurement error, truncation? An agent that skips this step produces results that are precise about the sample and wrong about the population.

**Analysis.** Match the method to the question. Descriptive statistics describe. Inferential statistics generalize. Causal inference requires more than either — it requires an argument about mechanism and a design that supports identification. Most analytical errors happen when agents apply inferential or causal framing to descriptive data. If you do not have an experiment, a regression discontinuity, a difference-in-differences design, or an instrumental variable, you do not have causal identification. Say so.

**Interpretation.** Translate statistics into claims about the world. Effect sizes matter; p-values alone do not. A p-value tells you the probability of observing a result this extreme if the null hypothesis were true — it does not tell you the probability that the null is true, the size of the effect, or whether the effect matters. Confidence intervals carry more information than point estimates. Practical significance and statistical significance are not the same thing.

**Communication.** Findings are not self-interpreting. State the decision-relevant conclusion first, then the evidence. Quantify uncertainty explicitly. Say what you can't conclude from this data, not just what you can.

## What agents get wrong

### HARKing — hypothesizing after results are known
The agent runs a broad analysis, finds a pattern, and presents it as if it was the thing being tested. The pattern may be real; the inference is not. Without a pre-stated hypothesis, every significant finding is a post-hoc rationalization dressed as confirmation.

**What good looks like:** The hypothesis is written before the data is examined. If exploration generates a new hypothesis, it is labeled as exploratory. No p-value is reported for an exploratory finding. The garden of forking paths is named, not hidden.

### p-value misinterpretation
"p < 0.05" is reported as evidence that the effect exists, or that the null is false, or that there is a 95% chance the finding is real. None of these is correct. p = 0.04 means that if the null hypothesis were true, there is a 4% chance of observing a result this extreme or more extreme by chance. It says nothing about the probability that the null is true, the size of the effect, or the replicability of the finding.

**What good looks like:** Report the p-value, the effect size, the confidence interval, and the sample size together. Describe what the p-value does and doesn't tell you. If the decision hinges on whether the effect is real, not just whether it crossed a threshold, say that a single p-value is insufficient.

### Multiple testing without correction
Running enough tests guarantees false positives. At alpha = 0.05, one in twenty tests on pure noise will return significant. An agent that slices a dataset by 10 dimensions, runs comparisons across each, and surfaces the five that crossed the threshold has found the noise floor.

**What good looks like:** Pre-specify comparisons before running them. Report the total number of tests run. Apply Bonferroni or Benjamini-Hochberg correction for multiple comparisons, or treat exploratory comparisons as hypothesis-generating and report them as such.

### Correlation presented as causation
"Users who log in daily retain at 90%; users who log in weekly retain at 40%. Increase login frequency to improve retention." This is backwards — retained users log in more; more logins do not cause retention. Correlation is symmetric; causation is not. The direction of inference is not determined by the data; it requires an argument about mechanism.

**What good looks like:** For any observed correlation, ask what mechanism would produce the relationship, whether reverse causation is plausible, and whether you have a design that supports causal inference. If not, state explicitly that the finding is associational and name the alternative explanations.

### Base rate neglect
A model with 95% accuracy sounds reliable. On a condition that affects 1% of the population, a 95%-accurate test produces roughly 10 false positives for every true positive. The base rate transforms the interpretation entirely. The same failure appears when a 2% click rate is called "high" without reference to channel, audience, and prior.

**What good looks like:** Before reporting any rate, proportion, or probability, establish the base rate. Ask "compared to what?" Normalize the finding against its baseline before drawing conclusions.

### Sample size intuitions
A large effect in a small sample is not informative — it may be noise. A tiny effect in a large sample may be statistically significant but practically irrelevant. Agents that report "statistically significant" without effect size and sample size are omitting the load-bearing information.

**What good looks like:** Report N, effect size (Cohen's d, odds ratio, relative risk, R²), and confidence interval alongside any significance test. Then ask whether the effect is large enough to matter for the decision being made, independent of whether it crossed a threshold.

### Spurious precision
Reporting a churn rate as 23.47% when the confidence interval spans 19–28%, or projecting revenue to five significant figures from a model with R² = 0.3, implies precision that the data does not support. Numbers with more decimal places than the uncertainty warrants are not more informative — they are misleading.

**What good looks like:** Round to the precision the data supports. State uncertainty ranges explicitly. If the model's error bars are wider than the decision threshold, say so — the analysis may not be resolving enough to answer the question.

### Confounding
A third variable causes both the independent and dependent variable, creating a spurious association. Hot weather causes both ice cream sales and drowning rates. Ice cream does not cause drowning. This class of error is common in any observational dataset where assignment was not randomized.

**What good looks like:** For any observed correlation, name the likely confounders. Can you control for them? Is the correlation robust when you stratify by likely confounders? If you can't run an experiment, name the identification problem explicitly rather than implying causation.

## What AI is changing

AI is reshaping parts of the analysis workflow but not all of it.

**Automating:** Exploratory data analysis (automated EDA tools surface distributions, missingness, and correlations in seconds), NLP-to-SQL (natural language queries against structured data), anomaly detection (pattern recognition at scale across time series and multi-dimensional data), literature synthesis (summarizing what is known in a domain before scoping an analysis).

**Advancing:** Causal inference research (double machine learning, causal forests, and synthetic control methods that handle high-dimensional observational data), forecasting (foundation models for time series that outperform ARIMA on many benchmarks), automated feature engineering.

**What stays human:** Question formulation — deciding what decision matters and what would change it. Context interpretation — knowing that a 3% drop in this metric in this business at this time is a crisis, while a 3% drop in another metric is noise. Business judgment — weighing the cost of being wrong in each direction. Communicating uncertainty — deciding how to frame confidence intervals and caveats for a specific audience making a specific decision. Ethical review — whether the analysis, if acted on, treats people fairly.

The risk AI introduces: automated EDA finds patterns efficiently, but pattern-finding without hypothesis discipline is p-hacking at scale. The failure modes in this file are not reduced by AI assistance — they are amplified if the human in the loop doesn't know what to check.

# mental models

Frameworks practitioners use to structure analysis and spot inferential errors. Two categories: **structuring models** (how to decompose and frame the work) and **error-detection models** (how to catch the mistakes technically correct code still makes).

---

## Structuring models

### 1. MECE decomposition

**What it is.** Mutually Exclusive, Collectively Exhaustive. A way to break a problem into parts that don't overlap and together cover the whole. The test: can every relevant case be placed in exactly one bucket? If buckets overlap (not ME) or leave cases uncategorized (not CE), the decomposition is leaking.

**Why it matters for analysis.** Flawed decomposition is the upstream cause of double-counting, missing segments, and analyses that don't add up. "Revenue by region" and "revenue by channel" are not MECE with each other — a sale has both a region and a channel. Mixing them in a breakdown produces arithmetic that doesn't reconcile.

**What to check.** Before segmenting data: do the segments overlap? Do they sum to the total? If not, name the overlap explicitly and decide how to handle it before reporting.

---

### 2. First principles decomposition

**What it is.** Breaking a metric or outcome into its irreducible drivers rather than working with the composite. Instead of analyzing "revenue," decompose to price × volume, then volume to acquisition × retention × average order size. Each leaf is independently influenceable.

**Why it matters for analysis.** Composites hide. A flat revenue line could be acquisition falling and order size rising — opposite problems requiring opposite interventions. Analyzing the composite produces no actionable conclusion.

**What to check.** For any metric you're analyzing: what are its component parts? Can you decompose it into factors that are independently measured and independently actionable? Work at the level where the levers are.

---

### 3. Fermi estimation

**What it is.** Rough quantitative reasoning from first principles to produce an order-of-magnitude estimate before precise data is available. The goal is not accuracy to three decimal places — it is determining whether the answer is in the thousands, millions, or billions, and catching claims that are implausible on their face.

**Why it matters for analysis.** A Fermi estimate before analysis anchors expectations and catches results that are wrong by an order of magnitude. If your model predicts 50,000 daily active users for a product that has 2,000 downloads, the model is wrong before you run a single test.

**What to check.** Before accepting any quantitative finding: does this number make sense given what you know about the scale of the system? Can you reconstruct it from first principles? If the analysis result and the Fermi estimate disagree by more than an order of magnitude, find out why before proceeding.

---

### 4. Systems thinking

**What it is.** Modeling a system as stocks (accumulated quantities), flows (rates of change), and feedback loops (where outputs become inputs). A stock is level; a flow is rate. Feedback loops are reinforcing (amplify) or balancing (dampen).

**Why it matters for analysis.** Many analytical errors come from treating feedback systems as linear pipelines. "Increase marketing spend → more users" ignores the balancing loop: more users → saturate addressable market → diminishing returns per dollar. "Improve onboarding → better retention" ignores the reinforcing loop: better-retained users refer others → higher-quality new users → even better retention.

**What to check.** For any proposed intervention: what are the feedback loops? What stocks does this affect, and what downstream flows change as a result? Are there time delays between the intervention and the effect? (Time delays in feedback loops produce oscillation — over-correction followed by under-correction.)

---

### 5. The counterfactual frame

**What it is.** For any causal claim, ask: what would have happened if the intervention had not occurred? The causal effect is the difference between the observed outcome and the counterfactual outcome — the latter is always unobservable, which is why causal inference is hard.

**Why it matters for analysis.** Most "impact" claims in business analytics do not answer the counterfactual question. "Users who completed onboarding had 40% higher retention" does not tell you the impact of onboarding — it tells you the correlation between completion and retention. The counterfactual (what retention would have been if those users hadn't completed onboarding) is not in the data.

**What to check.** For any causal claim: what is the counterfactual? How close is the comparison group to the treated group in the absence of treatment? Randomization produces the best counterfactual. Matching, difference-in-differences, and regression discontinuity produce weaker but sometimes defensible ones. No design = no causal claim.

---

## Error-detection models

### 6. Base rates

**What it is.** Any finding must be interpreted against its baseline frequency. A positive test result is only meaningful relative to how common the condition is in the tested population.

**Concrete example.** A fraud detection model flags 1% of transactions as fraudulent with 99% accuracy. If the true fraud rate is 0.1%, a "99% accurate" model still produces roughly 10 false positives for every true positive. The model is performing well; the interpretation without the base rate is wrong.

**What to check.** Before reporting any rate, proportion, or probability: what is the base rate? What does the finding look like when normalized against it?

---

### 7. Simpson's paradox

**What it is.** A trend that appears in aggregated data reverses when the data is disaggregated by a confounding variable. The aggregate is not wrong — it is describing a different population than you think.

**Concrete example.** A university's overall acceptance rate is higher for male applicants. Within every individual department, the acceptance rate is higher for female applicants. The aggregate reversal is driven by women applying disproportionately to the most competitive departments.

**What to check.** When comparing groups, ask: is there a third variable that stratifies the population and is correlated with both the independent and dependent variable? Segment before concluding. If you cannot explain why the aggregate and disaggregated trends go in the same direction, you do not understand your data.

---

### 8. Survivorship bias

**What it is.** Analyzing only the cases that passed through a selection filter, then drawing conclusions about the full population. The filter is invisible because the filtered-out cases are not in the dataset.

**Concrete example.** Studying the characteristics of successful startups to determine what makes a startup succeed. Failed startups, which shared many of the same characteristics, are not in the dataset. The analysis produces a portrait of survivors, not a theory of success.

**What to check.** What cases are missing from this data? Is there a selection mechanism — attrition, failure, non-response, opt-in, graduation — that means the observed sample is not representative of the population you want to draw conclusions about?

---

### 9. Goodhart's Law

**What it is.** When a measure becomes a target, it ceases to be a good measure. Optimizing for a proxy corrupts its relationship to the underlying thing it was proxying.

**Concrete example.** A support team is measured on ticket closure rate. Tickets start closing faster. Re-open rates climb — agents close tickets prematurely to hit the metric. The metric is now measuring ticket-closing behavior, not problem resolution.

**What to check.** Is the metric you're analyzing one that people have been incentivized to optimize? If so, treat it as a measure of metric-optimization behavior, not of the underlying construct. Ask what gaming behavior would look like, and check whether you see evidence of it in the data.

---

### 10. Regression to the mean

**What it is.** Extreme values on a measurement tend to be followed by less extreme values on subsequent measurements, due to random variation — not due to any intervention.

**Concrete example.** A student scores exceptionally poorly on one exam. Their parents hire a tutor. They score closer to their average on the next exam. The tutor gets credit for an improvement that was going to happen anyway.

**What to check.** Was the intervention applied to extreme cases? If so, some portion of any "improvement" is statistical regression, not treatment effect. Requires a control group to separate the two. Any before/after analysis without a control group cannot distinguish treatment effect from regression to the mean.

---

### 11. Effect size vs. statistical significance

**What it is.** Statistical significance tells you the probability of observing a result this extreme if the null hypothesis were true. It does not tell you the size, direction, or practical importance of the effect. Large samples make tiny effects significant.

**Concrete example.** An e-commerce site tests a new checkout flow on 2 million users. Conversion rate increases from 3.000% to 3.001%. The result is highly statistically significant (p < 0.001). The effect is economically irrelevant — the revenue impact at current scale does not justify the engineering cost.

**What to check.** Always report effect size alongside significance: Cohen's d for means, odds ratio or relative risk for proportions, R² for regression. Then ask whether the effect size is large enough to matter for the decision being made, independent of whether it crossed a threshold.

---

### 12. Selection bias

**What it is.** The sample analyzed is not representative of the population you want to draw conclusions about, because of how the sample was collected or filtered.

**Concrete example.** A company surveys customers who responded to an email to measure NPS. Email responders are disproportionately engaged. The NPS score reflects the engaged segment, not the full customer base — and it may be systematically biased in the direction of positive sentiment, since disengaged customers didn't respond.

**What to check.** How was this data collected? Who is in the sample, and who is not? What is the non-response or dropout rate? Is the selection mechanism correlated with the variable you're measuring? If so, the direction and magnitude of the bias needs to be addressed before drawing conclusions.

---

### 13. Ecological fallacy

**What it is.** Drawing conclusions about individuals from aggregate data. Relationships that hold at the group level do not necessarily hold at the individual level — and they can go in opposite directions.

**Concrete example.** Countries with higher average income have lower rates of a particular disease. A researcher concludes that wealthier individuals have lower disease risk. But within countries, poorer individuals may have lower rates of the disease — the aggregate relationship and the individual relationship can run in opposite directions because of how income and disease co-vary with other country-level variables.

**What to check.** Is the unit of analysis the same as the unit of inference? If the data is aggregated (by region, cohort, time period), you can draw conclusions only about that aggregate — not about the individuals within it. State the unit of analysis explicitly.

# sources

Fetch these at task time. Ordered by importance.

1. Andrew Gelman's blog — statistical modeling, causal inference, and the replication crisis. The "garden of forking paths" post is the canonical treatment of researcher degrees of freedom — how analysis choices made after seeing data inflate false positive rates even without deliberate p-hacking:
https://statmodeling.stat.columbia.edu

2. Calling Bullshit — Carl Bergstrom and Jevin West on data reasoning and statistical literacy. Lecture materials and case studies are freely available. Covers base rate neglect, spurious precision, misleading visualizations, and how selection effects corrupt published findings:
https://www.callingbullshit.org

3. R for Data Science (Hadley Wickham) — the tidy data workflow, exploratory analysis discipline, and visualization grammar. Free online. The chapter on EDA is the best short treatment of how to look at data before running any tests:
https://r4ds.hadley.nz

4. The Effect (Nick Huntington-Klein) — causal inference for social scientists and practitioners. Covers identification strategies (difference-in-differences, regression discontinuity, instrumental variables, matching) with plain-language explanations of when each applies. Free online:
https://theeffectbook.net

5. Ben Goldacre — Bad Science and Bad Pharma. The clearest writing on how evidence gets misread, misrepresented, and weaponized in practice. The chapters on publication bias and surrogate endpoints are directly applicable to business analytics:
https://www.badscience.net

PORTDOWN_1104C61A

# ── post ──
MARKER=$(awk '/^---$/ { f++; if (f==2) exit; next } f==1 && /^marker:[[:space:]]/ { sub(/^marker:[[:space:]]+/, ""); print; exit }' "$DEST")
[ -z "$MARKER" ] && { echo "seed: archive has no marker — corrupt" >&2; exit 1; }
awk -v m="$MARKER" -v outdir="$TARGET" '
BEGIN {
# Match  with an optional mode attr after
# the path (fold emits mode="644" on executables).
file_re = "^$"
end_re = "^$"
}
$0 ~ end_re { if (current) close(current); exit }
$0 ~ file_re {
if (current) close(current)
line = $0
sub(/^<!--fold:[^@]+@file path="/, "", line); sub(/".*$/, "", line)
current = outdir "/" line
dir = current; sub(/\/[^\/]*$/, "", dir)
if (dir != current) system("mkdir -p \"" dir "\"")
printf "" > current
next
}
current { print >> current }
' "$DEST"
SEED_EXTRACTED=$(find "$TARGET" -type f -not -path "$DEST" 2>/dev/null | wc -l)
if [ "$SEED_EXTRACTED" = "0" ]; then
echo "seed: archive contained no files — refusing to delete the source" >&2
echo " archive preserved at: $DEST" >&2
exit 1
fi
rm -f "$DEST"

echo "" >&2
echo "✓ seed unpacked → $TARGET ($SEED_EXTRACTED files)" >&2
find "$TARGET" -type f | sort | while IFS= read -r _sf; do
echo " ${_sf#${TARGET}/}" >&2
done
echo "" >&2
if [ -f "$TARGET/SKILL.md" ]; then
echo "This seed contains a skill (SKILL.md). Install it in your agent's skills directory." >&2
echo "" >&2
fi
echo "Install the seed skill if not already installed:" >&2
echo " https://seed.show/skill" >&2
echo "" >&2
echo "Publisher prompt:" >&2
sed 's/^/ /' >&2 <<'__SEED_PROMPT_END_AC1F2B__'
You have data analysis context. README.md covers: what analysis is actually doing (reducing decision uncertainty, not producing descriptions), the analysis workflow as a chain of inferential claims each of which can fail in a specific way, eight named failure modes with exact descriptions of what the mistake is and what good looks like, and what AI is and isn't changing. mental-models.md covers two categories — structuring models (MECE, first principles decomposition, Fermi estimation, systems thinking, the counterfactual frame) and error-detection models (base rates, Simpson's paradox, survivorship bias, Goodhart's law, regression to the mean, effect size vs. significance, selection bias, ecological fallacy). sources.md has five foundational resources to fetch. To share your own context bundles: curl seed.show/skill | bash
__SEED_PROMPT_END_AC1F2B__
exit 0

instructions

You have data analysis context. README.md covers: what analysis is actually doing (reducing decision uncertainty, not producing descriptions), the analysis workflow as a chain of inferential claims each of which can fail in a specific way, eight named failure modes with exact descriptions of what the mistake is and what good looks like, and what AI is and isn't changing. mental-models.md covers two categories — structuring models (MECE, first principles decomposition, Fermi estimation, systems thinking, the counterfactual frame) and error-detection models (base rates, Simpson's paradox, survivorship bias, Goodhart's law, regression to the mean, effect size vs. significance, selection bias, ecological fallacy). sources.md has five foundational resources to fetch. To share your own context bundles: curl seed.show/skill | bash

iddata.analysis.workflow size26.7 KB created2026-05-06 expirespermanent