
Stop SERP Poison: Shipping a Hard-Failed Pre-Write Gate
Writing at networkr.dev
Raw search dumps break lineage graphs. The engine now runs a cheap LLM evaluator to score research, reject noise, and map a narrative spine before generation. Here is the architecture and the latency tradeoffs.
What we shipped
Last week, the pipeline served a container orchestration guide that spent three paragraphs praising the NYC Mayor’s building tours. The Serper API dumped raw search results directly into the lineage graph. Municipal noise masqueraded as infrastructure documentation. That is exactly how domain authority quietly burns.
This week, we shipped a hard failure before the writer model ever touches a prompt. An editorial gate sits between the research collector and the article generator. Every raw snippet passes through src/gate/score_candidates.ts. The evaluator call returns a numerical relevance score. Anything below 0.5 gets discarded immediately. Surviving items feed into src/gate/build_narrative.ts. That function picks two to four outbound references and maps a story arc with an open question and falsifiable experiments baked in. One extra call. One structural guarantee. The engine now knows what question it must answer before drafting starts.
Why regex died on semantic edge cases
Traditional filtering seemed adequate at first. We built a chain of keyword matchers and negative lookaheads to catch unrelated traffic. It collapsed inside an afternoon. Container orchestration means scheduler architecture and network policies. It does not mean downtown building inspections. Semantic boundaries do not respect exact string matching. The regex chain accepted IPO watchlists and university governance forums. The post-mortem revealed a lineage graph clogged with administrative noise.
You can find old pre writing examples online that suggest listing in prewriting as a valid filtering method. That advice works for manual drafts. It breaks completely under programmatic scale. The system needs context recognition, not phrase matching. I scrapped the string matchers by Thursday night. Adding a language model step was the obvious pivot, but it cost token budget and added roughly two seconds of latency per topic. The engineering team debated the tradeoff openly. Shipping incoherent programmatic content destroys long term signal faster than a slow API response builds it. We kept the call.
How determining purpose shapes the drafting stage
Academic curricula always reference clustering, freewriting, looping, questioning, and mind mapping as core manual methods. Those are friction points designed to slow a human down so ideas sort themselves. Machines do not need friction. They need gates. Legacy pre writing techniques PDF manuals assume the writer will catch drift mid-sentence. The model will not. It will happily amplify municipal zoning data into a Kubernetes tuning guide.
The prewrite stage only works when purpose aligns with architecture. Knowing why a draft exists dictates which references survive the cut. Our revised prewriting process forces the evaluator to declare the central thesis before any paragraph generates. That simple constraint stops hallucinated fluff at the door. It also answers why determining your purpose during pre-writing helps in the drafting stage. The writer model receives a locked outline instead of a raw bag of text. Draft completion speeds up because the generator never guesses at intent. Context windows stay clean. Token waste drops.
What broke when the scorer tried to justify noise
Adding an LLM step solved the matching problem. It immediately created a new one. The evaluator started hallucinating justifications for irrelevant sources. We fed it the BuiltIn article about 2026 tech IPOs as a test. The model returned a 0.6 score and invented a convoluted link between venture capital surges and container resource allocation. Prompt drift was real. The function was rewarding verbosity over relevance.
We enforced a rigid JSON schema and rewrote the system instructions at line 142 of score_candidates.ts. Temperature dropped to zero. Output structure locked. The evaluator now rejects sources that cannot articulate a direct technical connection using strict key-value pair formatting. We kept a handful of edge cases where municipal data genuinely intersects public infrastructure mapping, but the pipeline now defaults to silence rather than hallucination. The latency penalty sits at roughly two seconds per topic. API spend increased slightly on the research phase. Draft failures dropped to near zero. The math favors a slow, accurate gate.
Cross-referencing and the leaky threshold
A single scalar threshold still slips at the margins. A 0.9 score on a loosely related forum thread made it past the validator last Tuesday. We are building a secondary pass that checks structural alignment. The function will require at least two surviving references to share overlapping technical terminology before approving the spine.
The World Economic Forum recently mapped agentic evaluation patterns to public trust thresholds. We are adapting that exact vocabulary for internal scoring. Automated systems must verify their own outputs before handing them to the next stage. The UCL governance framework provides a structural analogy for our pipeline transitions. State changes in the research collector now require explicit validation checkpoints rather than blind progression. What makes pre-writing effective in a human context is deliberate friction. What makes it effective in our context is enforced schema validation and hard rejection rules. Pre writing strategies for students rely on peer review. Our engine relies on recursive JSON verification.
What is next
The gate holds for now, but single-pass scoring feels fragile for highly technical niches. Recursive verification loops might be the only way to guarantee accuracy before tokens hit the generator. We are reading heavily from Self-Evaluating LLMs for Fact Verification to model a cheap feedback layer. The architecture exists. The implementation details remain messy. We are wiring a lightweight validation loop that runs after the first draft completion. It will check citations against the approved reference set and flag any drift. That loop will run locally on a seven billion parameter quantized model to keep costs manageable.
Two experiments will run before Friday. We will feed a hundred mixed topics through the old regex chain and the new evaluator side by side. Manual reviewers will score false positives against our internal domain standards. We will also A/B test a 0.5 cutoff against a 0.8 threshold. The goal is clear. Measure the API credit savings against the drop in successful draft completions. Mind mapping and classroom brainstorming belong in theory. Engineers building content pipelines need reproducible gates that fail loudly. The spine of every article now starts clean. The rest follows.
Networkr Team -- Writing at networkr.dev