Skip to content
← Back to articlesOur Crawler Choked on Its Own Outputs
ProductionWeekly build-logApr 22, 20263 min read838 words

Our Crawler Choked on Its Own Outputs

N
Networkr Team

Writing at networkr.dev

Heuristic similarity scoring collapses under LLM paraphrasing. We swapped to deterministic graph hashing. Crawl velocity recovered in hours.

What We Ship This Week

The feedback loop starts quietly. Our internal cross-linker generates anchor text for the network. An edge case in the prompt templates suddenly produces paragraphs identical to early training payloads. The crawler treats everything as fresh until a duplicate flag fires. It fires exactly zero times. The engine triples index bloat in seventy two hours. Crawl velocity flatlines. Every new node queues for processing and gets stuck behind synthetic clones. We fix an architectural mismatch live.

The obvious move points at prompt engineering. Builders blame generation quality when search indexes choke. The bottleneck actually sits in the traversal layer. The graph walker treats near-duplicate paraphrase trees as distinct paths. That design burns through crawl budget and triggers duplicate-content signals Google Search Central: Crawling and Indexing documents clearly. The fix does not happen upstream. It lands at the crawl layer.

Why Heuristic Thresholding Fails

We test similarity scoring first. Token overlap and fuzzy string matching feel like natural entry points. They collapse under LLM generation. The model paraphrases the same semantic structure across multiple syntax trees. A fixed overlap threshold catches lazy rewrites. It misses sophisticated variations. DOM shifts add another layer of noise. Class names rotate. Attribute ordering changes. The heuristic engine flags mismatched fingerprints while the actual content carries identical ranking signals.

The team spends a morning adjusting threshold sliders. It becomes a dead end quickly. You either accept massive false negatives or drown the queue in false positives. People ask if SEO is being phased out when the tools behave like this. The discipline survives. The architecture simply outgrows string-based matching. Index bloat requires deterministic resolution.

Swapping to Deterministic Graph Hashing

We remove the fuzzy matcher in lib/crawler/similarity.ts on Tuesday. The replacement lives in src/graph/dedup/hash_edge.ts starting at line 182. The function name is canonicalizeSubgraph(). It walks the immediate node neighborhood, strips session tokens, sorts attributes alphabetically, serializes the structure to JSON, and runs SHA-256 over the output. Exact matches drop immediately. Valid variations generate distinct hashes.

Handling circular references requires a visited-set tracker inside the serialization routine. The stack would overflow otherwise. Content-based addressing resolves the ambiguity problem. Content-addressable storage proves the math scales. The compute overhead jumps because we hash every traversal step. We pay the overhead for zero ambiguity. CPU cycles cost less than polluted search indexes. The engine stops guessing about semantic closeness and reads exact byte sequences. The noise floor drops.

LLMs stay non-deterministic. Index management requires hard boundaries. The tradeoff sacrifices a handful of legitimate structural variations to keep the queue clean. The crawler middleware filters cache busters and tracking parameters before the hash function touches the payload. That step alone cuts the collision noise in half. The routing table stays lean.

The Rollback Window and The Numbers

The routing middleware intercepts requests at src/middleware/request_filter.ts. We route every inbound payload through a normalization layer first. It decodes HTML entities, collapses whitespace, and removes empty DOM nodes. Only then does the request reach the crawler queue. The extra hop adds four milliseconds of latency per request. We absorb the latency because it prevents twenty milliseconds of wasted retry cycles downstream. The tradeoff holds.

We almost reverse the deployment forty minutes in. Dynamic timestamp nodes generate unique hashes on every request. A page with an active clock widget looks completely fresh to the engine. We pause the rollout. The rollback window sits at ten minutes. Instead of killing the feature, we add a structural ignore list. The serializer strips known mutable elements like live counters before computing the fingerprint. The deployment resumes.

Crawl velocity recovers fast. Request throughput roughly doubles over a single business day. The duplicate queue shrinks by more than half. Search console reports show the exact drop in rejected nodes. We stop guessing about prompt alignment and trust the graph math. The system runs clean. We monitor hash distribution curves because a single spike warns us about a broken serializer. The pipeline breathes again.

Open Questions and Next Steps

Strict deterministic hashing works until the content actually changes. Localized SEO for LLMs evolves rapidly this year. Regional variations require distinct page graphs even when the core structure stays identical. Our current threshold treats them as unique. We adjust granularity for dynamically injected modules without reintroducing false positives. The plan relies on partitioned hashing by region tag. The middleware now tags payloads with geo-headers before the hash computation.

When does strict deterministic graph hashing suppress legitimate content variations that drive ranking signals. We need a safe collision threshold. Too loose and the queue fills. Too tight and regional nuance disappears. The team runs a split test on staging next week.

Run a deterministic JSON serializer on your internal link graph. Hash the output and compare collision rates against your search console duplicate metric over a seven day window. Strip non-deterministic query parameters like UTMs and session tokens from your crawler middleware. Log the delta in crawl requests per second before and after the change. The data exposes where the bloat hides.

Networkr Team -- Writing at networkr.dev

Related

seo-automationcrawler-architecturededuplicationai-generated-contentindex-bloat