BLOGS

Apr 12, 2026

Dedup SOTA: How We Built an LLM-First Domain Deduplication System

Duplicate accounts in your CRM are not a cleanup problem. They are an architecture problem. Here is how OpenFunnel built an LLM-first deduplication system that handles the edge cases naive matching always misses.

OpenFunnel

Why Naive Deduplication Breaks

String matching on domain names fails fast. Companies use marketing domains that do not resemble their primary domain. They run subdomains that are either the same business or a completely separate one. They have regional variants, shortened URLs, and legacy domains pointing to the same homepage.

A few examples of what breaks a naive matcher:

Marketing domains: getstripe.com vs stripe.com. Different strings. Same company.
Subdomains that are separate businesses: aws.amazon.com is a distinct product. careers.google.com is just a page. String similarity treats these identically.
Regional variants: novonordisk-us.com vs novonordisk.com. A suffix-strip heuristic catches this sometimes. Not always.
Shortened URLs: A CRM entry with a bit.ly link tells you nothing about the underlying domain without resolving it first.

Add to that the baseline reality: every customer CRM we have seen is messy. Wrong domains entered manually. The same company logged under three slightly different names. Deduplication logic that was bolted on after the fact and breaks on anything beyond exact match.

The data coming in is not clean. The matching logic has to be smart enough to handle that.

Why We Went LLM-First

The standard approach to deduplication is rules-based: normalize the domain, strip subdomains, compare stems, apply a similarity score threshold. This works for the easy cases. It fails on everything interesting.

The fundamental issue is that domain similarity does not map cleanly to company identity. Two domains can look completely different and belong to the same company. Two domains can look nearly identical and belong to different companies. No string-matching rule handles both correctly.

An LLM can. It understands that getstripe.com and stripe.com are the same business from context. It understands that aws.amazon.com operates as a distinct entity even though it shares a parent domain. It brings world knowledge to the matching decision that no rules-based system can replicate.

So we built a tiered architecture that uses the LLM as the decision layer, with cheaper operations running first to avoid burning compute on cases that do not need it.

The Architecture: Cheapest to Most Expensive

The system runs in tiers. Each tier only escalates to the next if the current one cannot make a confident decision.

Tier 1: Domain resolution and verification
Before anything else, we resolve the domain. Follow all redirects. Expand shortened URLs to their final destination. Confirm the domain actually belongs to the company via a web search cross-check.

This catches the easy errors immediately. A bit.ly link in a CRM field resolves to a real domain in one step. A mistyped domain either resolves correctly or flags as invalid. No LLM needed.

Tier 2: Homepage scraping and LLM content understanding
For domains that resolve cleanly but still need identity verification, we scrape the homepage and pass the content to an LLM with a single question: what does this company do, and what is its name?

This builds a content fingerprint for the domain. Not a keyword match. An actual semantic understanding of what the company is.

Tier 3: Subdomain classification
Subdomains are their own problem. The same parent domain can host a completely separate business (aws.amazon.com) or a page that is functionally part of the main site (careers.google.com). No heuristic distinguishes these reliably.

We pass the subdomain and its scraped content to the LLM and ask it to classify: is this a distinct business entity, or an extension of the parent domain? The LLM uses site content combined with its pre-trained world knowledge to make the call. For well-known companies it already knows the answer. For less-known ones it infers from content.

Tier 4: Fuzzy candidate generation
For detecting duplicates that do not share obvious domain relationships, we generate a set of candidate domains using heuristics: strip regional suffixes, remove common marketing prefixes (get-, try-, use-), compare domain stems. This produces a list of plausible duplicates to check.

Heuristics alone cannot make the final call. They produce candidates. The next tier makes the decision.

Tier 5: Pairwise LLM matching

For each candidate pair, we scrape both sites and pass both content fingerprints to the LLM with a direct question: are these the same company?

The LLM has full context. It sees what each company does, what it calls itself, how its products are described. It returns a match decision with reasoning. We use the reasoning to audit edge cases and improve candidate generation over time.

What This Handles That Rules Cannot

The tiered architecture resolves cases that no rules-based system handles reliably:

Marketing domains
getstripe.com resolves and scrapes to Stripe content. The LLM matches it to stripe.com without needing a string similarity rule.

Subsidiary relationships
A domain that belongs to a wholly owned subsidiary gets flagged and linked to the parent. The LLM knows that Instagram is owned by Meta. It knows that Zappos is owned by Amazon. World knowledge fills the gap that scraping alone cannot.

Regional variants
novonordisk-us.com and novonordisk.com scrape to near-identical content. Tier 5 matches them correctly. The heuristic in Tier 4 also catches the suffix pattern as a candidate, so the system finds it from two directions.

Rebrands and legacy domains
A company that changed its name and domain still serves the same content or redirects cleanly. The resolution and content layers catch this without any hardcoded rule.

What This Means for Your CRM

The output is not a cleaned list you run once and forget. It is a live deduplication layer that runs on every new account that enters the system.

When OpenFunnel ingests a new domain, it runs through the tiers automatically. Duplicates are flagged before they create a second account. Subsidiaries are linked to their parent. Regional variants are consolidated.

The CRM stays clean without manual cleanup. The matching is accurate because the decision layer understands company identity, not just string patterns.

Frequently Asked Questions

What is domain deduplication in a CRM?
Domain deduplication is the process of identifying and merging account records in a CRM that belong to the same company but were entered as separate records, typically because they use different domain formats, subdomains, or marketing URLs. Naive deduplication uses string matching on domain names. More robust systems use content-based matching and world knowledge to identify company identity across different domain formats.

Why does string matching fail for domain deduplication?
String matching compares domain names as text. It cannot account for marketing domains that look nothing like the primary domain, subdomains that may or may not be separate businesses, regional variants with different suffixes, or shortened URLs that need to be resolved before comparison. Two domains can look identical and belong to different companies, or look completely different and belong to the same one. String similarity scores miss both cases.

What is LLM-first deduplication?
LLM-first deduplication uses a large language model as the core matching decision layer rather than rules or similarity thresholds. The LLM reads the actual content of each domain and uses its pre-trained world knowledge to determine whether two domains belong to the same company. This handles edge cases that no rules-based system can reliably resolve, including marketing domains, subsidiaries, rebrands, and regional variants.

How do you handle subdomains in deduplication?
Subdomains require case-by-case classification. Some subdomains are separate business entities (aws.amazon.com), while others are simply pages within the main site (careers.google.com). We pass the subdomain content to an LLM that classifies whether it operates as a distinct entity or an extension of the parent domain, using both scraped content and world knowledge to make the decision.

Does deduplication need to run continuously or just once?
It needs to run continuously. A one-time cleanup addresses existing duplicates but does nothing about new ones entering the system. Every new account record should run through the deduplication layer on ingestion so that duplicates are caught before they create separate records, not after.

Similar Blogs

OpenFunnel

Apr 21, 2026

Why Job Postings Are the Strongest Buying Signal in B2B

When a company posts a job, they have already secured budget, aligned leadership, and committed to act. Here is why job postings are the strongest buying signal in B2B and how reasoning agents read them.

OpenFunnel

Apr 20, 2026

Time Aware ICP: Your ICP Is Not a Set of Filters. It Is a Function of Time.

Static ICP definitions break as your product evolves and your market shifts. Here is how the TAQ framework makes it live.

OpenFunnel

Apr 12, 2026

Why Your Outbound Is Stale Before It Starts

Your SDR hit their activity number last month. 200 emails. 40 calls. Tight sequence, good subject lines, a personalised opener on every email. Reply rate: 0.9%. So you rewrote the copy. Tested new subject lines. Read the cold email newsletters and implemented all of it. Next month: 1.1%. The problem is not your copy. It was never your copy.