Details
-
Type: Bug
-
Status: Open
-
Priority: Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: Spider
-
Labels:None
Description
This data cross contamination problem is due to the fact that Otto was performing a crawl on a domain (www.tree.com) with the "all subdomains" box checked. It happens that some of the links on the crawled pages have a portion of the domain name in them (e.g., www.lendingtree.com). As a result, some of the links on the crawled pages were treated as subdomain links (instead of separate domains), and were included in the ADR accordingly. We need to properly identify a subdomain link, even if it has some (wording) overlap with the primary domain being crawled.