[ADMAX-3122] Cloud Spider support for robots.txt wildcard - Admax Local

Details

Type: New Feature
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: Sustaining
Component/s: Cloud Spider
Labels:
None

Description

From SEO Team:

We’ve noticed some blocked URLs showing up in ADRs, even when robots.txt is honored. This appears to be due to the use of a wildcard . Example:

Last week, AvalonCommunities added Disallow: /*launch-guest-card/1/ to their robots.txt file, in order to block URLs such as http://www.avaloncommunities.com/maryland/gaithersburg-apartments/avalon-rothbury/launch-guest-card/1/.

(Attachment 1)

Crawling the site afterward, these URLs continue to appear in the ADR:

(Attachment 2)

Google is honoring the disallow. Looking at our monitoring report, the GWT’s number of URLs restricted by robots.txt increased from 168 to 435 in the week following the robots.txt update.

(Attachment 3)

And there was a warning related to one of the URLs specifically being blocked by robots.txt:

(Attachment 4)

Bing is showing only reference URLs for these pages now:

(Attachment 5)

Not sure if this has come up before, but would it be possible for our spider to support the wildcard character when robots.txt is honored?

----------
Avalon Communities is the only client I work with who is currently using wildcards in their robots.txt. The first instance was added to the file in March, and all ADRs generated since then have included blocked URLs.

Pro: Adding this feature would mean that the ADRs would present a more accurate picture of what search engines are seeing when they crawl a site. Currently the ADR may be flagging issues, such as duplicate content and other on-page factors, that are not actually being seen by search engine spiders, and are therefore generally irrelevant. To gather only the relevant data, for presenting to clients or for our own monitoring/auditing purposes, manually checking the robots.txt file and filtering the data accordingly is necessary.

Con: I don’t see any cons – the option to not honor robots.txt is available if there is a need to crawl these pages.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

attachment1.JPG
22/Jun/12 1:15 PM
15 kB
Jeff Shih
attachment2.JPG
22/Jun/12 1:15 PM
39 kB
Jeff Shih
attachment3.JPG
22/Jun/12 1:15 PM
59 kB
Jeff Shih
attachment4.JPG
22/Jun/12 1:15 PM
37 kB
Jeff Shih
attachment5.JPG
22/Jun/12 1:15 PM
40 kB
Jeff Shih

Activity

People

Assignee:

Unassigned

Reporter:

Jeff Shih (Inactive)

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

22/Jun/12 1:15 PM

Updated:

08/Apr/13 2:06 PM