Search-engines are EVIL

Posted on August 12th, 2008 by DK1 in How to, Internet, SEO, Webmaster

Search engines where would we be with out them; well if you can remember the pre-google era your older then me :p With these the mass hub that we call the internet becomes navigable and we find what we want with a lot of ease.

Though with the good of this technological must have it brings it own security risks. I focused on affiliate sites before, for this article I would like to extend the range to member sites and other pay for download sites etc. Though I will only be focusing on Google for this post, remember other sites with crawlers such as alexa.com, tecnorati.com, yahoo/MSN etc can also be secuirty risks to you.

Google bot

Google bot

Lock? what lock?

One of the main ‘holes’ for most landing pages etc. is that the end pages are not encrypted/hidden. So these can be easilt found by searching for ‘thank you’ etc. in SE/Social media sites. Or using the big G and running queries such as:

This showing all indexed pages

1
 site:www.google.com

Doing a similar job as above except it can be used to for a period after a apparent hole is plugged.

1
 cache:www.google.com

This will return all the pages that have thank you in the title.

1
 allintitle: Thankyou

This will search for sites with thankyou in title and purchase on the page.

1
 intitle: Thankyou purchase

This will search for the words in the url

1
 allinurl: Thankyou purchase

This will search for Thankyou in the url and purchase on the page.

1
 inurl: Thankyou purchase

The question is how to protect your site from this? well you could use robot.txt or password protect the urls.

The problem with this is that robot.txt may be viewable and most likely is by the world, so the urls are not really hidden. Also not all robots *cough* yahoo *cough* tend to follow it strictly.

The later method of password protection is just to inconvenient and it oly takes one person to leak it out.

The easiest method is to put these ‘thankyou’ etc pages in a separate folder on your hosting and via cpanel set the indexing for that to ‘no index’. (note this can also be done via html tags when writing the page, though cpanel is faster and more broad).

For the more tech savvy i would suggest using something as shown below;

1
2
3
4
5
if(!$_SERVER[‘HTTP_REFERER’]){
header("HTTP/1.0 404 Not Found");
} else
if(!$_SERVER[‘HTTP_REFERER=your payment verified page’])
run download script;

now this method could further be improved by the site generating a random download page for each successful/verified purchase so it cannot be transfered, but that is another post altogether,

Give me your keys

Now you think you have secured your site by no indexing the key pages… may reveal your download/thankyou / member only pages is when you give access to your member content to the big bad Google bot. May it be by accident, a plan for better serps etc. It can be easily exploited.

Now in the old days (pre firefox) to gain the power of the great Google bot and go where he goes you had to add the following to your registry:

1
2
3
Windows Registry Editor Version 5.00
[HKEYLOCALMACHINESOFTWAREMicrosoftWindowsCurrentVersionInternet Settings5.0User Agent]
@=”Googlebot/2.1″ “Compatible”=”+http://www.googlebot.com/bot.html”

but now it is so much easier, all you need is This Firefox addon and this file to be imported in that addon. Once done you may go where only one bot has gone before!

An easy way to protect against this is…just don’t give it access; make a demo’s/excerpts available for the public.

Use the forcesource

One other flaw I have seen in some sites is that they have something like below;

1
2
<input name="return" type="hidden" value="http://www.thesite.com/someitemcode/
thanks/gibersish/ thankyou.html " />

Now I wouldn’t recommend using this method to ’succeed’ a transaction; but if you must encrypt it! Or ideally use email verification or a secure transaction method as talked about in my post ‘Buying any PayPal item for $0.01

Jush dake da lot

The last item in this already 600 word plus essay (and this is like cliff notes version!) is using an off-line browser to take the lot. I.e. when people just download the whole site. The easiest method to stop this is not to index some of your pages (the key ones :p) so they don’t get what you don’t want them to get.

Other methods may include using the power of htaccess, to which I would point you to the useragent snippet ‘A few tricks up my sleeves – htaccess style’ and the 3g Black list and assorted tricks, over at perishable press

Here endth another post or 700+ essay! hope it was useful as the previous posts!

I hope to be back again with another fantastic post; just bug Donace to let me write some more!

Popularity: 8% [?]

Related posts:

  1. Feedburner steals your Comment luv Teaches how to tweak a few settings here and there to maximise the potential of garnering links form comment luv plugin....
  2. Dealing with Google’s New Nofollow Policy Anyone plugged into the Web these days has heard about how Google has supposedly changed the way it deals with nofollow attributes. According to a...

4 Comments

  • At 2008.08.13 13:11, Jeff Starr said:

    Excellent post! Where did you find that incredible image of the evil GoogleBot? Some great advice in the article. Browsing the Internet as a bot (via the user-agent switcher) is a great way to learn some behind-the-scenes SEO stuff. Also, please check that 3G link! :)
    Cheers,
    Jeff

    • At 2008.08.13 14:57, Donace said:

      found the pic via google :p

      The article is meant to help you stop stuff like that Jeff not encourage using it :p but yes check out the 3g list and keep an eye out for 4g;)

    • At 2008.08.17 22:40, kipram said:

      Is Search Engine evil like this pic

      • At 2008.08.18 06:50, Donace said:

        Well we don’t really know :p google hides how they actually work :p

      (A must)
      (Another Must but dont worry will not be published)

      Archives

      Full Archive

      Tag Cloud

      .htaccess adgitize Alexa Internet automation Backlink Backlinks Blog bot Bots code competitons Contest copyright entrecard Firefox Google Google Page Rank How to howto Law link building Link Love links news Optimization PageRank PHP plugin Programming Promotion Rants of a loony toon rapidshare Search Engines Security SEO Site update Site updates Spammers TheDuke traffic tutorial updates Weblogs Webmaster Web traffic