Zeal Seo Services India , Seo Notes,

e enjte, 27 shtator 2007

Why a search engine crawler is not at all like Lynx

We're often told, in the SEO industry, that we should imagine crawlers as a very simple browser like Lynx. Quite why that is, I don't know, I can only assume that it helps lazy search engine software developers. But it has become a general trend to confuse the two. The crawler shares just one basic, superficial, similarity to Lynx; it processes web pages in a very simple way. The difference stops there. I therefore believe it worthwhile to take some time to examine the differences and understand just where we could go wrong if we think of search engine crawlers as Lynx style browsers.

Let’s start by examining what the two pieces of software do:

Lynx: retrieves a web page specified by the user and reformats it for display on a screen. Included in that formatting is various extra bits of information such as what to do if a user performs a particular action (for example the title element in href tags).

Search Engine Crawler: retrieves a web page specified by a software program (often known as url control) and saves it. It extracts additional urls from it. Later this information is fed through the indexer to generate the actual search index.

These are very different tasks. Whilst Lynx has to actually understand the elements of the page, the search engine crawler does not. Because the crawler is not re-formatting for human viewing there is greater tolerance for error and it can do it’s job using simple pattern matching. Let’s take the extracting urls as an example. Lynx has to actually display the anchor, the crawler does not. So whilst lynx would have to understand ever element of the following url:

the crawler merely needs to looking for the pattern that represents an anchor (” or ). Then extract the href section. This has two important implications:

1. Whilst Lynx must understand that things could be written in a different order in a different way, the simple pattern match of crawlers doesn’t matter.

2. Following on from 1, because it is a simple pattern match there is greater tolerance for errors. Consider this bad code:

It shouldn’t validate and so the browser has to choose how to deal with it. The crawler is just pattern matching, it still matches the rules I described earlier so it’s just fine.

Incidentally this is also why crawlers could, if their programmers choose to, easily find links in Javascript or unlinked citations. There’s a fundamental difference between interpreting Javascript and being able to find urls in Javascript. Thinking about this in human terms, if you give somebody who doesn't know Javascript a bit of code to look at with a url in it and ask them to tell you what the url is the chances are they’ll see it.

When we get to indexing this retrieved page (which just means creating the database for people to search), it’s actually nothing like Lynx either. With indexing we want to break things down to as little as possible. So the page is turned in to a list of positions of each word that occurs in the page and any special attributes. By special attributes I mean things like bold or font or color that’s different from the rest of the page. This really means that we have a very limited subset of html with very few tags, and because it is not actually displaying them the search engine has no need to understand what they mean but merely that they delimit a section of text.

I can only presume then, that those who support the view that Lynx shows pages like a crawler would see them do so because they believe that the more simplistic view represents something that must be closer to crawler. This again does not hold water. Sure it shows you a page without images, javascript, flash and so on. But that's a very superficial way of looking at things. Take the images, what about the filename? That's used in ranking but it doesn't show in Lynx. All you get without navigating through it's horrible menus is the alt text, well I can hover my mouse in IE as well as the rest of them. Javascript? Well I've already mentioned that search engines could read Javascript if they wanted to. It's there, it gets read and it gets processed but just not run. Flash? Doesn't AlltheWeb index flash? It sure does. Is this going to be a growing trend? You bet it is. So hang on, which of those simplifications is actually giving you a true or a better view when you're using Lynx? My answer is none of them.

Many of the people I've spoken to in an effort to try to understand the Lynx myth have pointed me to the "Google Information for Webmasters", which states:

--"Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as Javascript, cookies, session ID's, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site. --"

We've dispensed with many of these elements already, showing why they don't hold water. Let's pick a couple more:

Cookies. Does Google's crawler support cookies? Nope. Does Lynx? It sure does, so why would we want to test our sites with it to check that the cookies are okay for Google?

Session ID's. Does Google's crawler support session ids? Nope. Does Lynx? It sure does, so why would we want to test our sites with it to check that the session ids are okay for Google?

The answer of course is in a little word that many of the people I spoke to forgot to read: "may". This essentially means the whole paragraph could be true, false, or partially true and partially false. The only true for Google there is "Flash", and that's unlikely to be a true too long in to the future. And frankly, if you don't know when you're using flash on your pages you've got problems

In reality, the average person using Lynx to check in the light of current advice given by many SEOs and Google themselves is likely to end up making mistakes and not finding them. I don't argue that there isn't a time when there is a benefit, I merely argue that a regular old browser and hovering the mouse or right clicking is more often than not less confusing, easier and with a lesser learning curve. To imply that Lynx is anything like a crawler is telling newbie Niel that because his site doesn't render or work in Lynx it won't get crawled. That's just plain wrong. It will always get crawled and the vast majority of the time it will get indexed.

I know that now I've written this there will be those that choose to disbelieve me because of established belief, or because of the perception that the established belief is doing something beneficial for them (i.e. that Lynx helps them). I know this because I've spoken to a few people and that has been the general reaction. My one and only answer to that is that I've programmed crawlers, I know the differences and that doing so shifts your conceptual understanding of them further away from the truth and not closer to it. Maybe you believe you can see something in it that you could not elsewhere, but in all likelihood you are doing yourself more damage than your perceived gain. The benefits you perceive you gain could well be precisely because you believe that Lynx views things like a crawler, i.e. the logic is circular in nature. Take another look at Lynx, ask yourself "if this is not a representation of what a crawler sees, then what do I gain from this viewpoint?". In either case I ask you to look at things afresh and not with the eyes of what has been said in the past or the proveable bad "may"s of one particular search engine, to make your own reasoned decision and, hopefully, to stop another myth.

Postuar nga parvez në 10:56 m.d.

Nuk ka komente:

Posto një koment

Postim më i ri Postim më i vjetër Faqja e parë

Abonohu te: Posto komente (Atom)

Rreth meje

parvez

Shiko profilin tim të plotë

Arkivi i blogut

► 2008 (4)
- ► qershor (1)
- ► prill (1)
- ► janar (2)

▼ 2007 (91)
- ► dhjetor (1)
- ► nëntor (2)
- ► tetor (24)
- ▼ shtator (22)
- ► gusht (42)

AdSense for Search

Enter your search terms Submit search form