It's a way of life
Here's an interesting tip that I'll share with you, just in case it saves you the hours and hours of research I did to come up with this solution!
Having made a number of sites, and written a stack-load of PHP code, I still could not find a solution on the web for creating a search friendly URLs. And, not just any urls, but creating search engine friendly urls for user generated content.
That's simple. Search engines love static urls. That is, urls that do not contain parameters being passed to dynamic pages. You probably know the type. An example might be:
The above (made up) link could, for example, be used by a site to display product information for a specific type of Blue Widget. Search engines hate that! A search engine friendly version of that url could be something like:
A few essential rules to stick to when creating search friendly urls:
There are many good articles on the web explaining the details about what urls search engines like and don't like, and many sites dealing with ways to make search engine friendly urls. Check out this search for more info.
Ah, I hear you cry, what's the problem if there's so much information around?
User generated titles. That's the problem.
When you have a website that allows users to contribute content, you don't have any control over what they title their pages. Assuming that you wish to use the title of a page as its url, that can cause real problems. As these types of sites are database driven and the site creates links to the pages automatically, you can end up with some really long urls. What's more, they can contain any typed character in the title. Take, for example, a fictional article called "The grey line between what's right & what's wrong in politics - what's your view?". Potentially, an automatically generated url for this article might look like this:
They're bad. Really bad. And really long.
When I've encountered this on my own sites before, I've usually given up and simply used an ID that represents that entry in the database, so the url looks more like:
That's good. It is short, and contains no bad punctuation or other unfriendly stuff. It's also bad. There are no keywords in it.
So how do you go about addressing this situation?
Ok, let's start by specifying our aims and by setting out what we hope to achieve:
In other words, our code must be able to convert a title to a search engine friendly url so that it doesn't cause any problems for web crawlers/spiders. It must also be able to handle a visitor who arrives at our site via a search engine, and be able to correctly identify which page is to be shown from the referring URL. Lets tackle this in stages.
Below is a function that strips out punctuation and other fairly standard things:
<?php
function string_to_filename($string)
{
$filename = strtolower($string); // Convert to lowercase
$filename = preg_replace('/[^a-zA-Z0-9\s]/', '', $filename); // Remove all punctuation
$filename = ereg_replace(" +", " ", $filename); // Remove multiple spaces
$filename = str_replace(' ', '-', $filename); // Convert spaces to hyphens
return $filename;
}
?>
This on it's own leaves our example url looking like this:
Our aim is to generate a url containing the main words in the title, but to keep the url short, we also would like to remove unnecessary words (stopwords) like the or and. The function below strips out stopwords from any string passed to it:
<?php
function strip_stopwords($string)
{
$ignore_words = array('and', 'the'); // Define your stopwords
$ignore_words = array_flip($ignore_words);
$strarray = explode(' ', $string); // Create an array of words from the string
$new_text = array(); // Create an array to hold the new string
foreach($strarray as $key => $word) // Loop through all words in the string
{
if(!array_key_exists($word, $ignore_words))
{
$new_text[] = $word; // If this word isn't a stopword, add it
}
}
$new_str = implode(' ', $new_text); // Create a new string without stopwords
return $new_str; // Return the new string
}
?>Now, our example url looks like this:
Of course, you can be as strict or as flexible as you like with your list of stopwords. It would be possible to create url as short as these quite easily using the above function:
We now have a url that can easily be read by search engines and users alike. It's short, and contains only the words that provide an insight into what the page is about.
Right, now we have a site capable of creating pages like www.example.com/grey-line-right-wrong-politics-view.html. So, a search engine spider crawls the page and lists it in their index. Someone clicks onto your page via a search engine. What page do you display? After all, you don't have an article called "grey line right wrong politics view", do you? Here's how I solved that problem:
A dynamic MySQL databased site could contain a simple table for page definitions like this:
--
-- Table structure for table `articles`
--
CREATE TABLE `articles` (
'id' int(11) NOT NULL auto_increment,
'title' varchar(255) NOT NULL default '',
'body_text' text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;In order to track which short url corresponds to which article, you need to add a column to your database table which will hold a copy of the short url. This can then be used to cross-reference from one to the other. I'll call this new column 'filename', so our table now looks like this:
--
-- Table structure for table `articles`
--
CREATE TABLE `articles` (
'id' int(11) NOT NULL auto_increment,
'title' varchar(255) NOT NULL default '',
'body_text' text NOT NULL,
'filename' varchar(255) NOT NULL default '',
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;This new column, filename, allows you to query the database with a sting like "grey line right wrong politics view" and pull out the correct page entry for the article called "The grey line between what's right & what's wrong in politics - what's your view?".
If two entries are created with two different titles, and they both result in the same short url due to the removal of stopwords and punctuation, then the same page will be show for both entries. For example, if one user creates an entry called "A Fairy Tale" and another user creates "Fairy Tale", then they will both result in the short url of "fairy-tale" if 'a' is in your stopword list (I imagine it might be).
The code provided will not distinguish between the strings "example's" and "examples" since it removes all punctuation.
A stopword list is not a one-size-fits-all solution. Sure, you can have a very successful general stopword list that will do for most situations, but not all. The ability to provide specific stopword lists for specific situations would be a good extension to the code above.
This is a great article.
The strip_stopwords function will come in handy when i try to improve my Keywords & Density, thank you.
Also, i have always used a while loop to remove multiple spaces, now i can user a regular expression.
Thanks again!
In my application user "xyz" created a his home page just entering his information in a form. A record is created in table with id=24 and username="xyz".
Currently i am able to view the home page by using a url
www.example.com?id=24 or www.example.com?q=xyz ///AAA
my objective is to create a friendly url like this.
www.example.com/xyz.html ///BBB
as there is no physical existance of the page xyz.html the request end with page not found error.
problem: Assuming that url at ///AAA work fine. how i can get the same result
with url at ///BBB.
Note: i know you have given good explanation but i am not able to understand it.Please make it more clear.
Hi SSJha - You need to use URL Rewriting. If you're using Apache the this page has a good summary of the technique used in this article.
A great place to get help about htaccess and rewrite rules is the WebMasterWorld Forums
Post new comment