«
»

GPF, Internet, Software, Technology

Generating XML Sitemaps for MediaWiki

August 22nd, 2008 by Jeff | 3 Core Dumps

Not long ago, I took advantage of a nifty WordPress plugin to enable XML sitemaps for the blog. For those who’ve never heard of XML sitemaps (I hadn’t for quite a while), they are little XML files in a specific format that give search engines like Google hints on how to index your site. They don’t necessarily improve your search rankings per se, but they help the search engine better decide what to index, when it was last updated, relative priorities of different pages, etc. You then throw a special line into your robots.txt file or directly submit the file to the search engine to let it know the file is available. Once the engine knows about it, it will check it periodically to optimize how the site is indexed.

The plugin, of course, makes this ridiculously easy for WordPress. However, GPF gets orders of magnitude higher traffic than the blog does, so finding a way to generate sitemaps there would be ideal. I toyed with the idea for a while until I finally sat down, examined the sitemap specification, and figured out how to roll my own code. It now successfully runs via cron each morning and gives a pretty thorough census of what’s available on the GPF server. The problem is that the GPF site is divided into several parts that are largely autonomous and self-contained:

  • The archive, of course, is the main bread and butter.  This is what people come to see, so it is vitally important to get everything represented, especially since the archive is entirely dynamic PHP. I count the main index page as part of the archive, as it displays the latest strip.
  • The forum is self-contained because it uses a third-party application: phpBB. I decided long ago to discourage spamming by blocking search engines from indexing the forum, relying on the forum’s internal search capabilities for users who want to find things. Since search engines were being blocked anyway, I decided the forum didn’t need a sitemap; that would just be a waste of processing time.
  • The wiki, however, is slowly becoming a vital part of the site. It replaces the old cast pages and adds tons more GPF metadata in a convenient, easily searchable form. I wanted to make sure that got included too.  The GPF Wiki runs MediaWiki, the same software that powers Wikipedia.  As far as I know, MediaWiki does not include any internal mechanism for generating XML sitemaps nor are there any extensions to do so.  (I could easily be wrong, however; I’ll admit my research on this was extremely limited.)
  • Then there’s everything else. The GPF site beyond the two pre-packaged items above is completely custom-built PHP, so there’s no one to blame for that mess but myself.

Ignoring the forum, that left me three major sub-projects for creating sitemaps. It’s easy enough to segregate these into separate files and tie them together using a “sitemap index” file, so that wasn’t a problem.  The archive would just be a formatted dump of the archive database, deriving approximate update times from the posting date. The bulk of the rest of the site could be done by stepping through the file structure of the site and taking note of every HTML or PHP file and its last modification time (conveniently ignoring certain files and directories that don’t need to be counted, like access-restricted Premium pages). And that leaves the wiki.

I managed to come up with a decent wiki sitemap routine that I thought I’d share, just in case someone else might be interested. Of course, it’s not likely to be useful for massive wikis like Wikipedia—sitemaps are restricted to 10MB in size and 50,000 URLs—but something small like the GPF Wiki would be easy to submit and index. It was built using MediaWiki 1.12.0; I am uncertain what database changes may be needed for older or newer versions. Here’s my current process:

I only want to index relevant pages, including category pages. The relevant database table for this is “page”. (How… convenient). Unfortunately, this table also contains things like redirects and images. Each image has its own “page” assigned to it; try clicking on an image in Wikipedia or in the GPF Wiki to see what I mean. The time stamp of the latest revision, however, is stored in the “revision” table, joined to the page table by the latest revision ID number. So a good starting bit of SQL would be:

select p.page_title, r.rev_timestamp from page p, revision r where p.page_latest = r.rev_id and p.page_is_redirect = 0 and p.page_title not like '%.gif' and p.page_title not like '%.png' and p.page_title not like '%.jpg';

Unfortunately, this also returns a few meta pages like the sidebar and editing pages. Before selecting, I define a look-up hash of titles I want to avoid and as I loop through the results I just skip those.

The title, of course, is both the displayed title and the input portion of the URL that uniquely identifies the page. Thus, knowing the base URL (http://www.gpf-comics.com/wiki/) I can easily reconstruct the public URL of any article from the title. As with Wikipedia links, spaces have already been converted to underscores, but the rest of the string needs to be be URL encoded. This is easy enough, so we can quickly build the full URL as required by the XML schema.

The time stamp is a little bit tougher. MediaWiki stores time stamps as a 14-digit number in YYYYMMDDHHMMSS format, always in UTC time. In Perl (in which almost all my crons are coded) this is easy enough to break apart and turn into a UNIX time stamp. I then output the date in W3C ISO 8601 format as required by the schema. A sample of a resulting entry would be:

<url>
   <loc>http://www.gpf-comics.com/wiki/Nick</loc>
   <lastmod>2008-08-22T06:00:07Z</lastmod>
   <changefreq>monthly</changefreq>
   <priority>0.3</priority>
</url>

Change frequency and priority are purely guesses and fudges for mine. According to the sitemap specification, priorities are purely relative to other parts of the site. I rated the wiki pages as relatively low since the wiki at GPF is considered a “supporting” page and subordinate to things like the archive. As for change frequency, the sitemap specification includes a number of predefined choices (hourly, daily, weekly, monthly, etc.). Monthly was a purely off-the-cuff guess; some pages may update more or less frequently, but monthly would be a good average. It is entirely possible to rate select pages as higher priority or frequency than others, but I decided to take the easy route and rate everything the same. To apply different values, you just need to pay special attention to the title and assign a non-default value when that title crops up.

Well, I hope someone out there might find this helpful. I’m not sure if it really helps anyone find anything at GPF, but it was a fun little exercise nonetheless.

Tags: , , ,

3 Core Dumps

Dump your own core:

You can skip to the end and dump core. Pinging is currently not allowed.

Be nice. Keep it clean. Stay on topic. No spam. Or else.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

You must be logged in to dump core.


«
»