Not long ago, I took advantage of a nifty WordPress plugin to enable XML sitemaps for the blog. For those who’ve never heard of XML sitemaps (I hadn’t for quite a while), they are little XML files in a specific format that give search engines like Google hints on how to index your site. They don’t necessarily improve your search rankings per se, but they help the search engine better decide what to index, when it was last updated, relative priorities of different pages, etc. You then throw a special line into your robots.txt file or directly submit the file to the search engine to let it know the file is available. Once the engine knows about it, it will check it periodically to optimize how the site is indexed.
The plugin, of course, makes this ridiculously easy for WordPress. However, GPF gets orders of magnitude higher traffic than the blog does, so finding a way to generate sitemaps there would be ideal. I toyed with the idea for a while until I finally sat down, examined the sitemap specification, and figured out how to roll my own code. It now successfully runs via cron each morning and gives a pretty thorough census of what’s available on the GPF server. The problem is that the GPF site is divided into several parts that are largely autonomous and self-contained:
Ignoring the forum, that left me three major sub-projects for creating sitemaps. It’s easy enough to segregate these into separate files and tie them together using a “sitemap index” file, so that wasn’t a problem. The archive would just be a formatted dump of the archive database, deriving approximate update times from the posting date. The bulk of the rest of the site could be done by stepping through the file structure of the site and taking note of every HTML or PHP file and its last modification time (conveniently ignoring certain files and directories that don’t need to be counted, like access-restricted Premium pages). And that leaves the wiki.
I managed to come up with a decent wiki sitemap routine that I thought I’d share, just in case someone else might be interested. Of course, it’s not likely to be useful for massive wikis like Wikipedia—sitemaps are restricted to 10MB in size and 50,000 URLs—but something small like the GPF Wiki would be easy to submit and index. It was built using MediaWiki 1.12.0; I am uncertain what database changes may be needed for older or newer versions. Here’s my current process:
I only want to index relevant pages, including category pages. The relevant database table for this is “page”. (How… convenient). Unfortunately, this table also contains things like redirects and images. Each image has its own “page” assigned to it; try clicking on an image in Wikipedia or in the GPF Wiki to see what I mean. The time stamp of the latest revision, however, is stored in the “revision” table, joined to the page table by the latest revision ID number. So a good starting bit of SQL would be:
select p.page_title, r.rev_timestamp from page p, revision r where p.page_latest = r.rev_id and p.page_is_redirect = 0 and p.page_title not like '%.gif' and p.page_title not like '%.png' and p.page_title not like '%.jpg';
Unfortunately, this also returns a few meta pages like the sidebar and editing pages. Before selecting, I define a look-up hash of titles I want to avoid and as I loop through the results I just skip those.
The title, of course, is both the displayed title and the input portion of the URL that uniquely identifies the page. Thus, knowing the base URL (http://www.gpf-comics.com/wiki/) I can easily reconstruct the public URL of any article from the title. As with Wikipedia links, spaces have already been converted to underscores, but the rest of the string needs to be be URL encoded. This is easy enough, so we can quickly build the full URL as required by the XML schema.
The time stamp is a little bit tougher. MediaWiki stores time stamps as a 14-digit number in YYYYMMDDHHMMSS format, always in UTC time. In Perl (in which almost all my crons are coded) this is easy enough to break apart and turn into a UNIX time stamp. I then output the date in W3C ISO 8601 format as required by the schema. A sample of a resulting entry would be:
<url> <loc>http://www.gpf-comics.com/wiki/Nick</loc> <lastmod>2008-08-22T06:00:07Z</lastmod> <changefreq>monthly</changefreq> <priority>0.3</priority> </url>
Change frequency and priority are purely guesses and fudges for mine. According to the sitemap specification, priorities are purely relative to other parts of the site. I rated the wiki pages as relatively low since the wiki at GPF is considered a “supporting” page and subordinate to things like the archive. As for change frequency, the sitemap specification includes a number of predefined choices (hourly, daily, weekly, monthly, etc.). Monthly was a purely off-the-cuff guess; some pages may update more or less frequently, but monthly would be a good average. It is entirely possible to rate select pages as higher priority or frequency than others, but I decided to take the easy route and rate everything the same. To apply different values, you just need to pay special attention to the title and assign a non-default value when that title crops up.
Well, I hope someone out there might find this helpful. I’m not sure if it really helps anyone find anything at GPF, but it was a fun little exercise nonetheless.
I hope to post more on this when there’s more data to post, but I thought I’d throw up a quick note stating that the latest episode of the Security Now! “netcast” features a question posed by yours truly. (The best part was listening to Leo Laporte stumble over my long-winded rambling.
) The high-quality version of the show can be found at the previous link; a low-bandwidth version as well as a text-only transcript can be found at the corresponding page at GRC.com. A search in the transcript for “Darlington” will take you to the beginning of my question; in the netcast, it starts around 38 minutes, 22 seconds in. (Of course, I encourage everyone to read/listen to the entire thing.)
For the full effect, though, you’ll also need to listen to/read the previous two non-Q&A episodes of the show, #149 and #151. (Low-bandwidth and trascriptions can be found here and here.) The entire dialog concerns the recent trend of ISPs selling out their customers to allow third-party advertisers to come in and install hardware at the ISP to facilitate tracking the ISPs’ customers’ surfing habits across sites. While the ad companies in question claim to not be recording personally identifyable information about the ISPs’ customers, the capability is there and the possibilities for abuse are enormous. It brings back many shades of the DoubleClick controversies of the late 1990s-early 2000s, only much more ominous. I provided a unqiue standpoint to the discussion: that of a Web developer hosting a site and encountering similiar mysterious “first party” cookies set for my domain but not set by me.
The full body my question is present, but I’m not completely satisfied with the answer.
Let’s just say I think Steve Gibson made an assumption about the GPF site that’s not 100% true. I’ve replied to his response with additional information. I don’t necessarily expect another response (he does, after all, have his own agenda to follow on his show), and even if he does it will likely be in episode #154, the next scheduled Q&A episode. If anyone is interested, I’ll post updates if and when this occurs. If I don’t get a response, I’ll post my response here, especially since it contains some disturbing observations about “first party” cookies that have mildly paranoid folks like me nervous. (I’d hate to see what it does to really paranoid people.)
So ICANN, the organization that oversees the doling out of domain names on the Internet, has approved the relaxation of the rules for top-level domains (TLDs) to allow for arbitrary TLDs for whoever has the money and technical capability to grab it. If things go according to plan, by the middle of next year you may be able to just type into your browser something like http://search.google/ rather than http://www.google.com/, or perhaps you’d rather http://drink.coke/ or http://drive.ford/ or even http://have.crazy.monkey.sex/.
To quote virtually ever character in the Star Wars universe, I have a bad feeling about this.
I am so sitting on the fence on this one. My initial gut reaction is this can’t be a good thing. I know far too many non-techies who are confused by Internet addressing as it is, so let’s confuse them some more by adding even more things for them to figure out. JD Fraizer over at User Friendly hit the nail on the head; anyone who has ever used Usenet is probably rolling their eyes a lot more lately. The potential for cybersquatting and trademark dilution is enormous. ICANN insists that an “objection-based mechanism” will be in place to prevent such things, but how much red tape (and legal dollars) will someone have to go through to protect their brand? Every day that a squatter sits on a domain equates to valuable time, money, and reputation that can be lost, something big corporations may be able to wait out but little guys like me can’t afford. It’s been hard enough right now for me to keep up with all the variants of gpf-comics.something out there. And let’s not get into the discussion of what “offensive” TLDs creative individuals might come up with….
Of course, it’s not like I’m going to be registering .gpf anytime soon anyway. I suppose that’s one thing ICANN did right: to create your own TLD, you’ll need a truck load of money first. The CBC is reporting an estimated $100,000 per TLD—I have no idea if that’s Canadian dollars or not—but ICANN only says for now that “fee information is not yet available”. Ordinary domain names are dirt cheap nowadays, which is a blessing to small-time operators like me but a curse in that squatters with cash to burn can snap up thousands at a time and hold them for ransom. At least starting a new TLD will take capital, making it a serious investment. It will also be quite a technical undertaking; owning a TLD also means you have to build the infrastructure support it. So if Google were to grab .google with their pocket change, they’ll also need to pony up the hardware and bandwidth to maintain the root server. Google may be a bad example (they’ve got servers to spare, I’m sure), but for organizations not used to maintaining that kind of “big iron” it will be a significant learning curve.
But then it occurred to me… how awesome would it be if all your favorite comics or comic-related sites could found at “something dot comics”?
Imagine if you will that some philanthropic comics creator/reader with a hundred grand in “mad money” under his bed were to snatch up .comics and register that with ICANN. Being philanthropic, this individual would charge a minimal fee to register a domain there, just enough to cover operational costs and maybe make a modest living in the process, aggregated out to anticipated demand (of which I’m sure there’d be plenty). There would be only one additional requirement for application beyond the current standard (ethical) process: the domain must be used for a site publishing, promoting, or discussing comics in some way, shape, or form. Consideration for approval would require proof of content, such as a preview development site, previously published work, portfolios, etc.—just enough to prove the site really will be used for something comic-related. Individual titles would be encouraged to register at the root level (dilbert.comics, gpf.comics, x-men.comics) while companies would register their names (dc.comics, marvel.comics, keenspot.comics) and potentially use sub-domains for their own titles (x-men.marvel.comics). Our hypothetical philanthropic registrar would also be fair and balanced as to not let big conglomerates dominate the little guys. Disputes over domains would come down to traditional copyright and trademark resolutions, requiring proof of prior art, etc.
Wouldn’t that be just grand?
Of course, what will really happen will be that some big company will come along and buy up .comics with far more misanthropic intentions (and we know such an obvious TLD wouldn’t sit dormant for long). They’d either squirrel it away selfishly for promoting their own works and no one else’s, or they’ll charge such an exorbitant “premium” price for registrations that only big publishing houses like DC, Marvel, etc. will be able to afford it, shutting out the little independents and webcomics. Even if they price it fairly and keep it open, I’d bet it would get so swamped with squatters that the novelty of the whole TLD would become as diluted .info is today. Maybe it’s just that I’m pessimistic… or that I’ve been annoyed for so long that some jerk had been holding gpf-comics.org hostage for years… but I just don’t see this turning into as promising a possibility as I think it could be.
Oh, well. I’ve been waiting for gpf.com for nearly a decade now. I guess I can just add gpf.comics to the list. Wishful thinking….
For both of you out there who care, WinHasher has now been bumped to version 1.3. The changes are very minor, so there’s no need to upgrade unless you find the following two new features useful:
I had originally started adding support for HMAC signed hashes but have abandoned that for now. If there’s anyone out there who might actually find that useful, drop me a line and I’ll revisit the code to see what I might be able to add. Downloads can be found at the first link above.
The following is a specification proposal for a new pseudo-random character generator (PRCG), tentatively called the “Tiny Tots PRCG”. This specification is to be considered open and royalty free; everyone is free to implement and extend this specification, although attribution is appreciated. It usefulness, however, may be limited and may only be of interest to cryptographic and mathematical academics or really bored parents.
System Requirements:
Implementation:
Caveats, Limitations, and Additional Notes:
Just a head’s up to say I’ll be guest hosting Friday’s installment of the Jesus Geek podcast. I apologize in advance for any static or artifacts in the audio; chalk that up to my podcasting inexperience and not as an overall indicator of the quality of Jesus Geek as a whole. I’ll post a direct link to the download page as soon as I see that it goes live.
Update March 21: Aaaand… here it is.
The new GPF site has been running live for half a month now, and I’m proud to say things have been running incredibly smoothly. That is, at least, from my perspective; I haven’t seen any major glitches, and aside from a few typos in the comic (which are obviously independent of the site code), nobody has written me about any problems. This is especially heartening because the new site was pretty much entirely coded by hand by me, sans a few bits and pieces. (I can’t take credit for the OS, the web server software, the database engine, or the forum. But everything else… yep, that was me.)
There were a lot of motivations for writing my own archiving system, but the primary one was efficiency. While I considered trying something off-the-shelf, so to speak, like ComicPress or Drupal, I really wanted something that would be blazingly fast yet still dynamically generated to let me do things like GPF Premium on the server side, primarily for security reasons. (Server-side processing means no messy JavaScript is required by the users, thus exposing them to less risks, while Premium content doesn’t even get sent to the browser at all if Premium isn’t enabled.) So the GPF site is optimized out the wahzoo, with certain high-volume pages built once by nightly crons while others that require more interactivity reduce database queries to simple selects as much as possible. I’m never one to brag and toot my own horn, but I’m actually pretty proud of the new site and how responsive it is.
Of course, I can’t really take all the credit. I do have to give some serious props to XCache.
For those unfamiliar with PHP, it is one of many server-side, interpreted scripting languages commonly used for dynamic Web site development. The caveat, however, to any interpreted language is that on each request the source script must be read, parsed, compiled, and executed before anything is set back to the end user’s browser. This is one reason why dynamic sites are and will always be slower than serving purely static HTML files. Static HTML just needs to be read and regurgitated; anything that requires the Web server to actually think takes more time. Add to that the fact that there could be hundreds or even thousands of requests all competing at once for content and it’s a miracle anything get served at all.
XCache is one of several opcode caching extensions for PHP. Essentially, when the first request for a script is made, the script is parsed and compiled as usual. However, XCache stores the compiled code so subsequent requests can skip the parsing and compilation steps and go directly to executing the code. This significantly increases the speed of execution by eliminating one of the costliest parts of the process (except perhaps database connections). In addition, XCache also includes the ability to cache variables and objects, so commonly repeated and expensive variable generation–such as the cryptographic hashes I use for salting cookie hashes or database look-ups for common elements like the Premium subscription levels–can be stored in the cache rather rebuilt on each request.
I was first introduced to XCache by the XCache for WordPress plugin, which was probably mentioned in one of the development feeds built into the WordPress dashboard. I’ve been running this combination here on the blog for a little while with moderate success; I’m still trying to find a good balance of configuration settings to get the best results, but I’ve been happy with the results so far. Without putting much thought into it, I went ahead and installed XCache on the GPF server, hoping that it would help even if I never got a chance to optimize it. Fortunately, it has helped, and now that I’ve optimized the settings it’s exceeded most of my expectations. I’m not sure if there’s something about my code that caches better than WordPress, but GPF has done much better with XCache than the blog has.
Admittedly, I haven’t compared it to any other opcode cachers, nor have I benchmarked it against any of the competition. That said, however, I heartily recommend it to anybody running PHP applications. To get the greatest benefit, you may need to modify some code (or install a plugin if you’re using a prepackaged application) to take advantage of the variable/object caching. But even without modification the opcode caching alone makes for a vast improvement.
Not sure if anyone noticed, but both the blog and the new GPF beta test site were down last night. Our hosting service, Slicehost, informed us that a breaker blew in their data center and they were forced to bring a number of machines down to protect them. In addition, the blog server (which also hosts a number other private sites I run) stopped responding, so they had to reboot it again.
Unfortunately, while Slicehost was very informative and sent me several e-mails to keep me apprised of the situation, the sites continued to be down until early this morning. That’s when I discovered that for some bizarre reason the MySQL and Apache services were not configured to start at boot time. This is baffling, in my opinion, as I thought this was automatic with Fedora. You install the application package and, if it’s a service like this, it also installs the appropriate links in the init directories to make sure the services start on boot. Not so, apparently. I’m not sure if this is Fedora’s fault, Slicehost’s, or mine, to be honest, but it should be fixed now.
There’s one part of me thinks that this outage is an ominous sign on the eve of my leaving Keenspot. Then again, it also helped me catch a critical flaw that would have been extremely annoying if it happened a week later, after the move when thousands of readers would be hitting the new site. So I don’t know whether to be paranoid or relieved. (O_O)
Anyone interested in the history of webcomics should check out this week’s episode of the This Week in Tech (TWiT) podcast. Especially since it has nothing to do with webcomics.
Here’s my line of reasoning: In this episode, Leo Laporte and his unusual round of suspects are joined by Jonathan Coulton, geek musician extraordinaire. Aside from discussing a few topics of current note (like the death of HD DVD), they discuss a recent concert by Coulton where Leo and company joined him to play Rock Band before a nerd-filled audience. They go on to talk about the “new” Internet phenomena of niche entertainment targeting–skipping the big, mass-market blitzkrieg typically used by music, TV, and movie studios and canvasing thousands or millions of potential customers, to instead go directly to your core fans, the few dedicated people who are the ones that will really appreciate what you do. Coulton talks of making a living catering to a small handful of hard-core fans and how this is much more fulfilling that the big media alternative, where both the artist and the audience are faceless statistics on the bottom line of a balance sheet. And they discuss this with such freshness and enthusiasm, as if this is were the next new thing, some epiphany that no one has yet uncovered.
What I find so funny about it is… those of us in webcomics have already been doing this… for years.
I’ve noticed this a lot over the past near-decade of GPF’s existence. Blogs, podcasts, and other forms of grass-roots media have all cropped up during that time, putting publishing power in the hands of the masses, becoming “innovative” and “groundbreaking” in bringing content production to the people. But a fair number of “new” trends (and problems) associated with these technologies are things I remember seeing crop up among webcartoonists several years before. Long before the term “blog” was coined, I remember chatting with other cartoonists on mailing lists and news groups, swapping ideas about search engine optimization (before that term was coined as well), getting and retaining readers, how to monetize your site, etc. It’s entertaining now to watch many tech headlines to see “fresh” ideas crop up that I’ve personally tried–and abandoned–a couple years before. It’s like the wheel reinventing itself every couple of years, only with different colors and/or materials.
Of course, I would never be so conceited to believe webcomics “did it first.” Webcomics themselves borrow heavily from the underground comics movement of the 1950s, 60s, and 70s, where small independent publishers ducked under government sensors to push out innovated and controversial content directly to the people who wanted them. What changed between then and now is that the interconnectivity of the Internet moved this from basements and back rooms to hidden mailing lists and chat rooms, eventually making its way to the mainstream, all while expanding the sphere of availability from isolated pockets of common interest to global reach. It would also be naive to believe this flow of “innovation” is one-way; RSS and other syndication technologies took off first in the blogosphere, and was only later ret-conned and shoe-horned into webcomic automation systems as a handy update notification system.
Perhaps one of the reasons bloggers and podcasters didn’t learn any lessons from webcartoonists is the difference between skill level–real or perceived, take your pick–required for entry. Cartooning obviously requires some level of artistic talent as cartooning, in all of its myriad of forms, is a form of art. It’s often a commercial art, intended more to generate revenue than anything else, but an art nonetheless, conveying ideas and emotions graphically. And while a well-crafted blog certainly requires a talent for writing, that is often easier to come by than the ability to both write and draw. Thus the critical mass of webcartoonists is much smaller than that of bloggers and podcasters, making it less noticeable to the mainstream. That’s also why “break-out” blogs now seem to be a dime a dozen, but it’s still major news when an online comic gets noticed by big media and gets optioned for TV/movie deals. Everyone knows about blogs and maybe even reads a few, but there are other comics on the “intraweb” besides Dilbert?
I’m not sure if there’s anything useful to these observations, other than the fact that they amuse me occasionally and it gives me something to post about. I’m not sure if anyone else has made these kinds of observations or, for that matter, anybody else cares. But I’ve often wondered if those underground cartoonists of yesteryear thought to same way about us webcartoonists as I have about bloggers. I’d like to think so, just because it creates a nice symmetry. I can’t wait for bloggers to sit around in the old bloggers’ home, thinking such thoughts about whatever comes next. “Those kids with their holocasts… if they had learned the lessons we did about AI search, they’d be raking the quatloos by now….”
By now, I’m assuming most of you have read Mondays GPF News item. (If you haven’t, shame on you.) GPF is leaving Keenspot, and I’m neck-deep in unit testing the new site with hopes of releasing it to beta testers soon. If you’re interested in beta testing, you can volunteer in this thread on the old forum.
However, I’ve hit upon one little programming snag, so I thought I’d put out an appeal for help. I thought the blog would be more appropriate venue for this than the forum; that assumption could be wrong, but I’ll go with it anyway. For those of you with some Web-based programming knowledge, especially in the areas of PHP and cookies, please put on your thinking caps.
As part of the new site, I’m implementing my own version of Keenspot’s PREMIUM service, reusing the old relabeling of GPF Premium. Keenspot PREMIUM is going away (for several reasons I won’t go into here), but as the service’s biggest proponent and largest beneficiary, I’d hate to lose that functionality. So the new site will launch with its own independent Premium functionality including all the old service’s features (optional ad-free surfing, weekly archives, High-Def archives, tons of exclusives like Jeff’s Sketchbook, etc.) plus a few new features that I’ve been wanting to implement but haven’t had the time or technological hoop-jumping expertise to work on at Keen.
For security reasons, I want to secure Premium sign-ups and account management via secure HTTP (HTTPS). The benefits should be obvious. By encrypting account creation & management pages, you eliminate sniffing attacks and protect user privacy. While these pages may still be susceptible to other forms of attacks (and I’ve coded them to be as resilient as I know how), encrypting the traffic end-to-end can go a long way to cutting off those vectors of attack.
However, I seem to have hit a brick wall when it comes to setting the Premium authentication cookie. Like Keenspot’s implementation, the subscriber’s browser will be “enabled” by “branding” it with a cookie, which will be read and authenticated each time the page is loaded. If valid, Premium features for that page will be turned on; if invalid, the page will default to a non-enabled state, which could be a simple as showing all ads or as complex as denying access to the content within. Unlike Keenspot’s implementation, which was JavaScript based, mine is scripted server-side in PHP, meaning it should be more accessible to a wider range of browsers and in theory more secure (no Premium content is sent at all if Premium is not enabled, rather than letting the client browser decide). My implementation has been thoroughly tested and appears to work pretty much flawlessly… with one hitch.
The problem occurs when I set the cookie over the encrypted HTTPS connection, then try to read it over unencrypted HTTP. I appears that none of my test browsers send the cookie back when the encryption state changes. The reverse is the same; if I change the URL and set the cookie over HTTP, then try to access a page via HTTPS, the encrypted page can’t see the cookie either. It works like an either-or situation, when what I really want is both. If I set a cookie over HTTPS, I want to see it in both HTTP and HTTPS mode.
PHP’s primary cookie interface is the setcookie() method (for setting) and the $_COOKIE array (for reading). setcookie() includes a boolean parameter for secure cookies, i.e. cookies that will only be sent via HTTPS. What’s annoying is that even when I set this flag to false to force it to be insecure, the scripts continue to exhibit the same behavior: cookies set via HTTP can only be read via HTTP and vice versa. I’ve also tried setting the same cookie both ways–first in one protocol, then the other, without erasing the first cookie–but that didn’t seem to work. The second cookie overwrites the first one, effectively turning it off.
I had heard that IE 6 exhibited this behavior as a bug. However, I tried the exact same tests in Firefox 2.0.0.11, Opera 9.24, and Safari 3.0.4 (all on Windows) as well as IE 7, and all reacted the same way. Cookies set over HTTP could not be read over HTTPS and vice versa. It’s a bit frustrating. Obviously, I don’t want my Premium folks to be forced to use the new site in encrypted mode all the time, as this would slow down all the pages and put a significant extra load on the server as the number of subscribers increases. But I want to protect my users’ privacy and settings (and one of my important revenue streams) by encrypting their account access.
So I guess I’m looking for answers to two questions:
Any responses via e-mail or (preferred) comments below will be appreciated.
Update March 5, 2008: Thanks to the input of many commentors below, it looks like I’ve got a solution. The problem, as usual, was somewhere between the chair and the keyboard and the faulty component has been sufficiently flogged with a wet noodle. Immense thanks to everyone who provided feedback and suggestions.