The last thing I had to do was to clean up the actual posts, by removing HTML tags, web addresses and embedded scripts. With the right tool, this is a straightforward process. For my needs, the right tool was Retrievem, which I developed for just such tasks.
Inside a WordPress Dashboard
Everything is so deceptively tidy in the WordPress admin area. Let’s look at a typical links post.
WordPress Admin Area
When we click the Edit link, we can choose the visual or HTML view:
Inside a WordPress Database
You wouldn’t believe how messy a WordPress post looks! First of all, the image information is stored as a web address. Of course, you can expect to see raw HTML tags and hyperlinks. However, the database converts several special characters into code. For example, each \r\n marks the end of a line. They have nothing to do with the HTML break and paragraph tags; when you see a post in the visual editor, \r\n are the line breaks. Other codes – also known as escaped characters – may be present.
The most troublesome one is the \’ because, to the database, apostrophes are used to mark the start and end of data in each field. The escaped apostrophe tells the database to actually display an apostrophe when showing the post. So, in the example below, proBlogger\’s will show up as proBlogger’s.
<p>This post takes up the <a href="http://www.problogger.net/archives/2010/07/16/take-the-7-link-challenge-today/">#7Link Challenge</a> issued by proBlogger\'s <a href="http://www.problogger.net/about-problogger/">Darren Rowse</a>.</p>\r\n<img src="http://www.morphodesigns.com/mdblog/wp-content/uploads/linklove.jpg" border="2" alt="Blog to ebook" title="Blog to ebook" />\r\n<br><font size="1"><em>Photo by <a href="http://www.flickr.com/photos/whatmegsaid/">whatmegsaid</a></em></font> \r\nThe rules are simple and fun!\r\n<blockquote>\r\n<font size="2" color="green">\r\nThe idea is to publish a post that is a list of 7 links to posts that you and others have written that respond to the following 7 categories. Your links should be to:\r\n<ul>\r\n<li>Your first post</li>\r\n<li>A post you enjoyed writing the most</li>\r\n<li>A post which had a great discussion</li>\r\n<li>A post on someone else’s blog that you wish you’d written</li>\r\n<li>Your most helpful post</li>\r\n<li>A post with a title that you are proud of</li>\r\n<li>A post that you wish more people had read</li>\r\n</ul>\r\n\r\nYou might like to add a few explanations to different links – for example to talk a little about why you enjoyed writing a post or what you like about the post on another blog that you link to . . .\r\n</font>\r\n</blockquote>\r\nVisit Darren\'s post to get more details, if you wish to participate!\r\nSo, here goes:\r\n<h2> My First Post</h2>\r\n<span class="removed_link" title="http://www.morphodesigns.com/archives/2006/hello-world-no-really-hi-there/">Hello World</span>\r\nIf you look at the permalink, I tried to make a point, here. Since this is a links post, I can only refer to the first post on my self-hosted blog. My <strong>real</strong> first post was uploaded to the now defunct WritingUp.com, titled <u>I Wonder If I am an Artist or an Engineer</u>:\r\n<blockquote>\r\n<font size="2" color="green">\r\n
This is just how it appears, a single line of characters! Aren’t you glad WordPress and browsers do such a great job of displaying our posts?
Now that you see what a mess extracted blog is, let me show you how Retrievem cleans it up. The goals are to:
- sanitize, which means to remove hyperlinks and scripts
- scrub, which means to remove the HTML tags
- spit-shine, which means to reformat the text and selectively reintroduce any links desired
Retrievem took care of the first two tasks, so that I was free to focus on the third task. After all, I wanted to add my own style to the books. That’s why I chose not to use Anthologize, a WordPress plugin that helps you convert your blog posts to ebook format.
Earlier (see previous fieldnote ), I had gotten the database fields into an Excel spreadsheet, using regular expressions. I used a technique known as concatenation to merge the fields into a single spreadsheet cell. Look at the row for the links post:
All I had to do was select all of the rows in column F and copy/paste them into a plain .TXT file. I added some markers, such as [posttype], to make my job easier:
The “Before” picture
This is the file Retrievem had to clean up. The image only highlights one of the nearly 300 rows from the WordPress database. (One complication was the presence of attachments, which is why I needed the [posttype] marker. I filtered out attachments from column F before pasting the results into the text file.)
Retrievem has a task called HTML to Text, which removes everything between <script> and </script> tags, including the tags. It also removes hyperlinks, but does not delete anchor text. Losing the links to post images was only slightly annoying, as I could always visit my blog or open up my saved post if I wanted to grab an image.
HTML to Text also takes care of removing all other HTML tags. Some HTML files have DOCTYPE declarations. HTML to Text will remove that, as well. WordPress dynamically generates the DOCTYPE, so that wasn’t present in my text file.
The last thing HTML to Text does is remove blank lines left by deleted tags. The HTML-free file is written to another .TXT file. Here is the cleaned file:
The “After” picture
I pasted the clean text file into Microsoft Word. I globally replaced the \’ code with apostrophes, \r\n with paragraph marks (^p) and \t with tab marks (^t.) Needless to say, the formatting was a mess. However, since my blog posts were formatted to make them easier to scan, I would have had to reformat the selected posts, anyway.
After changing the escaped characters
More problematic was the handling of bulleted lists. As it turned out, formatting was preserved for some lists, but only if the post had been created in the WordPress visual editor. Remember the escaped characters? Well, \t represents the tab character. In the visual editor, tabs are used to indent list items. When stored in the database, a list will look like the following:
Visual editor uses tabs (\t)
In the list post example, you’ll notice that the \t is absent. That’s because the post had been composed in the WordPress HTML editor. They were still easy to spot, in context of the post, since I generally precede lists with a colon and line break after the final word before the list.
I had to carefully compare the live post with the cleaned version, to see what visual effects I wanted to preserve, such as text coloring, quotes, bold and emphasized text.
I selected a small number of posts from the blog. The spreadsheet came in handy for filtering out posts. The preparation was relatively painless, except for the part where I had to build that massive regular expression to parse the WordPress database. The easiest part was searching through the clean copy in Microsoft Word, pasting each selected post into the related ebook document.
Having this level of control may be overkill, but I love the way the ebooks look, so far!