Blog to eBook Project

My blog to ebook project is going to be an exercise in parsing and extracting. I briefly considered using Anthologize, a WordPress plugin that many people seem to love. Personally, I want total control over the entire process. So, I’ll begin by extracting those posts, sorting them and picking out the ones that will be added to the ebooks.

The WordPress database tables that store your posts, pages, comments and other data can be exported into a simple text file. In fact, that’s what happens when you perform a backup, using a plugin such as WordPress Database Backup (WPDB).

I am taking full advantage of this. I instructed WPDB to email my backups to my Gmail account. I can save any one of them to my hard drive and unzip it into a folder. I use 7-zip, a free, open-source program that creates and manages archive files. After a bit of parsing and extracting, I end up with a spreadsheet of post titles, dates and actual text.

I’ll explain the parsing and extracting, next time. For now, here is a montage of the action:

Blog to eBook
From Database to Spreadsheet to Word Document

Blog to eBook – Parsing and Extracting

The parsing and extracting portion of my Blog to eBook project was a fun, one-time exercise in reading the WordPress database backup file. Even though the database can be read using a powerful tool like phpMyAdmin, I took advantage of the fact that WordPress can make a plain text file when it creates the backup. Besides, I wasn’t about to tinker around with the original database!

The key to parsing and extracting my blog posts was deciphering the backup file. Every MySQL database can export some or all of its records into a single file. Records, such as details about blog posts and pages, are stored in tables. Each table has a structure, basically a list that describes how to store each detail. The most important things I had to know were the order and type of data used to store a blog post.

This requirement is the main drawback to working directly with a file. If a future version WordPress changes the database structure, my parser would have to be updated, as well. However, this is not an ongoing project, so I’m not worried about being able to re-extract anything once this project is completed.

Read more about RegexBuddy …

I used a program called RegexBuddy to build the pattern-matching instructions. One set of instructions helped me find all posts. WordPress stores posts, pages and attachments in the same database table. Attachments include images, video, spreadsheets and other documents. Since those were probably not going to be in my ebook, I didn’t need them. The second set of instructions helped me find just the attachments.

By running both sets of instructions, I was able to extract the blog posts from the backup file. I pasted the extracted information into an Excel spreadsheet. Next, I compared the list of attachments to the list of everything and deleted the spreadsheet rows that contained attachments. Then, I sorted the records by date and went through each row, cherry-picking the posts that I wanted to include in my ebook. The last thing I had to do was to clean up the actual posts, by removing HTML tags, web addresses and embedded scripts.

I’ll explain the cleanup process next time. Here is a summary, in pictures (I chose Excel rather than RegexBuddy to display the second set of instructions, used to get rid of attachments. Otherwise, it would not mean anything to you unless you understood regular expressions):

Parsing and extracting blog posts from a WordPress backup
From the .sql file to the spreadsheet – parsing and extracting made easy!

Blog to eBook Post Cleanup

The last thing I had to do was to clean up the actual posts, by removing HTML tags, web addresses and embedded scripts. With the right tool, this is a straightforward process. For my needs, the right tool was Retrievem, which I developed for just such tasks.

Inside a WordPress Dashboard

Everything is so deceptively tidy in the WordPress admin area. Let’s look at a typical links post.
WordPress admin area
WordPress Admin Area

When we click the Edit link, we can choose the visual or HTML view:
WordPress visual editor
Visual Editor

WordPress HTML editor
HTML Editor

Inside a WordPress Database

You wouldn’t believe how messy a WordPress post looks! First of all, the image information is stored as a web address. Of course, you can expect to see raw HTML tags and hyperlinks. However, the database converts several special characters into code. For example, each \r\n marks the end of a line. They have nothing to do with the HTML break and paragraph tags; when you see a post in the visual editor, \r\n are the line breaks. Other codes – also known as escaped characters – may be present.

The most troublesome one is the \’ because, to the database, apostrophes are used to mark the start and end of data in each field. The escaped apostrophe tells the database to actually display an apostrophe when showing the post. So, in the example below, proBlogger\’s will show up as proBlogger’s.

<p>This post takes up the <a href="http://www.problogger.net/archives/2010/07/16/take-the-7-link-challenge-today/">#7Link Challenge</a> issued by proBlogger\'s <a href="http://www.problogger.net/about-problogger/">Darren Rowse</a>.</p>\r\n<img src="http://www.morphodesigns.com/mdblog/wp-content/uploads/linklove.jpg" border="2" alt="Blog to ebook" title="Blog to ebook" />\r\n<br><font size="1"><em>Photo by <a href="http://www.flickr.com/photos/whatmegsaid/">whatmegsaid</a></em></font> \r\nThe rules are simple and fun!\r\n<blockquote>\r\n<font size="2" color="green">\r\nThe idea is to publish a post that is a list of 7 links to posts that you and others have written that respond to the following 7 categories. Your links should be to:\r\n<ul>\r\n<li>Your first post</li>\r\n<li>A post you enjoyed writing the most</li>\r\n<li>A post which had a great discussion</li>\r\n<li>A post on someone else’s blog that you wish you’d written</li>\r\n<li>Your most helpful post</li>\r\n<li>A post with a title that you are proud of</li>\r\n<li>A post that you wish more people had read</li>\r\n</ul>\r\n\r\nYou might like to add a few explanations to different links – for example to talk a little about why you enjoyed writing a post or what you like about the post on another blog that you link to . . .\r\n</font>\r\n</blockquote>\r\nVisit Darren\'s post to get more details, if you wish to participate!\r\nSo, here goes:\r\n<h2> My First Post</h2>\r\n<span class="removed_link" title="http://www.morphodesigns.com/archives/2006/hello-world-no-really-hi-there/">Hello World</span>\r\nIf you look at the permalink, I tried to make a point, here. Since this is a links post, I can only refer to the first post on my self-hosted blog. My <strong>real</strong> first post was uploaded to the now defunct WritingUp.com, titled <u>I Wonder If I am an Artist or an Engineer</u>:\r\n<blockquote>\r\n<font size="2" color="green">\r\n


This is just how it appears, a single line of characters! Aren’t you glad WordPress and browsers do such a great job of displaying our posts?

Cleanup Time

Now that you see what a mess extracted blog is, let me show you how Retrievem cleans it up. The goals are to:

  • sanitize, which means to remove hyperlinks and scripts
  • scrub, which means to remove the HTML tags
  • spit-shine, which means to reformat the text and selectively reintroduce any links desired

Retrievem took care of the first two tasks, so that I was free to focus on the third task. After all, I wanted to add my own style to the books. That’s why I chose not to use Anthologize, a WordPress plugin that helps you convert your blog posts to ebook format.

Earlier (see previous fieldnote ), I had gotten the database fields into an Excel spreadsheet, using regular expressions. I used a technique known as concatenation to merge the fields into a single spreadsheet cell. Look at the row for the links post:
Blog to ebook spreadsheet
Spreadsheet magic

All I had to do was select all of the rows in column F and copy/paste them into a plain .TXT file. I added some markers, such as [posttype], to make my job easier:

Blog to ebook raw text
The “Before” picture

This is the file Retrievem had to clean up. The image only highlights one of the nearly 300 rows from the WordPress database. (One complication was the presence of attachments, which is why I needed the [posttype] marker. I filtered out attachments from column F before pasting the results into the text file.)

Retrievem
Retrievem

Sanitize

Retrievem has a task called HTML to Text, which removes everything between <script> and </script> tags, including the tags. It also removes hyperlinks, but does not delete anchor text. Losing the links to post images was only slightly annoying, as I could always visit my blog or open up my saved post if I wanted to grab an image.

Scrub

HTML to Text also takes care of removing all other HTML tags. Some HTML files have DOCTYPE declarations. HTML to Text will remove that, as well. WordPress dynamically generates the DOCTYPE, so that wasn’t present in my text file.

The last thing HTML to Text does is remove blank lines left by deleted tags. The HTML-free file is written to another .TXT file. Here is the cleaned file:

Blog to ebook cleaned text
The “After” picture

Spit-Shine

I pasted the clean text file into Microsoft Word. I globally replaced the \’ code with apostrophes, \r\n with paragraph marks (^p) and \t with tab marks (^t.) Needless to say, the formatting was a mess. However, since my blog posts were formatted to make them easier to scan, I would have had to reformat the selected posts, anyway.

Word document
After changing the escaped characters

More problematic was the handling of bulleted lists. As it turned out, formatting was preserved for some lists, but only if the post had been created in the WordPress visual editor. Remember the escaped characters? Well, \t represents the tab character. In the visual editor, tabs are used to indent list items. When stored in the database, a list will look like the following:
Tabbed lists
Visual editor uses tabs (\t)

In the list post example, you’ll notice that the \t is absent. That’s because the post had been composed in the WordPress HTML editor. They were still easy to spot, in context of the post, since I generally precede lists with a colon and line break after the final word before the list.

I had to carefully compare the live post with the cleaned version, to see what visual effects I wanted to preserve, such as text coloring, quotes, bold and emphasized text.

Summary

I selected a small number of posts from the blog. The spreadsheet came in handy for filtering out posts. The preparation was relatively painless, except for the part where I had to build that massive regular expression to parse the WordPress database. The easiest part was searching through the clean copy in Microsoft Word, pasting each selected post into the related ebook document.

Having this level of control may be overkill, but I love the way the ebooks look, so far!