Extracting Your Blog

Blogs and eBooks are two completely different beasts that just happen to live in the same digital jungle. One of the things exposed by converting between the two formats is that blog content, in its browser layout form, looks horrible in a PDF.

By design, most blog content is bite-sized, both visually and conceptually. While eBooks can emulate that, they also have the freedom to be dense tomes. Conventional wisdom claims that blogs generally cannot keep visitors’ attention with dense layout. (Look at this blog, for example. I’ve attempted to buck the norm with a denser layout and fewer visual breaks.)

Lorelle VanFossen on WordPress
- Lorelle VanFossen from Lorelle on WordPress

Therefore, if you want total control over the eBook creation process, you are going to have to get comfortable with the idea of editing and cleaning things up, as Lorelle advised. If you have not read Preparing Your Blog to eBook Categories, do yourself a favor and check it out.

Let’s get your blog content out in raw format, with no restrictions on the layout. There is no reason to compromise your vision with inflexible software. With the right mix of general-purpose software and specialized tools, you can automate the drudgery, yet ably manage the task of converting your content to an eBook that you’ll love.

The general-purpose software includes a Word Processor and Spreadsheet. The specialized tool I will use is my own Windows desktop application called Retrievem. It has a built-in task, unimaginatively titled blog2ebook. You run that task, set up a few rules and in a few seconds, you’ll have a text file that contains your desired content. From there, you could import that file into a spreadsheet in order to keep track of which posts should be grouped together.

Originally, I was going to send the text file to a CSV file (comma separated values). Then, I was going to set up the columns that I used for my own project. However, that goes in the wrong direction; I wouldn’t want to impose my structure on your content. Besides, you may not feel like bothering with the spreadsheet approach.

Instead, just look at the text file. You will see the sections easily enough to cherry-pick what you you need. If you are proficient with your word processor, you may prefer to paste the whole file into that software before editing.

The next post will actually step you through the process of extracting your content.

Fieldnotes

Original Method: WordPress Backup Files

Convert WordPress Blog to e-Book
Backup WordPress Posts
Extracting Posts from WordPress Backup Files
Working with Extracted WordPress Blog Posts
Accessing the WordPress Database Posts Table
(While writing the last post in this list, I discovered a better way...)

How-to

Getting Started

Preparing Your Blog to eBook Categories

New Method: WordPress Export Tool

Using the WordPress Exporter

Raw Content Retrieval

Extracting Your Blog

Preparing Your Blog to eBook Categories

Two concepts drive the strategy outlined in this article. First, no one method of categorization is superior to another. Second, let the tools do as much of the work as possible. Each of the following scenarios embraces these concepts. All you have to do is decide which scenario best describes your blog to eBook status.

Before looking at the scenarios, let us review the two concepts and briefly discuss comments, images and attachments.

Categories

You can segment your content based on any of the following criteria:

  • Title
  • Publication date
  • Post type (page, post, attachment)
  • Category
  • Tag
  • Any attribute, really

So, pick whatever combination makes sense. Keep in mind, though, your prep work will be easier if the combination includes only existing elements. Retrofitting your content may not be an option if your blog is still active; you don’t want to risk messing it up for your visitors or SEO.

A good idea is to create a rough outline of the eBook. Consider the posts or pages that will go into each section. Perhaps your current blog taxonomy—tags, categories and other groupings—will help you decide. WordPress.org has a technical discussion of Taxonomies.

Tools

WordPress Export

Obviously, WordPress itself is the best tool you have for preparing your content to make the trip from blog to eBook. Be sure you understand how the built-in WordPress Export tool works (and its limitations!) If you have a huge blog, consider limiting the amount of exported content. You might even plan ahead by deciding to perform multiple exports.

WordPress Posts Menu

Using the Posts menu, add categories and tags to your taxonomy as needed. By adding these ahead of time, you can use the bulk editing tools for pages and posts.

WordPress Posts Menu
Add categories and tags via the WordPress Posts Menu

WordPress Bulk Editor

The bulk editors are great time-savers. Not only can you filter your posts, you can also select the ones you wish to edit. In this instance, editing is limited to a few attributes, such as categories, tags, author and publication status.

WordPress Bulk Editor
Select posts to be edited

WordPress Bulk Editor
Bulk editing multiple posts

Deferred Preparation

You don’t have to mess around with the bulk editors. In fact, if your blog is still active, you may want to defer categorization until after you have exported the posts you want in your eBook. Don’t take risks with your visitors’ experience.

Again, if you have a huge blog, consider doing multiple exports to reduce the size of the export files. With the right file managing tool, you can work with multiple export files as easily as you can with a single, massive file. (Of course, the recommended tool is Retrievem!)

One of the neat things about the WordPress XML file format is that it can be used to create other types of files. For example, if you can get your blog information into a spreadsheet, you can play around with the titles, grouping, filtering and sorting them by categories. You can create new categories, rename others and generally treat your blog as the actual outline for your eBook!

(A future how-to article will show you some ideas for using spreadsheets to organize your blog to eBook projects.)

Comments, Images and Attachments

Comments are what make a blog post come to life. You may wish to recapture some or all of the engagement related to the posts you add to your eBook. Of course, you get to decide whether to include images but, if you plan to add sourced material, be sure to keep track of the attributions. As for attachments, you will most likely be deciding whether or not to link to them.

Thanks to the WordPress Export tool, the technical bits will be available. Depending on your skill with other tools, organizing these extras will be easy, challenging or impossible. You must consider how much time you are willing to spend to recreate the blog. If you have teaching content, you’ll probably want your eBook to faithfully reproduce your lessons.

On the other hand, if you have a bunch of essays where the images were just added for the sake of esthetics (or catching eyeballs), you may not need the images.

Scenarios

Let’s consider some likely scenarios. Your blog may be active, undergoing changes or dead. Your desired eBook will either replace or supplement your blog content. That’s six possible scenarios. Your situation may not be among these six but the ideas should still be helpful.

Scenario 1: Active Blog, eBook to Supplement Posts

Creating an eBook is one way to deal with the invisibility of older posts. This is different from the eBook-for-email address offer, in that you’re culling existing content. That is not to say you couldn’t offer the eBook of old posts as an inducement, especially if you provide a good amount of time-saving information.

The more common strategy I have seen is to offer the eBook for sale to those who either wish to save time or just want a tangible collectible from their favorite author. Transparency is the key to making this work. Just be upfront about the choices available to the reader, especially if the content is being sold.

Whatever your motivation is for keeping both formats, your preparation should include adding a blurb to each post that will be going into the eBook. Think of it as advertising. At the very least, you’ll want to mention that the post is part of a collection. Add a link to your eBook download and you’re set!

A plugin that provides shortcodes for text snippets will be very handy for such blurbs. I use WebSimon Tables, but you could use anything that works for you.

Scenario 2: Active Blog, eBook to Replace Posts

I am not going to give advice about SEO. I don’t care about it, so my actions may seem reckless to those who do care. This scenario is the one I defaulted to when my previous web host crashed and burned. (Okay, I botched an upgrade and hosed my site.)

Whether you remove old posts all at once or little by little, the most important thing you can do is to decide to redirect the permalinks, rather than delete them. That blurb from Scenario 1 would be a good target for such redirects.

You’ll also want to think about customizing your 404 page for those links you decide not to redirect. Link rot can fertilize eBook downloads if you let wandering visitors know what happened to old blog posts. Be sure to include a link to the eBook.

Wacky 404

At last! My own wacky 404 page

Scenario 3: Evolving Blog, eBook to Supplement Posts

Again, consider your SEO ramifications before going nuts with categories. Your best bet is to defer preparation until you have an offline copy of the blog. Presumably, evolution simply means that you won’t be taking pains to keep interlinking old content. Or maybe you’re just lazy and don’t feel like embarking on Scenario 4…

Scenario 4: Evolving Blog, eBook to Replace Posts

I suspect that you’ll need to do some homework, whether or not you care about SEO. As with Scenario 2, think about how you want to handle the old permalinks. But, unlike replacing posts that may have been topically relevant, find out what you can expect from visitors encountering evidence of unrelated links.

This is one time where those bulk editing tools can come in handy. You’ll basically have three classes of categories, tags and other groupings:

  • The New Stuff
  • The Good Old Stuff
  • The Bad Old Stuff

Try to categorize the old stuff in such a way that it can be hidden from the readers who land on your blog looking for the new stuff. You can use plugins to hide categories from the various list pages generated by WordPress. List pages include Archives, Categories, Tags, Search, etc.

To hide your own pages, take a look at Page-List, for example. Once you understand how it works, you’ll be able to evaluate similar plugins for posts.

Scenario 5: Dead Blog, eBook to Supplement Posts

This is kind of silly, except where you may consider your blog to be inactive rather than dead. Perhaps the blog was the delivery medium for a course. If you no longer offer the course but still wish to share the content via eBook then consider these ideas:

  • Use the course outline as-is for your eBook chapters
  • Tag obsolete posts so that you can filter them out later, either to ignore or update
  • Create an “ignore” category and assign it to posts you want to skip

Scenario 6: Dead Blog, eBook to Replace Posts

All of the ideas from Scenario 5 can be used here. In addition, think hard about ignoring posts if your blog is going to be deleted. If you don’t already have an archive of old blog posts, you should at least store the posts as saved web pages. You never know when you’ll want to refer to them.

Summary

You should be ready to tackle your eBook before you even log into your WordPress site. Once you have a basic outline, you’ll have a better idea of how to prep your posts and pages. Don’t be too quick to add categories and tags. Also, be careful about handling old permalinks and discarding old content in its original format.

The safest bet is to defer all planning until you have a local copy of your blog. It means more work, but you won’t have to look for an Undo button! The Export tool built into WordPress makes retrieving your blog content a snap, no matter how you prep them. If things go wrong, just download another copy.

Fieldnotes

Original Method: WordPress Backup Files

Convert WordPress Blog to e-Book
Backup WordPress Posts
Extracting Posts from WordPress Backup Files
Working with Extracted WordPress Blog Posts
Accessing the WordPress Database Posts Table
(While writing the last post in this list, I discovered a better way...)

How-to

Getting Started

Preparing Your Blog to eBook Categories

New Method: WordPress Export Tool

Using the WordPress Exporter

Raw Content Retrieval

Extracting Your Blog

Using the WordPress Exporter

The built-in WordPress Exporter utility is a great tool for retrieving some or all of my blog content. Finally, I can describe a process that is potentially useful to others.

First of all, without getting too technical, the WordPress Exporter creates an RSS feed of the blog. The developers chose XML, a file format that simplifies the monumental task of describing a chunk of content – namely a blog post or blog page. If you use an RSS feed reader, you will appreciate the care taken to preserve the blog post layout.

I decided to start small. I set up the exporter to retrieve a single category of posts. I picked a category that had exactly one post. If I could manage the extraction of a single post, I figured that scaling up would be a simple matter of repetition.

The WordPress Exporter has three main choices for what to include in the export file:

  • All Content
  • Posts
  • Pages

Each choice reveals a second set of options that can be used to limit the amount of content exported. When I chose Posts, I saw these options:

WordPress Export Tool
WordPress Export offers many options

The ability to fine-tune the export makes this a great tool for a Blog to eBook project. With a bit of planning ahead of time, I imagine that I would save a lot of time by not having to sift through irrelevant posts.

When I clicked the Download Export File button, I received a tiny, 8KB file with my one post. The next step is to extract that post and any other relevant information. Since my goal is to make this process useful to others, I will try to be as flexible as possible – probably grabbing more data than necessary. Stay tuned.

Fieldnotes

Original Method: WordPress Backup Files

Convert WordPress Blog to e-Book
Backup WordPress Posts
Extracting Posts from WordPress Backup Files
Working with Extracted WordPress Blog Posts
Accessing the WordPress Database Posts Table
(While writing the last post in this list, I discovered a better way...)

How-to

Getting Started

Preparing Your Blog to eBook Categories

New Method: WordPress Export Tool

Using the WordPress Exporter

Raw Content Retrieval

Extracting Your Blog

Accessing the WordPress Database Posts Table

When I first wrote about converting some of my blog posts to an e-book, I hadn’t planned on repeating the process. Since I’m including a tutorial this time around, I will find out if my biggest concern is valid:

The most important things I had to know were the order and type of data used to store a blog post.
This requirement is the main drawback to working directly with a file. If a future version of WordPress
changes the database structure, my parser would have to be updated, as well.

from Extracting Posts From WordPress Files

Although I used iThemes Security Plugin for the tutorial, I installed WP DB Backup so that I could compare its backup file to the one I created for my e-book.

WP DB Backup plugin
Visit WP DB Backup plugin

Regrouping

I quickly discovered two things: the table structure had indeed changed and, WP DB Backup and iThemes Security both extracted the same fields from the post table. This meant that I could not use my old parsing pattern. On the other hand, at least I stood a chance to make a pattern that would work, regardless of the plugin used to create the backup file.

Clearly, I needed to standardize my extraction procedure; otherwise, this project would be of no use to anyone else. For the morbidly curious, here is a snapshot of the two table structures:

Changed Tables
Don’t count on table columns staying the same!

As I was putting this together, I realized that I didn’t have a clue about why the structures were different. Rather than speculate, I hunted for the answer and found it deep within the WordPress Codex. If you examine the Changelog for the Post Table, you’ll noticed that the category field was dropped in version 2.8:

WordPress Codex
The Posts table changes frequently

It is one thing to account for table structure changes. It is quite another thing to map the changes to a complex pattern. In fact, doing so might not be the best option. I came across an interesting post about importing posts and pages from one website to another. This gave me a new direction to explore.

Using the WordPress Exporter

I played around with the tool provided by WordPress. This turns out to be a simple XML file! Of course, simple is relative. The exporter has three options: all content, posts or pages. The good news is that if you don’t want to bother with adding pages to your e-book, you could use the posts option.

At this point, I was ready to abandon the old pattern in favor of parsing the xml file. After all, the XML file is much cleaner than the raw data from the database. I would need to extract the title, publication date and content. I found the specific XML tags that identified these elements:

  • <title> and </title>
  • <pubDate> and </pubDate>
  • <content:encoded><![CDATA[ and ]]></content:encoded>

That is topic of the next post.

Fieldnotes

Original Method: WordPress Backup Files

Convert WordPress Blog to e-Book
Backup WordPress Posts
Extracting Posts from WordPress Backup Files
Working with Extracted WordPress Blog Posts
Accessing the WordPress Database Posts Table
(While writing the last post in this list, I discovered a better way...)

How-to

Getting Started

Preparing Your Blog to eBook Categories

New Method: WordPress Export Tool

Using the WordPress Exporter

Raw Content Retrieval

Extracting Your Blog

How to Backup All Your WordPress Posts

Project update

Aug. 28, 2014:

The WordPress Exporter provides a better way to extract blog posts. I will use it for the rest of this project.


As part of the Blog-to-eBook Project, I will present a step-by-step procedure for acquiring your blog posts and pages. You will also gain the benefit of having a backup plan for your WordPress blog.

Plugins for backing up WordPress have many features. At the most basic, a good backup plugin will export your posts from the WordPress database on your web host. Once the posts have been extracted, you can save, download, email or copy them to a cloud service like Dropbox. As long as you can access the backup files and copy them to your local hard drive, you can use whatever plugin and storage scheme you’d like. (For the sake of clarity, I use the term posts only. WordPress considers pages and attachments to be posts as well, and they all get backed up. Be aware that only the links to attachments are backed up in the posts table. Depending on your chosen plugin, the actual attachments may be added to the backup file.)

For this tutorial, I used iThemes Security, a great plugin for securing and backing up WordPress installations. I set the backups to be emailed to me, so that I can easily download the attachments. (Plus, I don’t want to use up server space.)

To start, you have to install your chosen plugin. Once you have activated it, find the setting that allows you to configure the backups.

How to Backup WordPress Posts
Visit iThemes Security plugin page

Weirdly, iThemes Security Backups tab emphasizes the Create Database Backup button when, in fact, your first step is to click the Adjust Backup Settings link.

iThemes Security backup tab
Do not click the button…yet

On the massive settings tab, the backup settings are about midway down. You have just three Backup Methods from which to choose.
I selected Email Only. If you choose Save Locally Only, you’ll have to transfer the file via FTP. This might actually be necessary if your email chokes on huge attachments.

You should check the box for Zip Database Backups. Compressing the original file really reduces the size of the zip file. (See final image)

iThemes backup settings
Finally, set up scheduled backups. It doesn’t matter for this project but, if you are blogging actively, you may as well reap the benefits of current backups.

Enable scheduled backups
You may as well enable scheduled backups

Back on the main iThemes Security Backups tab, click the Create Database Backup to generate a current backup. Get that file onto your hard drive so that you can begin the next step.

Now you can click the button
Now you can click the button


Here is the downloaded attachment. I opened it in 7-zip to show you the compression – the zip file is just over 20% of the original file’s size! (1.7 MB vs 375 KB attachment)

7-zip Info screen
Nearly 80% compression ratio

Project update

Aug. 28, 2014:

The WordPress Exporter provides a better way to extract blog posts. I will use it for the rest of this project.


Fieldnotes

Original Method: WordPress Backup Files

Convert WordPress Blog to e-Book
Backup WordPress Posts
Extracting Posts from WordPress Backup Files
Working with Extracted WordPress Blog Posts
Accessing the WordPress Database Posts Table
(While writing the last post in this list, I discovered a better way...)

How-to

Getting Started

Preparing Your Blog to eBook Categories

New Method: WordPress Export Tool

Using the WordPress Exporter

Raw Content Retrieval

Extracting Your Blog

Extracting Posts from WordPress Backup Files

The parsing and extracting portion of my Blog to eBook project was a fun, one-time exercise in reading the WordPress database backup file. Even though the database can be read using a powerful tool like phpMyAdmin, I took advantage of the fact that WordPress can make a plain text file when it creates the backup. Besides, I wasn’t about to tinker around with the original database!

The key to parsing and extracting my blog posts was deciphering the backup file. Every MySQL database can export some or all of its records into a single file. Records, such as details about blog posts and pages, are stored in tables. Each table has a structure, basically a list that describes how to store each detail. The most important things I had to know were the order and type of data used to store a blog post.

This requirement is the main drawback to working directly with a file. If a future version of WordPress changes the database structure, my parser would have to be updated, as well. (In the how-to portion of this project, I’ll discuss ways to mitigate this.)

RegexBuddy

WordPress stores posts, pages and attachments in the same database table. I used a program called RegexBuddy to build two pattern-matching instructions. The first pattern matched all three types of entries. Attachments include images, video, spreadsheets and other documents. Since those were probably not going to be in my ebook, I used a second pattern to match just the attachments.

By running both patterns, I was able to extract the blog posts from the backup file. I pasted the extracted information into an Excel spreadsheet. Next, I compared the list of attachments to the list of everything and deleted the spreadsheet rows that contained attachments. Then, I sorted the records by date and went through each row, cherry-picking the posts that I wanted to include in my ebook. The last thing I had to do was to clean up the actual posts, by removing HTML tags, web addresses and embedded scripts.

I’ll explain the cleanup process next time. Here is a summary, in pictures (I chose Excel rather than RegexBuddy to display the second pattern, used to get rid of attachments. Otherwise, it won’t mean anything unless you understand regular expressions):

Parsing and extracting blog posts from a WordPress backup
(click for full size)

Project update

Aug. 28, 2014:

The WordPress Exporter provides a better way to extract blog posts. I will use it for the rest of this project.


Fieldnotes

Original Method: WordPress Backup Files

Convert WordPress Blog to e-Book
Backup WordPress Posts
Extracting Posts from WordPress Backup Files
Working with Extracted WordPress Blog Posts
Accessing the WordPress Database Posts Table
(While writing the last post in this list, I discovered a better way...)

How-to

Getting Started

Preparing Your Blog to eBook Categories

New Method: WordPress Export Tool

Using the WordPress Exporter

Raw Content Retrieval

Extracting Your Blog

Working with Extracted WordPress Blog Posts

The last thing I had to do was to clean up the actual posts, by removing HTML tags, web addresses and embedded scripts. I used Retrievem, software that I developed for just such tasks.

Gory Blog Post
Extracted Post BEFORE Cleanup

The raw text extracted from the WordPress database is practically unreadable. Line break markers, hyperlinks and HTML formatting tags had to be removed or replaced with their visual equivalents.

Not-so-Gory Blog Post
Extracted Post AFTER Cleanup

As part of the cleanup, I added some of my own markers, tags in brackets that identified each post. A combination of Word documents and Spreadsheet references simplified the final task of choosing the posts I wanted in my e-book.

Al Gore Blog Post
Extracted Post in e-Book

Project update

Aug. 28, 2014:

The WordPress Exporter provides a better way to extract blog posts. I will use it for the rest of this project.


Fieldnotes

Original Method: WordPress Backup Files

Convert WordPress Blog to e-Book
Backup WordPress Posts
Extracting Posts from WordPress Backup Files
Working with Extracted WordPress Blog Posts
Accessing the WordPress Database Posts Table
(While writing the last post in this list, I discovered a better way...)

How-to

Getting Started

Preparing Your Blog to eBook Categories

New Method: WordPress Export Tool

Using the WordPress Exporter

Raw Content Retrieval

Extracting Your Blog

One Way to Convert Your WordPress Blog to an e-Book

My blog to ebook project is going to be an exercise in parsing and extracting. I briefly considered using Anthologize, a WordPress plugin that many people seem to love. Personally, I want total control over the entire process. So, I’ll begin by extracting all posts, sorting them and picking out the ones that will be added to the ebooks.

The WordPress database tables that store your posts, pages, comments and other data can be exported into a simple text file. In fact, that’s what happens when you perform a backup, using a plugin such as WordPress Database Backup (WPDB) or iThemes Security.

I am taking full advantage of this. I instructed WPDB to email my backups to my Gmail account. I can save any one of them to my hard drive and unzip it into a folder. I use 7-zip, a free, open-source program that creates and manages archive files. After a bit of parsing and extracting, I end up with a spreadsheet of post titles, dates and actual text.

I’ll explain the parsing and extracting, next time. For now, here is a montage of the action:

Blog to eBook
From Database to Spreadsheet to Word Document (click for full size)

Project update

Aug. 28, 2014:

The WordPress Exporter provides a better way to extract blog posts. I will use it for the rest of this project.


Fieldnotes

Original Method: WordPress Backup Files

Convert WordPress Blog to e-Book
Backup WordPress Posts
Extracting Posts from WordPress Backup Files
Working with Extracted WordPress Blog Posts
Accessing the WordPress Database Posts Table
(While writing the last post in this list, I discovered a better way...)

How-to

Getting Started

Preparing Your Blog to eBook Categories

New Method: WordPress Export Tool

Using the WordPress Exporter

Raw Content Retrieval

Extracting Your Blog

The ParserMonster Project

I have installed the ParserMonster Project wiki on this site to document the features of the new ParserMonster Framework. There is not much on it, at the moment, so you should bookmark it or subscribe to the blog feed if you want to keep up with it as it grows.

DocuWiki
Check out DocuWiki.org

I decided to use DocuWiki, mostly because it doesn’t rely on a database, but also because it reminds me of TiddlyWiki, which I used to create the older version’s documentation.

Updates

As with all of the projects you will find on Morpho Designs, I’ll be sharing updates on how I actually use the DocuWiki software. I think that one of the main things I will be doing is figuring out how to automate the documentation process.

Software documentation needs to be consistent. While DocuWiki provides a consistent interface, I need to ensure that the content is presented in a uniform manner. That’s why I will probably make a bunch of boilerplate snippets. Stay tuned!