Diamonds of Data: Digging Beneath the Surface of Your Content

DBW SpotlightThe theme this month here on Digital Book World is data and analytics. We are all hyper-aware of the data that is outside our books (metadata). Simon & Schuster CEO Carolyn Reidy made that point very well last year when she gave an example to Publishers Weekly about the success that came from adding better metadata to just one title on the S&S list.

But I’d like to talk about an aspect of data that most publishers are still not engaging and are having a hard time even knowing what to do with: the data beneath the surface.

Now, anyone who knows me personally knows that I am a huge Minecraft nerd. Minecraft is a computer game that has become very popular in the last five or six years, appealing mostly to kids in the 6-12 age range (but also to a growing number of adults like me). It has spawned some great books, opened up the minds of children around the world to new creative endeavors, and developed a culture quite unlike any other. If you have never heard of Minecraft, I would be very surprised; the game’s creator sold it to Microsoft last year for $2 billion (yes, with a “b”).


Minecraft is largely about collecting resources from the randomly generated world and using them to survive against monsters that try to attack you, and to build whatever your mind can conceive. The most valuable resource in the game, especially when you are first starting out in a new world, is diamonds. These shiny little objects are very important for building the best tools and armor, but are also difficult to find and hazardous to obtain. You have to dig down deep into the ground, navigate through caves filled with monsters, and avoid lakes of lava that can both kill you and destroy all the resources you have gathered.

Much of the data in publishing is like diamonds in Minecraft: valuable, useful for building tools and protecting against competition, and integral to discoverability and success. This data also lies beneath the surface of our world, down inside manuscripts and published books, in editing room bins, and in the minds of our authors. However, most publishing professionals are just as befuddled by that data as they are with their kids’ love for Minecraft.

Not sure what kind of data I’m talking about? Think about these examples:

• The author of that new treatise on the Civil War wrote 100 more pages of great content than you can fit into the print book. What can you do with that instead of just cutting it?
• The YA novel coming out in two months introduces 30 characters with names that are just similar enough to be confusing. How can you ensure that readers have that information readily available when they need it? And how can you use that data for discoverability?
• The edited volume on healthcare you published last year has some chapters you would like to re-use, but others that you don’t. What can you do with that?

… and the list goes on. When you start looking, I’m sure you’ll be able to find a lot of great material with a variety of uses.

So, how do you collect this data? How do you store it and recall it? That’s a good question!

The first step is always to train your team to know how to recognize good data that might be useful later. Editors are obviously the front line on this, but every team member can be on the lookout for valuable diamonds sitting there waiting to be collected.

A big part of data collection and management is connected to your content workflow. Publishers that have implemented an XML-based workflow can have an upper hand when it comes to the collection of deeper data, but having an XML workflow and having accessible, usable data is not always the same thing. The key is to collect the data in a repository that allows you to tag it, manage it, pull it out and re-use it at will. Also, you don’t actually have to start with XML. Your EPUB files might be a good way to start your content archive without having to revamp your entire workflow.

Of course, the quality of your EPUB data is important if you are going to be using those files as the basis of your content archive (or even if you are not!). EPUB 3 is now accepted by all the major ebook retailers and the vast majority of the minor ones, so switch to that from EPUB 2 if you have not already. With that, be sure to dig into the semantic markup and semantic inflection capabilities of HTML5 and EPUB 3, and develop a plan for implementing accessibility in all your EPUB files.

Those steps are not just about making better ebooks or selling them to readers with disabilities; a semantically-rich, accessible EPUB file may actually be a better resource than an XML file, especially if you don’t have the ability to build a new XML-based workflow.

I hope these thoughts have stimulated your imagination and creativity a bit and have encouraged you to dig deeper into your content archives and develop new strategies to take advantage of the data sitting beneath the surface.

