What Agents Should Know About Ebooks Made from PDFs

Expert publishing blog opinions are solely those of the blogger and not necessarily endorsed by DBW.

ebooks, pdf, ocr, scan, agents, authors, publishersAgents, here’s what you should know about ebooks made from PDFs: they’re not perfect. Knowing this will put you in a better position to protect your authors’ work.

While an ebook made from a PDF is almost certainly better than one made from a scan (OCR), don’t fall into the trap of thinking that it will be perfect. Many people do fall into this (understandable) trap, though—even some ebook conversion “experts,” who though they know conversion from PDF is not perfect, won’t be completely clear with you on this point.

(For this article I’m assuming that PDFs are not scans.)

Conversion from a PDF has many of the same challenges as conversion from a scan. Some of the more important (and error-prone!) of these challenges are as follows:

1. Distinguishing hard vs. soft line breaks (unbreaking lines)
2. Distinguishing hard vs. soft hyphens (de-hyphenating)
3. Handling images
4. Handling notes (footnotes & endnotes)
5. Rejecting headers & footers
6. Handling the table of contents

In other words, compared to conversion from a scan, the only thing conversion from a PDF doesn’t have to worry about is OCR in its most narrow sense: recognizing characters. With a PDF, the conversion software knows what every character is, and where it lies on the page. That sounds like all it would need to know, but it is far from it. The software, or the human operator (often low-paid and/or low-skilled), needs to understand what each character’s role is on the page. Is it part of a footnote? Is it part of a footer? Knowing such things can be difficult.

It is easy enough for conversion software to know what line a character belongs to, but beyond that, things get complicated quickly. For example, is the end of a particular line a hard break, as in a line of poetry, or is it a soft break, as in all lines of a prose paragraph except the last? Is a particular hyphen at the end of a line hard, meaning it should “survive” the unbreaking of the line, or is it soft, meaning it should disappear when the line is unbroken?

The role of a character is usually captured in the publishing software (e.g. InDesign), but the role is lost when the publishing software creates a PDF. So in the vast majority of cases, it’s desirable to create an ebook directly from the publishing software rather than from a PDF. There is no guarantee, however, that an ebook created from InDesign (or similar programs) will be perfect, either. But, if InDesign (or a similar program) is indeed used, entire categories of errors, like de-hyphenation errors, become impossible.

If a book must be converted from PDF, for example—if the publishing software’s source files for the book have been lost (it happens!)—then other quality control measures must be put in place if high quality is to be achieved. These include proofreading, spell-checking (surprisingly, not always done, or not always done well), and re-keying (re-typing) the book and comparing the results.

Now, why have I directed this article toward agents? The answer is, it reflects the current state of my thinking about who can improve ebook quality. When I first discovered the low quality of ebooks, I was outraged, and thought that it was the publishers’ fault. Now I don’t think so.

A publisher should be expected to set the price and quality of their books so as to maximize profit. That notion of quality includes production quality, e.g. things like typos. The profit-maximizing quality, in terms of typos, is (or is perceived to be) very high in the paper book world. But who’s to say what quality and price maximize profit in the ebook world? So I must admit that for all I know, publishers are doing the right thing by releasing ebooks whose quality is low compared to their paper counterparts. Any further investment in ebook quality might yield negligible returns.

In other words, unless customers demand high quality (which apparently they don’t), don’t expect publishers to supply it.

So if not publishers, who might be interested in increasing ebook quality? My next thought was, maybe authors would be interested. Maybe authors would be perturbed to see their books appear with “typos” (conversion errors) even if they were selling like hotcakes.

But many authors don’t read ebooks and/or don’t understand the technical issues surrounding ebook conversion. For example, they may assume ebooks inherently match their paper counterparts perfectly (if only it were so!). Or, on the other hand, they may be aware of the problems but resignedly assume that ebooks can never match paper books well.

The bottom line is, in practice most authors don’t (and shouldn’t have to!) worry about such things. In my opinion, that’s what their agents are for.

So I landed upon agents as the logical advocates for protecting their authors’ work in the new medium of ebooks. Just as agents hopefully negotiated new publication rights and royalties for this new medium, they should have negotiated quality standards as well. To my knowledge, this very rarely (if ever) happened. Agents did not, and do not, include quality standards in ebook contracts. That’s what I’d like to see happen.

As it is, all agents and authors can rely on is whatever quality provisions exist in the paper book’s contract. If the paper book must reasonably match the approved page proofs, I think one can argue that the ebook should be held to the same standard, though what is a “reasonable match” is admittedly a little harder to define for an ebook.

To get all the ebook and digital publishing news you need every day in your inbox at 8:00 AM, sign up for the DBW Daily today!


Your email address will not be published. Required fields are marked *