What Authors Should Know About OCR

Expert publishing blog opinions are solely those of the blogger and not necessarily endorsed by DBW.

ocr, optical character recognition, ebook, author, publisher, bookIf you published a book before 2008, its ebook edition was probably created using optical character recognition (OCR). And if your ebook was created using OCR, it probably has typos in it. That’s the bad news.

The good news: you don’t have to accept this situation.

What’s special about the year 2008? Nothing, really. I just chose 2008 because the first Kindle came out in late 2007. So 2008 is the earliest year I can imagine a significant number of publishers adopting a single-source workflow: a workflow in which the ebook is created from the same files used to create the paper book. For example, nowadays Adobe InDesign can create an ebook and a paper book (well, a PDF) from the same file. A single-source workflow avoids OCR and OCR-caused typos. It doesn’t avoid all problems, but it goes a long way toward making higher-quality ebooks.

Many publishers continued to use OCR for books published more recently than 2008. On the other hand, commendably, some publishers used single-source workflows for books published before 2008. Since files may be available for books published as long ago as the 1970s, single-source workflows are possible (though unlikely) for books published while Jeff Bezos was still a child.

The bottom line for authors is this: regardless of its year of paper publication, ask your publisher whether OCR was used to create the ebook edition of your book.

If OCR was used, your ebook probably has typos in it. It was probably spellchecked, but not carefully. The whole conversion, including spellchecking, was probably outsourced to inexpensive workers who, even if their English skills were good, were probably working under severe time constraints. And even the most careful spellchecking, as you know, is no substitute for good old proofreading. Your ebook was almost certainly not proofread.

So what can you do?

Ask your publisher to tell you what efforts they made to have your ebook match your paper book. If you are not impressed, try to get them to make more of an effort and fix it now, after the fact. You should not be impressed with an answer like “we outsourced the conversion to one of the most popular conversion houses used by the industry,” because in this case “popular” means low-cost, not high-quality. And you definitely should not be impressed with an answer like “we spellchecked it.”

You should only be impressed with one answer: “We proofread it just as carefully as we did the paper version.”

By the way, if you have an agent, ask your agent to take up these ebook quality issues with the publisher. After all, it is your agent who should has been advocating for the proper treatment of your work in the first place. By and large, agents have either not understood what has happened to their authors’ work in ebooks, or they have understood but have been unwilling or unable to prevent it. Therefore, they share quite a bit of the blame here. For-profit publishers are just that: for profit. Where profit is aligned with authors’ interests, great. Where it is not aligned, that’s where agents need to advocate. Ebook quality may be one of those “unaligned” areas.

Regardless of who asks the publisher—you or your agent—what can you ask your publisher to do if you suspect a low-quality ebook conversion has been performed?

Some good remediation options include asking them to:

• reconvert it using a single-source workflow.
• proofread it carefully.
• have the book re-typed and compared to the OCR conversion.

The publisher may claim that #1 (single-source reconversion) is impossible. This is certainly true if the book is old enough to have been typeset using pre-computer technology. But as long as it was typeset with a computer, there is (or should be!) a file somewhere, and this file should be decodable. If they’ve lost the file (incredibly, this happens!), or won’t pay to have it decoded (by the way, this is my specialty), then perhaps #2 (proofreading) or #3 (re-typing) would work.

Re-typing, when combined with OCR, is a laborious but high-quality strategy. The reason it results in high quality is that the types of errors that a (human) typist makes are very different from the types of errors that OCR software makes. When you combine two noisy (error-containing) versions of the same signal (the book), if the two noises are uncorrelated, you can recover a surprisingly high-quality version of the original.

Finally, it is reasonable to ask on what basis can you (the author) or your agent make such ebook quality demands of the publisher. Even if you don’t ask yourself this question, your publisher almost certainly will!

The answer, as is so often the case, is that it depends on your contract. Before diving into questions of ebook quality, it might be worth stepping back for a second and asking if your publisher even has the rights to make an ebook. If your original contract explicitly included ebook rights, then of course there is no issue. Similarly, there is no issue if you signed a new contract giving your publisher ebook rights.

The only case in which there might be an issue is if your original and only contract was interpreted by your publisher to include ebook rights. This means one of two things. If your publisher’s interpretation was correct, then, in my opinion, your original contract was over-broad, but that’s water under the bridge. The other possibility, however, is that your publisher’s interpretation went beyond their rights, in which case you could of course sue. But it probably makes more sense to just demand a new explicit contract for ebooks. You should demand pretty favorable terms as compensation for their going beyond their original contractual bounds.

So assuming the basic issue of rights is resolved, let’s get back to the much trickier question of what ebook quality obligations are implied by the contract. I say “implied” because even in a contract that explicitly lays out ebook rights, it is very rare that ebook quality obligations are explicitly laid out.

So you’ll have to rely on whatever paper-edition quality obligations are explicit in the contract. Hopefully there was a final correction phase in your paper publication process, as specified by your contract. This would typically be the correction of page proofs. Your publisher was hopefully obligated to print—and hopefully did print—something that more or less perfectly matched the corrected page proofs.

My argument, which I urge to be your argument, is that, lacking any specific ebook quality obligations, your contract implies that your ebook, like your paper book, should have been produced to match the corrected page proofs. This is the basis on which I think you can make ebook quality demands of your publisher.

I hope this article has alerted you to what might have happened to your book during conversion to ebook, and what you can do about it. Feel free to post any thoughts or questions in the comments below.


To get all the ebook and digital publishing news you need every day in your inbox at 8:00 AM, sign up for the DBW Daily today!

4 thoughts on “What Authors Should Know About OCR

  1. Michael W. Perry

    As someone who’s created digital books from OCR more times than I care to remember, I’ll add some other techniques to get that text right.

    1. Use text-to-speech to read the book aloud. There are mistakes that are easily missed in reading that become glaringly obvious in listening. Over time, you may even acquire the knack for telling when a period at the end of a sentence is actually a hard-to-see comma. Someone should create a text-to-speech app specifically designed to make catching typos easier. For instance, it could read “their” aloud as “t-h-e-i-r” to make it easier to catch sound-alike problems.

    2. Display the written text on different devices with differing line and page breaks, i.e. proof the PDF on a large computer screen and the epub on a tablet. Looking different often brings out mistakes. Try different physical settings too. Read once in your office at a desk. Read again at home on your living room sofa. Do whatever it takes to keep your mind focused and open.

    3. Especially for academic materials, consider making a pass in which you look for particularly horrible inaccuracies that result from bad OCRing. For _Chesterton on War and Peace_, I had a quote that was along the lines of “we must now…” What was being said didn’t seem quite right for the person being quoted, so I checked the source. That OCRed “now” was actually “not” (a ‘w’ being much like a lowercase ‘t’). OCRing is mindless. It does not know that there’s a huge difference in meaning between “now” and “not.”

    4. Make multiple passes with specialized purposes. When I proofed for Microsoft, I often made several passes, each time looking at one specific aspect such as the punctation or consistency with terms. That made catching a particular sort of error more likely. Do less each time, and you can make each pass more quickly and probably less tiringly.

    5. Make heavy use of search and replace. If you find a type of error that might be repeated, use search to see if there are other instances. That also means you have one less thing to watch for as you read. Search for “smith” for instance, and you know that this person named Smythe is always spelled that way, that you didn’t occasionally type it as Smith.

    6. Take frequent breaks. Proofing is a grind and proofing what you wrote is particularly hard since you know what you meant and often see what you think is there rather than what is there.

    Good luck making that book digital and typo-free.

    –Michael W. Perry, Inkling Books

    Reply
  2. Julanna

    Those of us who proofread Gutenberg books or Trove newspaper articles know how much work is involved. You need the images next to the OCR text to get it right and I always have the guidelines open to quickly check something new or unusual I find. Of course the people working on Gutenberg books are volunteers so the money constraints aren’t such a problem. The books go through 2 proofreading rounds, the first round being checked by the second set of proofreaders, and they are done page by page by lots of people so the fatigue problem is lessened, before they even think about formatting. And if a reader spots a mistake after that it is fixed quickly so future downloads are better.

    I like the idea of using text to speech at one level of the work. Nice.

    Gutenberg makes a special font available to download to your own computer that enables easy identification of letters and numbers that may look similar. It has improved the workflow so much. Both Gutenberg and Trove enable enlarging the original image to make it easier to read unclear images.

    If at all possible, if the writer has the original (edited and corrected) word or text document it’s best to work from that for the e-book version.

    Reply
    1. Ben Denckla

      Thanks for your comment: it seems freely available texts produced by dedicated volunteers may be achieving far higher quality that commercial texts! Perversely, I have even heard of people breaking the DRM on commercial ebooks to fix errors and (illegally) distributing these improved copies!

      Reply
      1. Julanna

        😀 I have heard of that too but I haven’t needed that method myself. This …probably… wouldn’t happen so much if publishers were a little more responsive about fixing mistakes. I wrote to a publisher once about how the word even had replaced the word ever through the entire e-book and the only response I received was asking for the page numbers … I didn’t know as much back then but I’m pretty sure now it was a book originally published in paper and then OCRed badly.

        Reply

COMMENT

Your email address will not be published. Required fields are marked *

*