Automated Cleanup for Word Manuscripts

Expert publishing blog opinions are solely those of the blogger and not necessarily endorsed by DBW.

Automated Cleanup for Word ManuscriptsIf you have been doing ebook development for any amount of time, you have heard a lot about XML workflows. These conversations, sometimes in hushed tones of trepidation and other times in hopeful anticipation, usually tend to end in the conclusion that an XML workflow is just not possible, is too costly, or does not have the tools to be accessible to anyone other than the biggest publishers.

And that may all be true. In my opinion, while I do believe it will happen eventually, the jury is still out on how long it will be until we see XML workflows building the majority of print and ebook files. However, that should not keep us from working toward that eventual goal.

The question is, what do we do in the interim? While there are some XML content management systems available, most publishers are continuing to use the same Word-to-InDesign-to-EPUB process that has become commonplace. This process invariably involves little or no work in Word itself, though sometimes an enterprising publisher will develop a style system in editorial that can be mapped more easily to paragraph and character styles when the Word doc is being imported into InDesign. Once the content is loaded into InDesign, the production and editorial teams still have to clean up the text, fix issues and craft the content into something resembling a book.

The standard Word to InDesign process can also bring with it a lot of junk code and styling, and I have often heard designers say that they just strip all the styling out and start from scratch in the InDesign file. That adds time and potential for error into the process that I think is avoidable.

How many times have you had to run a bunch of GREP commands in InDesign to find common issues and fix them? How many times have you looked at each and every em-dash to ensure it has the right spacing, or tried to figure out what structure the editor or author had in mind when adding those five different heading styles?

This part of the process between Word and InDesign is ripe for change. One of the most important tasks I learned when I started out building ebooks in 2002 was the importance of building automated quality control steps into my process. Yes, you can certainly do a lot of things by hand, but that does not scale well and it is not usually very fun.

Automated quality control takes on many forms. Obviously there are important tools like EPUBCheck and FlightDeck that can be used to catch issues in the EPUB file, but there are issues that those tools can’t currently catch. Also, the end of the ebook creation process is not really the right time to truly engage textual quality issues; it should be done before the print book is designed.

What publishers need are automated quality testing steps earlier in the process. Those could take on many forms, especially considering the diversity of the production processes used by different publishers, but here are a few options that I think you may find helpful:

1. That sweet spot between the Word manuscript creation and the InDesign import is the best place to do these kinds of quality control steps. Bookalope handles a lot of that work for you automatically. It can import your Word document, analyze the structure of the book just based on the formatting used (no Word Styles needed), run a full complement of checks on the text to find and flag potential issues, and more. It then shows you the entire book in a non-code interface and allows you to verify and mark the styles and structure with semantic meaning (including deciding what that italics text is, emphasis, citation, or something else). Once you are finished, Bookalope can export an IDML (InDesign Markup Language) file that you can then import directly into InDesign for your print production. The work that Bookalope does is invaluable for finding the kinds of issues we are talking about here, and making it easier to apply semantic, accessible markup to your books earlier in the production process.

2. Another option is creating your own quality control checks in Word. The VBA (Visual Basic for Applications) macro language in Word is not a complex language to learn, and there are also developers out there who know it very well. If you already have a style guide in place and are requiring your authors or editors to apply those styles to manuscripts before they are sent to production, then this might be a good solution for you. You can have a series of checks run on the content to look for consistency in the styling, highlight potential issues with the text, and more, then fix those issues within Word itself and continue with your import. As with all homebrew solutions, this can be costly to create and will require periodic tweaks, but for some tech-savvy publishers it can work quite well.

3. If you are comfortable scripting InDesign you could probably come up with a similar process to run after the Word file is imported. There are also some good InDesign script creators out there who can help with that.

4. The last option I’ll throw out there might be a bit odd, but it could also work: Export the Word document as Web Page (Filtered) and run a series of scripts on it that can do the majority of this cleanup for you. These could be written in Python, Perl, JavaScript or other languages, and can be customized to your styles and your process. When you are finished, you would have an HTML (or even XML) file that can be imported into InDesign.

Automation is your friend. Whether you build something yourself, have something built for you, or use a tool that is already doing the job right now, adding this step in your process can easily result in more efficiency and better quality books. I recommend you think about how this can work for you, check out the solutions I suggested, and comment below if you have other ideas for quality control steps in this part of your process.

To get all the ebook and digital publishing news you need every day in your inbox at 8:00 AM, sign up for the DBW Daily today!

6 thoughts on “Automated Cleanup for Word Manuscripts

  1. Rebecca Springer

    There are also off-the-shelf or semi-custom Word plug-in options: eXtyles, Editorium, and JouveEdit come to mind.

    I agree that manuscripts should be plain text with markup, but Track Changes is the real missing link.

  2. Ken Jones

    Good stuff Joshua.

    I’d like to mention that along with bespoke Mac development my company Circular Software is currently previewing at London Book Fair a new tool for InDesign called ‘GreenLight’ which runs quality control checks and fixes for print but equally for any documents destined for digital output.

    We aim to have some free checklists based on #eprdctn community suggested checks and will also have a paid for setup where publishers can dictate their own checks and fixes to include in their own checklists which are easy to share with others in their workflow.

    Anyone interested should please feel free to contact me via

    Ken Jones
    Circular Software

  3. Thad McIlroy

    This is great information, including Rebecca’s tips.

    I’m a big believer that, wherever possible, fixes should be applied in the source document. If they take place in InDesign or post-InDesign, they are essentially lost, i.e., if the book is reissued in a corrected or revised format all of the changes need to be re-applied.

    My go-to Word plug-in is PerfectIt ( It amounts to a copy editing tool: when I use PerfectIt (along with tools from the Editorium suite) the amount of time real humans need to devote to copy editing plummets.

  4. Jens Troeger, Bookalope

    Thank you for the great write-up, Joshua!

    You make many good points regarding XML and the Word-InDesign-Epub workflow, all of which were motivators for us to build the Bookalope tools.

    In addition to the clean-up options you mention above, Bookalope expends great efforts during the initial document ingestion without user interaction: consolidate and remove styling (residue problems that Word carries in its internal document structure), check Unicode character encodings of all the text, consolidate duplicated and superfluous characters, automatically assign semantic structure, and more.

    Internally, Bookalope works with XML that is similar to DocBook. While the XML is not yet exposed through the web site itself, the web API allows users to upload and download an XML version of their books, and soon we will add DocBook export. From that aspect, Bookalope offers a full XML workflow without the laborious XML exposure.

    With kind regards,

  5. Aaron

    Sublime Text 3 has a great plugin called RegReplace which allows you to string together and execute multiple saved RegEx’s which I have found work great for cleaning up my Word HTML. It really allows me to control what I want to strip out depending on the project.



Your email address will not be published. Required fields are marked *