Is It Possible to Predict the Next New York Times Bestseller?

Expert publishing blog opinions are solely those of the blogger and not necessarily endorsed by DBW.

data, code, algorithm, publishers, bestsellers, new york times, booksThe upcoming book The Bestseller Code is getting a great deal of buzz, forcing many of us to ask the question, Can one genuinely predict what kind of book will become a New York Times bestseller (typically considered the most prestigious bestseller list)?

The promise of a formula for predicting a bestseller is getting many in the publishing industry and those who write about books excited, or at least curious. Several journalists contacted me for an opinion about the book because of my background in pub-tech and reader analytics. Thus, I became interested in reading it, and St. Martin’s Press was kind enough to provide me with an advance reader copy.

First of all, this is a delightful book to read. I would recommend it as both an entertaining and educational read for anybody interested in the business of books. This is not a magisterial work, like Merchants of Culture by John Thompson, but a book written for the mass market with plenty of anecdotes and examples that readers and authors can relate to.

The “code” is based on some of the latest advances in machine learning as applied to literature, but the authors attempt to simplify the computer science behind the book. There is no mention of “big data” or artificial intelligence—just plain and simple descriptions of what the “black box” does, with references for interested readers to find out more about its inner workings.

However, there is a statement in the book that is misunderstood by many of those who discussed the book with me, and that is that the algorithm can predict whether a book will be a bestseller with a level of accuracy of 80 percent.

I had a sense when being interviewed that most journalists thought this meant something along the following lines of, “If there are something like 500 New York Times bestsellers this year, then this algorithm can produce a list of 500 titles and 400 of those will indeed turn out to be bestsellers.”

Well, that’s not actually what 80-percent accuracy means. The misunderstanding is in the “can produce a list of 500.”

One needs a bit of statistics knowledge to understand this concept better, and I will first provide (with some statistical elaboration) how the authors describe the 80-percent accuracy remark:

If the algorithm is applied to 50 books that are genuinely bestsellers, then it will recognize that 40 of these (80 percent) are indeed bestsellers, but will classify incorrectly (“falsely”) that 10 of the books (20 percent) are not bestsellers (a “negative” result). Thus, the 10 titles that are missed are what statisticians call the “false negatives.”

The inverse is also true: if the algorithm is applied to 50 books that are known not to be bestsellers, then it will recognize that 40 of these (80 percent) are indeed not bestsellers, but will classify incorrectly (“falsely”) that 10 of the books (20 percent) are, in the opinion of the algorithm, in fact bestsellers (a “positive” result), when in fact they never were bestsellers. Thus, these 10 titles that are incorrectly predicted to be bestsellers are false positives.

Let’s construct a different scenario. Imagine a Barnes & Noble megastore in the American midwest with 200,000 nicely ordered titles on its shelves, including 1,000 titles in a section called “Past and Present New York Times Bestsellers.”

Now a mob of Donald Trump supporters enters the stores and throws all the books on the floor in protest of Trump’s Art of the Deal not being displayed in the bestseller section. They don’t actually take any of the books with them, however, so there are now 200,000 books lying in a jumble on the floor.

A poor B&N staff member is now assigned to put the 1,000 bestsellers back on the shelf, but, being new to the job, he has no idea what makes a bestseller and therefore decides to make use of this magical new algorithm.

The poor worker now tests all 200,000 books against the algorithm (stay with me).

When applied to the 1,000 bestsellers, the algorithm identifies 800 of them correctly as bestsellers, but dismisses 200 as not being bestsellers.

Now it gets interesting. When analyzing the remaining 199,000 books, the algorithm identifies 80 percent—that is, 159,200 books, as not being bestsellers, but it believes (incorrectly) that the rest (20 percent) are. That is a whopping 39,800 books.

Our B&N staffer, using the algorithm, identified a total of 40,600 (39,800 + 800) books as New York Times bestsellers, and discovered not just the 1,000 bestsellers he was looking for, but 39,800 “bestsellers” while missing out on 200 real bestsellers that were incorrectly classified by the algorithm. That is what 80-percent accuracy means.

We applied the algorithm to a large sample that had many books in it that were not bestsellers, and as a result the algorithm produced many, many false positives.

It did do its job, though. Whereas the original 200,000 books contained only 0.5 percent bestsellers (i.e. 1,000 books) the new, smaller list of 39,800 books contains 2 percent bestsellers (800 books), a fourfold “enrichment,” which came at the loss of 200 bestsellers going missing, because the algorithm is not 100-percent perfect.

We could play this thought experiment a bit differently. Suppose the staffer is lazy and fills the shelf with the first 1,000 books that the algorithm identifies as bestsellers. Well, based on the above enrichment factor, we know that among the first 1,000 books the intern selects, 2 percent (i.e. 20 books) will be bestsellers. So the new “bestseller” shelf will consist almost entirely of books that are not bestsellers. There is even a 1-in-200 chance that Trump’s book will end up on the shelf.

Now, this result doesn’t sound quite as impressive, does it? But this is what 80-percent accuracy means. And given that a million new books and manuscripts are written every year, it will not turn publishing on its head. To that end, an algorithm with 80-percent accuracy will just not cut it.

Don’t be deterred from reading the book, though. It still offers some genuine and novel insights as to what makes a bestseller. But that said, it is not going to put acquisition editors out of their jobs.

What should not get lost in all this, however, is that machines are getting smarter, machine learning is improving, and artificial intelligence is getting more intelligent. So what if the algorithm were 99.9-percent accurate rather than just 80-percent accurate? In that case, the staffer would have correctly identified 999 of the 1,000 bestsellers lying on the floor as New York Times bestsellers and missed only one.

But the staffer also had to test the 199,000 other books, and that would have produced 199 “false positives,” meaning he would have 1,198 books to put on the shelves—198 more than he would have expected if the algorithm were 100-percent accurate (like an inventory list with no mistakes or typos).

Now that would sound a heck of a lot more impressive, but an algorithm that is 99.9-percent accurate is still a long way off for the simple reason that human taste and fashion are so incredibly unpredictable.

Book publishing will always be a bit of a lottery, but that does not mean the odds cannot be improved with good data and smart algorithms. At my own company, Jellybooks, the emphasis is on generating good data. That means understanding how people read books and when they recommend them, not just judging success based on sales data or a book’s position on particular bestseller list.

Going forward, code will appear more and more in publishing even if it can’t write novels yet or predict with 100-percent accuracy the next New York Times bestseller.

To get all the ebook and digital publishing news you need every day in your inbox at 8:00 AM, sign up for the DBW Daily today!

5 thoughts on “Is It Possible to Predict the Next New York Times Bestseller?

  1. Michael W. Perry

    Quote: “Now a mob of Donald Trump supporters enters the stores…”
    I’m not a Trump supporter, but I do find this politicizing irritating. We all know that if Trump were running as a Democrat, which he well might, the news media would be delighted by his ‘frankness’ and championing his appeal to “the common man.”

    I suspect what’s going on here is like the computer modeling that was used to “predict” global warming. The models were tweaked to match recent data (think 1990s) and to create a disaster scenario, since that meant lots of funding for climate research. (To understand science, always follow the money.) But as soon as those models had to deal with the future, they fell apart. None predicted the last fifteen years plus of essentially flat global temperatures.

    In this case, it’s easy to run recent bestsellers through a computer, examining hundreds of “critera” and finding some set of them that “predicts” (after the fact) which become bestseller. But apply those same criteria to future books, and that contrived relationship is likely to collapse.

    It’s a well-known issue in medical research. Study enough factors and there’ll always seem to be correlation somewhere. Look for environmental “causes” for childhood leukemia and add enough factors and you’ll discover that, for a particular set of data, children who get leukemia are more like to live in (say) green houses or within (say) half-a-mile of a MacDonalds. That means nothing. Use a different set of data and that correlation disappears. That happens all the time with criteria that seems more creditable than house color. The ability of computers to grunch enormous quantities of data often turns up bogus results.

    Don’t forget something else. What publishers can use, authors can also use. The latter can rewrite their stories to give them a high ‘bestseller’ score. Ironically, even if the story itself is unremarkable, it may do well because the publisher, seeing that high score, puts a lot of money into promoting it. Self-fulfilling prophecy.

  2. andrew

    Hi Michael,

    Having Trump supporters mess up a BN store was a more plausible scenario than having an earthquake causing all the books to tumble on the floor, so I made use of it. Yep, I am a sneaky Londoner in that regard (we have our own Brexit earthquake to contend with).

    If you read the book I discussed (“Th Bestseller Code” is out on 20th sept), you will notice that large parts seem to be very much aimed at an audience of self-published authors. Some may indeed use it as a kind of “editorial guidance”. However, you can’t create a “masterpiece” by “painting by numbers” and the same applies to any kind of blockbuster.

    The best music producers can get a song to the top of the charts using a “formula” but rarely into the Top 10. Same in the movie business, some kind of story arcs consistently work, but for a block-buster you need something “extra”. At the same time, movie studios do test screening all the time to collect data as to whether they should spend the big marketing budget or not (we do something similar at Jellybooks…)

    However, the focus of my post was to focus on what it means to say with 80% accuracy that a book might be a bestseller. My point was to highlight that the predictive power of 80% is much weaker than some people imagine. I used Trumps supporters in an attempt to make a basic statistical concept more accessible …

    Quite frankly it was more fun writing the story that way even if the “numbers” suggested I should have used a more “bland” example…

  3. CJ

    Enjoyed the article Andrew and thought your explanation for 80% accuracy was about as clear as one can get without sitting people on the floor with colored marbles. I look forward to reading the book. Michael’s comments remind me of a few students I’ve had. Both are good reminders that no matter how we strive to be clear the author has no control over how the reader interprets what was written.

  4. Lizzienewell

    Interesting article. The next question–which the book might address–it what happens when authors get a hold of the algorithm and use it to write books. The result would be a glut of books according to algorithm and it would become useless for predicting which of the books would come out on top. Probably that unpredictable 20 percent would win out.



Your email address will not be published. Required fields are marked *