The Literary Darknet of Independent Publishing
The independent and self-publishing space recently found itself with a cascading bit of drama, eventually escalating to impact everyone from Amazon to Barnes & Noble, to WHSmith and Kobo. It began with an article on The Kernel about how Amazon sells incest, rape, and underage erotica in their online book stores. This is not mild content.
The story quickly spread through larger news channels to include virtually every major online retailer, though somehow, the Google Play Store escaped notice, despite having the exact same content. WHSmith, the respected online book seller, responded by shutting down their entire site to categorically remove all independent books until they could be verified “clean.” In case it’s back up by the time this article goes up, the image below is what a major site looks like when the universe implodes.
The relative ease that independent authors can publish content directly to a digital store has created a tremendous swell in content with no editorial oversight. The vast majority of these titles have almost no reliable meta-data about what’s in them. It is a large, invisible ocean of content that most people are not really aware of.
Learn more about the future of ebook retail at Digital Book World 2014
The Literary Darknet
On the internet, the Darknet is a collection of underground or largely unindexed websites that you have to know exist in order to find. A lot of questionable content has grown around these Darknet communities — if you’re familiar with the Silk Road that was recently taken down by the authorities, you’re at least partly familiar with the Darknet.
The invisible, generally unregulated ocean of written content coming onto the market from the self-publishing community is, in some ways, a literary equivalent of the Darknet. This tremendous volume of content is far greater than any current social-based review system can handle, not only from a sexual content standpoint, but from a review and discovery standpoint.The vast majority of these books have zero reviews, and zero star ratings on even the largest social review sites. You can see this in the hundreds of pages of “zero rating” books in almost any Goodread’s keyword search.
This creates a problem. Online retailers like Amazon, Google, and B&N end up putting books on their shelves without content oversight.
How to Map the Literary Darknet
Contrary to popular belief, there is a way to map these sorts of issues, and to do so with millions of books. The Book Genome Project, where I work, has spent years building and tuning computer-based tools that catalog the vast amount of invisible content, generally books that don’t have the marketing resources to be visible on social discovery sites. We also build tools to help retailers identify and reclassify books with potentially objectionable content, such as flagging a Juvenile title that has sex, bestiality, or incest in it. We do this on a scene-by-scene basis in a book, and we do it at scale — normally in the range of 40,000 to 100,000 titles a week.
You can read more details about how our tools work here in an article we did about the impact of 50 Shades of Grey on sexual content in publishing, but for a quick glimpse of what our system sees when it looks at a book, here’s a single sexual content graph from that article: 50 Shades of Grey, from beginning to end of the book. Each block represents roughly 1,000 words. Green means no sexual content. Yellow means some. Red means… well…
From our perspective, we’re mostly interested in whether or not a book is in the right category. As Erotica, this graph wouldn’t have raised an eyebrow, but if it had been misclassified as Juvenile Fiction we would certainly have flagged it. To give you an example that’s more specific to this topic, here’s a graphic showing the sexual content of one of the objectionable books identified by The Kernal as being for sale at Amazon, called Daddy’s Invisible Condom. This book was flagged as both Erotica and Incest by our automated tools:
As you can see, virtually every scene in this book contains sexual content, and as the name implies, incest or pseudo-incest features throughout. It’s also interesting to note that almost immediately after this book was highlighted in the article on The Kernel, the name of the book was changed from Daddy’s Invisible Condom to simply Invisible Condom, removing the ability for title-based screening methods to identify it as containing incestuous themes. However, when we ran the book through our system, it was still flagged as containing those themes anyway, meaning that at the time of analysis the incestuous content was likely still there.
What Percentage of Self-Published Books Are Erotica?
Now, let’s look at books that have sexual content to a degree that they’re likely to be considered erotica. These are self-published books that contain an amount and type of sexual content that puts them statistically in the erotica category established by traditional publishers. In our observations, roughly 28.5% of the self-published content falls into this category. This is based on a “slice of life” sub-sample of data; I would not consider it necessarily representative of all self-published content, though I believe it’s relatively typical as self-published content goes. I have no concrete way to estimate how representative our sample is of all self-published content, though it represents several tens of thousands of books — I can only speak to what we’ve observed. In that case, a little under 30% of the independent content we’ve observed fell into the sexual category. For comparison, about 1.11% of the roughly 110,000 traditionally published books in the Book Genome Project fall into the Erotica category, though I had difficulty tracking down a breakdown of Fiction categories in general in terms of the entire industry perspective.
This supports my personal observations that the self-published marketplace is producing a great deal of sexual content compared to traditional sources.
Type of Erotic Content in Self-Published Titles
Here’s the tough question: How much of this content is of concern to a company like Amazon, Kobo, Google, or news outlets like The Kernel? If we define Erotic Incest and Erotic Bestiality as objectionable, how many books are we actually talking about here?
That, too, we can provide some information on. Out of any given 1,000 self-published books that we observed, roughly 19 (1.9%) will contain erotic incestuous themes, and 9 (0.91%) will contain erotic bestiality themes. Put another way, just under 3% of self-published titles are likely to contain objectionable content by The Kernel’s definition.
There are many ways to spin that, depending on your particular view. On one hand, this means that 97% of self-published titles do NOT contain this content. Yes, it contains substantially more than any similar content we’ve found in traditional publishing (we’ve observed virtually no erotic incest or bestiality in traditional titles), but self-published books are overwhelmingly about something other than those themes.
On the other hand, if you’re inclined to look at it the other direction, it potentially indicates that the amount of questionable content in self-published books is significant. Another way of stating our observations is that nearly 1 out of 10 erotic titles in our self-published sample contained either bestiality or incest. Personally, a more eye-opening way of putting this in perspective is to compare that potential 2.81% overall objectionable content rate in our sample with the prevalence of common genres in traditional publishing. For example, it would be three-times larger than the percentage of traditionally published Cookbooks in 2010. Those only made up 1.04% of total new books. Sports titles made up only 2.26%. If Erotic Incest/Bestiality were a single category of books, it would be a larger category than nearly half of the genres listed in Bowker’s data, and bigger than most sub-categories of Fiction:
Do I really think that the combined categories of self-published Erotic Incest and Bestiality compete in scale with Computer or Literature books? I certainly think it’s possible, but there are some caveats that have to be included.
- There might be substantially more Incest & Bestiality books because: There are more self-published books published each year than traditionally published books. As a consequence, 3% of the self-published books is likely to be far more books than 3% of traditionally published ones. In terms of sheer numbers, there could be substantially more Incest books coming onto the market than this data implies.
- There might be substantially fewer Incest & Bestiality books because: Not all self-publishing companies attract the same authors, and we didn’t shape the data to represent source distribution. I do believe that the books we’ve observed are highly similar to what most people think of as “self-published” — the sort you’d expect to see in CreateSpace, Smashwords, Amazon Kindle Direct, and other similar publishers. But I’d never try to pass the above numbers off as somehow a complete picture of the universe of self-publishing; even if we had access to those books, most of that data would be proprietary and we wouldn’t be able to share it. So, this is more an indication of the potential scale in a single slice, not definitive.
- There are no sales numbers in this data. As with any long tail, it’s likely irrelevant how many books on a topic are available compared to how many people are reading them. After all, does it matter that there is really objectionable content in the long tail of the book market if no one ever sees or purchases it? As a percentage of sales volume, they could be virtually invisible. They could also be one of the few categories of the market that’s filling a niche not already addressed in traditional publishing. The answer to that would require additional data I don’t have.
How Do We Know — The Tools for Mapping Content in the Literary Darknet
In order for any of the above to have any validity, even as a curiosity, it requires some faith in the technology used to generate the data. The Book Genome Project focuses on using computers to understand the thematic, emotional, and stylistic make-up of the content of a book. It’s often been compared to the Pandora of the book industry, in terms of methodology. Every theme that we measure is done on a scene-by-scene basis, allowing for a very granular degree of content mapping throughout a book. In terms of accuracy, our tools for identifying erotic content has a better than 99% catch rate, and a less than 1% false positive rate. The same is true with bestiality. For a more detailed example of measuring sexual content in books, check out how Fifty Shades of Grey impacted the amount of sexual content in Romance.
If you’re interested in the more general application of the Book Genome Tools on search and discovery, or you happen to be a Stephen King fan, check out Visualizing the Data of Stephen King.
For more information on BookLamp or the Book Genome Project, feel free to visit here, or fire questions my direction.
Learn more about the future of ebook retail at Digital Book World 2014