an analysis of Danbooru tags and metadata
Danbooru (NSFW) is a site that hosts anime related images and enables users to add tags to them based on the features of images. This post provides a rough statistical analysis of Danbooru's image tags and metadata based on publicly available data with a focus on trends and scores.
The methodology for the data analysis and details about the companion tool Danbooru Tags Explorer can be found at the bottom of this post. Importantly, the data for this analysis is sourced from Danbooru2021 which means the data points end at the start of 2022.
graphs and commentary
post rating category counts
An overwhelming majority of content on Danbooru is actually safe. Danbooru's active moderation of submitted images has likely contributed to this as their focus has historically been on archiving high quality images rather than explicit images with no other qualities (#1524, #1966).
average scores for post rating categories
Users really like explicit content. It is also likely that explicit images demand greater skill from artists to draw, and so the overall quality of such images are also higher.
post tag counts
The most common number of tags on an image seems to be 26. The post with 1150 tags has lots of Pokémon.
post scores with tag count
Posts with a higher tag count tend to have a higher score, up until a certain point, where higher tag counts appear to weakly correlate with a lower score. Given how scattered the data points are, this may be due to insufficient posts which destabilizes the average score.
The data for this graph has been truncated to posts with a tag count of 200 or less.
number of posts created over time
The rate at which posts were added to Danbooru accelerated significantly near the start of 2020. It is unclear what the principal cause is but the advent of COVID-19 may have had an effect.
The anomalous data point at January 2022 is an artifact of the dataset being truncated at that time.
post scores with age
Newer images tend to be of higher resolution and quality which means they should have a higher score. However, very recent images have not spent enough time on the site for a sufficient amount of users to vote on them. This likely explains the reversed trend near the end of 2019.
The anomalous data points at the end of 2021 are likely due to insufficient posts which destabilizes their score average.
post scores normalized by votes over time
As Danbooru allows users to both upvote and downvote posts, the data points on this graph represent an average ratio between upvotes and downvotes on posts. This serves to ignore the effect of the size of Danbooru's user base over time on the absolute score of a post.
It is unclear what the cause for the dislocated group of data points between 2013 and 2016 is. The sum of the upvotes and downvotes should also be equal to the score of a post but this is not the case for every entry.
Ignoring that time period, it seems that the average score ratio across Danbooru's lifetime is roughly 0.9. In absolute terms, this means for every 9 upvotes on a post, there is 1 downvote. This suggests that newer images might not actually be considered higher quality, but just that more users are voting on them.
post tag counts over time
Newer posts tend to have more tags on them. This is likely due to a combination of factors including the growing strength of Danbooru's tags wiki, more stringent tag counts for posts to pass moderation, and a more diverse database of tags over time.
top tags by average post score
The top scoring tags change significantly depending on the minimum post count for a tag to be included. As expected, many of these tags refer to NSFW image features.
methodology
Technical instructions for reproducing these results can be found on this post's companion GitHub repository.
The raw data used for this analysis comes from Gwern's Danbooru2021 dataset. Danbooru2021 contains not only Danbooru's image metadata and tags, but also the images themselves. While the metadata can be sourced directly from Danbooru's BigQuery mirror, Danbooru2021's snapshot is used instead to aid with reproducibility.
confounding factors
Posts without a valid ID are ignored. This includes deleted and banned posts. The size of the list of banned artists is insignificant so they are unlikely to have a noticeable effect on results. Posts from banned artists may normally be expected to correlate with a higher score.
danbooru tags explorer
Danbooru Tags Explorer is a tool for exploring the correlations between tags on Danbooru. For a given tag, it ranks the other tags that are most likely to appear alongside it:
For all posts that have the touhou
tag, 64% of them also have the 1girl
tag.
This tool can aid the construction of image generation prompts that accept Danbooru-style tags. It is particularly useful for augmenting uncommon tags such as character tags with other sets of tags that image generation models are more able to understand. An example with the character Suguri from the SUGURI series:
Images are generated using NovelAI and cherry-picked for safeness. The source code for the explorer is in this post's companion GitHub repository.