It'll be our dirty little secret
Posted Dec 19, 2017
Here’s another update on what I’ve been doing lately, for the 1 person who follows my blog. Hi mom! Just kidding. My mom doesn’t read my blog.
Years ago, I joined a tiny IRC channel on a whim and ended up liking the people so much that I stayed there ever since, and I have even visited some of them in real life. At one point I made an IRC bot called nda specifically for the channel. IRC bots are the classic Baby’s first Python program, but for once it became a thing everyone actually liked (or at least tolerated). It did a bunch of things, but its best features were arguably !quote and !porn.
!quote was a system that would record every message in the channel and save it to a SQLite database. You could then query it using the !quote command, which would give you a random message from the entire collection. I wrote a simple script that would import years and years of chat logs acquired from one of the channel founders, so it ended up with a table of over a million messages to choose from. I initially ran nda on a Raspberry Pi under my desk, but the large quantity of quote data soon meant that I had to move it to the cloud to have any semblance of responsiveness. nda was single-threaded and blocked the event loop for the duration of the (quite expensive) database query.
Side note: the queries were expensive because the only way to get a random row from an arbitrary result set is doing ORDER BY RANDOM() LIMIT 1
. There are a number of tricks to improve on this, and for nda, the best solution I came up with was:
- perform a
SELECT COUNT(*)
of the matched rows ascount
- find a random number
r
in the range0 ≤ r < count
- perform the actual
SELECT *
usingLIMIT 1 OFFSET r
This avoids ordering the entire result set at the cost of performing a count, which is pretty fast, and skipping through a number of rows, which is slower the larger r
happens to be. My solution had decent performance for the use case, but in retrospect, this strategy would have been even faster had I thought of it.
The other feature, aptly named !porn, was pretty simple in comparison: you typed in !porn and it gave you a link to a random adult video and the longest user comment on that video. Okay, okay, I know how it sounds. But it’s a lot of fun. I don’t want to kinkshame anyone, but some people practically write entire novels in the comment section of porn sites – which I know exclusively from the professional research conducted in order to develop this feature – and it can be absolutely hilarious.
So, how would you go about finding a random video? For some sites, like Reddit or Imgur, it’s easy: you either have an API to help you out, or the URL pattern is simple enough to just brute force with random combinations until you find a link that works. But for the adult sites I investigated, they all used a much more complicated pattern, and porn APIs weren’t really a thing. For instance, Xhamster used a combination of the video title and a unique id in its URLs, which is impossible to crack by blindly hammering away at it. But the good thing about Xhamster: their randomizer button that just gave you a 302 to a random video.
Once you have a link, you can fetch the page and extract comments from it. The Python standard library has everything, including an HTML parser, and nda used this to parse the title and comments out of whatever video page was sent back. Thankfully, the layout stayed relatively stable and rarely broke the !porn feature. To remain inconspicuous, it spoofed the User-Agent
header by using a couple of widely used identifiers, though in retrospect, they would probably just block the IP if it somehow became enough of a nuisance.
Over a year ago, the IRC channel became a WhatsApp group. A few people continued to use both, but WhatsApp was the only reliable way to reach everyone, and eventually we all abandoned the channel. nda stayed in the channel until all regular users had left, and was finally shut down for good last month. The channel registration has now expired. But I’m excited to say that the spirit of nda lives on in two spinoff projects based on its two best features!
!quote lives on in quote-importer, which expands upon nda’s quote database. It focuses on capturing and storing as much information as possible from chat logs instead of fast random access. It can currently parse Irssi logs, WhatsApp logs, and even nda’s logs, and saves several types of events; not just messages. It can then store them in various database systems like SQLite, MySQL, and MongoDB. I have recently used it to parse and import the entire catalog of our group’s IRC and WhatsApp logs into a single database, which currently clocks in at just about 3.4 million objects.
!porn lives on in talkdirty2me, which is a small .NET Core/Vue web app (live demo here, turn up the volume). It works much like the original feature did: it gets you a random video and its longest comment. It also saves the comment and generates a shareable permalink that works nicely in rich previews. Xhamster remains a valuable source of quality comments, but there’s an important technical update: Xhamster now has a REST API! I know!! It’s undocumented and not particularly pretty, but a recent update to their site uses it to – you guessed it – fetch comments. You no longer have to parse HTML to get the comments for a video. If you have the video’s id (which can be extracted from the URL) and a few secret headers up your sleeve, you can get its comments in a paged JSON list, which is a huge upgrade from trying to match element ids and CSS classes. I haven’t yet found an endpoint that gives me the video title, so for now, a small amount of HTML fiddling is done with a regular expression.
So that’s what I’ve been up to recently. As I mentioned before, I’ve also rewritten crimsonwars.net as a static Jekyll site. Even though it’s technically simple, I’m still proud of it and want to make it public on Github. I just need to separate the somewhat questionable downloads from the rest of the site so they don’t endanger my account. More on that later, maybe!