Hi! I’m Oliver


I'm a Human-Computer Interaction (HCI) researcher and programmer living slightly north of Castle Black. I study online communities, focusing on how people consume content, how user behaviour varies between desktop and mobile platforms, and how we can best understand systemic bias in peer-production communities. My C.V. can be found here.

A long-time Wikipedia editor, I'm employed by the Wikimedia Foundation as a researcher, geolocating the heck out of everything. Prior to that I worked at the Foundation as the first community manager in our product development team, and prior to that I worked as (by decreasing length of survival) a political campaigner, a free culture advocate, a librarian and a butcher.

Outside of research I'm a dedicated C++ and R programmer, and enjoy nothing more than writing whip-fast packages for a variety of tasks (see 'Code') and contrasting that with writing things you should definitely not write in R, in R. Like this website. Outside of outside of research, I blog about both.

I'm powered by books, archery and that Midwest hippity-hoppity nonsense the kids love so much.

Doing terrible things to R so that other researchers don’t have to

I spend kind of a lot of time programming, both in support of my research and to give back to the ecosystem I depend on. My work stack is Java, Hadoop and Hive, with bits of Python or AWK on occasion; my recreational-and-also-work stack is a mix of R and C++, with C when I absolutely can't avoid non-reference pointers.

The “giving back to the ecosystem” consists of building useful R packages covering a variety of subjects. These include:

Access log parsing and data extraction

There's a lot of focus on R as a language for, say, biostatistics or financial modelling, and it's very good at those things - but there are a lot of researchers, particularly in web-based industries, that are working with direct access log data (I'm one of them). I wrote a series of tools to allow a researcher to read in, parse and normalise, and finally calculate common metrics over, web access logs.

This starts with webtools, a package for reading files that follow common access log formats that, thanks to its dependencies, is 10-15 times faster than an equivalent task in base R. It also contains functions to normalise IPs, resolve x_forwarded_for fields. and decode URLs in a vectorised fashion.

For more dedicated URL handling, urltools (see the pattern?) contains vectorised decoders, encoders and parsers for URLs, including lubridate-like syntax to allow you to modify specific parts of a URL, such as the host or scheme.

And if you decide that you want to take this data and use it to calculate, say, information about user sessions, you can use reconstructr - a generalised package for time-based session reconstruction approaches.

All of the above have an R frontend and a C++ backend, allowing for extremely speedy computations that are still user-friendly.

API clients

I've written, am writing and have contributed to a large number of API clients - some useful, some not. In 'written' are API client packages for MediaWiki, Wikidata, The Setup and the Star Wars API - so you can stop looking for Alderaan data in all the wrong places.

“writing” includes a client package for Google Drive, and “contributed to” includes rsunlight, taxize and the underlying httr and rvest packages.

Miscellanea

Contributions that don't fit into the categories above include openssl, which provides tested cryptography to R users, an IETF RfC webscraper, and a geolocation client library. I was also a lead in the ua-parser family of open source user agent parsers, implementing the R version and maintaining the core definition.

My research is centred around geodata and reader behavioural patterns: I'm happiest when I have something to localise.

Papers and presentations