This is an alternate FAQ for R. Specifically, it’s an FAQ that tries to answer all the questions about R’s weird standards, formatting and persnicketiness that you're afraid to ask.

It was written sober but is heavily influenced by stuff written while drunk. It does not contain expletives or adult language but has tried really hard to give the impression it does.

About

Why?

Because relying on R’s FAQs and docs is great for what they’re designed to do (teaching you to do X), and pretty terrible at teaching you why to do X, or how in the blinding fires of hell X became the convention in the first place.

It can be pretty frustrating for inquisitive folks, so I decided to write down all the stuff I’ve learned over the last few years about why we do the things we do.

Licensing?

Informally, this is public domain. Formally:

CC0
To the extent possible under law, Oliver Keyes has waived all copyright and related or neighboring rights to this work. This work is published from: United States.

If you live in some weird backwards country that treats copyright as a mandatory license to print money (looking at you, British people), and they somehow don’t accept public domain declarations, you have my permission to treat your local copyright law as the asinine benighted mass of political donations it undoubtedly is and ignore it completely.

Who do I complain to?

That’d be Oliver Keyes, although the name came from Alex and the underlying concept, idea and thoughts from Erin.

The questions themselves were generated by Oliver, Alex and Erin, with further submissions from Jen Haskell, Andrew McDonald, Auriel Fournier and probably some other people I’m forgetting who I will include if they poke me.

Thanks to boB Rudis and Ben Edwards for copyediting!

FAQ

Here are the actual questions and answers!

As a general rule of thumb, if you encounter something truly ludicrous, don’t know where it comes from, and don’t see it listed here, randomly select from one of the following explanations:

  1. Backwards-compatibility.
  2. Nobody thought it was important to get right at the time.
  3. That still exists?! I thought we’d removed tha- oh, wait, backwards-compatibility.
  4. Scheme did that.
  5. S did that.
  6. APL did that.
  7. Lisp did that.
  8. That’s the only use case late-20th century pure statisticians have, and if it’s good enough for us it should be good enough for you.
  9. Are you kidding?! If we’d done it that way it wouldn’t work on Solaris 8!

Why do we use <- for assignment?

R as a programming language originates in S, which was the previous language of choice for statistical people. And S had <- as the assignment operator. And so we do. Standard explanation.

But wait, there’s more!

The reason S uses <- is that APL did. APL being a programming language from the year of our lord nineteen hundred and sixty four. APL used it because it was designed on a machine that had a single key that printed <-, so, hey, it’s right there on the keyboard! It looks like it assigns things! Let’s use that!

This got incorporated into S, and everything from S got incorporated into R, because R started off as basically “S with knobs on”, and we’ve been using it ever since, despite the fact that keyboards have now standardised away from having an arrow key.

So to summarise; the reason we use <- for assignment is it made sense in a programming language written before the incorporation of Apple Computers, because it made sense in a programming language written before the moon landings.

I regret to inform the audience that this is not the last R idiosyncracy in this document that can be summarised as “because keyboards”.

Okay. And should we be using <- or =? People keep telling me to use <-.

Pictured: A C programmer reading their
first R, pre-2001.

<- used to be preferred for a very good reason. = wasn’t introduced as an assignment operator until 2001; before that it was only used in function calls:

foo(bar = "baz")

Then in 2001 = was introduced due to the dawning recognition that everyone else uses it and so it might make switching to R from another language slightly less of a horror show if at least basic syntax lined up with other environments.

Initially there were some things you couldn’t do with = that you could do with <-, precisely because = is also used in function arguments and has very different meaning there. Basically you could only use = if it was at the top level of a call, or wrapped in parentheses. Otherwise the parser would read it, choke and die due to ZOMGambiguity!

These days a lot of the differences between <- and = have been eliminated, and it’s safe to use = pretty much everywhere. There are exceptions around specific functions, like quote, which holds unevaluated R expressions; it doesn’t like =:

> quote(y <- 5)
y <- 5
> quote(y = 5)
Error in quote(y = 5) : supplied argument name 'y' does not match 'expr'

This error is kind of hilarious because the announcement of the introduction of = included quote(y = 5) as something with an equals sign that’d work fine.

But = didn’t exist until 2001, and wasn’t safe to use without worrying until even later than that. And <- was still the standard in S, where R, most of its code and a lot of its people originated. Combine the two, and the result is that <- became The Idiom. All the examples use it, all of the books use it, most of the programmers who might teach you use it.

Most of the time, though? Use whatever the heck you want. There are very very few situations in which = will trip you up and knock you down, and pretty much all of them are situations where you’re doing something you should in no way be doing. The only consideration you should make when chosing which to use is your audience: if you’re writing for other R programmers, who are primarily <- based as a group, it might increase the readability difficulty of your code. Similarly, if you’re planning on spending a lot of time reading everyone else’s code, having = as your brain’s default might make it harder.

Generally, though, do what you want. Just don’t operator-shame.

Why do we use . so much in function parameters? Everyone else uses _.

So this is stuff like mean(), which looks like (as its default):

mean(x, trim = 0, na.rm = FALSE, ...)
No really, this is what S was developed on.

na.rm is “remove NAs”. It’s multiple words, separated by a period. Why a period? Well, this is another “blame S” story.

The reason R doesn’t use _ to separate words is very simple; R used to use _ as an assignment operator. x _ 5 set x to 5.

Why? S did! Because S was developed on an Execuport. If you’ve never heard of those, that’s perfectly understandable, because an Execuport isn’t even a computer in the modern sense of the word; it’s a thermal printer. It’s a glorified keyboard with a cable leading out the back to an actual computer as large as your house, as inefficient as your local council and many orders of magnitude slower than a modern phone.

But what the Execuport did have to make up for all these deficiencies is a key that printed <-. Which for some reason had a _ on it. So the solution of course was to introduce both what the key said and what the key did as assignment operators just so nobody could get confused.

Anyway, as a result of that _ couldn’t be used to separate words in arguments or any of that lovely stuff because it’d make the parser cry and hide under its bed until the nasty programmers went away. So . was used instead, despite the fact that . is also used for defining methods in object-oriented programming, and is visually identical to that when defining something that isn’t actually a method at all, and argh, R really likes its periods, period.

_ for assignment has now been deprecated. You can now use _ anywhere you damn well want - except at the beginning of an object name, because R wouldn’t be R unless the cleanup after its idiosyncracies was itself idiosyncratic. But you can use them most places, including to separate words in function parameters, and I thoroughly recommend you do because it’s just_so_much_more_readable than this.kind.of.thing.

The reason it’s not used more widely and we still have things like na.rm? Changing things would promptly break the code of anyone who had explicitly used the parameter name. The R developers very highly prize backwards compatibility, quite understandably.

How does library(foo) even work? Shouldn’t it be library("foo")?

Oh this is a joyous one. The thing you have to thank is non-standard evaluation, once lauded and celebrated by one of the original authors of R as…oh, actually he says it’s “discouraged”.

You pass an argument through to a function, it evaluates the argument, it uses the argument, right? Wrong! Evaluating arguments is completely optional!

See, in R, everything is ultimately an expression. foo? Expression. It only does something when it’s evaluated, which could be when you hit return in the interpreter or never depending on the intentions of the person who wrote the function you’re passing foo into.

In the case of library calls, foo isn’t evaluated directly. Instead it’s held, unevaluated, and then the unevaluated expressionis turned into a string with as.character (after a bit of mangling). You can run:

as.character(substitute(foo))

to see what it looks like, and if you do, you’ll end up with “foo”. Or to put it another way, inside the function, library(foo) is precisely the same as library("foo").

Why this is there…I’m provisionally going to blame some form of backwards compatibility but I have no idea. Anyone who does know, I’d love to hear.

Why is it called library anyway? Shouldn’t it be package?

Actually, no. If you’ve ever submitted a package to CRAN and, somewhere in the package description, put “a library for…” you will have experienced the singular pleasure of getting a shouty email along the lines of “IT IS NOT A LIBRARY. IT IS A PACKAGE. PACKAGES LIVE IN THE LIBRARY”.

Quite what libraries CRAN browses I have no idea, since it’s normally books living in mine, but the point is that a library is “the R packages you have on your system”. A package is a collection of code that, on install, lives in your library. So the reason the function is called library is that it’s not “load this package” it’s “from the library, load…”.

To make things more confusing of course, in S the informal term for packages (known as “library sections” or “chapters” formally) was “library”. So a <- is a <- because of S but a package is a package despite it.

Why do people shout at me when I use require instead of library?

The reason people care so much about library versus require is simply that when library can’t find something, it issues an error and your script falls over. When require can’t find something, it…issues a warning. Which means that if you’re running your script automatically, or the loading call is inside a function, the script doesn’t stop until it needs the contents of the package it couldn’t find, can’t get it, and falls over with a confusing message.

And then tells you oh by the way it couldn’t load this library. Super helpful.

Accordingly library is a lot safer an idiom because it means you don’t run into surprises when you’re writing automated code and something goes funky in your system.

Yihui Xie has a longer and more detailed writeup on this (with examples and everything!) that’s a thoroughly good read.

Why is stringsAsFactors=TRUE by default?

Because for the data R programmers originally dealt with, it made a lot of sense. It’s a statistical programming language that was written oriented towards traditional data sources: when strings appeared, they were usually categorical variables (“control”, “test1”, “test2”) and so treating them as a set of fixed values by default was the sensible thing to do, particularly since it massively saved on space (you were only repeating the factor level in each entry, not the whole string. Numbers tend to be smaller than strings).

These days people throw a ton of different things into R and factors are used less-commonly as a proportion than they once were. Plus, caching of character strings inside R means that basically all strings are factors from a memory point of view. But defaults are important; changing them means breakage.

For a more detailed explanation I thoroughly recommend Roger Peng’s writeup

Why doesn’t round work like you think it should?