Why is append() an anti-pattern in R?

By Oliver Keyes

As a case study: suppose we have a vector of 1,000 URLs and we want to decode them? URLdecode isn’t vectorised, so we’ll have to run a for loop over it or something:

encoded_urls <- readRDS("urls.RDS")
decoded_urls <- c()
for(i in seq_along(encoded_urls)){
  append(decoded_urls, URLdecode(encoded_urls[i]))
}

If you do this, the R performance fanatics in the room (this may be a population of…like, Kevin, Hadley and myself) will promptly make animal noises and explain that append is an anti-pattern and it’s not performant and you should not do it and aaaah. Only nicer than that. If there’s coffee.

But why? The reason I see this style so often is that in Python, it’s totally acceptable. In fact, it’s encouraged! To perform precisely the same operation in Python, you’d do something like:

from urllib import unquote
from pickle import load
file = open('urls.pkl')
encoded_urls = load(file)
file.close()
decoded_urls = []

for url in encoded_urls:
  decoded_urls.append(unquote(url))

…and that would be totally acceptable. Absent some handwaving around the file loading and specificity of the importing, it’s actually the preferred method. It’s totally performant! Why the difference?

A lot of the time the answers are things like “R is not optimised for scalar operations” and “function calls have a cost which is why vectorisation is so important”. And these are true. But a bigger issue is: storage format. Python lists and R vectors look very similar and behave in very similar ways, but they’re totally different under the hood.

R vectors

How are R vectors stored? Not in R, but in the underlying C? The answer is…well, thorny but the important bit is: R vectors are stored contigously.

What this means is that when I have an R vector called a_vector of 5 elements, there’s a blob of memory 5 elements long that is reserved for this vector. It might be surrounded by lots of other things from R, or from totally unrelated programmes. It might be surrounded by empty space. R doesn’t know. And so when I want to add an item to a_vector and make it of length 6…well, R doesn’t know if the memory around it is free, so it instead:

  1. Finds an entirely new, unused space in memory, of length 6, and reserves it;
  2. Copies the (now-expanded) vector to that space;
  3. Changes the name a_vector to point to the new memory address;
  4. Deletes the old version of the vector and garbage collects it.

We can actually see this happening. Let’s use an example vector provided by Professor Mathers:

example_vector <- c("Reggie",
                    "Jay-Z",
                    "Tupac",
                    "Biggie",
                    "Andre from OutKast",
                    "Jada",
                    "Kurupt",
                    "Nas")

This has a length of 8, and so when you create it, R goes off, reserves memory equivalent to a vector of 8 elements, and reports back. Great. We can actually find where in memory this is with the pryr package, which provides address() to identify where in memory an object is stored. On my system, example_vector lives at “0x3584c38”. But if we append “me” to the list, as Mathers suggests, it suddenly lives at “0x4dea320” instead - because R has had to go off and carve out a bigger, totally different space to store the newly-expanded vector in.

That’s why append is an anti-pattern; because carving out new space, shifting everything over, reassigning the name and garbage collection takes time, and when you do it thousands or tens of thousands of times, it adds up.

Python lists

So why is append() perfectly workable with Python lists, the equivalent data type?

Because Python lists might look similar, but they’re actually stored in a totally different way.

Python lists consist of two elements:

  1. An index saying how long the list is (say, ‘9’)
  2. A contiguous array of pointers to where each element is stored.

Pointers, not the actual entries. Beause the content of a Python list is stored non-contiguously: the entries don’t have to be anywhere near each other in memory! When you ask for the 8th element, it goes down the list of pointers, finds the 8th one, grabs that memory address and provides the contents. And so when you append to a list, you’re incrementing the index by 1, finding memory to store one more element, and popping that pointer on to the end of the pointer list. And that’s it.

So in R, appending means allocating memory for the entire vector, which could be millions of elements. In python, appending means allocating memory for one element. And that’s the main reason it’s faster.