C++ types for R programmers

A lot of my work involves Dirk’s fantastic Rcpp package, which provides a C++ API for seamlessly integrating compiled code into R. This means we can drastically increase the speed of a lot of specific operations - at the cost of having things like strict typing, and a lack of easily-accessed mathematical functionality.

There are maps between R data types (vectors, lists, data.frames, matrices) and C++ equivalents, along with syntactic sugar to make programming C++ as “R-like” as possible for people new to the language. It’s a fantastic pile of code, and there are already a lot of guides on how to use the Rcpp data types with R.

But C++ has a lot of types of its own, designed with different goals in mind: some are faster for particular sorting algorithms, others take up less memory, others are easier to manipulate. And these are just as usable as the Rcpp types, and can sometimes be better for particular tasks. So this will hopefully end up as a brief guide to the very basic C++ types, and the advantages and disadvantages they have, aimed at R programmers.

Deques

The first basic data type I find really useful is the deque; I’ll explain why below.

What the heck is a deque?

Deques (or “double-ended queues”) are very similar to vectors on the face of it, and they can be declared in a similar way to a vector:

// Construct a deque of strings, containing space for 12
std::deque < std::string > example(12);

// Construct a vector of strings, containing space for 12
std::vector < std::string > vector_example(12);

In practice, though, deques look very different from vectors. Vectors are contiguous: they’re stored, all of the elements, in a single long block of memory. That means that when you expand or reduce the size of the vector, it can get pretty expensive, particularly with large objects. But writing to a pre-allocated or pre-assigned vector is pretty much the cheapest way of doing anything.

(This is why Hadley always tells you, R-side, to pre-assign vectors wherever possible. Same principle).

Deques, however, are not; individual elements are stored wherever they’ll fit. That makes it a lot cheaper to expand or contract the object. And by extension - because C++ tries not to let people do things that are too bad an idea - deques are also more manipulatable than vectors. With vectors, you can only add elements to the middle or the end, and adding them to the middle can get incredibly expensive.

With deques, on the other hand, inserting into the middle is very cheap, and you can also remove or add elements from the front of the deque - operations that are not (directly) possible with a vector. This makes it faster and more memory efficient when it comes to objects you’re going to resize, add to or pop bits off of.

Where and when should I use it?

If you’ve got something you’d normally store in a vector, and you want to be inserting new elements or removing them from anywhere other than the end, or resizing the object as you go? Use a deque. It’s faster and cheaper.

On the other hand, if you just want to create an object of a fixed size and then do object[i] = result_of_calculation() a lot, a pre-allocated vector is going to be faster - but the key word there is ‘pre-allocated’. Moving to C++ does not mean the cautionary tales about for-loops in R don’t apply, it just means they’re less noticeable to people operating on human scales.

You can see an example of deque usage in the humaniformat package, which parses human names. Because it’s applying different heuristics to different bits of the phrase it’s given, and not necessarily from back to front, we want to be able to pop bits off really trivially - and so we use a deque. See here for a specific use of it.

Sets

Sets are fun, although less useful compared to vectors than deques.

What the heck is a set?

Think of a set as a group of keys. Because, uh. Well, that’s what it is.

It’s again defined similarly to a vector, the difference being you can’t pre-allocate space (although inserting is very cheap so this isn’t much of a problem):

//Define a set of strings
std::set < std::string > set_of_strings;

//Define a vector
std::vector < std::string > vector_of_strings;

A set is roughly equivalent to a vector (in the sense that it can only store one, fixed list of items), but with one crucial difference: a set only contains unique values. Because of this, and because of how it’s implemented, it can (note the qualifier) be a fast way of checking whether a value you have is within a specified range of unique values.

Where and when should I use it?

In R we have the %in% keyword:

1 %in% c(2,3,4)
[1] FALSE

Nothing that simple and pretty exists in C++ for achieving that; instead, we’d have (for example) an integer as the value, and a vector of integers as the range of values to test against, and we’d do:

if(std::find(vector_of_values.begin(), vector_of_values.end(), value) != vector_of_values.end()){
	return true;
} else {
	return false;
}

This is a pretty annoying operation because it’s very common to have to do and there’s nothing as easy as %in% for it (long practice has seared the syntax above into my brain, but not willingly). It can also be pretty slow; when you’re searching an unordered vector, performing this operation over and over again on various different values, your code starts to move slower than a nihilist on quaaludes.

A set can sometimes be a faster way of doing this, because it’s ordered by default - and, because it’s ordered, it has strong guarantees about how performant comparisons and checks will be. In addition, inserting new values into a set is very cheap compared to inserting new values into a vector, so if you need to incrementally build up your range of values, a set may be the choice for you.

Having said that, an ordered vector can be pretty fast too. And, since a set uses more memory than an equivalently-sized vector, it can end up looking very expensive from a RAM point of view. So if you’re dealing with a small number of values in the set/vector (meaning that the cost of inserting new elements is not an issue) and a small number of comparisons (meaning that the marginal speed improvement sets have over ordered vectors isn’t that useful), just use a vector, order it once, and rely on that.

On the other hand, if you’ve got a lot of unique keys to insert, or a lot of comparisons to make, or both, and you don’t mind about the memory consumption, a set may be what you need. And if you’re interested in reading up more on the implementation and performance differences Matt Austern has a great writeup.

How do I use these in Rcpp?

Just use them! Originally I had some advice here about how you can’t export those objects and so Handle With Care, but Hadley pointed out that actually yes, yes you can. Not sure where the idea you couldn’t comes from, but there we go.

Sets and deques can both be returned in exactly the same way vectors or lists are; specify that the output type is std::set < foo > or std::deque < foo >, and return an object of that type. You can’t use them as arguments to exposed Rcpp functions, of course, because R has no analogous types internally (and Rcpp converts sets and deques both to vectors when you return them).