Stop Mapping Names to Gender

Oliver Keyes bio photo By Oliver Keyes

I go to a lot of software-related events. Many of them feature efforts to increase inclusivity and diversity in the tech industry. In the last 6 months or so, I’ve been confronted two different times in such spaces with a presentation that features someone mapping names to gender. It’s always “for inclusion”: we want to find out what percentage of people in a community or event were women, and so we worked out how many were called Alice versus John or Bob versus Mary.

Such methods highlight complete ignorance about gender identity and centre spaces on cis white women. They exacerbate deep and really obvious racist and transphobic problems in how we identify people and map inclusivity. I sort of expected that by 2017 we’d be doing better than this kind of broad-strokes clumsiness, but apparently not.

Accordingly, I’ve written up a very short guide on why name-gender mappings:

  1. produce incorrect results;
  2. are morally horrifying, and;
  3. are usually unnecessary.

It’s inaccurate

There are seven billion human beings on the planet. Even in the event we’re only dealing with generic cis white people from European or North American backgrounds, there’s considerable variation in the data: ambiguous names that you can at best probabilistically tie to a binary gender. “Sam” could be of any gender or none.

There are also a lot of names that simply can’t be mapped to a gender. Mapping of this sort is usually based on data releases from the Social Security Administration and similar bodies in other countries, which (for the privacy of citizens) aggregate the data and only actually publish the list of top 100 or 1,000 names per assigned gender at birth.

Given that four million children were born in the United States alone in 2010, and given that (as every data scientist knows) there’s always some arsehole human outlier messing up one’s careful models and assumptions, a lot of people aren’t going to be included in the data at all. So you’ve got a system which misclassifies or can’t classify a lot of people due to ambiguity, and can’t even form an opinion on millions more because their namesaren’t in the system to begin with.

It’s racist

The voids in these datasets don’t cover everyone evenly. Rather, they often fall straight down lines of race, culture and ethnicity.

Names vary between groups, even within a nation. In a European or North American nation, that means overlapping but distinctive pools of names for people of African, Asian, Native American and European backgrounds. In all of those nations, the majority of the citizens are white people.

Accordingly, names that are largely unique to non-white groups are far more likely to be excluded from a top-N dataset than common names used by white people, for the simple reason that there are fewer non-white people.

Congratulations: your methodology is racially biased.

Things get much worse when you talk about global populations, because the availability of data falls drastically for places that aren’t (you’ve guessed it) North America and Europe. This may have something to do with centuries of exploitation and imperialist destabilisation making it harder to put together cohesive long-term records, trust central government, or just throw money around on digitisation, I dunno.

The result is, invariably, that you end up with a model that underrepresents people of colour, be they from European/North American contexts or elsewhere. Both are vital, non-excludable populations to bconsider in even the most half-hearted inclusion initiative. It’s not enough to just include white women and call it done: marginalisation is both deeper for and different for women of colour, and excluding them from these kinds of assessments leads to bad outcomes.

It erases trans and non-binary people

Nobody maps name to gender to end up with gender information. Rather, they want the values that they can impute from gender - what someone’s socialisation, lived experiences and current treatment are or were probably like.

At best, this is a really inaccurate measure. Women of colour have very different experiences from white women, and that reflects across all the axes of intersectionality and marginalisation. But sure, there are probably some commonalities.

The problem gets teeth in the cases of trans and non-binary people, whose socialisation, lived experiences and current treatment are likely to be all over the map if the baseline is set by cis people. A non-out trans person might be treated as their assigned (rather than identified) gender. But their experience of that treatment–and the experiences any trans person had pre-transition–are going to be very different from the experiences of cis people of the same assigned gender.

If you have someone in your dataset called ‘Cindy’, there are five main options:

  1. They are a cis woman called Cindy
  2. They are a trans man called Cindy
  3. They are a trans woman called Cindy
  4. They are a non-binary person called Cindy, or;
  5. They are a cis man called Cindy (unlikely, but possible).

All of these people have very different experiences of gender. They are all treated differently, and react to that treatment differently. Bundling them all in together guarantees inaccuracies and (given how demographic percentages work out) guarantees missing the experiences of trans and nonbinary people in your project. Frankly, claiming that birth name maps immutably to gender in the first place is the kind of essentialist TERFy nonsense that has no place in inclusion efforts.

Name datasets won’t save you on gender. Not only are they full of holes, but even if they weren’t, they can’t serve to distinguish cis and trans people (and thank god for that). They can’t, so long as they’re based on the idea of binary gender, account for people with fluid or non-binary identities at all, which also taps into the geography and race problem: >2 genders is pretty common in a lot of social structures that aren’t of European origin, so how is your fancy DB gonna handle those names? They’ll give you crap data that glosses over the existence of some of the most marginalised people in your community.

Most of the time it’s unnecessary

The good news is: most of the time you can do better! It’s really simple: all you have to do to find out gender information is… ask your community. Create a survey, stick in a question that asks the respondent’s gender (and provides an open text box), add another question on whether the respondent is trans, and you’re done. Entirely done. No fuzziness, no racist or trans-excluding bias in your dataset. You have precisely the genders of the community you’re studying or working with. And you didn’t have to download a metric ton of name data. Everyone wins!

The alternative is that you keep doing this, and I keep having to write about it. You keep studying gender as a binary, keep mapping names to said gender, keep ignoring POC in your community or research space and communicating to trans and non-binary people, with every presentation, that they are not welcome in your community.

The result will be inclusion and diversity for cis white women. Any benefits for anyone else will be incidental.

If that’s the sort of initiative you want to run, go right ahead, but it’s not inclusion. If you want to do better–well, just do better. Start surveys, ask those questions, involve your community in the research you’re doing so problems like this aren’t missed. If your inclusion efforts can’t be inclusive, precisely how do you expect your outcomes to be?