Splitting names

August 17, 2009

You can find the latest version of this code on Github. There are libraries for both PHP and JavaScript.

The quest

I’m on a ongoing search to find the best algorithm for splitting a full name into a first name and a last name. I’m sure this sounds like a ridiculously trivial quest — just explode the string on a space, right?

The challenge

But how do you tell the difference between people with double first names like Jo Ann Smith and people with double last names like Jo Von Trapp? What would you do if I gave you a double first name AND and a double last name at the same time?

Did you remember that you might need to parse out prefixes (Mr, Mrs, etc) and suffixes (II, Jr, PhD, etc)?

How do you turn Paul T. S. Williams into Paul Williams while intelligently deducing that that T. James Adams probably wants to go by James Adams, but T. Adams should probably stay as T. Adams?

And how do you straighten out the capitalization? I MIGHT WRITE IN ALL CAPS or all lowercase. Most names have the first letter capitalized and everything else in lowercase, but of course there are exceptions. J.P. likes to have both initials capitalized and Mr. McDonald always gets fussy when you forget to capitalize the D. Oh, and I hope you’re prepared for other anomalies like people with dashes in their name.

As you have probably realized by now, splitting a full name into its proper parts is a little more complicated than it appears on the surface.

I wrote the first version of my name-parsing algorithm two years ago and I’ve been gradually refining it ever since. It’s not perfect, but it’s improved a lot over time. I’m posting this code along with a demo in hopes that it will spur contributions to improve its accuracy even more. Throw the hardest names you know at it and let me know how it performs. I know I’m missing words for the various dictionaries of prefixes, suffixes, and compound name identifiers. Please let me know what I missed.

The algorithm

We start by splitting the full name into separate words. We then do a dictionary lookup on the first and last words to see if they are a common prefix or suffix. Next, we take the middle portion of the string (everything minus the prefix & suffix) and look at everything except the last word of that string. We then loop through each of those words concatenating them together to make up the first name. While we’re doing that, we watch for any indication of a compound last name. It turns out that almost every compound last name starts with 1 of 15 prefixes (Von, Van, Vere, etc). If we see one of those prefixes, we break out of the first name loop and move on to concatenating the last name. We handle the capitalization issue by checking for camel-case before uppercasing the first letter of each word and lowercasing everything else. I wrote special cases for periods and dashes. We also have a couple other special cases, like ignoring words in parentheses all-together.

The code

<?

// split full names into the following parts:
// - prefix / salutation (Mr., Mrs., etc)
// - given name / first name
// - middle initials
// - surname / last name
// - suffix (II, Phd, Jr, etc)
function split_full_name($full_name) {
$full_name = trim($full_name);
// split into words
$unfiltered_name_parts = explode(" ",$full_name);
// completely ignore any words in parentheses
foreach ($unfiltered_name_parts as $word) {
if ($word{0} != "(")
$name_parts[] = $word;
}
$num_words = sizeof($name_parts);

// is the first word a title? (Mr. Mrs, etc)
$salutation = is_salutation($name_parts[0]);
$suffix = is_suffix($name_parts[sizeof($name_parts)-1]);

// set the range for the middle part of the name (trim prefixes & suffixes)
$start = ($salutation) ? 1 : 0;
$end = ($suffix) ? $num_words-1 : $num_words;

// concat the first name
for ($i=$start; $i < $end-1; $i++) {
$word = $name_parts[$i];
// move on to parsing the last name if we find an indicator of a compound last name (Von, Van, etc)
// we use $i != $start to allow for rare cases where an indicator is actually the first name (like "Von Fabella")
if (is_compound_lname($word) && $i != $start)
break;
// is it a middle initial or part of their first name?
// if we start off with an initial, we'll call it the first name
if (is_initial($word)) {
// is the initial the first word?
if ($i == $start) {
// if so, do a look-ahead to see if they go by their middle name
// for ex: "R. Jason Smith" => "Jason Smith" & "R." is stored as an initial
// but "R. J. Smith" => "R. Smith" and "J." is stored as an initial
if (is_initial($name_parts[$i+1]))
$fname .= " ".strtoupper($word);
else
$initials .= " ".strtoupper($word);
// otherwise, just go ahead and save the initial
} else {
$initials .= " ".strtoupper($word);
}
} else {
$fname .= " ".fix_case($word);
}
}

// check that we have more than 1 word in our string
if ($end-$start > 1) {
// concat the last name
for ($i; $i < $end; $i++) {
$lname .= " ".fix_case($name_parts[$i]);
}
} else {
// otherwise, single word strings are assumed to be first names
$fname = fix_case($name_parts[$i]);
}

// return the various parts in an array
$name['salutation'] = $salutation;
$name['fname'] = trim($fname);
$name['initials'] = trim($initials);
$name['lname'] = trim($lname);
$name['suffix'] = $suffix;
return $name;
}

// detect and format standard salutations
// I'm only considering english honorifics for now & not words like
function is_salutation($word) {
// ignore periods
$word = str_replace('.','',strtolower($word));
// returns normalized values
if ($word == "mr" || $word == "master" || $word == "mister")
return "Mr.";
else if ($word == "mrs")
return "Mrs.";
else if ($word == "miss" || $word == "ms")
return "Ms.";
else if ($word == "dr")
return "Dr.";
else if ($word == "rev")
return "Rev.";
else if ($word == "fr")
return "Fr.";
else
return false;
}

// detect and format common suffixes
function is_suffix($word) {
// ignore periods
$word = str_replace('.','',$word);
// these are some common suffixes - what am I missing?
$suffix_array = array('I','II','III','IV','V','Senior','Junior','Jr','Sr','PhD','APR','RPh','PE','MD','MA','DMD','CME');
foreach ($suffix_array as $suffix) {
if (strtolower($suffix) == strtolower($word))
return $suffix;
}
return false;
}

// detect compound last names like "Von Fange"
function is_compound_lname($word) {
$word = strtolower($word);
// these are some common prefixes that identify a compound last names - what am I missing?
$words = array('vere','von','van','de','del','della','di','da','pietro','vanden','du','st.','st','la','ter');
return array_search($word,$words);
}

// single letter, possibly followed by a period
function is_initial($word) {
return ((strlen($word) == 1) || (strlen($word) == 2 && $word{1} == "."));
}

// detect mixed case words like "McDonald"
// returns false if the string is all one case
function is_camel_case($word) {
if (preg_match("|[A-Z]+|s", $word) && preg_match("|[a-z]+|s", $word))
return true;
return false;
}

// ucfirst words split by dashes or periods
// ucfirst all upper/lower strings, but leave camelcase words alone
function fix_case($word) {
// uppercase words split by dashes, like "Kimura-Fay"
$word = safe_ucfirst("-",$word);
// uppercase words split by periods, like "J.P."
$word = safe_ucfirst(".",$word);
return $word;
}

// helper function for fix_case
function safe_ucfirst($seperator, $word) {
// uppercase words split by the seperator (ex. dashes or periods)
$parts = explode($seperator,$word);
foreach ($parts as $word) {
$words[] = (is_camel_case($word)) ? $word : ucfirst(strtolower($word));
}
return implode($seperator,$words);
}

?>

Josh Fraser
Entrepreneur, world traveler and rock climber.
Software engineer and co-founder of Din, Torbit and EventVue.
Read more...

Comments

Pete Warden said at 3:16 pm on August 18th, 2009:

That rocks, thanks Josh! I have very similar problems, but nowhere near so comprehensive a solution.

In my case I'm trying to canonicalize display names from email address headers. One common case is that the name will appear as "Warden, Pete" – I try to detect and flip those, but I'm guessing that's not an issue for your data set? Also there's sometimes multiple words inside the parentheses, eg "Pete Warden (Mailana Inc)", but from inspection it looks like you're only catching the first word with your parentheses check?

I'd love to see this on Google Code, there's some other functionality I'm working on that might fit here, like gender guessing from first names:
http://search.cpan.org/~edaly/Text-GenderFromName…
Josh Fraser said at 8:06 pm on August 18th, 2009:

Glad you found this useful and good catch on the parentheses issue. Perhaps you can merge in your code for handling last name, first name? That's definitely a common use-case that I missed. I've set up Google Code and given you commit access at http://code.google.com/p/php-name-parser/.
Jason Priem said at 11:44 am on September 7th, 2010:

Hey Josh, nice work. I just finished writing something similar, along with a test suite of names. It does pretty much what yours does, although it's object-oriented and captures nicknames and first-initials separately. Here are a few names your lib misses that HumanNameParser.php parses correctly:
George (gob) bluth // gets "gob" as a nickname (not part of first name)
smith, john // reverses around the comma
carlos garcia y luz // gets "garcia y luz" as a last name
e.e. cummings // keeps original capitalization

I like your idea of matching all middle names as part of the first name; that way you never miss names like 'Billie Jo'. However, I'd argue that this is less of a problem than always treating middle names as parts of first names, since it's far more common to have a single-word first name. My lib is at GitHub, and of course it's open, so take or fork anything you like.
Josh Fraser said at 3:25 pm on September 10th, 2010:

Nice! Thanks for sharing. One thing I've realized is that proper parsing varies a lot on the context of where the names came from and how they are being used. For example, in my use-case, anything in parenthesis should be ignored — in yours, it's a nickname. I guess, ideally we should write a class where people can change that behavior w/ a single variable to customize it for their own purposes. Let me know if you're interested. Perhaps we could combine forces to see what we could come up with.
Michael Scott McGinn said at 2:50 pm on April 27th, 2014:

How to solve this one in javascript.
Sort the given array of names by last name and then by first name so they return the array as the result target.

//beginning
var= aNames = ['Gabriel Ba','John Adams','Kieth Richards','Prince','John Adams McKensie'];
//result target
var= aNames = ['John Adams','Gabriel Ba','John Adams McKensie','Prince','Kieth Richards'];
Jim said at 1:16 pm on June 10th, 2014:

Josh,

By any chance has the Name Parser been ported to C#?

Sorry to ask but I can't find anything close to what this does and I need it for .Net.

Any help is greatly appreciated,

Jim

Online Aspect

Splitting names

Comments