Splitting names

by on August 17, 2009


You can find the latest version of this code on Github. There are libraries for both PHP and JavaScript.

The quest

I’m on a ongoing search to find the best algorithm for splitting a full name into a first name and a last name. I’m sure this sounds like a ridiculously trivial quest — just explode the string on a space, right?

The challenge

But how do you tell the difference between people with double first names like Jo Ann Smith and people with double last names like Jo Von Trapp? What would you do if I gave you a double first name AND and a double last name at the same time?

Did you remember that you might need to parse out prefixes (Mr, Mrs, etc) and suffixes (II, Jr, PhD, etc)?

How do you turn Paul T. S. Williams into Paul Williams while intelligently deducing that that T. James Adams probably wants to go by James Adams, but T. Adams should probably stay as T. Adams?

And how do you straighten out the capitalization? I MIGHT WRITE IN ALL CAPS or all lowercase. Most names have the first letter capitalized and everything else in lowercase, but of course there are exceptions. J.P. likes to have both initials capitalized and Mr. McDonald always gets fussy when you forget to capitalize the D. Oh, and I hope you’re prepared for other anomalies like people with dashes in their name.

As you have probably realized by now, splitting a full name into its proper parts is a little more complicated than it appears on the surface.

I wrote the first version of my name-parsing algorithm two years ago and I’ve been gradually refining it ever since. It’s not perfect, but it’s improved a lot over time. I’m posting this code along with a demo in hopes that it will spur contributions to improve its accuracy even more. Throw the hardest names you know at it and let me know how it performs. I know I’m missing words for the various dictionaries of prefixes, suffixes, and compound name identifiers. Please let me know what I missed.

The demo

Splitting names demo

The algorithm

We start by splitting the full name into separate words. We then do a dictionary lookup on the first and last words to see if they are a common prefix or suffix. Next, we take the middle portion of the string (everything minus the prefix & suffix) and look at everything except the last word of that string. We then loop through each of those words concatenating them together to make up the first name. While we’re doing that, we watch for any indication of a compound last name. It turns out that almost every compound last name starts with 1 of 15 prefixes (Von, Van, Vere, etc). If we see one of those prefixes, we break out of the first name loop and move on to concatenating the last name. We handle the capitalization issue by checking for camel-case before uppercasing the first letter of each word and lowercasing everything else. I wrote special cases for periods and dashes. We also have a couple other special cases, like ignoring words in parentheses all-together.

The code

<?

// split full names into the following parts:
// - prefix / salutation  (Mr., Mrs., etc)
// - given name / first name
// - middle initials
// - surname / last name
// - suffix (II, Phd, Jr, etc)
function split_full_name($full_name) {
    $full_name = trim($full_name);
    // split into words
    $unfiltered_name_parts = explode(" ",$full_name);
    // completely ignore any words in parentheses
    foreach ($unfiltered_name_parts as $word) {
        if ($word{0} != "(")
            $name_parts[] = $word;
    }
    $num_words = sizeof($name_parts);

    // is the first word a title? (Mr. Mrs, etc)
    $salutation = is_salutation($name_parts[0]);
    $suffix = is_suffix($name_parts[sizeof($name_parts)-1]);

    // set the range for the middle part of the name (trim prefixes & suffixes)
    $start = ($salutation) ? 1 : 0;
    $end = ($suffix) ? $num_words-1 : $num_words;

    // concat the first name
    for ($i=$start; $i < $end-1; $i++) {
        $word = $name_parts[$i];
        // move on to parsing the last name if we find an indicator of a compound last name (Von, Van, etc)
        // we use $i != $start to allow for rare cases where an indicator is actually the first name (like "Von Fabella")
        if (is_compound_lname($word) && $i != $start)
            break;
        // is it a middle initial or part of their first name?
        // if we start off with an initial, we'll call it the first name
        if (is_initial($word)) {
            // is the initial the first word?  
            if ($i == $start) {
                // if so, do a look-ahead to see if they go by their middle name
                // for ex: "R. Jason Smith" => "Jason Smith" & "R." is stored as an initial
                // but "R. J. Smith" => "R. Smith" and "J." is stored as an initial
                if (is_initial($name_parts[$i+1]))
                    $fname .= " ".strtoupper($word);
                else
                    $initials .= " ".strtoupper($word);
            // otherwise, just go ahead and save the initial
            } else {
                $initials .= " ".strtoupper($word);
            }
        } else {
            $fname .= " ".fix_case($word);
        }  
    }

    // check that we have more than 1 word in our string
    if ($end-$start > 1) {
        // concat the last name
        for ($i; $i < $end; $i++) {
            $lname .= " ".fix_case($name_parts[$i]);
        }
    } else {
        // otherwise, single word strings are assumed to be first names
        $fname = fix_case($name_parts[$i]);
    }

    // return the various parts in an array
    $name['salutation'] = $salutation;
    $name['fname'] = trim($fname);
    $name['initials'] = trim($initials);
    $name['lname'] = trim($lname);
    $name['suffix'] = $suffix;
    return $name;
}

// detect and format standard salutations
// I'm only considering english honorifics for now & not words like
function is_salutation($word) {
    // ignore periods
    $word = str_replace('.','',strtolower($word));
    // returns normalized values
    if ($word == "mr" || $word == "master" || $word == "mister")
        return "Mr.";
    else if ($word == "mrs")
        return "Mrs.";
    else if ($word == "miss" || $word == "ms")
        return "Ms.";
    else if ($word == "dr")
        return "Dr.";
    else if ($word == "rev")
        return "Rev.";
    else if ($word == "fr")
        return "Fr.";
    else
        return false;
}

//  detect and format common suffixes
function is_suffix($word) {
    // ignore periods
    $word = str_replace('.','',$word);
    // these are some common suffixes - what am I missing?
    $suffix_array = array('I','II','III','IV','V','Senior','Junior','Jr','Sr','PhD','APR','RPh','PE','MD','MA','DMD','CME');
    foreach ($suffix_array as $suffix) {
        if (strtolower($suffix) == strtolower($word))
            return $suffix;
    }
    return false;
}

// detect compound last names like "Von Fange"
function is_compound_lname($word) {
    $word = strtolower($word);
    // these are some common prefixes that identify a compound last names - what am I missing?
    $words = array('vere','von','van','de','del','della','di','da','pietro','vanden','du','st.','st','la','ter');
    return array_search($word,$words);
}

// single letter, possibly followed by a period
function is_initial($word) {
    return ((strlen($word) == 1) || (strlen($word) == 2 && $word{1} == "."));
}

// detect mixed case words like "McDonald"
// returns false if the string is all one case
function is_camel_case($word) {
    if (preg_match("|[A-Z]+|s", $word) && preg_match("|[a-z]+|s", $word))
        return true;
    return false;
}

// ucfirst words split by dashes or periods
// ucfirst all upper/lower strings, but leave camelcase words alone
function fix_case($word) {
    // uppercase words split by dashes, like "Kimura-Fay"
    $word = safe_ucfirst("-",$word);
    // uppercase words split by periods, like "J.P."
    $word = safe_ucfirst(".",$word);
    return $word;
}

// helper function for fix_case
function safe_ucfirst($seperator, $word) {
    // uppercase words split by the seperator (ex. dashes or periods)
    $parts = explode($seperator,$word);
    foreach ($parts as $word) {
        $words[] = (is_camel_case($word)) ? $word : ucfirst(strtolower($word));
    }
    return implode($seperator,$words);
}

?>