Detecting subdomains & effective TLD’s using publicsuffix.org

by on March 17, 2011


How do you detect if a domain contains a subdomain and then return that subdomain? At first it seems like it should be a simple problem. Just write a regexp that looks for the first period and return everything up to that point, right?

example.domain.com

But then you remember that subdomains can have multiple levels:

another.example.domain.com

And then you have those effective top level domains like co.uk to consider:

another.example.domain.co.uk

Hmmm, maybe this isn’t going to be so simple after all.

There are a couple ways to look at this problem. One way is to ask: how you know which part of a domain is the registered domain? Once you know the registered domain, you can figure out the subdomain.

One way of finding the registered domain is to do a whois lookup on the entire domain. If the whois comes up empty, strip the leftmost subdomain and try the whois lookup again. Continue this process until you get a match and that is your registered domain. While this method works, it comes at the cost of multiple network lookups.

Thankfully, there is a better way. It turns out browser vendors need to be able to parse domains too. The Mozilla Foundation started a cross-vendor initiative called publicsuffix.org that is a list of all the effective top level domains. This information is particularly important to browsers to be able to restrict cookies to a given domain. For example, without the public suffix list, anyone on a co.uk domain would be able to read the cookies set by any other co.uk site.

To figure out the registered domain, you need to know all the possible top level domains (like .uk) as well as all the effective top level domains (like .co.uk). There are some great libraries available that make it easy for you to parse the public suffix list to find a registered domain. Google has one in Java and regdom has libraries available in C, Perl and PHP.

While it feels a little inelegant to keep a list of effective TLD’s that has to be updated every now and then, it really is your best option for now. Using this list, it’s trivial to get a subdomain from a domain. For example, using the PHP library from regdom, the code to check for and return a subdomain is simple:

function has_subdomain($domain) {
    return ($domain != getRegisteredDomain($domain));
}

function get_subdomain($domain) {
    if (!has_subdomain($domain))
        return false;
    $registered_domain = getRegisteredDomain($domain);
    return str_replace(".".$registered_domain, "", $domain);
}

The public suffix list is incredibly useful and there are probably many more applications for it that I haven’t considered. I’m surprised it doesn’t get more attention from developers. If you need to do anything that requires you to know a registered domain, the public suffix list is the best place to start.