Thursday, August 18, 2016

How to gather domains belonging to a Top Level Domain (TLD)

I was recently looking for a list of all domains belonging to a country, i tried the usual methods of googling for lists and came up empty. The sites that i found that claimed to have such lists were asking for money.....and you know how that usually goes. So i took it upon myself to scrape the Internet and find as many domains belonging to a country as possible. If anyone has a more elegant way of getting these kinds of lists without that doesn't include paying money, I'm all ears.  For example if you wanted to find the domains the belong to a country like Uganda, you'd have to find as many ug domains as possible or if you're a German, find as many .de sites as possible. For this exercise I didn't bother including the .com .net domains and the like for obvious reasons. I have picked Kenya (.ke) as  the country of interest. Without further ado lets dive in, but before we start, certain assumptions have to be made:
1.    All sites ending with .ke belong to Kenya
2.    All the domains are in 3 parts in the format for example ku.ac.ke
3.    We don't have any money to use the paid services (not really an assumption :-))

My first order of business was to collect the top visited sites belonging to that country on Alexa. Alexa lists the top 500 sites visited by each country. What better way to start finding domains by going to a site that has done part of the work for you. With this i quickly fired up python and wrote a simple script that would give us a nice list of the domains in a well listed format. I leveraged the adept beautiful Soup library which has amazing capabilities when it comes to scraping web pages.


Disclaimer: Any change in the structure of the webpage will break the scraper so probably by the time you read this post and Alexa has changed its websites' structure, the scraper will seize to work as expected.

Run the above script.

roman@ubuntu /tmp $ python topKEwebsites.py 

This gives us a nice list like the one below. (I've trimmed it for brevity)

We'll check our top500kenyansites.txt list to make sure the sites are 500.

cat top500kenyansites.txt | wc -l
500 

With this, we'll use bash to get only those ending with .ke. The below bash line should suffice. This leaves us with 66 domains.

roman@ubuntu /tmp $ cat top500kenyansites.txt | grep '.ke$' > alexalist.txt
roman@ubuntu /tmp $ cat alexalist.txt | wc -l
66

Next we're going to scrape google to find as many .ke domains as possible. For this post I'll do only google but if you want to be more thorough in your results you can leverage other search engines as well. The concepts are the same as when you use google.

For this next step there are a number of ways of achieving this; you could use the google API but its heavily limiting in terms of the number of results we'll get back. When i tried it i think i was limited to 10 results which is hardly helpful since we're trying to get all the domains google has indexed for our TLD in question. We could also pay for services like import[.]io but that goes against one of the targets of this post (not spending money in the process of getting our list). Another way would be to scrape the google results pages as i first did when i attempted this exercise but that is now a dead end as well since what will be returned by HTTP request is meant for browsers to render and beautiful soup doesn't render data so we'll probably end up with empty container which are later filled in dynamically as a page renders. Google probably did it to thwart attempts like this one. I finally decided on the method used by Chris Ains. With his method you need to have the chrome browser since he wrote a bookmark-let to get all the URLs from a particular search. Follow the steps listed in the link below.

http://www.chrisains.com/seo-tools/extract-urls-from-web-serps/

Once your browser is all set, we'll use the google search operand "site:.ke". This should give us the as the in the screen-shot below.














 



From this we can see the various sub domains for the .ke TLD ie .go.ke,.ac.ke,.or.ke,.me.ke. We'll repeat the searches for the other sub-domains that we have found. Don't forget to scroll to the end each of the search results before clicking the bookmark-let as stated in Chris' steps.













 



Go to the results processed by the bookmark-let and scroll to the URL list section. Copy all the URLs for all the results got from the various sub-domain searched and put them in a text file.

roman@ubuntu /tmp $ cat kenyaURLs.txt | wc -l
3249

This gave me about 3249 URLs. The next plan is to extract the domain names from all the URLs.
We'll first remove the duplicates and the extra unwanted http stuff and links. This leaves us with about 1216 domains.

roman@ubuntu /tmp $ cat kenyaURLs.txt | cut -d"/" -f 3 | sort -u > uniqueDomains.txt
roman@ubuntu /tmp $ cat uniqueDomains.txt | wc -l
1216
roman@ubuntu /tmp $ cat uniqueDomains.txt | head
aasciences.ac.ke
abdalla.me.ke
about.me.ke
academics.uonbi.ac.ke
accommodation.ku.ac.ke
account.ecitizen.go.ke
acorce.nca.go.ke
actuarieskenya.or.ke
adc.or.ke
adis.uonbi.ac.ke

roman@ubuntu ~/playing_folder/KENYA $ cat kenyaURLs.txt | wc -l
3249

This gave me about 3249 URLs. The next plan is to extract the domain names from all the URLs.
We'll first remove the duplicates and the extra unwanted http stuff and links. This leaves us with about 1216 domains.

roman@ubuntu /tmp $ cat kenyaURLs.txt | cut -d"/" -f 3 | sort -u > uniqueDomains.txt
roman@ubuntu /tmp $ cat uniqueDomains.txt | wc -l
1216

roman@ubuntu /tmp $ cat uniqueDomains.txt | head
aasciences.ac.ke
abdalla.me.ke
about.me.ke
academics.uonbi.ac.ke
accommodation.ku.ac.ke
account.ecitizen.go.ke
acorce.nca.go.ke
actuarieskenya.or.ke
adc.or.ke
adis.uonbi.ac.ke


We quickly run into a problem for example academics.uonbi.ac.ke and adis.uonbi.ac.ke both belong to the same domain but are listed as separate in our list.

roman@ubuntu /tmp $ sed -e 's/./\L&/' alexalist.txt >> uniqueDomains.txt 
roman@ubuntu /tmp $ cat uniqueDomains.txt | wc -l
1282
roman@ubuntu /tmp $ cat uniqueDomains.txt | sort -u | wc -l
1278
roman@ubuntu /tmp $ cat uniqueDomains.txt | sort -u > unifiedlist.txt
roman@ubuntu /tmp $ cat unifiedlist.txt | wc -l
1278
roman@ubuntu /tmp $ sed 's/^\( *\).*\.\(.*\..*\.\)/\1\2/' unifiedlist.txt | sort -u > FINAL.txt
roman@ubuntu /tmp $ cat FINAL.txt | wc -l
958

In the above snippets what I've done is append our first list (the one that we got from Alexa) to our uniqueDomains.txt list. I first changed the case of all the first characters to lower since they were all uppercase, during the sorting, bash would have treated them differently.
I then went ahead and sorted them so that the duplicated were removed and then in the final esoteric sed command usage, I basically removed all the sub-domains (remember our academics.uonbi.ac.ke and adis.uonbi.ac.ke problem?) Well this snippet seeks to solve that problem. So we end up with a nice list of 958 unique .ke domains. You can find the list of the domains as well as the code snippets on my github.