How do we get from URL to the content we want to see?
You type twitter.com in your browser and within a second or two the page loads and boom you’re off exploring the latest trending hashtags and quick blurbs from those you follow. This is a very normal everyday occurrence, at least if you are a Twitter user, but have you ever considered how you get from your computer to Twitter.com? What exactly is the process of visiting a webpage on the internet?
Before we dive into the web stack and how a page on the internet is served, let’s cover the very basics. The client-server model refers to how you as a client request information from a server, and how that content is served to you over the internet. The client typically refers to your browser which you are using to access content, ie websites. The server refers to software that is serving you the content you are requesting. A server can also refer to the physical hardware that the server software runs on. There are often thousands of server racks, towers with several server computers, in a data center, or large tech warehouse. These data centers serve many thousands and sometimes many millions of requests for content a day, and they keep the internet up and running.
Domain Name System(DNS)
When we go looking for Twitter.com it is important to know that addresses on the internet are not the same as the URLs we recognize. While you are looking for www.twitter.com, behind the scenes your browser is navigating to 199.59. 148.10 or the Internet Protocol Address(IP) for Twitter. It’s easy to see why we as humans prefer to remember www.twitter.com instead of 199.59. 148.10, but for our convenience sake it means our browser must know where we want to go based on the human-readable URL we type. In finding the IP address our first stop is our browser’s cache. Whether you use Chrome, Firefox, Edge, or anything else the first place that a browser will look for an IP is in its cache of (recently) previously visited sites. If you have not visited this site before, or if perhaps you have cleared your cookie and cache, and the browser can not find the IP address for our site then the browser will check with your operating system. The operating system like the browser will check its cache to see if it has the IP address for the website you want to visit. If it fails to find it then the browser will reach out to your Internet Service Provider (ISP) which has a system known as a Name Server. A name server is a bit like an address book, its main job is to help us find the IP address for the site we are seeking. Most ISP nameservers are configured as recursive resolvers which means they can both give an answer as well as send queries to other name servers if they don’t know the answer. We’ll call the ISP name server Resolver for short.
Resolver has its own cache of IP address for websites that customers of the ISP have recently visited or queried about. Once again, if the IP address is in the cache then it will be returned to the operating system which will give it to the browser and off we go. If not then Resolver at least knows where to look next, the Root name server. The Root name server is one of 13 root servers around the world and collectively they the root of DNS. The root servers sit atop the DNS hierarchy and their only role is to point queries to the correct Top Level Domain servers (TLD). A top-level domain is what we know as the ending on websites such as .com, .org, .net, .gov, and country codes like .uk, .de, .jp, etc. While .com is the oldest TLD, created in 1985, today we are getting even more TLDs with additions like .tech, .quest, .cool, and many more. TLD servers, much like root servers, have a simple job which is to point to the Authoritative Name Server. The top-level domains always know where the Authorative Name servers are thanks to the registrar. Whenever a domain name is purchased its information such as the domain’s responsible party and contact details, as well as the name servers are shared via the registrar with the correct TLD.
The Authoritative Name Servers are normally run by the company that owns the website you’re seeking (if the company is big enough) or if not the owner directly at least the host of the website. Therefore the name convention for an Authoritative Name Server is typically along the lines of ns for the name server, a number for what server it is, and then the domain of the site. For example, ns01.twitter.com would be one of many name servers that are running to make sure queries for the IP address to www.twitter.com can always be answered even if one of the name servers should be down for any reason. The ANS will definitely have the IP address we are looking for so this is where the resolver caches the answer and returns to the ISP with the information. Resolver hands off the IP Address to our OS which caches it and feeds it to the browser. One of the truly amazing things is that all this happens in fractions of a second, and off we go to visit the site.
Now that we know where we are going it’s time to connect and visit our webpage. This process starts with a synchronization signal sent from our browser to the server at the IP address we got from the resolver. This synchronization is sent with the TCP or Transmission Contol Protocol a part of the Internet Protocol suite which determines how data is sent over the internet. The super short and sweet definition; TCP is a protocol for how information is broken down into packets and sent over a network, and how the receiver confirms everything was received properly. Once the original SYN signal is received the server will send back an acknowledgment (ACK) that it received the SYN from the browser plus its own SYN signal. Our browser will then send back an ACK that it received the SYN from the server.
If we are dealing with standard HTTP (hypertext transfer protocol) this is where we are finally in contact with the server that will be able to serve us the webpage we have been after. However, if we are working with HTTPS, the secure encrypted version of HTTP, we have a few more steps. With HTTPS once the server gets the browser’s ACK, the server will then send its SSL cert or Secure Socket Layer certificate. SSL Certificates are small data files that digitally bind a cryptographic key to an organization’s details. The browser will verify the SSL cert and with the public key included in the cert, the browser will change to the specified cryptographic cipher and send an ACK. The Server will verify the cipher spec and send back an ACK. Now all of the traffic between our browser (the client) and the server is encrypted.
Another layer of security that we may need to get through if it is configured, is a firewall. A firewall is a piece of software, or occasionally also a piece of hardware, that prevents malicious connection and malicious data from entering a network. The first and simplest method is for a software firewall is to only allow connections on a few dedicated ports. Typical connection ports include 22 for SSH or Secure Shell, 80 for standard HTTP, and 443 for HTTPS. By rejecting all other connection requests and only allowing connection over these ports the server is already far safer as it has less traffic to monitor and less ‘surface area of attack’. Beyond limiting the points of connection some firewalls also have a filtering system that can scan the contents of packets flowing into and out of the network to look for anything suspicious when compared against known malicious code.
When it comes to the SSL certificate and the Firewall, in a standard multiserver set up both are normally on the load balancing server. Let’s take a step back and talk about server architecture. While small sites may only need a single server to serve all the client requests, medium and large sites often need many servers to handle all the traffic that the site gets. To effectively use multiple servers you need to have traffic routed by a load balancing server. A load balancer uses an algorithmic process to assign incoming requests to the servers it manages. Algorithms a load balancer can use include things like a naive round robin that just assigns requests in a standard in order, or something more complex like a resource-based adaptive approach which knows the capabilities of each server and how much strain they are under, then assigns traffic to the server with the most available resources. There are many ways a load balancer server can go about its work but the bottom line is that it distributes traffic to balance the load on its servers.
The servers themselves are first and foremost hardware made up of many special-purpose computers designed to serve content over the web. Unlike the computer you are probably reading this on, server computers are normally very powerful specialized computers that feature a lot of computing power, storage, and redundant systems to prevent any hardware single point of failure. Beyond the hardware, a server for our purposes is a web server or a piece of software meant for serving the content of a webpage. If we have a very simple static page made up of HTML and CSS a web server such as Nginx or Apache can simply server this content from the codebase which has been uploaded to our server and that will be it, webserver sends the content back to the client browser which then renders that page for you the user. If, however, we have a more dynamic page that loads things that update frequently like a social media timeline or has user-specific features like a forum or settings page, we will need to build dynamic content using an application server. An application server is meant to serve content like a web server but it does so through managing applications. In our case, our application server will be constructing dynamic content using PHP for the framework which it pulls from our codebase, and the content will be built from data in our database which is built using SQL. In this case, our web server will load up the HTML and CSS content and call on our application server to pull the PHP and SQL content, then everything will be sent back to the client browser to be rendered for you the user.
Considering we are visiting a site like Twitter we are absolutely talking about hundreds or maybe thousands of servers, all of which are managed by load balancers, and have webservers(software) working with application servers(more software), to serve us both static (the look and structure of Twitter) and dynamic content (Tweets/users/hashtags) that is stored in some sort of database(usually another type of server and type of software).
So congratulations we’ve made it all the way from typing a URL in our browser to a resolved website! The final offering I have on this adventure is a diagram I made for the whole process which is embedded just below. There are a ton of great resources that cover each step of this process in much greater detail, a few of which are linked in the source. Thanks for learning along with me, and let me know what you think in the comments.
Sources and Reverences
Root name server
A root name server is a name server for the root zone of the Domain Name System (DNS) of the Internet. It directly…
SSL Certificate - SSL Information Center | GlobalSign
SSL Certificates are small data files that digitally bind a cryptographic key to an organization's details. When…
How HTTPS (SSL) Works 🔐 & Differs From HTTP
With all the noise over the past couple of years about upgrading every website to use HTTPS & SSL, and with good…