Posted August 11th, 2011 by Jason N
The Internet is a complex place, and if you haven’t already, you should go read part 1 of Explaining the Internet.
This article will cover the real in-depth details regarding the Internet.
So I’ve already covered that the Internet is a way of communicating between computers, but how do these computers ‘talk’ exactly? Well, they actually communicate in binary, and machine code, which is transmitted as electrical or light or satellite signals. This signal is really a series of changes in voltage or light, etc which is interpreted by the computer as a series of values, either one or zero a single one of these values is called a ‘bit’. Eight ‘bits’ makes a ‘byte’ and one-thousand-and-twenty-four (1024) bytes make a ‘kilobyte’, and another one-thousand-and-twenty-four (1024) kilobytes make a ‘megabyte’, and this pattern of one-thousand-and-twenty-four goes on and on. Most kinds of data is transmitted in bytes, and these bytes are then stored in a ‘buffer’ where they wait to be processed by a program. The programs that process these bytes can be anything from a web browser to an email program to an online game. Which all communicate differently, using the same means of travel.
Explaining the protocols – So what’s this stuff about HTML, FTP, SMTP, POP3, and all of the other abbreviations used called ‘protocols’ on the Internet? Well I went into it a bit in my last article about the Internet, but here I’ll go in a bit deeper into the HTTP protocol.
HTTP stands for Hypertext Transfer Protocol. It’s the protocol behind the world wide web, and it makes sure that all the websites and web browsers can understand each other correctly. As the title suggests, it’s a text-based protocol, and most data is sent as text. But wait… You just said that data is transmitted as bytes, so how do we get this text to be sent as binary data? Well there’s another protocol, called a character encoding, which describes how to change text characters into binary data and back back again. Other character encodings such as UTF-8, Unicode or ANSI are also used.
The actual HTTP protocol uses a ‘header’ followed by the content or data that is being transmitted. A header is a couple of lines that describe the kind of content being sent. It also contains details about the browser, any cookies that belong to the site, timestamps, etc. These work in a format like this:
This is sent to the server:
GET / HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0
Accept-Encoding: gzip, deflate
So what does this mean?
- Well the first line says to get the page / using the protocol HTTP 1.1. The page / is the page that you get when you ask for a page. In this example, the browser is asking for http://www.cyderize.org/. If it sayed’ GET /folder/file.html HTTP/1.1′ then it would be asking for the URL http://www.cyderize.org/folder/file.html.
- The second line specifies the host domain or IP address that the website resides on. In this case, www.cyderize.org. It could be any domain or IP Address that’s resolved by a DNS request (More on that in my next article).
- The third line tells the website what kind of browser and operating system is being used, in case there are special requirements for different computers and browsers. This one indicates a Mozilla Gecko Firefox Browser (A.K.A Firefox) and Windows NT version 6.1 (A.K.A. Windows 7).
- The next four lines let the server know what kind of character sets, languages, and types of files are allowed.
- The last line tells the server to keep the connection open for future communications.
That’s what’s called an HTTP ‘request’ because it’s requesting data from the server.
The server would then give a ‘response’ to the browser:
HTTP/1.1 200 OK
Date: Thu, 11 Aug 2011 11:25:49 GMT
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Last-Modified: Thu, 11 Aug 2011 11:25:49 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Content-Type: text/html; charset=UTF-8
- The first line is an ‘okay’ response, meaning that the server understands HTTP 1.1, and will work correctly with the request. The 200 signifies a good request. If it’s another number, such as 404, it may be an error. 404 means file not found, 503, means service unavailable, 403 means forbidden file, etc.
- The second line is simply the date.
- The next two lines tell the browser that the server is an Apache HTTP server with PHP 5.2.17.
- The fifth line is the Pingback URL of this website. It’s usually not included unless the site is a blog site with Pingback.
- The next line indicates when the website is ‘stale’, and shouldn’t always be retrieved from the computer’s ‘cache’.
- The next line is pretty self-explanatory, stating the date the file was last modified.
- The eighth and ninth line tells the browser not to keep a cache copy of the web page.
- The tenth line tells the browser to close the connection after the request finishes, because the server has no need to send anymore data.
- The next line is to let the browser know that the data is to be transferred in small chunks.
- The final line tells the browser that the data being sent back is HTML in UTF-8 encoding.
There’s also an empty line after the end of the header, before the data begins.
After the empty line, the server would send the length of the data, followed by a new line, then the actual HTML data.
A great way to see this in action is to use WireShark to see the requests in real-time. I’ll post a tutorial on that soon.
The same principle of the header followed by data is used every time the browser connects to a website.
And that concludes my little section on the HTTP protocol.Comments: none yet | Filed under: articles | Tagged: data, headers, http, internet, web