ABIS Infor - 2017-01
How old is a web page?
When "surfing" on the internet, we often stumble upon web pages that are clearly outdated. For some other pages (for most of them, actually) it is not at once clear how up-to-date the information on the page really is. Sometimes there is an explicit timestamp (often at the bottom of the page). But maybe even that info is outdated!
How can we find out the "real" age of a certain web page? And does e.g. Google take into account the age of a page when sorting the results for our search?
HTML versus HTTP
When a web server sends a web page to a (client) web browser, it not only sends the HTML of the web page: the HTTP protocol (or HTTPS) by which client and server communicate first transmits a so-called HTTP header. The client can than decide, based on the information in that header, whether or not to request for the HTTP body (containing the HTML of the actual web page).
A browser only shows the HTTP body (i.e., the HTML) of a web page, although it did always see the header first. As a user of a normal browser, unfortunately we cannot inspect that header info. Of course there exist network tools (like e.g. wget) allowing us to display that header information of a certain web page.
The HTTP header
The HTTP header of an HTML web page could e.g. look as follows:
HTTP/1.1 200 OK
Date: Mon, 16 Jan 2017 14:12:38 GMT
Expires: Mon, 16 Jan 2017 14:12:39 GMT
Cache-Control: no-cache, must-revalidate
Last-Modified: Sat, 23 Feb 2013 10:00:17 GMT
"Date" just tells us when this information was sent; "now", otherwise said (at least, if the clock setting of the server is correct).
More Interesting is the field "Last-Modified". This info is sent to the client by the web server, often based on the real file timestamp on the server.
Actually, the age of that file does not necessarily reflect the age of the info inside of that file: it can e.g. be generated on the server every night, based on old back-end information, or the other way around it can be a PHP file which was written years ago but still is generating up-to-date information, e.g. by extracting data from a database.
On the other hand, the server can also "make up" this timestamp, of course. E.g. to give the impression that a page does not contain outdated data.
Finally there's the field "Expires", which should normally be younger than Last-Modified. A client (browser) should never keep a cache of this page for a longer time than whatever the Expires field is saying. That's actually also what the field "Cache-Control" tells us in the above example: must-revalidate. This is typical for dynamic pages. With static pages (like e.g. CSS files or photos), especially when they are relatively large, it is more interesting to tell the client that a potential cache at the browser (i.e., a locally stored copy of the same page as of some time ago) can be reused. The client (browser) may detect, after reception of the HTTP header, that its cache is more recent than the (just received) "Last-Modified", and consequently decides not to request for the HTTP data but instead display its cached version.
For example, this is the header of CSS file "abisstruktuur.css" on the ABIS web site http://www.abis.be/:
HTTP/1.1 200 OK
Date: Mon, 16 Jan 2017 14:12:38 GMT
Server: Apache/2.0.52 (CentOS)
Last-Modified: Tue, 18 Oct 2016 21:38:36 GMT
Keep-Alive: timeout=15, max=100
Last-Modified suggests that the file has not been changed for a few months. Since Expires and Cache-Control are missing, the server implicitly tells the client that it does not need to refresh its cached version of this file.
"Stale" web pages
Is there a way for a web client (and more specifically for a web crawler like those used by Google for feeding their search machines) to conclude whether information on a web page is up-to-date, or instead completely outdated?
The only "official" HTTP way to know this, is by using the HTTP header information, as explained just before. All other ways to estimate "age" of data are undocumented, hence free to implement.
For example, a web crawler (like Google) could compare the content of a page with the content of that same page a day before. Or a few months before. And decide, based on that comparison, to attach a certain "stale index" value to the page.
To circumvent this stale index of Google, one could e.g. decide to insert a few "random" bytes in a web page.
Going a bit further, a sophisticate "ad hoc" age estimation algorithm could scan the content of a page and make an "educated guess" on the age of a certain page, based on style, vocabulary, actuality references (e.g. hyperlinks to page which did not exist a few months before), ...
I'll bet that Google utilizes a very sophisticate version of such an algorithm! A perfect "Big Data" application.
In summary: there's no watertight, guaranteed way to find out to what extent the data of a certain web page is up-to-date. Admittedly, the HTTP header contains interesting information (which we can't see from within a web browser) and this information is most of the time accurate (but not necessarily, e.g. because the web site tries to mislead Google). All other ways to attach a reasonable age to the information in a web page is the subject of a "Big Data" analysis.