Searching for information using search engines.

How Search Engines Retrieve Information — Search and Rank Algorithms

Published in

The Startup

9 min readOct 21, 2019

Before databases were electronic, people who want information had to go to an archive or library. They would then be presented with volumes of paperwork that can take weeks and months to research. When the Computer Age arrived, it reduced searching for relevant information from hours to seconds. This is because we now have search engines that can retrieve information from anywhere on the Internet i.e. World Wide Web. Information is stored on various websites and can be accessed directly from a web browser URL. However, it is a time-consuming task for users to have to search each and every website and memorizing the URL of each link.

In popular culture, searching for information on the web has become so synonymous with Google. People will now say if you want to know about something, just “google it”.

Structured Query Language SQL

Searching for general and specific information began with databases. In a database, information is stored in tables that consist of columns that contain data with rows that pertain to records. Databases were created to store information that users can retrieve using a query tool. That can allow users to query information using a standard protocol like SQL. This was the start of a basic search engine, and it can rely on simple requests called statements. These statements make what are called boolean requests which are exact matches to query columns or rows in a database.

For example, if the query was to search for a column of data ‘SSN’ from a database table called ‘Payments’, then the results would be a column that contains all values for ‘SSN’.

SELECT SSN
FROM PAYMENTS

An example of a column retrieved from a database.

A query can also be made specifically to get a record for a social security number ‘555–55–5555’, and the result will be a row that contains all values that pertain to that social security number.

Select * 
FROM PAYMENTS
WHERE SSN = '555-55-5555'

An example of row record retrieved from a database.

Databases eventually became a more mature technology for information retrieval. However, they were for the most part not interconnected. Users will often have to be at the location of the database computer to retrieve information. That was until the arrival of the Internet, and the use of web browsers. Now anyone with access to a database online can perform and execute a query that will return matching results. For the most part, databases used structured relational data that works fine for organizations. Putting the entire Internet into a database just isn’t practical. That would already exceed Terabytes of data, and when databases get large they become slow. How about when users want more specific information that is not structured or relational, like information on the web? It can be just text, a video clip or photos of sunsets. This gave rise to search engines.

Search Engines

Early search engines include Gopher, a document retrieval protocol that allows users to search documents prior to the web. With the popularity of the web came Alta Vista, Yahoo and later on Google. Search engines were like the query tools used for relational databases, but they differ in the information retrieval process. While query tools can quickly return results based on relational indexes, a search engine has to look throughout the entire web. To help search engines, indexes are created by search bots called crawlers (also called spiders in reference to the web) which compile information for faster retrieval by search engines.

Users on the web don’t need to know the column or row of the data they are searching for. They often just want to get information that pertains to what they are searching for with no particular organization or structure. For example, if a user wanted to research about ‘Digital Cameras’, all they need to do is type it in and let the search engine return all results it finds about digital cameras. There is no need for the user to specify what database, column or row to search. The results will include anything that is found on the web that contains the ‘Digital Cameras’ whether it is text from a web page or an entire article about the topic.

This simple diagram shows the search engine process from the user’s search request up to the return results.

What search engines do is provide the search results (also called the SERP or Search Engine Results Page) to a request using databases that have indexed that information already. This was first gathered by the spiders from the web, then compiled into indexes that are stored not in one database, but many databases distributed on the network. Going back to the earlier example, when you search for ‘Digital Cameras’, the search engine will lookup an index that contains the keyword and will be directed to a database that stores an index of it. Then the search engine will generate a web page that contains the hyperlinks with the result.

Search Optimization

Webmasters will put a robots.txt file for spiders to access the site for information retrieval by a search engine. The robots.txt contains directives for spiders to follow when indexing a website, including which web pages not to include for crawling. Another method that can be used with robots.txt uses a standard called Sitemaps, an inclusion standard for websites. If a website doesn’t have robots.txt file, the spider will still crawl it and add it to the index by using relevant data like page title, content and HTML meta tags. Developers want their website to have the greatest visibility to spiders so they use a method called SEO (Search Engine Optimization).

This is an example of meta tags in an HTML document:

<head>
  <meta charset="UTF-8">
  <meta name="description" content="Search Engines">
  <meta name="keywords" content="web,search,ranking,information">
</head>

Enclosed in the ‘head’ tags are the meta tags. The meta tag keywords describe the content of the website. Meta tags contain what is called metadata. This information describes the contents and other details about the website.

Here is an example of code for robots.txt file.

# Group 1
User-agent: Googlebot
Disallow: /nogooglebot/

# Group 2
User-agent: *
Allow: /

Sitemap: http://www.example.com/sitemap.xml

Using details about robots.txt from Google, here is the explanation of the code.

The user agent named “Googlebot” crawler should not crawl the folder http://example.com/nogooglebot/ or any subdirectories.
All other user agents can access the entire site.
The site’s Sitemap file is located at http://www.example.com/sitemap.xml

The reason for these optimizations is to get a higher score in rankings when being searched. That way the website will aim to attract more traffic and increase engagement. More of how ranking websites work will be discussed in the next section.

BIR And PageRank Algorithm

Traditional searching uses a BIR (Boolean Information Retrieval) technique. This uses algorithms that return the exact match of what a user is requesting and it should be nothing more. It is boolean in the sense that it can either be a match(1) or no match(0). This works well for simple searches, but this model would not be ideal for efficiency purposes. That is why different search techniques were introduced like the PageRank technique by Google.

The PageRank algorithm does not just use exact match criteria to return search results. It takes into consideration the significance of a website. According to the Wikipedia definition:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Google ranks search results based on relevance, returning the most popular website of the search request. A score is given to a website that matches the criteria of the search request. The higher the score, the more likely it will be returned at the top of the search. The PageRank scores are recalculated after each web crawl in order to keep results relevant to the search. (Note: The actual PageRank algorithm is very complicated to explain, and will require a separate posting)

The web is full of so much information, the results returned by search engines can be an overload. PageRank was developed to determine the most popular or significant websites. If it were just exact match criteria using BIR, users may not get the best results. For example, in an exact match algorithm, the user will see results in no particular order and may get websites at random. The problem here is that not all these websites may be relevant to the search. It will take the user a long time before they find a website that is significant based on their criteria. With PageRank, a score is used to rate a website’s importance relative to the search request. That allows users to get the most important websites at the top of the search results list, while those that are ranked with lower scores are deemed less significant.

Due to ranking algorithms, SEO became an important method for websites to have a higher score so that they appear at the top of searches. This led to many services that offered customers ways to increase the rank of their website. For businesses, this offers many advantages if your website appears at the top of a search result because it is going to be seen before other websites. Then again the PageRank algorithm also looks at how often the website is clicked and linked from other websites.

PageRank is used along with other techniques for searches. At one time users can view PageRank scores from the Google toolbar on the Chrome web browser. Google has since removed it in order to make people not focus on the numbers. Instead, it is about the quality of the links to websites in order to avoid spamming. Bots can also create links to a website in order to boost their ranking, so a system was put into place to check just how significant the website that is linking to it is.

For example, if the website is being linked from a forum or blog, that website must have some sort of significance and not just used to spam spiders who crawl the website. Take a look at this list of how PageRank scores a keyword relative to other keywords found.

In this list, PageRank scores the most popular results for the keyword ‘iPhone’. A score of 10 ranks ‘iPhone 11’ the highest. The list follows a descending order from the most searched to the least searched.

These are the lowest-ranked results for the keyword ‘iPhone’. These are the bottom of the list because even though they appear as keyword suggestions based on popularity, they are the least likely to be visited.

What if a keyword you type in has no suggestions or predictions? It can still be used, but now the search engine will learn it for future searches. If there was never a search request for the word ‘DeX’, the search engine will still pull up whatever was crawled on the web for that keyword. Then it will become part of keyword predictions for future search requests. It will then use keyword suggestions ranked based on popularity relative to other keyword suggestions. If the search term violates the search engine’s policy, it will not appear as a suggestion.

Autocomplete

The Google search engine has another feature that can be used to assist searches called autocomplete. You basically begin typing a word you want to search and Google automatically gives a list of suggestions of what a user wants to request. Take a look at the example below.

The purpose of autocomplete is to get faster search results. Autocomplete makes suggestions on keywords based on predictions made using popularity and similarity. However, there are certain factors that determine these suggestions and they won’t be the same for every person. If you are logged in via a Google account, it becomes easier to implement since your searches can be tracked based on your user account. Autocomplete takes a look at the following factors during a search:

Search terms a user types (this can be unique to every user).
Searches a user has done in the past i.e. search history (if signed in to Google.
Based on what is trending. These are popular topics not related to search history and can be based on the user’s region (e.g. what is trending in Southern California or Eastern Pennsylvania) and other information that is known about the user.

Summary

Information has become the key to knowledge in the Digital Age. People become more informed or misinformed by web searches. The algorithms continue to evolve, as web searches become more optimized. The problem here is that boolean exact match searches don’t exactly return the best results, so a rank and score system became more useful. This provides users with the most relevant results to their search requests.