Search engines are an indispensable technology in our daily lives, it’s through them that our cyberspace journeys begin. As soon as a question or a need arises, all we have to do is think of a word, and we’re immediately typing it into a text box, and without even pressing enter, we’re already being inundated with the results.
The search engines haven’t changed much in their general appearance over time, their structure has remained the same throughout their evolution:
- a text box to type words;
- a button to start the search;
- a section to show the paginated results.
The most significant advances throughout their history have been achieved mainly in their rear, that is, in the algorithms that build their engine, in the speed at which they operate, and, finally, in the amount of information they’re fed with. We can’t either exclude the progress that hardware has achieved in terms of speed and processing capacity, as well as the fact that the cost and storage capacity have evolved in completely opposite directions as these technologies mature.
There are several types of search engines, but we’ll focus our study on the crawler-based World Wide Web (WWW) search engines. We’ll now present a summary of the main innovations and discoveries achieved so far, regarding the search engine industry. After that we’ll present the basic concepts applied to its operation, and make a suggestion of a solution to implement your own search engine. Finally, we’ll present the latest developments that have been applied to the commercial search engines.
Search engines’ history
Most of the research methods used by the current search engines originate from the developments made in the 1970s and the 1980s, by the Informatics’ area dedicated to Information Retrieval (IR). The science of Information Retrieval studies the use of computer technologies for the acquisition, organization, storage, retrieval and distribution of information. With the advent of the WWW and the general use of the Internet, IR is a daily part of our lives. Whenever we do a Google search, we’re using Information Retrieval technologies. Information Retrieval can be expressed as an “equation” like: retrieval = indexing + search. Indexing is a tabulation of the document’s contents in the collection, and the search consists of corresponding to a query to these tables. In turn, the search can be represented by another “equation” like: search = query processing + match (+score). Scoring only occurs in cases where a retrieval ordered by classification is intended.
The first tool that emerged to search the Internet was the Archie search engine, in 1990. At the time, the WWW hadn’t been invented yet. Due to the limitations on the available storage at the time, this search engine only indexed the files’ names available on free-access FTP (File Transfer Protocol). As a result of Archie’s popularity, a new protocol emerged in 1991 - the Ghoper - that intended to be an alternative to Archie. For this new protocol two new interfaces (search engines) emerged, first Veronica and later Jughead.
In 1991, Tim Berners-Lee created the WWW and the first browser. But it was only in 1993 that Mosaic was launched, the browser that would later popularize the use of the WWW.
At the beginning of the WWW there were no proper search engines, only catalogs or boards of existing pages. It wasn’t until 1993 that the first WWW search engine appeared, the Wanderer, which consisted of a web robot developed in the PERL programming language, that navigated the existing web and created an index of the pages found. In the same year, another search engine, the Aliweb, appeared, which didn’t use a web robot, but a system of notification, in which the website’s administrators were the ones to inform the search engine that there was new content. At the end of that year comes Jumpstation, which was the first search engine to use the three main features of a modern search engine (tracking, indexing, and search), although still limited in search of the titles and heading of the pages it indexed. Only in 1994 search engines that indexed the entire content of the pages started to appear, such as the Webcrawler, the Lycos and the Infoseek.
In 1995 the search engines increased their popularity and it’s in that year that the Altavista appears, remaining a leader until the appearance of Google. Altavista is the first search engine to allow queries in natural language. It’s also in 1995 that SAPO is born, in Portugal.
During this period, other search engines appeared, but it’s in 1998 that arises the one that, up until now, is the most popular search engine, and the leader practically all over the world (except in Russia and China) - Google.
Google jumps to the leadership right in 2000 as a result from the innovation from their results ranking algorithm, the PageRank. This algorithm assesses the relative importance of each web page, taking into account the number of links that point to that web compared to all the other web pages, thus establishing a relative importance for each of the pages. But it wasn’t only this innovation in the technological field that made Google a leader in the sector, it was also the idea of small little text ads (PPC) (Google AdWords) that appeared along with the search results to related keywords.
As it was expectable, search engines based on open source technology also appeared. Currently, the most popular is the combination of Apache Lucene with Apache Solr and Apache Nutch. Apache Lucene arises the first time in 1999, by the hands of its creator Doug Cutting, being later supported, in 2001, by the Apache Foundation.
How search engines work
The search engine’s work begins well before the user starts typing a set of keywords in the text box and submitting it. We can even state that it’s the last part of a cycle that repeats itself indefinitely. The three key phases in a search engine performance are:
Before it’s possible to search, the search engine needs to fill its index with the documents about which it’ll research. The search engine delegates the tracking task, that is, find in the WWW the documents that will be part of its index, to a software that’s called web crawler (also known as spider or Internet bot). To this web crawler a list of URLs is provided (called seed) from which it starts following every link found in these pages and the subsequent, and so on, until it has visited and copied all the intended pages. There are thousands of bots constantly scrolling the web, Google’s bot is appropriately called Googlebot. It’s with these copies of the pages that the search engine builds its index. Currently, the Internet is so wide that these crawlers can’t track it whole. The Internet that isn’t indexed by the search engines is called Invisible Web (Deep Web).
The indexing phase is the process by which the search engine extracts the necessary information from these documents and stores it in its database, so that the searches carried out on this index are fast and accurate. If there were no indexing process, searches on the set of documents (corpus) could take hours, or even days, for just one query.
The index is usually in the form of an inverted index. The idea is to keep a vocabulary of all the terms found in the documents, with the indication (list) of where these terms exist. This index is a key factor in the efficiency of Information Retrieval systems, from which the search engines are part of.
The main steps required for the construction of an inverted index are:
- Collection of the documents to be indexed.
- Text’s “Tokenization”. The task of separating a set of characters into words, or more precisely, “tokens”. This step is essential for further analysis and, without it, it would be difficult to extract high-level information from any type of document.
- Language processing of tokens (terms).
- Indexing of documents where each of the terms takes place.
The first step to answer a query is to analyze (parsing) that query. The type of queries available in Information Retrieval systems are diverse, we can have Boolean queries, proximity queries, wildcard queries and queries with automatic spelling corrections, among others.
Once the query has been interpreted, the IR system will search for all documents that correspond to the keyword(s) used in the query. At this stage there is no ranking between processed documents, the intention is only to identify all the documents that are candidate to belong in the list of possible results. As the number of results can be in the order of hundreds or even thousands, it's important that some sort of order applies to the results' list. The results found are thus returned in a list sorted in order of relevance. The score order's establishment follows a ranking/scoring algorithm. One of the most popular methods for scoring documents is the “Term Frequency - Inverse Document Frequency” (tf-idf), but there are other methods, such as, for example, the Okapi B25 or the Latent Semantic Analysis (LSA).
How to implement a custom search engine
Commercial search engines allow for some kind of customization. Some of them even sell versions that can be installed on private computer networks. However, there's a completely free and open source solution that can be used to develop a search engine on an internal computer network or even to search the entire Internet. This system consists of three open-source projects, the Apache Lucene, the Apac Solr and the Apache Nutch. They're three distinct projects, which have joined forces to provide a powerful and effective research tool, with all the resources needed to build a search engine. Solr is a search server with a REST API interface, and Lucene is a text search engine over a high performance index, currently being the most popular free format library, due to its scalability and performance. Apache Nutch is the crawler that will feed Lucene's index.
The main features found in Lucene, in addition to its inverted index for efficient collection of documents by indexed terms, are a set of components of text analyzers linked between them, which allow to adapt the index and the search term interpretation system to the desired language and the completely customizable and adaptable scoring system.
Solr can be considered a kind of layer that runs over Lucene, extending it with new features, making it more flexible and allowing it to integrate with a variety of programming languages and systems. The added features are, in particular:
- A flexible data scheme, with dynamic fields, unique keys, etc
- Extensions to Lucene Query Language
- Faceted search and dynamic filtering
- Geospatial research
- Configurable text analyzers
- Highly configurable and User Cache Extensible
- Performance optimizations
- External configuration via XML
- An administration interface
- Event log
- Index' incremental updates
- Indexes' replication
- Highly scalable distributed search with indexes distributed across multiple nodes
- JSON, XML, CSV / delimited text and binary update formats
- Easy ways to import data from databases and XML files from the local disk and HTTP sources
- Indexing and parsing of several file formats (PDF, Word, HTML, etc.) using Apache Nika
- Integration with Apache UIMA for configurable metadata extraction
- Multiple search indexes
The Lucene index can be fed through crawlers other than the Apache Nutch, but there are significant advantages to using it. Perhaps the most important is that as it's also an Apache project, the interoperability with the other Apache systems is larger, and thus we have the guarantee of a more cohesive search engine. The other advantages are being highly scalable and with a set and functionality that make it very flexible, as well as allow for a complete customization of the algorithm that decides which pages to track.
This platform consisting of three open source projects can be installed on any operating system capable of running Java.
The future of search engines
The most recent advances in search engine technology consist of moving to a new phase, that is, Information Retrieval (IR) is a defeated challenge, and now the attempt is to conquer a new challenge - Information Extraction (IE). IE tries to find useful information in the documents without the need for the user to read the full text, we can therefore consider the field of Information Extraction as a subfield of natural language processing (NLP) that focuses on finding specific facts in relatively unstructured documents. It's different from the IR that tries to find documents containing certain information.
Within this field, we begin to see the first steps being taken, such as the “Knowledge Graph” developed by Google, that already counts with over 570 million concepts and relationships among them in its database. Google is thus already able to apply Named Entity Recognition (NER) and extract information about them from the indexed documents. NER is the first step in identifying relationships in an information extraction system.
Simultaneously, search engines increasingly develop their natural language processing capacities, as the latest update to the Google search engine, they've called Hummigbird, demonstrates.
Search engines move towards becoming more than “simple” search engines, and they'll also answer questions asked by their users in natural language. These systems - Question Answering (QA) - integrate the scientific fields in the areas of Information Retrieval (IR) and Natural Language Processing (NLP).
The evolution in this industry doesn't stop, a good example is this ograhic that shows the main innovations in Google's search engine over the past 15 years.
By José Fernandes