A key building block of an Open Search infrastructure: Serci WebCrawler launched
Designed for speed, the Serci WebCrawler is the first part of the Serci SearchEngine project.
The Serci SearchEngine project will provide an open-source web search engine. When fully deployed, the Serci Suite will consist of a WebCrawler, an Indexer with index tools, and a query server. The Serci WebCrawler is the first component to have been published. It will be a candidate for the crawling necessary to gather data for the Open Web Index as part of an Open Search infrastructure.
Hartmut Stein, software consultant, one of the developers of the Crawler and member of the osf working group Tech, explains: “The Serci WebCrawler was designed from the ground up for speed – after all, high crawling speed means faster throughput and a lower power consumption per fetched web page.” First experiments show that the crawler is much faster and more energy efficient than other compared crawlers, i.e. ca. four times faster then Heritrix (measurements to be verified by independent parties).
Important cornerstone for the Open Web Search initiative the Serci WebCrawler
“I am very excited to see how the Serci Crawler will be received in the community and in federated operations. From what I’ve seen so far, it has the potential to establish itself as a key building block of an Open Search infrastructure,” said Dr. Stefan Voigt, board member of the Open Search Foundation and one of the leading coordinators of the EU project OpenWebSearch.eu. “The release of the first elements of the Serci search engine by Hartmut Stein, supported by NLNet, IT4I and the University of Passau, all active players in the Open Search community, is a great example of how cross-linking diverse expertise will make the open Internet search of the future possible.”
A crawler with history
The Serci WebCrawler is not a completely new development. Being a derivative of the AREXERA X-Crawler, it dates back to the early 2000’s when AREXERA GmbH (former TECOMAC GmbH) wrote it as part of a toolset to run public search engines like Seekport in Germany and some other European countries. The tool was in full productive use until the company went out of business.
The crawler supports common features, like TLS support, robots.txt, politeness rules, de-chunking, de-compression, and WARC file output. It is is written in C++ and prepared for distributed operation. For administration there exists an experimental web front-end. The Serci WebCrawler is free software and can be redistributed and/or modified under the terms of the Apache License, Version 2.0.
Wonder where the name comes from? The name “Serci” stems from the Esperanto word serĉi: to search, to seek, to find, to look for.
About the developer:
Hartmut Stein, born 1956, graduated physicist, has spent his working life as a software consultant and developer. He essentially developed the AREXERA Internet and Intranet search engine and played a major role in the AREXERA X-Crawler, which is the basis for the Serci WebCrawler.