A Synchrotron as an experimental physics facility can provide the opportunity of a multi-disciplinary research and collaboration between scientists in various fields of study such as physics, chemistry etc. during the construction and operation of such facility valuable data regarding the design of the facility, instruments and conducted experiments are published and stored. It takes researchers a long time going through different results from generalized search engines to find their needed scientific information so that the design of a domain specific search engine can help researchers to find their desired information with greater precision. It also provides the opportunity to use the crawled data to create a knowledgebase and also to generate different datasets required by the researchers. There have been several other vertical search engines that are designed for scientific data search such as medical information. In this paper we propose the design of such search engine on top of the Apache Hadoop framework. Usage of Hadoop ecosystem provides the necessary features such as scalability, fault tolerance and availability. It also abstracts the complexities of search engine design by using different open source tools as building blocks, among them Apache Nutch for the crawling block and Apache Solr for indexing and query processing.
Keywords: Synchrotron, Search Engine, Information Retrieval, Big Data, Hadoop, Solr, Nutch.
A vertical search engine called HVSE has been proposed in , in which the authors improved topic oriented web crawler algorithms and developed a search engine based on Hadoop platform. With the decentralized Hadoop platform this search engine can have higher efficiency for massive amount of data due to ability of expanding the Hadoop cluster.
The architecture of a search engine consist of four main parts as shown in figure 1, the crawler as the first part is responsible for collection of data from web pages, then the indexer part creates a search able index of the collected raw data. The third is the query parser which pars the user’s input query and retrieves the related information. The last and forth part is the user interface which could be in the form of a web application or mobile app that facilitates the search and showing the results to the end user.
Fig. 1. Architecture of a search engine