Our Universal Web Crawler carries out the Internet surfing of more than 40 web portals of different trade companies and inputs data about their products into the database. Information for each product contains its code, model, description, all technical features listed on the site, price (if it is showed on the site), measurement unit for bought items count, etc.
The ASP.NET technology is used for interface building. The program language is C# (.NET Framework 3.5). The interface allows setting all parameters of crawling, scheduling and running. It contains the information about the crawler’s last run.
To monitor for the process the detail logging is available. You can determine where the crawler’s work was incorrect and rectify the situation.
Our database contains information about several millions of products and this number increases every day. You can look at the summary data for any category, manufacturer and period as well as the detailed information about goods.
With the purpose of data saving and processing SQL Server 2008 R2 Enterprise is used.
There are two databases. The first one is used as an Online Transaction Processing service and the second is used as a Data Warehouse. The data transfer to the Data Warehouse is implemented by transactional replication.
The SQL Server Agent jobs are extensively used for different purposes such as database maintenance, summary tables filling, notification about the current state of different processes, etc.
The CLR-stored procedures and functions are used for tasks which cannot be realized in Transact-SQL, for example, using regular expressions, downloading data from outer sources (the Internet or local networks), etc.
The SQL Server Reporting Services are applied to generate a number of reports about both the current state of the system and the macro activities of different products categories, manufacturers and sellers, price changes during any period, comparison of prices in different companies, analysis of price index dynamics, etc. The multidimensional structures and data mining models in SQL Server Analysis Service are used to get the main trends in pricing for different products categories, manufacturers and sellers.
The SQL Server Integration Services packages are used very much for database maintenance and other tasks such as data uploading onto the FTP-server, etc.
The following techniques and approaches are largely used to make everything mentioned above possible:
- Transformation of the incorrect HTML marker of the document into a well-structured XML document using SgmlReader
- Organizing the crawl of all website pages or the given part to retrieve the necessary data
- If needed, realizing the input of data into the required fields (e-mail, ZIP-code, login/password, etc.)
- Management of the Internet information resources scanning
- Automatic process initialization for the execution of the crawlers work
- Tuning configuration and management of crawlers work through the user’s interface in addition to providing different types of reports based on the results of crawlers work
- Selection of the necessary information from a given Internet information resource
- Handling HTML frames
- Handling complex AJAX constructions
- Handling any exception pages such as 404 – website not found, “Site under reconstruction”, etc.
- Escaping black lists
- Using anonymizers
- Using custom proxy servers
- Using Regular expressions and XPath for extraction of textual and graphical information
Tools and Technologies