The application is a desktop scraping application with MS SQL database and several crawlers in it scraping data from 3 real estate websites https://www.redfin.com/ , https://www.glassdoor.com/ and http://www.zillow.com/. The data is scraped into the database of the application and output into CSV format.
The project was quite complex, as at https://www.glassdoor.com/ browser identification is used, and in case of a big number of queries from the same user, the server replies with a Captcha. The problem was solved with the help of IP rotating and client identification in case of having to deal with Captcha. A special service was created to realize this. The service scans Free Proxy available on the web with the frequency predefined, checks the possibility of them to be used at each target website and saves these Proxy addresses into the database for further use.
The real estate website https://www.redfin.com/ is protected with Google captcha, very uneasy to solve. The crawler was integrated with third-party service http://www.deathbycaptcha.com which provides right answers for Google Captcha, as well as browser emulator was created used for clicking on the right Google Captcha Options.
The complexity with http://www.zillow.com/ was due to the limitation of 300 items to be scraped a day. As a solution, post codes were used to specify small areas for search with the aim to daily scrape all new items available at the website without limitation.
Tools and Technologies