Bojko: downloading/scraping 500,000 pages from a website

Sunday, 1 September 2013

downloading/scraping 500,000 pages from a website

downloading/scraping 500,000 pages from a website

I need to scrape data from a particular website say example.com.
Now, my interest is in example.com/foo/
Basically, I need to download pages like example.com/foo/xyz-vs-abc.html ,
example.com/foo/def-vs-pqr.html and so on. This will enable me to use all
these pages and process the data in them offline. I have all the links
that I want to download in a csv file.
I am aware that data can be directly extracted from the website itself
without having to download it. But in my case, I think it'd be best if I
have all the pages and then can process them at leisure.
I would also like to know if tool like Scrapy would serve the purpose. I'm
skeptic about it since the number of http requests sent might get me
blocked.
If this is any useful, here is how the robots.txt of my target site looks
like:
User-agent: *
Sitemap: http://www.example.com/sitemap.xml

Bojko

Sunday, 1 September 2013

downloading/scraping 500,000 pages from a website

No comments:

Post a Comment