Scraper¶
The Scraper is the heart of our app. It fetches the data and pumps to the app. So the scraper design is very critical. I have considered several tools and technologies for scraping.
I have done scraping in node.js and python. Python is my choice as it is very simple and follows sequential programming approach.
The following explains the scraping browsers and frameworks
1. Scrapy-Python¶
Scrapy is one of the best framework for scraping. But it can’t scrape dynanmic sites where the content is generated on the fly by java script. because it uses headless browser.
2. Selenium-xvfb¶
Selenium uses real browser so it can scrape dynamic sites. But is bit slow. It doesnt matter. it does the job pretty good. XVFB is emulator to run browser in hidden mode.
Puma India scraping¶
It is a dynamic site. But the initial page is static. so for the first page, we used scrapy. for products page used selenium.
Puma-lev1¶
- its job is to fetch all the unique product pages along with the following fields.
- id
- process_lev1
- process_lev2
- sale
- name
- url
- image_small
- regular_price
- discounted_price
- command to run ::
- cd puma-spider/stack/stack/
- scrapy crawl puma-lev1
Program Logic¶
- Fetch the data from the site and while saving it, do the following.
If the item is not present (check using URL as key), Then Insert it.
- If item exists in the DB, then update the following fields.
- sale
- regular_price
- discounted_price
Puma-lev2¶
- command to run ::
python puma-lev2.py refresh=N
python puma-lev2.py refresh=Y
- The level2 spider reads yields the following fields.
- images - original, big, small
- style number
- availability
- size
Keep a refresh flag for this program. if refresh flag is Y, then process all else process only those with process_lev2 flag N