After months of experiencing the American culture and adapting to the course level at USC, I finally got a chance to review my Python skill. Fortunately, I remember the basic operation command quite well and perhaps that’s how interests can drive a normal person to learn something deep enough.
In short, I still want to enhance my skill in grab the content on different kinds of websites. Everybody knows there are tons of packages of Python that could do that, but everybody should be careful about the learning curve of those stuff unless you major in computer science. During this semester, I have heard a lot of my classmates claiming that Python is easy to learn but COME ON, Python after all is a very powerful objected oriented programming language. If you have never touched on programming before, it would make you crazy. To sum up using an analogy, it’s like learning to play the guitar, at first it might be easy but if you want to do a SOLO in BLUES at stage in front of thousands of audience, you’d better invest a decent amount of time in it.
Alright…To me, a business undergrad, learning Python is a process that’s full of challenging tasks but I really enjoy doing that in my spare time. Now it’s time for Scrapy. I believe people who learn to use Python to scrape websites would have heard of if, definitely. It is “easy” to use and the automation setup would help users a lot. So, after successfully using the Beautifulsoup and regular expression to grab stuff, I look into Scrapy to do a similar task, but hoping it could be done much much more quickly with the multiprocessing feature built in Scrapy.
There are a lot of tutorials about how to set up Scrapy so I don’t want to talk about it here. It is basically typing commands in CMD(Windows) or Terminal(Mac) and the computer would set up the folder for you. Afterwards, you need a text editor to input the code. Theoretically, you only need to set up Items.py and Spider.py to put scrapy in work, which is what I am going to do. Power users can customise way more settings.
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
from scrapy.item import Item, Field
# define the fields for your item here like:
# name = scrapy.Field()
stock_date = Field()
stock_open = Field()
stock_high = Field()
stock_low = Field()
stock_close = Field()
stock_volume = Field()
Above is my Items.py file. It set up some fields to store the contents that will be grabbed by the spiders file. You can omit this step if you just want to print out the results in the console.
from scrapy.selector import Selector
from stockmarket.items import StockmarketItem
from scrapy.contrib.spiders import CrawlSpider,Rule
from selenium import webdriver
from selenium.webdriver.support.ui import Select, WebDriverWait
name = 'stockmarket'
allowed_domains = ['nasdaq.com']
start_urls = ['http://www.nasdaq.com/symbol/yhoo/historical']
driver = webdriver.Chrome()
select = Select(driver.find_element_by_id('ddlTimeFrame'))
# select by visible text
abc = select.select_by_visible_text('1 Year')
sites = 
# key step
sites = driver.find_elements_by_xpath('//*[@id="quotes_content_left_pnlAJAX"]/table/tbody/tr')
items = 
for i in sites:
item = StockmarketItem()
item['stock_date'] = i.find_elements_by_xpath('.//td').text.strip()
item['stock_open'] = i.find_elements_by_xpath('.//td').text.strip()
item['stock_high'] = i.find_elements_by_xpath('.//td').text.strip()
item['stock_low'] = i.find_elements_by_xpath('.//td').text.strip()
item['stock_close'] = i.find_elements_by_xpath('.//td').text.strip()
item['stock_volume'] = i.find_elements_by_xpath('.//td').text.strip()
Another tricky thing is that Scrapy is so fast with its Twisted framework and this would raise an error if your Internet speed is not fast enough. An error would state that the tags which you are looking for are not embedded in the index you just set. That’s why I use “time.sleep” to wait for the index to be fully loaded.
In the terminal, I type in “scrapy crawl stockmarket -0 items.csv -t csv” to generate a CSV file. Everything goes well and I enjoy the moment when the terminal is refreshing quickly to show all the results that are from the website. It seems to be a pretty easy job, but who knows I have been debugging my code to make it work for a week.
So sad that I am a business student….