Scrapy, Finally….

After months of experiencing the American culture and adapting to the course level at USC, I finally got a chance to review my Python skill. Fortunately, I remember the basic operation command quite well and perhaps that’s how interests can drive a normal person to learn something deep enough.

In short, I still want to enhance my skill in grab the content on different kinds of websites. Everybody knows there are tons of packages of Python that could do that, but everybody should be careful about the learning curve of those stuff unless you major in computer science. During this semester, I have heard a lot of my classmates claiming that Python is easy to learn but COME ON, Python after all is a very powerful objected oriented programming language. If you have never touched on programming before, it would make you crazy. To sum up using an analogy, it’s like learning to play the guitar, at first it might be easy but if you want to do a SOLO in BLUES at stage in front of thousands of audience, you’d better invest a decent amount of time in it.

Alright…To me, a business undergrad, learning Python is a process that’s full of challenging tasks but I really enjoy doing that in my spare time. Now it’s time for Scrapy. I believe people who learn to use Python to scrape websites would have heard of if, definitely. It is “easy” to use and the automation setup would help users a lot. So, after successfully using the Beautifulsoup and regular expression to grab stuff, I look into Scrapy to do a similar task, but hoping it could be done much much more quickly with the multiprocessing feature built in Scrapy.

There are a lot of tutorials about how to set up Scrapy so I don’t want to talk about it here. It is basically typing commands in CMD(Windows) or Terminal(Mac) and the computer would set up the folder for you. Afterwards, you need a text editor to input the code. Theoretically, you only need to set up Items.py and Spider.py to put scrapy in work, which is what I am going to do. Power users can customise way more settings.

 

Above is my Items.py file. It set up some fields to store the contents that will be grabbed by the spiders file. You can omit this step if you just want to print out the results in the console.

 

What is tricky is that I use Selenium to select some options on the web. I bumped into this brand new package not because I want to use it but because I have to. Yahoo stock websites are full of javascript and jquery codes, which means I could not simple select a tag and drag all of the information back because the tags are even not there! I need to simulate the click action and that’s where Selenium comes into help. I might talk about this package in the future.

Another tricky thing is that Scrapy is so fast with its Twisted framework and this would raise an error if your Internet speed is not fast enough. An error would state that the tags which you are looking for are not embedded in the index you just set. That’s why I use “time.sleep” to wait for the index to be fully loaded.

In the terminal, I type in “scrapy crawl stockmarket -0 items.csv -t csv” to generate a CSV file. Everything goes well and I enjoy the moment when the terminal is refreshing quickly to show all the results that are from the website. It seems to be a pretty easy job, but who knows I have been debugging my code to make it work for a week.

So sad that I am a business student….

3 Comments

  1. AlecKLosolla says:

    Good post. I learn something new and challenging on websites I stumbleupon every day.
    It will always be helpful to read through articles
    from other writers and use a little something from their web sites.

  2. Hello. remarkable job. I did not anticipate this. This is a impressive story. Thanks!

  3. GailPRecar says:

    Saved like a favorite, I love your website!

Leave a Reply

Your email address will not be published. Required fields are marked *