Scrapy, Finally….

After months of experiencing the American culture and adapting to the course level at USC, I finally got a chance to review my Python skill. Fortunately, I remember the basic operation command quite well and perhaps that’s how interests can drive a normal person to learn something deep enough.

In short, I still want to enhance my skill in grab the content on different kinds of websites. Everybody knows there are tons of packages of Python that could do that, but everybody should be careful about the learning curve of those stuff unless you major in computer science. During this semester, I have heard a lot of my classmates claiming that Python is easy to learn but COME ON, Python after all is a very powerful objected oriented programming language. If you have never touched on programming before, it would make you crazy. To sum up using an analogy, it’s like learning to play the guitar, at first it might be easy but if you want to do a SOLO in BLUES at stage in front of thousands of audience, you’d better invest a decent amount of time in it.

Continue reading

Select other columns when using functions in SQL

Thanks to INF 551, I finally got a chance to practice the SQL skillset that I learned on my own. In this class, the instructor is recommending us students to use MySQL, which is, not bad but it is not as familiar to me as SQLite.

SQL is straightforward in most ways, but I found a tricky problem when I want to select another column while using functions such as sum(), count() in SELECT clauses. In excel or R, I can easily specify which column to show up when specifying some conditions. However, it seems that SQL does not provide such user-friendly methods to do that.

For example, when I want to select the column next to the maximum of the sum of “sales_amount” column, I ended up inserting two “SELECT” clauses into the chunk, and used a where clause to clarify one last condition. Mind that two clauses in brackets are kind of the same except for the letter alias.

I have done some search on the Internet but I could not find a more elegant way to solve this issue. Another method is to create a new table or view in the software to serve as a substitute for clauses in brackets.

Anyone know better solutions?

Is Excel kidding me with ” ‘ ” ?

I just took my midterm exam which is such a kick ass to me. Obviously it has something to do with Excel and it is not a pleasant experience.

In one question which requires me to convert cell format from DATE to Day of the Week. And I fully understand the formula that is “=TEXT(A2, “dddd”)”. However, because I have been busy learning programming languages myself this semester, I wrongly assume that the ” ‘ ” would be as same as ” ” “, which is  a quite weird expression, isn’t it?

But no way Excel’s formula is still using outdated rules for programming. How is that possible that it does not support single quote. It took me about half an hour and still unaware of the problem. Only after the exam is submitted did I realize that this could be the problem.

So be careful next time you use the formatting string in Excel. DO NOT USE SINGLE QUOTE!

Hands on the Tkinter module but a sad story….

Actually these days I’ve been learning the Tkinter module on Python quite hard because I found it quite interesting. It served as a tool to make my teenage dreams come true since I would like to make something visually enjoyable for a long time.

To be honest, the learning process of Tkinter is pretty complex due to its huge volume of pre-requisite knowledge of Tk programming “protocol”. I just touched on it for 3 days but I want to create some programs in order to preserve the confidence and interests to continue learning. So, I want to share a sad but true story using Tkinter module. Here’s the code:

Please don’t laugh at me, because I want to keep writing….(T_T)

Come on Python! Just Read My Clipboard

I don’t know whether you guys have been annoyed by the MyStatLab powered by Pearson. These days I was busy doing homework on it. My homework is about statistics analysis and it requires to use a lot of work on RStudio. However, as everybody know, R doesn’t support xls format very well, which requires users to transform the xls file into csv file first.

It might not seem to be a big deal as long as you have the csv file that is quite ubiquitous among data world. Nevertheless, MyStatLab doesn’t provide the csv format and give users the access of xls format instead! It gives the user to options: one is to copy the datasheet to the clipboard; the other is to download it as an Excel file.

This was totally nightmare when I tried to do my work on MyStatLab. Since I want to avoid uncertainties of reading a txt file in RStudio(which means I have to copy and paste the data to a txt file and you know we can’t simply modify the “.txt” to “.csv” because it won’t work!), I have to download the Excel file, open it in Excel, save it as a csv file, read and open it in RStudio. Those are all repetitive work and quite time consuming and not pleasing.

Screen Shot 2015-09-14 at 10.28.18 PM

After finishing my job painfully and stupidly, I calmed down and tried to solve it in Python. I hate all repetitive work and I believe machine should have done it better. Therefore, I wrote a small program to read my clipboard and save it as a csv file.

P.S. I did consider to read the excel file and run a python program save it as a csv file but I don’t see the boost in efficiency comparing to copy the data and run the python scripts. Another reason to do that is because creating a csv format file is supported originally by python.

Here are the codes:

I refer to http://stackoverflow.com/questions/16188160/how-to-read-data-from-clipboard-and-pass-it-as-value-to-a-variable-in-python and it is quite useful and block the Tkinter window to pop up automatically. Also, I added the support for UTF-8 encoding because I am afraid that I might use it to convert some csv files where headers would contain some Chinese characters.

Something I like about SAS

I’ve been active in learning SAS for about a week. During this time, I found out something about SAS that really make it on top of statistical software for a long time. I want to talk about my feeling about it here, as a reminder.

First of all, I want to ask a question: what softwares would you think of when you heard other people talk about statistics? Answers may vary: R, Python, SAS, STATA, SPSS……. To be honest, I have never touched on one of them and that is SAS. Due to my major in undergraduate study,  I used STATA quite a lot and even my whole paper counted on it. Additionally, I have tried to broaden my vision about statistical software, so I have installed and played with R Studio and SPSS before. Except for the various capabilities of R, I am not stunned by any kind of statistical softwares. At least that’s my initial impression. I love python very much, but I have not much experience in the statistics-related packages such as Numpy.

Until I am forced to learn SAS can I find out the most interesting point in SAS — it separates the data input procedure and data analysis procedure with DATA and PROC command! I appreciate this setting very much because when the command file is becoming chunkier, this setting would help you recognise what you want to navigate to more efficiently. At first, it has some learning curve but once you get the hand of that, you will treasure the clarity it gives you.

When I wrote my undergraduate paper using STATA, I thought of it as a python-based statistical software because both of them have similar coding logic. It is simple but is still not easy for a non-programmer to understand at once. As for R, it simplifies a lot of the coding command and makes it easier for statistician to code, but it has way harder learning process than SAS does because it is more versatile and contains a lot of unnecessary packages for researcher. It’s true that people have different preference on their tastes, so that’s also why there are many similar but different tools in one area.

Frankly speaking, the coding structure of SAS is not as efficient as that of R, for example, I have to type “RUN;” to make it run each time. What I like about is the designing of its logic that separates the data input and analysing process. Just think about it, what if Python could open every type of data files using “DATA” command, it would be of great convenience!

Generating unbreakable password ?

Last day when I was busy learning the SAS and R in the workshop provided by MSBA program in USC, I thought a lot about my next python program. All the program I wrote are driven by my interest in python, so until I found something interesting again, I would hardly write something notable.

I came across a post on http://www.williamlong.info/archives/3224.html, which inspired this short program. Unfortunately it was written in Chinese and the main idea is about the security of your password. To sum up, there are a lot of encryption rules and most of them are related to MD5(Message-Digest Algorithm), which generates hash value. However, due to the fast-growing technology, MD5-generated code could also be hacked using super computers. Many others would recommend a newer algorithm but in my opinion the hacking is a matter of time.

Therefore, what I am going to do is to enhance the difficulty of password hacking. I am going to give the password a random shuffle and a for loop to increase the complexity and time for breaking. Here is my simple idea:

During this example I found out that Sublime Text 3 surprisingly did not support “raw_input” function because it could not pop out a window for users to input the strings. It makes ST3 so annoying that I have to have terminal active all the time.

Overall, I don’t think there’s any kind of password that is unbreakable so security should be well considered every time you store it at some place. What’s more, we need to increase the time and cost for hacking, and also lower  the value of breaking our password. Protect yourself on the Internet!

Jikexueyuan Spider

www.jikexueyuan.com is one of my favorite websites to study coding on my own. One of its biggest advantages is the lessons are relatively short compared to long and boring lessons on some websites. However, the UI of jikexueyuan.com has drawbacks in showing learner the connection between different classes. For example, I want to study HTML and CSS, but I cannot filter by lecturers or by time added, so that it is extremely hard to find out the correct sequence after this class(time added is only available in the course’s main page, not on the index page). What I could ONLY do is to select a specific category such as HTML, and face a number of unordered courses, making me frustrated which one to go after this course.

Therefore, I would like to design an application to retrieve all the related courses under a specific category, and automatically click the URL of each course in the background, and then retrieve the name and time added of each course. Finally, I want it to sort the course by time added and write the information down on a .txt file so that I can analyze conveniently. Here are the codes:

Use python to check my email inboxes

Last month I came across an idea to check my email inboxes using python. I did some research on the web and try to imitate the great solutions from other coders. Because I am currently using IMAP service in my main email accounts, I chose to use the “imaplib” module to achieve my expectation. Here are the codes:

I needed this to help me because in China is kind of difficult to check my Gmail account and I want a simpler way to check my inboxes without opening bulkier email clients.

My First Python Application

It’s been two months since I started learning python on my own. It all starts from the bottom of my heart, about what I really want to learn. To be honest, I genuinely feel that I’ve enjoyed this journey.

After reading related books and videos for quite a long time, I decided to get hands on the practice, which is more difficult as expected.

What drives me to write this small application is that I love watching sports games and I often checked a website, which is “www.zhibo8.cc”, to see what games are on live today. However, I found this action repetitive and a waste of time opening my website, selecting the bookmark and then scrolling down to the part which I want. I have tried several modules for web scraping and this time I chose to use BeautifulSoup to get the web page. Therefore, this module’s code is as follows:

It works pretty well and shows me what I need in the console. But after I came to the U.S., it started to raise an error saying the variable “today” is Nonetype. After debugging, I found it kind of funny because it is due to the time zone difference. Variable “i” here represents today in the U.S. while sometimes in China it is one day later. Obviously, the website “zhibo8” will delete all the live games information at the end of the day. So if I am lucky enough to run this application at night in Los Angeles, when it is tomorrow morning in China, I will encounter this error.

I plan to figure out a way to solve this problem later, and I am quite satisfied with my first executable python file after two months.