www.jikexueyuan.com is one of my favorite websites to study coding on my own. One of its biggest advantages is the lessons are relatively short compared to long and boring lessons on some websites. However, the UI of jikexueyuan.com has drawbacks in showing learner the connection between different classes. For example, I want to study HTML and CSS, but I cannot filter by lecturers or by time added, so that it is extremely hard to find out the correct sequence after this class(time added is only available in the course’s main page, not on the index page). What I could ONLY do is to select a specific category such as HTML, and face a number of unordered courses, making me frustrated which one to go after this course.
Therefore, I would like to design an application to retrieve all the related courses under a specific category, and automatically click the URL of each course in the background, and then retrieve the name and time added of each course. Finally, I want it to sort the course by time added and write the information down on a .txt file so that I can analyze conveniently. Here are the codes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
#coding=utf-8 from bs4 import BeautifulSoup import re import urllib2 import time import json import os import sys reload(sys) sys.setdefaultencoding('utf8') class See_Course_History(object): """docstring for See_Course_History this application is designed to see all of the course names and dates added to the "www.jikexueyuan.com" """ def __init__(self, website='http://www.jikexueyuan.com/course/web/', pageNum=11): super(See_Course_History, self).__init__() self.website = website self.pageNum = pageNum def Htmlparser(self): print 'I am retrieving the .html files of all pages under this category' global emptybox emptybox = [] for i in xrange(1, self.pageNum+1): url = urllib2.urlopen(self.website+'?pageNum=' + str(i)) soup = BeautifulSoup(url) lesson_list = soup.find_all(class_ = 'lesson-info-h2') emptybox += lesson_list def get_page_urls(self): global urlbox self.Htmlparser() print 'I am working hard to get the urls on the .html files' urlbox_string = [] for i in emptybox: i = str(i) urlbox_string.append(i) urlbox_string = ','.join(urlbox_string) # print urlbox_string urlbox = re.findall(r'href=\"(.*?)\"', urlbox_string) def write_urls(self): """ write urls to a txt file. This is for saving the time when we use get_course_info method later, without having to connect to websites to get urls again. """ self.get_page_urls() file = open('../desktop/page_urls.txt','w') for i in urlbox: file.write(i+'\n') file.close() print 'I have successfully finished writing the document. Now you can check it.' def get_course_info(self): print 'I am scanning and connecting all the urls in the document...It might take some time.' file = open('../desktop/page_urls.txt','r') b = file.read() b = b.rstrip() urlbox2 = str(b).split('\n') d = {} for i in urlbox2: url = urllib2.urlopen(i) soup = BeautifulSoup(url) info = soup.find_all(class_='bc-box') for a in info: d[a.h2.string] = a.div.span.next.next.string # print json.dumps(d,ensure_ascii=False,encoding='UTF-8') d = sorted(d.iteritems(), key=lambda d:d[1], reverse = False ) # output = json.dumps(d,ensure_ascii=False,encoding='UTF-8') file = open('../desktop/courses_info.txt','w') for i in d: file.write(i[0] + ' ' + i[1] + '\n') file.close() print 'All work is done. You can see the courses_info.txt right now' if __name__ == '__main__': a1 = See_Course_History() filename = r'../desktop/page_urls.txt' if os.path.exists(filename): pass else: a1.write_urls() time.sleep(3) a1.get_course_info() |
I was just looking at your Jikexueyuan Spider – Shu Wu website and see that your site has the potential to become very popular. I just want to tell you, In case you don’t already know… There is a website service which already has more than 16 million users, and the majority of the users are interested in websites like yours. By getting your site on this service you have a chance to get your site more visitors than you can imagine. It is free to sign up and you can find out more about it here: http://url.laspas.gr/ak – Now, let me ask you… Do you need your site to be successful to maintain your business? Do you need targeted visitors who are interested in the services and products you offer? Are looking for exposure, to increase sales, and to quickly develop awareness for your website? If your answer is YES, you can achieve these things only if you get your site on the network I am talking about. This traffic network advertises you to thousands, while also giving you a chance to test the service before paying anything at all. All the popular sites are using this network to boost their readership and ad revenue! Why aren’t you? And what is better than traffic? It’s recurring traffic! That’s how running a successful website works… Here’s to your success! Read more here: http://s.t0m-s.be/3A
I really enjoy the post.Much thanks again. Much obliged.
This website certainly has all the info I needed about
this subject and didn’t know who to ask.