最近正方教务处貌似升级了,网上的代码都不好使了。具体原因应该是cookie和验证码不同步。每次模拟登陆新网址时总是objective moved to here.下面是用request模块模拟登陆教务处系统的代码,并抓取课程表。(课程表直接输出来的没有输入Excel也没有美化)
代码一共有60行,注意账号和密码要自己输入。
正方的MIS系统基本上都是http://服务器地址/default2.aspx
验证码地址为http://服务器地址/CheckCode.aspx?
代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
from lxml import etree import requests studentnumber = "*******" password = "*******" s = requests.session() url = "http://jw3.edu.cn/default2.aspx" response = s.get(url) selector = etree.HTML(response.content) __VIEWSTATE = selector.xpath('//*[@id="form1"]/input/@value')[0] imgUrl = "http://jw3.edu.cn/CheckCode.aspx?" imgresponse = s.get(imgUrl, stream=True) image = imgresponse.content try: with open('C://Users//dell//desktop//1.jpg' ,"wb") as jpg: jpg.write(image) except IOError: print("IO Error\n") code = input("验证码:") data = { "__VIEWSTATE": __VIEWSTATE, "txtUserName": studentnumber, "TextBox2": password, "txtSecretCode": code, "Button1": "", } headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36", } response = s.post(url, data=data, headers=headers) def getInfor(response, xpath): content = response.content.decode('gb2312') # 网页源码是gb2312要先解码 selector = etree.HTML(content) infor = selector.xpath(xpath)[0] return infor text = getInfor(response, '//*[@id="xhxm"]/text()') text = text.replace(" ", "") print("你好 ", text) kburl = "http://jw3.edu.cn/xskbcx.aspx?xh="+studentnumber+"&xm="+text[:-2]+"&gnmkdm=N121603" print(kburl) headers = { "Referer": "http://jw3.edu.cn/xs_main.aspx?xh=E21614061", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36", } response = s.get(kburl, headers=headers) html = response.content.decode("gb2312") print(html) selector=etree.HTML(html) content = selector.xpath('//*[@id="Table1"]/tr/td/text()') for each in content: print(each) |
效果:
最近想写爬教务系统课表的代码,可是看完mooc上北理工的爬虫课自己写还是不大会,如果可以的话想请您给一点指导。
首先要看教务处用的什么系统,判断是静态网站还是用js渲染的。学习一下urllib,beautifulSoup4,Scrapy这些爬虫包。先从获取数据开始,接着学习提交数据。