- 求職 : 系統(tǒng)工程師等
- 論壇徽章:
- 5
|
1、我把一個(gè)網(wǎng)頁(yè)保存為html,然后用beautifulsoup去分析。
結(jié)果發(fā)現(xiàn),通過(guò)封裝的函數(shù)buildSoupFromStr(content)調(diào)用會(huì)報(bào)錯(cuò),
直接使用BeautifulSoup(content,fromEncoding="GBK")則不會(huì)。
2、另外,請(qǐng)教各位大俠,如果抓取一個(gè)網(wǎng)頁(yè)的文本部分,除了正則、beautifulsoup、還有比較好的辦法嗎?
感覺(jué)beautifulsoup也不太方便。
報(bào)錯(cuò)內(nèi)容如下:- Traceback (most recent call last):
- File "C:\Program Files\Python27\code\hanhan.py", line 29, in <module>
- buildSoupFromStr(content)
- File "C:\Program Files\Python27\code\hanhan.py", line 20, in buildSoupFromStr
- soup = BeautifulSoup(content,fromEncoding)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1522, in __init__
- BeautifulStoneSoup.__init__(self, *args, **kwargs)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1147, in __init__
- self._feed(isHTML=isHTML)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1189, in _feed
- SGMLParser.feed(self, markup)
- File "C:\Program Files\Python27\lib\sgmllib.py", line 104, in feed
- self.goahead(0)
- File "C:\Program Files\Python27\lib\sgmllib.py", line 174, in goahead
- k = self.parse_declaration(i)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1463, in parse_declaration
- j = SGMLParser.parse_declaration(self, i)
- File "C:\Program Files\Python27\lib\markupbase.py", line 109, in parse_declaration
- self.handle_decl(data)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1448, in handle_decl
- self._toStringSubclass(data, Declaration)
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1379, in _toStringSubclass
- self.endData()
- File "build\bdist.win32\egg\BeautifulSoup.py", line 1251, in endData
- (not self.parseOnlyThese.text or \
- AttributeError: 'str' object has no attribute 'text'
復(fù)制代碼 全部代碼如下:- # -*- coding: cp936 -*-
- from sys import *
- from BeautifulSoup import *
- def getContent(filename):
- try:
- file_object = open(filename, 'r')
- except IOError:
- print 'Can not find file'
- return -1
- try:
- content = file_object.read( )
- finally:
- file_object.close( )
- return content
- def buildSoupFromStr(content,fromEncoding="GBK"):
- print type(content)
- soup = BeautifulSoup(content,fromEncoding)
- #return soup
-
- if __name__ == '__main__':
- content = getContent('han.html')
- #print content
- if -1 == content:
- print 'error happen'
- buildSoupFromStr(content)
- #BeautifulSoup(content,fromEncoding="GBK")
-
復(fù)制代碼 |
|