I'm trying to get some data from this website http://www.mof.gov.cn/, and I keep on getting EOFException:

The connect command is very basic: Jsoup.connect("http://www.mof.gov.cn/").ignoreContentType(true).userAgent("Mozilla/5.0(Windows NT 6.1; rv:23.0) Gecko/20100101 Firefox/23.0").timeout(30000).get();

The EOFException is that:

java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
at java.util.zip.GZIPInputStream.readUInt(GZIPInputStream.java:189)
at java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:179)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:94)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.FilterInputStream.read(FilterInputStream.java:90)
at org.jsoup.helper.DataUtil.readToByteBuffer(DataUtil.java:124)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:464)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at com.staging.Crawfinance.main(Crawfinance.java:30)

Any idea why? Many thanks.

===========================================================================================

Please try this with a valid charset name :

Document doc = Jsoup.parse(new URL("http://www.mof.gov.cn/").openStream(), "UTF-8", "http://www.mof.gov.cn/");


Works great! Thanks! I just change the charset to be GBK, it work fine









浏览 653 评论 0 赞 0 砸 0 标签: exception 爬虫
评论
还可以再输入500个字

请您注意

·自觉遵守:爱国、守法、自律、真实、文明的原则
·尊重网上道德,遵守《全国人大常委会关于维护互联网安全的决定》及中华人民共和国其他各项有关法律法规
·严禁发表危害国家安全,破坏民族团结、国家宗教政策和社会稳定,含侮辱、诽谤、教唆、淫秽等内容的作品
·承担一切因您的行为而直接或间接导致的民事或刑事法律责任
·您在NoteShare上发表的作品,NoteShare有权在网站内保留、转载、引用或者删除
·参与本评论即表明您已经阅读并接受上述条款