為了不復制粘貼,我被逼著學會了JAVA爬蟲
掃描二維碼
隨時隨地手機看文章
受疫情影響一直在家遠程辦公,公司業(yè)務進展的緩慢,老實講活并沒有那么多,每天吃飯、睡覺、逛技術(shù)社區(qū)、寫博客,摸魚摸得爽的很。早上本來還想在來個回籠覺,突然部門經(jīng)理的語音消息就過來了。
甩給我一個連接地址 http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/,要我把全國的省市名稱和區(qū)域代碼弄出來,建一個字典表,時限一上午。

分下一下需求
要全國的省、市名稱,建一張字典表進行存儲,表結(jié)構(gòu)設(shè)計相對容易,那么城市數(shù)據(jù)該怎么搞?
有兩種解決辦法:
辛苦點復制粘貼,說多了也就幾百個而已
寫個爬蟲工具,一勞永逸
但作為一個程序員沒有什么是不能用程序解決的,雖然工作Ctrl+C
、 Ctrl+V
用的不少,像這種沒有技術(shù)含量的復制粘貼還是挺丟面子的。
爬蟲搞起
基于這個需求只想要城市名稱,爬蟲工具選的是Jsoup,Jsoup是一款Java 的HTML解析器,可直接解析某個URL地址、HTML文本內(nèi)容。它提供了一套非常省力的API,可通過DOM,CSS以及類似于jQuery的操作方法來取出和操作數(shù)據(jù)。
Jsoup是根據(jù)HTML頁面的<body>
、<td>
、<tr>
等標簽來獲取文本內(nèi)容的,所以先分析一下目標頁面結(jié)構(gòu)。打開F12
查看頁面結(jié)構(gòu)發(fā)現(xiàn),我們要的目標數(shù)據(jù)在第5個<tbody>
標簽 class
屬性為provincetr
的 <tr>
標簽里。

省份名稱內(nèi)容的頁面結(jié)構(gòu)如下:
<tr class="provincetr">
<td>
<a href="11.html">北京市<br></a>
</td>
<td>
<a href="12.html">天津市<br>
</td>
.........
</tr>
再拿到<td>
標簽中<a>
標簽屬性就可以了,省份名稱找到了,再看看省對應的城市名在哪里,屬性href="11.html"
就是省份下對應的城市頁面Url
http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/11.html
找城市名稱跟上邊分析過程一樣,這里就不再贅述了,既然數(shù)據(jù)位置已經(jīng)確定了,那么就開始具體的爬取工作了。

爬蟲實現(xiàn)
1、引入Jsoup依賴
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.3</version>
</dependency>
2、代碼編寫
代碼的實現(xiàn)比較簡單就兩個方法而已,沒有什么難度主要是得細心,捋清頁面標簽的嵌套結(jié)構(gòu)就可以了。按照之前分析的邏輯一步一步走
/**
* @author xin
* @description 解析省份
* @date 2019/11/4 19:24
*/
public static void parseProvinceName(Map<String, Map<String, String>> map, String url) throws IOException {
/**
* 獲取頁面文檔數(shù)據(jù)
*/
Document doc = Jsoup.connect(url).get();
/**
* 獲取頁面上所有的tbody標簽
*/
Elements elements = doc.getElementsByTag("tbody");
/**
* 拿到第五個tbody標簽
*/
Element element = elements.get(4);
/**
* 拿到tbody標簽下所有的子標簽
*/
Elements childrens = element.children();
/**
* 當前頁面的URL
*/
String baseUri = element.baseUri();
for (Element element1 : childrens) {
Elements provincetrs = element1.getElementsByClass("provincetr");
for (Element provincetr : provincetrs) {
Elements tds = provincetr.getElementsByTag("td");
for (Element td : tds) {
String provinceName = td.getElementsByTag("a").text();
String href = td.getElementsByTag("a").attr("href");
System.out.println(provinceName + " " + baseUri + "/" + href);
map.put(provinceName, null);
/**
* 組裝城市頁面的URL,進入城市頁面爬城市名稱
*/
parseCityName(map, baseUri + "/" + href, provinceName);
}
}
}
}
在抓取城市名稱的時候有一點要注意,直轄市城市的省份和城市名稱是一樣的
/**
* @author xin
* @description 解析城市名稱
* @date 2019/11/4 19:26
*/
public static void parseCityName(Map<String, Map<String, String>> map, String url, String provinceName) throws IOException {
Document doc = Jsoup.connect(url).get();
Elements elements = doc.getElementsByTag("tbody");
Element element = elements.get(4);
Elements childrens = element.children();
/**
*
*/
String baseUri = element.baseUri();
Map<String, String> cityMap = new HashMap<>();
for (Element element1 : childrens) {
Elements citytrs = element1.getElementsByClass("citytr");
for (Element cityTag : citytrs) {
Elements tds = cityTag.getElementsByTag("td");
/**
* 直轄市,城市名就是本身
*/
String cityName = tds.get(1).getElementsByTag("a").text();
if (cityName.equals("市轄區(qū)")) {
cityName = provinceName;
}
String href1 = tds.get(1).getElementsByTag("a").attr("href");
System.out.println(cityName + " " + href1);
cityMap.put(cityName, href1);
}
}
map.put(provinceName, cityMap);
}
public class test2 {
public static void main(String[] args) throws IOException {
Map<String, Map<String, String>> map = new HashMap<>();
parseProvinceName(map, "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018");
System.out.println(JSON.toJSONString(map));
}
}
3、輸出JSON字符串
當前只要省份和城市名稱,爬蟲沒有什么深度,如果還需要區(qū)縣等信息,可以根據(jù)市后邊的url 35/3508.html
繼續(xù)往下爬取
{
"福建省": {
"龍巖市": "35/3508.html",
"南平市": "35/3507.html",
"莆田市": "35/3503.html",
"福州市": "35/3501.html",
"泉州市": "35/3505.html",
"漳州市": "35/3506.html",
"廈門市": "35/3502.html",
"三明市": "35/3504.html",
"寧德市": "35/3509.html"
},
"西藏自治區(qū)": {
"拉薩市": "54/5401.html",
"昌都市": "54/5403.html",
"日喀則市": "54/5402.html",
"那曲市": "54/5406.html",
"林芝市": "54/5404.html",
"山南市": "54/5405.html",
"阿里地區(qū)": "54/5425.html"
},
"貴州省": {
"貴陽市": "52/5201.html",
"畢節(jié)市": "52/5205.html",
"銅仁市": "52/5206.html",
"六盤水市": "52/5202.html",
"遵義市": "52/5203.html",
"黔西南布依族苗族自治州": "52/5223.html",
"安順市": "52/5204.html",
"黔東南苗族侗族自治州": "52/5226.html",
"黔南布依族苗族自治州": "52/5227.html"
},
"上海市": {
"上海市": "31/3101.html"
},
"湖北省": {
"黃岡市": "42/4211.html",
"孝感市": "42/4209.html",
"恩施土家族苗族自治州": "42/4228.html",
"省直轄縣級行政區(qū)劃": "42/4290.html",
"襄陽市": "42/4206.html",
"鄂州市": "42/4207.html",
"十堰市": "42/4203.html",
"咸寧市": "42/4212.html",
"黃石市": "42/4202.html",
"荊州市": "42/4210.html",
"隨州市": "42/4213.html",
"宜昌市": "42/4205.html",
"武漢市": "42/4201.html",
"荊門市": "42/4208.html"
},
"湖南省": {
"湘潭市": "43/4303.html",
"衡陽市": "43/4304.html",
"張家界市": "43/4308.html",
"益陽市": "43/4309.html",
"岳陽市": "43/4306.html",
"婁底市": "43/4313.html",
"株洲市": "43/4302.html",
"常德市": "43/4307.html",
"湘西土家族苗族自治州": "43/4331.html",
"郴州市": "43/4310.html",
"邵陽市": "43/4305.html",
"長沙市": "43/4301.html",
"永州市": "43/4311.html",
"懷化市": "43/4312.html"
},
"廣東省": {
"河源市": "44/4416.html",
"韶關(guān)市": "44/4402.html",
"茂名市": "44/4409.html",
"汕頭市": "44/4405.html",
"清遠市": "44/4418.html",
"深圳市": "44/4403.html",
"珠海市": "44/4404.html",
"廣州市": "44/4401.html",
"肇慶市": "44/4412.html",
"中山市": "44/4420.html",
"江門市": "44/4407.html",
"云浮市": "44/4453.html",
"惠州市": "44/4413.html",
"湛江市": "44/4408.html",
"東莞市": "44/4419.html",
"揭陽市": "44/4452.html",
"陽江市": "44/4417.html",
"佛山市": "44/4406.html",
"汕尾市": "44/4415.html",
"潮州市": "44/4451.html",
"梅州市": "44/4414.html"
}
.......
}
Jsoup爬取頁面數(shù)據(jù)對網(wǎng)絡的依賴比較高,像我500塊一年50M的方正寬帶快要卡出翔了,平均執(zhí)行三次才有一次成功,我太難了!
Exception in thread "main" java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:443)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:465)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at com.xinzf.project.jsoup.test2.parseProvinceName(test2.java:32)
at com.xinzf.project.jsoup.test2.main(test2.java:17)
總結(jié)
從分析頁面到編寫代碼花費的時間,可能要比簡單的復制粘貼還要長,但我依然選擇用程序解決問題,并不是因為我有多勤快,反而是因為我很懶,你品,你細品!
特別推薦一個分享架構(gòu)+算法的優(yōu)質(zhì)內(nèi)容,還沒關(guān)注的小伙伴,可以長按關(guān)注一下:
長按訂閱更多精彩▼
如有收獲,點個在看,誠摯感謝
免責聲明:本文內(nèi)容由21ic獲得授權(quán)后發(fā)布,版權(quán)歸原作者所有,本平臺僅提供信息存儲服務。文章僅代表作者個人觀點,不代表本平臺立場,如有問題,請聯(lián)系我們,謝謝!