昨天发生了一件另我非常沮丧的事情。我的个人站点 web 前端中文站,数据库发生了故障,导致了将近 100 篇文章的丢失。
更多精彩内容请看 web 前端中文站
http://www.lisa33xiaoq.net 可按 Ctrl + D 进行收藏
本站点主要是一个月备份一次数据库,上个月,也就是 9 月份的文章目前已全部丢失。
通过我个人对搜索引擎的理解,发现谷歌网页快照中有部分保留,于是我就用 https 抓取了部分快照,以便能恢复部分文章。下面介绍本文的重点如何使用 HttpsClient 抓取 https 网页内容?
一般的 jsoup 等爬虫框架对 https 的支持都不够友好。因此我这里借助了 HttpsClient 工具类来实现。
注意:如果你使用我的案例,在抓取 https 开头的网页时报错:unable to find valid certification path to requested target 或者是 peer not authenticated 异常,原因是可能是使用 jdk1.6,可以 1.7 试试,如果还是报错那就重新包装抓取用到 HttpClient 类。
下面我们进入代码实战阶段。
import java.security.cert.CertificateException; import java.security.cert.X509Certificate; import javax.net.ssl.SSLContext; import javax.net.ssl.TrustManager; import javax.net.ssl.X509TrustManager; import org.apache.http.client.HttpClient; import org.apache.http.conn.scheme.Scheme; import org.apache.http.conn.scheme.SchemeRegistry; import org.apache.http.conn.ssl.SSLSocketFactory; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager; //web 前端中文站:www.lisa33xiaoq.net public class HttpsClient { public static DefaultHttpClient getNewHttpsClient(HttpClient httpClient){ try { SSLContext ctx = SSLContext.getInstance("TLS"); X509TrustManager tm = new X509TrustManager() { public X509Certificate[] getAcceptedIssuers() { return null; } public void checkClientTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { } public void checkServerTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { } }; ctx.init(null, new TrustManager[] { tm }, null); SSLSocketFactory ssf = new SSLSocketFactory(ctx,SSLSocketFactory.ALLOW_ALL_HOSTNAME_VERIFIER); SchemeRegistry registry = new SchemeRegistry(); registry.register(new Scheme("https", 443, ssf)); ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(registry); return new DefaultHttpClient(mgr, httpClient.getParams()); } catch (Exception ex) { ex.printStackTrace(); return null; } } }
在抓取之前重新获取 httpClient 类(httpClient = HttpsClient.getNewHttpsClient(httpClient);)
import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection; import org.apache.commons.httpclient.HttpStatus; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.util.EntityUtils; public class Test { //web 前端中文站:www.lisa33xiaoq.net public static void main(String[] args) { String url ="https://baidu.com"; String html = getPageHtml(url); System.out.println(html); } /** * 获取网页 html */ public static String getPageHtml(String currentUrl) { HttpClient httpClient=new DefaultHttpClient(); httpClient = HttpsClient.getNewHttpsClient(httpClient); String html = ""; HttpGet request = new HttpGet(currentUrl); HttpResponse response = null; try { response = httpClient.execute(request); if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){ HttpEntity mEntity = response.getEntity(); html = EntityUtils.toString(mEntity); } }catch(IOException e){ e.printStackTrace(); } return html.toString(); } }
使用的 jar:
- commons-httpclient-3.1.jar
- commons-logging.jar
- httpclient-4.2.5.jar
- httpcore-4.2.4.jar
以上代码使用 jdk1.7 测试通过。
源码和 jar 到这里进行下载,导入 eclipse 中就能运行。
【注:本文源自网络文章资源,由站长整理发布】