Python Web Crawler Learning Notes (1) Crawling Simple Static Web Pages

I. Using urllib3 to Implement HTTP Requests#

1. Generating Requests#

Generate requests using the request method, prototype as follows

urllib3.request(method,url,fields=None,headers=None,**urlopen_kw)

Parameter	Description
method	Accepts string. Indicates the type of request, such as "GET" (commonly used), "HEAD", "DELETE", etc., with no default value.
url	Accepts string. Indicates the URL in string format. No default value.
fields	Accepts dict. Indicates the parameters carried by the request type. Defaults to None.
headers	Accepts dict. Indicates the parameters carried by the request header. Defaults to None.
**urlopen_kw	Accepts dict and data types in Python, parameters that can be added based on specific needs and request types, usually assigned as dictionary types or specific data.
code:

import urllib3
http = urllib3.PoolManager()
rq = http.request('GET',url='http://www.pythonscraping.com/pages/page3.html')
print('Server response code:', rq.status)
print('Response body:', rq.data)

2. Handling Request Headers#

The headers parameter can be passed by defining a dictionary type, defining a dictionary that contains User-Agent information, using browsers like Firefox and Chrome, operating system "Windows NT 6.1;Win64; x64", to send a GET request with headers to the website "http://www.tipdm/index.html", where the headers parameter is the defined User-Agent dictionary.

import urllib3
http = urllib3.PoolManager()
head = {'User-Agent':'Windows NT 6.1;Win64; x64'}
http.request('GET',url='http://www.pythonscraping.com/pages/page3.html',headers=head)

3. Timeout Settings#

To prevent packet loss due to unstable networks, you can add a timeout parameter to the request, usually a floating-point number. You can set all parameters for this request directly after the URL, or set the connection and read timeout parameters separately. Setting the timeout parameter in the PoolManager instance applies to all requests of that instance.

Directly set

http.request('GET',url='',headers=head,timeout=3.0)
# Timeout if exceeds 3s

http.request('GET',url='http://www.pythonscraping.com/pages/page3.html',headers=head,timeout=urllib3.Timeout(connect=1.0,read=2.0))
# Timeout if connection exceeds 1s, reading exceeds 2s

Apply to all requests of this instance

import urllib3
http = urllib3.PoolManager(timeout=4.0)
head = {'User-Agent':'Windows NT 6.1;Win64; x64'}
http.request('GET',url='http://www.pythonscraping.com/pages/page3.html',headers=head)
# Timeout if exceeds 4s

4. Request Retry Settings#

The urllib3 library can control retries by setting the retries parameter. By default, it retries the request 3 times and performs 3 redirects. You can customize the number of retries by assigning an integer to the retries parameter, and you can define a retries instance to customize the number of request retries and redirects. If you need to disable both request retries and redirects, you can assign False to the retries parameter; to disable only redirects, assign False to the redirect parameter. Similar to the Timeout settings, you can set the retries parameter in the PoolManager instance to control the retry strategy for all requests under that instance.

Apply to all requests of this instance

import urllib3
http = urllib3.PoolManager(timeout=4.0,retries=10)
head = {'User-Agent':'Windows NT 6.1;Win64; x64'}
http.request('GET',url='http://www.pythonscraping.com/pages/page3.html',headers=head)
# Timeout if exceeds 4s, retry 10 times

5. Generating Complete HTTP Requests#

Use the urllib3 library to generate a complete request to http://www.pythonscraping.com/pages/page3.html. This request should include the link, request headers, timeout settings, and retry settings.
Insert image description here

Note the encoding method utf-8

import urllib3
# Send request instance
http = urllib3.PoolManager()
# URL
url = 'http://www.pythonscraping.com/pages/page3.html'
# Request headers
head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56'}
# Timeout
tm = urllib3.Timeout(connect=1.0,read=3.0)
# Retry and redirect settings, generate request
rq = http.request('GET',url=url,headers=head,timeout=tm,redirect=4)
print('Server response code:', rq.status)
print('Response body:', rq.data.decode('utf-8'))

Server response code: 200
Response body: <html>
<head>
<style>
img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
</style>
</head>
<body>
<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;">
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br>
123 Main St.<br>
Abuja, Nigeria
</br>We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>

<tr id="gift1" class="gift"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg">
</td></tr>

<tr id="gift2" class="gift"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg">
</td></tr>

<tr id="gift3" class="gift"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg">
</td></tr>

<tr id="gift4" class="gift"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg">
</td></tr>

<tr id="gift5" class="gift"><td>
Mystery Box
</td><td>
If you love surprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg">
</td></tr>
</table>
</p>
<div id="footer">
&copy; Totally Normal Gifts, Inc. <br>
+234 (617) 863-0736
</div>

</div>
</body>
</html>

II. Using the requests Library to Implement HTTP Requests#

import requests
url = 'http://www.pythonscraping.com/pages/page3.html'
rq2 = requests.get(url)
rq2.encoding = 'utf-8'
print('Response code:',rq2.status_code)
print('Encoding:',rq2.encoding)
print('Request headers:',rq2.headers)
print('Body:',rq2.text)

Solving Character Encoding Issues#

It is important to note that when the requests library guesses incorrectly, you need to manually specify the encoding to avoid garbled content in the returned web page. The manual specification method is not flexible and cannot adapt to different encodings during the scraping process, while using the chardet library is more convenient and flexible. The chardet library is an excellent string/file encoding detection module.
The chardet library uses the detect method to detect the encoding of a given string, with common parameters and their descriptions as follows:

Parameter	Description
byte_str	Accepts string. Indicates the string whose encoding needs to be detected. No default value.

import chardet
chardet.detect(rq2.content)

Output: 100% probability of being encoded in ASCII

Complete code

import requests
import chardet
url = 'http://www.pythonscraping.com/pages/page3.html'
head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.56'}
rq2 = requests.get(url,headers=head,timeout=2.0)
rq2.encoding = chardet.detect(rq2.content)['encoding']
print('Body:',rq2.content)

III. Parsing Web Pages#

The functions of the Chrome Developer Tools panels are as follows:
Insert image description here

1. Elements Panel#

In web scraping development, the elements panel is mainly used to view the corresponding positions of page elements, such as the location of images or text links. The left side of the panel shows the structure of the current page as a tree structure, and you can click the triangle symbol to expand branches.
Insert image description here

2. Source Code Panel#

Switch to the source code panel (Sources) Click on the "tipdm" folder on the left and select the "index.html" file to display its complete code in the middle, as shown.

3. Network Panel#

Switch to the network panel (Network), you need to reload the page first. Click on a resource, and the middle will display the header information, preview, response information, cookies, and time spent details for that resource, as shown.
Insert image description here

IV. Using Regular Expressions to Parse Web Pages#

1. Python Regular Expressions: Finding Names and Phone Numbers in Strings#

Regular expressions are a tool that can be used for pattern matching and replacement, allowing users to construct matching patterns using a series of special characters and then compare the matching pattern with the string or file to be compared. Based on whether the comparison object contains the matching pattern, execute the corresponding program.

rawdata = “555-1239Moe Szyslak(636) 555-0113Burns, C.Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson,Homer5553642Dr. Julius Hibbert ”

Try it

import re
string = '1. A small sentence - 2.Anthoer tiny sentence. '
print('re.findall:',re.findall('sentence',string))
print('re.search:',re.search('sentence',string))
print('re.match:',re.match('sentence',string))
print('re.match:',re.match('1. A small sentence',string))
print('re.sub:',re.sub('small','large',string)) 
print('re.sub:',re.sub('small','',string))

Output:
re.findall: ['sentence', 'sentence']
re.search: <re.Match object; span=(11, 19), match='sentence'>
re.match: None
re.match: <re.Match object; span=(0, 19), match='1. A small sentence'>
re.sub: 1. A large sentence - 2.Anthoer tiny sentence.
re.sub: 1. A sentence - 2.Anthoer tiny sentence.

Common generalized symbols

English period “.”: can represent any character except the newline character “\n”;

string = '1. A small sentence - 2.Anthoer tiny sentence. '
re.findall('A.',string)

Output: ['A ', 'An']

Character class “[]”: Any character contained within square brackets will be matched;

string = 'small smell smll smsmll sm3ll sm.ll sm?ll sm\nll sm\tll'
print('re.findall:',re.findall('sm.ll',string))
print('re.findall:',re.findall('sm[asdfg]ll',string))
print('re.findall:',re.findall('sm[a-zA-Z0-9]ll',string))
print('re.findall:',re.findall('sm\.ll',string))
print('re.findall:',re.findall('sm[.?]ll',string))

Output:

re.findall: ['small', 'smell', 'sm3ll', 'sm.ll', 'sm?ll', 'sm\tll']
re.findall: ['small']
re.findall: ['small', 'smell', 'sm3ll']
re.findall: ['sm.ll']
re.findall: ['sm.ll', 'sm?ll']

Quantifier "{}": Indicates how many times it can be matched

print('re.findall:',re.findall('sm..ll',string))
print('re.findall:',re.findall('sm.{2}ll',string))
print('re.findall:',re.findall('sm.{1,2}ll',string))
print('re.findall:',re.findall('sm.{1,}ll',string))
print('re.findall:',re.findall('sm.?ll',string)) # {0,1}
print('re.findall:',re.findall('sm.+ll',string)) # {0,}
print('re.findall:',re.findall('sm.*ll',string)) # {1,}

Output:
re.findall: ['smsmll']
re.findall: ['smsmll']
re.findall: ['small', 'smell', 'smsmll', 'sm3ll', 'sm.ll', 'sm?ll']
re.findall: ['small smell smll smsmll sm3ll sm.ll sm?ll', 'sm\tll']
re.findall: ['small', 'smell', 'smll', 'smll', 'sm3ll', 'sm.ll', 'sm?ll']
re.findall: ['small smell smll smsmll sm3ll sm.ll sm?ll', 'sm\tll']
re.findall: ['small smell smll smsmll sm3ll sm.ll sm?ll', 'sm\tll']

ps: Greedy rule, match as much as possible

Complete Code#

import pandas as pd
rawdata = '555-1239Moe Szyslak(636) 555-0113Burns, C.Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson,Homer5553642Dr. Julius Hibbert'
names = re.findall('[A-Z][A-Za-z,. ]*',rawdata)
print(names)
number = re.findall('\(?[0-9]{0,3}\)?[ \-]?[0-9]{3}[ \-]?[0-9]{4}',rawdata)
print(number)
pd.DataFrame({'Name':names,'TelPhone':number})

Output:
Insert image description here

V. Using Xpath to Parse Web Pages#

XML Path Language (XPath) is a language based on XML that finds nodes in a tree structure of data, determining the location of a part of an XML document. Using XPath requires importing the etree module from the lxml library, and the HTML class must be used to initialize the HTML object to be matched (XPath can only handle the DOM representation of the document). The basic syntax format of the HTML class is as follows.

1. Basic Syntax#

lxml.etree.HTML(text, parser=None, *, base_url=None)

Parameter	Description
text	Accepts str. Indicates the string to be converted to HTML. No default value.
parser	Accepts str. Indicates the selected HTML parser. No default value.
base_url	Accepts str. Indicates the original URL of the document, used for finding relative paths of external entities. Defaults to None.
If the nodes in the HTML are not closed, the etree module also provides an auto-completion feature. Calling the tostring method will output the corrected HTML code, but the result is of bytes type and needs to be converted to str type using the decode method.

XPath uses expressions similar to regular expressions to match content in HTML files, with commonly used matching expressions as follows.

Expression	Description
nodename	Selects all child nodes of the nodename node
/	Selects direct child nodes from the current node
//	Selects descendant nodes from the current node
.	Selects the current node
..	Selects the parent node of the current node
@	Selects attributes

2. Predicates#

Predicates in XPath are used to find a specific node or a node containing a specified value, and predicates are embedded in square brackets after the path, as follows.

Expression	Description
/html/body/div[1]	Selects the first div node under the body child node
/html/body/div[last()]	Selects the last div node under the body child node
/html/body/div[last()-1]	Selects the second to last div node under the body child node
/html/body/div[position()<3]	Selects the first two div nodes under the body child node
/html/body/div[@id]	Selects the div node under the body child node that has an id attribute
/html/body/div[@id="content"]	Selects the div node under the body child node with an id attribute value of content
/html/body/div[xx>10.00]	Selects nodes under the body child node where the xx element value is greater than 10

3. Function Functions#

XPath also provides some function functions for fuzzy searches. Sometimes, you only grasp part of the characteristics of the object, and when you need to perform a fuzzy search for such objects, you can use function functions to achieve this, with specific functions as follows.

Function Function	Example	Description
starts-with	//div[starts-with(@id,”co”)]	Selects div nodes whose id value starts with co
contains	//div[contains(@id,”co”)]	Selects div nodes whose id value contains co
and	//div[contains(@id,”co”) and contains(@id,”en”)]	Selects div nodes whose id value contains both co and en
text()	//li[contains(text(),”first”)]	Selects div nodes whose text contains first

4. Using Google Developer Tools#

Google Developer Tools provides a very convenient way to copy XPath paths.
Insert image description here
eg: Complete code for scraping Zhihu hot list
I tried scraping the Zhihu hot list, which requires logging in, so you can log in yourself and get the cookie.

import requests
from lxml import etree
url = "https://www.zhihu.com/hot"
hd = { 'Cookie':'Your Cookie', #'Host':'www.zhihu.com',
        'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

response = requests.get(url, headers=hd)
html_str = response.content.decode()
html = etree.HTML(html_str)
title = html.xpath("//section[@class='HotItem']/div[@class='HotItem-content']/a/@title")
href = html.xpath("//section[@class='HotItem']/div[@class='HotItem-content']/a/@href")
f = open("zhihu.txt",'r+')
for i in range(1,41):
    print(i,'.'+title[i])
    print('Link:'+href[i])
    print('-'*50)
    f.write(str(i)+'.'+title[i]+'\n')
    f.write('Link:'+href[i]+'\n')
    f.write('-'*50+'\n')
f.close()

Scraping results
Insert image description here

VI. Data Storage#

1. Storing in JSON Format#

import requests
from lxml import etree
import json
# Above code omitted
with open('zhihu.json','w') as j:
    json.dump({'title':title,'hrefL':href},j,ensure_ascii=False)

Storage result (ps: after file formatting processing)
Insert image description here