Basic Usage of Beautiful Soup#
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
Node Selector#
To get the text inside a node, simply call the node's name and then call string.
soup.a.string   --->Get the text of the first a tag, returns a string
soup.a    --->Get the HTML code of the first a tag (Tag type), can be nested
soup.a.attrs['class']
soup.a['class'] --->Get the attribute; if the attribute is unique, returns a string; if the attribute is not unique, returns a list;
Related Selector#
- Child nodes
 
soup.a.contents   --->Returns a list
soup.a.children  --->Returns a generator
- Descendant nodes
 
soup.a.descendants  --->Returns a generator
- Parent node
 
soup.a.parent  --->Returns the parent node of the first a node
- Ancestor nodes
 
soup.a.parents  --->Returns a generator
- Sibling nodes
 
soup.a.next_sibling  --->Returns the next sibling node
soup.a.previous_sibling  --->Returns the previous sibling node
soup.a.next_siblings  --->Returns the sibling nodes after, generator type
soup.a.previous_siblings  --->Returns the sibling nodes before, generator type
The above methods return a single node, you can directly call string, attrs, and other attributes to get its text and attribute content;
If the returned result is a generator of multiple nodes, you can convert it to a list, take out a certain element, and then call string, attrs, and other attributes;
Method Selector#
find_all(name, attrs, recursive, text, **kwargs): Find all elements that meet the conditions
name: Query based on the node name, returns a list, Tag typeattrs: Query based on attributes, returns a list, Tag type
soup.find_all(attrs={'id':'list-1'})
soup.find_all(id='list-1')
soup.find_all(class_='element')  --->'class' is a Python keyword, add '_' after it
text: This parameter can be used to match the text of the node. The input can be a string or a regular expression object
soup.find_all(text=re.compile('link')) -->Returns a list of all node texts that match the regular expression
find(name, attrs, recursive, text, **kwargs): Find the first element that meets the conditions, usage is the same as find_all(), returns a single Tag type
CSS Selector#
Call the select() method and pass in the corresponding CSS selector: soup.select('CSS selector statement') returns a list, with elements of Tag type
1. Supports nested selection
for ul in soup.select('ul'):
    print(ul.select('li'))
            
2. Get attributes
for ul in soup.select('ul'):
    content = ul['id']
    content = ul.attrs['id']
3. Get text
for ul in soup.select('ul'):
    content = ul.string       --->Get direct text
    content = ul.get_text()   --->Get all text inside the node