BeautifulSoup Study Notes

Basic Usage of Beautiful Soup#

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

Node Selector#

To get the text inside a node, simply call the node's name and then call string.

soup.a.string   --->Get the text of the first a tag, returns a string
soup.a    --->Get the HTML code of the first a tag (Tag type), can be nested
soup.a.attrs['class']
soup.a['class'] --->Get the attribute; if the attribute is unique, returns a string; if the attribute is not unique, returns a list;

Child nodes

soup.a.contents   --->Returns a list
soup.a.children  --->Returns a generator

Descendant nodes

soup.a.descendants  --->Returns a generator

Parent node

soup.a.parent  --->Returns the parent node of the first a node

Ancestor nodes

soup.a.parents  --->Returns a generator

Sibling nodes

soup.a.next_sibling  --->Returns the next sibling node
soup.a.previous_sibling  --->Returns the previous sibling node
soup.a.next_siblings  --->Returns the sibling nodes after, generator type
soup.a.previous_siblings  --->Returns the sibling nodes before, generator type

The above methods return a single node, you can directly call string, attrs, and other attributes to get its text and attribute content;
If the returned result is a generator of multiple nodes, you can convert it to a list, take out a certain element, and then call string, attrs, and other attributes;

Method Selector#

find_all(name, attrs, recursive, text, **kwargs): Find all elements that meet the conditions

name: Query based on the node name, returns a list, Tag type
attrs: Query based on attributes, returns a list, Tag type

soup.find_all(attrs={'id':'list-1'})
soup.find_all(id='list-1')
soup.find_all(class_='element')  --->'class' is a Python keyword, add '_' after it

text: This parameter can be used to match the text of the node. The input can be a string or a regular expression object

soup.find_all(text=re.compile('link')) -->Returns a list of all node texts that match the regular expression

find(name, attrs, recursive, text, **kwargs): Find the first element that meets the conditions, usage is the same as find_all(), returns a single Tag type

CSS Selector#

Call the select() method and pass in the corresponding CSS selector: soup.select('CSS selector statement') returns a list, with elements of Tag type

1. Supports nested selection
for ul in soup.select('ul'):
    print(ul.select('li'))
            
2. Get attributes
for ul in soup.select('ul'):
    content = ul['id']
    content = ul.attrs['id']

3. Get text
for ul in soup.select('ul'):
    content = ul.string       --->Get direct text
    content = ul.get_text()   --->Get all text inside the node