Basic Usage of Beautiful Soup#
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
Node Selector#
To get the text inside a node, simply call the node's name and then call string.
soup.a.string --->Get the text of the first a tag, returns a string
soup.a --->Get the HTML code of the first a tag (Tag type), can be nested
soup.a.attrs['class']
soup.a['class'] --->Get the attribute; if the attribute is unique, returns a string; if the attribute is not unique, returns a list;
Related Selector#
- Child nodes
soup.a.contents --->Returns a list
soup.a.children --->Returns a generator
- Descendant nodes
soup.a.descendants --->Returns a generator
- Parent node
soup.a.parent --->Returns the parent node of the first a node
- Ancestor nodes
soup.a.parents --->Returns a generator
- Sibling nodes
soup.a.next_sibling --->Returns the next sibling node
soup.a.previous_sibling --->Returns the previous sibling node
soup.a.next_siblings --->Returns the sibling nodes after, generator type
soup.a.previous_siblings --->Returns the sibling nodes before, generator type
The above methods return a single node, you can directly call string, attrs, and other attributes to get its text and attribute content;
If the returned result is a generator of multiple nodes, you can convert it to a list, take out a certain element, and then call string, attrs, and other attributes;
Method Selector#
find_all(name, attrs, recursive, text, **kwargs)
: Find all elements that meet the conditions
name
: Query based on the node name, returns a list, Tag typeattrs
: Query based on attributes, returns a list, Tag type
soup.find_all(attrs={'id':'list-1'})
soup.find_all(id='list-1')
soup.find_all(class_='element') --->'class' is a Python keyword, add '_' after it
text
: This parameter can be used to match the text of the node. The input can be a string or a regular expression object
soup.find_all(text=re.compile('link')) -->Returns a list of all node texts that match the regular expression
find(name, attrs, recursive, text, **kwargs)
: Find the first element that meets the conditions, usage is the same as find_all()
, returns a single Tag type
CSS Selector#
Call the select()
method and pass in the corresponding CSS selector: soup.select('CSS selector statement')
returns a list, with elements of Tag type
1. Supports nested selection
for ul in soup.select('ul'):
print(ul.select('li'))
2. Get attributes
for ul in soup.select('ul'):
content = ul['id']
content = ul.attrs['id']
3. Get text
for ul in soup.select('ul'):
content = ul.string --->Get direct text
content = ul.get_text() --->Get all text inside the node