天道酬勤,学无止境

Python BeautifulSoup Loop

Thanks to this board I have managed to retrieve the name and the price of the item I want using this code:

import urllib2  
from BeautifulSoup import BeautifulSoup
import re

html = urllib2.urlopen('http://www.toolventure.co.uk/hand-tools/saws/').read()

soup = BeautifulSoup(html)
item = re.sub('\s+', ' ', soup.h2.a.text)
price = soup.find('p', '*price').text
price = re.search('\d+\.\d+', price).group(0)

print item, price

This is great as it returns one result perfectly. Moving on I am now trying to retrieve ALL the results on the page. I have been playing around with loops but am very new to this and am unable to work out how to loop it.

Can someone more knowledgeable point me in the right direction?

Many thanks

评论

I'd use findAll for this:

soup = BeautifulSoup(html)

mostwant = {'class': 'productlist_mostwanted_item '}
griditem = {'class': 'productlist_grid_item '}

divs = soup.findAll(attrs = mostwant) + soup.findAll(attrs = griditem)

for product in divs:
    item = product.h2.a.text.strip()
    price = re.search('\d+\.\d+', product.findAll('p')[1].text).group(0)
    print(f"{item} - {price}")

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐
  • 使用python请求会话登录LinkedIn(Logging in to LinkedIn with python requests sessions)
    问题 我正在尝试使用Python请求登录LinkedIn: import sys import requests from BeautifulSoup import BeautifulSoup payload={ 'session-key' : 'user@email.com', 'session-password' : 'password' } URL='https://www.linkedin.com/uas/login-submit' s=requests.session() s.post(URL,data=payload) r=s.get('http://www.linkedin.com/nhome') soup = BeautifulSoup(r.text) print soup.find('title') 我似乎无法使用此方法登录。 我什至尝试在有效负载中使用csrf等,但是会话不是应该为您解决吗? 关于最后一行的注意事项:我使用标题来检查是否已成功登录。(如果我已经登录,则应该看到“ Welcome!| LinkedIn”,而我会看到“世界上最大的专业网络| LinkedIn”) 我想念什么吗? 回答1 我修改了一个网络抓取模板,该模板可用于大多数基于Python的抓取需求,以满足您的需求。 验证它是否可以使用我自己的登录信息。
  • Repetitive process to follow links in a website (BeautifulSoup)
    I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3, then I should follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated twice. I can't come about a way to repeat the same process 18 times in a loop.Any help would be appreciated. import re import urllib from BeautifulSoup import * htm1= urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html ').read() soup =BeautifulSoup(htm1) tags = soup('a') list1=list() for tag in tags: x = tag
  • How can I loop scraping data for multiple pages in a website using python and beautifulsoup4
    I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website. I want to extract ALL THE GOLF COURSES but my script is limited
  • Python3 bs4库BeautifulSoup爬虫网页解析入门
    一、4种BeautifulSoup库解析器: BeautifulSoup解析器常配合网络爬虫解析返回的网页文档 1、bs4的HTML解析器: 安装bs4库:pip install bs4 解析用法:BeautifulSoup(mk,‘html.parser’) 2、lxml的HTML解析器: 安装lxml库:pip install lxml: 解析用法:BeautifulSoup(mk,‘lxml’) 3、lxml的XML解析器: 安装lxml库:pip install lxml: 解析用法:BeautifulSoup(mk,‘xml’) 4、html5lib的解析器: 安装html5lib库:pip install html5lib: 解析用法:BeautifulSoup(mk,‘html5lib’) bs4库BeautifulSoup解析器解析页面示例: #!/usr/bin/python # coding = utf-8 from bs4 import BeautifulSoup import requests url = 'https://blog.csdn.net/qq_43054896' try: r = requests.get(url) r.raise_for_status() # bs4库BeautifulSoup解析返回页面内容 soup =
  • 使用 Python 将 HTML 渲染为纯文本(Rendered HTML to plain text using Python)
    问题 我正在尝试使用 BeautifulSoup 转换一大块 HTML 文本。 下面是一个例子: <div> <p> Some text <span>more text</span> even more text </p> <ul> <li>list item</li> <li>yet another list item</li> </ul> </div> <p>Some other text</p> <ul> <li>list item</li> <li>yet another list item</li> </ul> 我尝试做类似的事情: def parse_text(contents_string) Newlines = re.compile(r'[\r\n]\s+') bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES) txt = bs.getText('\n') return Newlines.sub('\n', txt) ...但那样我的跨度元素总是在一个新行上。 这当然是一个简单的例子。 有没有办法在 HTML 页面中获取文本作为它在浏览器中呈现的方式(不需要 css 规则,只是呈现 div、span
  • 如何使用python和beautifulsoup4循环抓取网站中多个页面的数据(How can I loop scraping data for multiple pages in a website using python and beautifulsoup4)
    问题 我正在尝试从PGA.com网站上获取数据,以获取美国所有高尔夫球场的表格。 在我的CSV表中,我想包括高尔夫球场的名称,地址,所有权,网站,电话号码。 有了这些数据,我想对其进行地理编码并放入地图中,并在我的计算机上拥有本地副本 我利用Python和Beautiful Soup4提取了我的数据。 到目前为止,我已经提取了数据并将其导入CSV,但是现在我遇到了从PGA网站上的多个页面抓取数据的问题。 我想提取所有高尔夫球场,但是我的脚本仅限于一页,我想循环播放它,以便它将从PGA网站中找到的所有页面捕获高尔夫球场的所有数据。 大约有18000个黄金课程和900个页面来捕获数据 以下是我的脚本。 我需要有关创建代码的帮助,这些代码将捕获来自PGA网站的所有数据,而不仅仅是一个站点,而是多个站点。 通过这种方式,它将为我提供美国黄金课程的所有数据。 这是我的脚本如下: import csv import requests from bs4 import BeautifulSoup url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0" r =
  • Python returning values from infinite loop thread
    So for my program I need to check a client on my local network, which has a Flask server running. This Flask server is returning a number that is able to change. Now to retrieve that value, I use the requests library and BeautifulSoup. I want to use the retrieved value in another part of my script (while continuously checking the other client). For this I thought I could use the threading module.The problem is, however, that the thread only returns it's values when it's done with the loop, but the loop needs to be infinite.This is what I got so far: import threading import requests from bs4
  • Python之Html解析方法(beautiful soup)
    Python之Html解析方法(beautiful soup) BeautifulSoup的安装及介绍 官方给出的几点介绍: Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。 Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。 Beautiful Soup的安装 pip install beautifulsoup4 BeautifulSoup中的HTML解析器对比 解析器使用方法优势劣势Python标准库BeautifulSoup(markup, “html.parser”)Python的内置标准库;执行速度适中;文档容错能力强Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差lxml HTML 解析器BeautifulSoup(markup, “lxml”
  • 在Python中使用BeautifulSoup获取直接父标签(Get immediate parent tag with BeautifulSoup in Python)
    问题 我已经研究了这个问题,但尚未看到解决此问题的实际方法。 我正在将BeautifulSoup与Python结合使用,我想做的是从页面中获取所有图像标签,循环浏览每个图像标签,并检查每个图像标签,看看它的直接父级是否是锚定标签。 这是一些伪代码: html = BeautifulSoup(responseHtml) for image in html.findAll('img'): if (image.parent.name == 'a'): image.hasParent = image.parent.link 有什么想法吗? 回答1 您需要检查父母的名字: for img in soup.find_all('img'): if img.parent.name == 'a': print "Parent is a link" 演示: >>> from bs4 import BeautifulSoup >>> >>> data = """ ... <body> ... <a href="google.com"><img src="image.png"/></a> ... </body> ... """ >>> soup = BeautifulSoup(data) >>> img = soup.img >>> >>> img.parent.name a
  • Python BeautifulSoup抓取表(Python BeautifulSoup scrape tables)
    问题 我正在尝试使用BeautifulSoup创建表格刮板。 我编写了以下Python代码: import urllib2 from bs4 import BeautifulSoup url = "http://dofollow.netsons.org/table1.htm" # change to whatever your url is page = urllib2.urlopen(url).read() soup = BeautifulSoup(page) for i in soup.find_all('form'): print i.attrs['class'] 我需要抓取Nome,Cognome,Email。 回答1 循环遍历表行( tr标签)并获取内部的单元格文本( td标签): for tr in soup.find_all('tr')[2:]: tds = tr.find_all('td') print "Nome: %s, Cognome: %s, Email: %s" % \ (tds[0].text, tds[1].text, tds[2].text) 印刷: Nome: Massimo, Cognome: Allegri, Email: Allegri.Massimo@alitalia.it Nome: Alessandra, Cognome
  • 【题解】Assignment: Following Links in HTML Using BeautifulSoup (Using Python to Access Web Data)
    吐槽:好难哦...而且有个问题我一直不知道呢还,若有大佬看到这篇文章希望帮忙解决一下疑惑,感谢w 题目:Following Links in Python In this assignment you will write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find. We provide two files for this assignment. One is a sample file where we give you the name for your testing and the
  • Get immediate parent tag with BeautifulSoup in Python
    I've researched this question but haven't seen an actual solution to solving this. I'm using BeautifulSoup with Python and what I'm looking to do is get all image tags from a page, loop through each and check each to see if it's immediate parent is an anchor tag. Here's some pseudo code: html = BeautifulSoup(responseHtml) for image in html.findAll('img'): if (image.parent.name == 'a'): image.hasParent = image.parent.link Any ideas on this?
  • soup.select('.r a') in f'https://google.com/search?q={query}' brings back empty list in Python BeautifulSoup. **NOT A DUPLICATE**
    The "I'm Feeling Lucky!" project in the "Automate the boring stuff with Python" ebook no longer works with the code he provided. Specifically, the linkElems = soup.select('.r a') I've already tried using the solution provided in: soup.select('.r a') in 'https://www.google.com/#q=vigilante+mic' gives empty list in python BeautifulSoup , and I'm currently using the same search format. import webbrowser, requests, bs4 def im_feeling_lucky(): # Make search query look like Google's search = '+'.join(input('Search Google: ').split(" ")) # Pull html from Google print('Googling...') # display text
  • 使用Python3和BeautifulSoup4处理本地html文件
    文章目录 遇到的问题初始需要处理的文本搜索和替换的一些常用正则表达式python3中使用beautifulsoup4beautifulsoup4是什么?安装beautifulsoup4开始使用beautifulsoup4 其他的一些小细节python3中将list合并转为string 最终的代码(python3)参考资料 我的博客地址:https://hxd.red 原文链接:https://hxd.red/2019/08/06/python3-beautifulsoup4-html-190805/ 我的微信公众号:不淡定的实验室(hxdred) 遇到的问题 在制作第三个微信小程序“法语背单词记忆小助手”时,我需要处理大量单词有关的数据,为了一劳永逸解决单词释义、单词例句等种种方面的问题,我打算提取mdx词典数据,将词典里面所有单词的数据做成数据表,并上传至云开发。这样的话,另一个小程序“法语动词变位记忆小助手”也能共享成果。 作为一个懒人,肯定不会手动去处理这么多数据(提取mdx之后有60万行数据,去除对我来说没用的动词变位数据,还有15万行,共计12000余个单词)。所以打算使用python和Beautiful Soup(以下可能简称BS)进行数据处理。引用官方文档的说法:Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库
  • Scrape with BeautifulSoup from site that uses AJAX pagination using Python
    I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report This is the script I have so far: with io.open('heritageURLs.txt', 'a', encoding='utf8') as
  • python笔记 爬虫精进·第2课 【BeautifulSoup模块,html.parser,解析数据,提取数据,find()与find_all(),Tag对象】
    BeautifulSoup BeautifulSoup库的应用,主要是爬虫的解析数据和提取数据。 安装方法:(win)pip install BeautifulSoup4 (mac)pip3 install BeautifulSoup4 解析数据 解析数据使用方法 from bs4 import BeautifulSoup soup = BeautifulSoup(字符串,'html.parser') bs对象 = BeautifulSoup(要解析的文本,‘解析器’) 注意: 1.括号里面第0个参数是被解析的文本,必须是字符串 2.括号里第1个参数用来标识解析器,使用python内置库,html.parser 解析代码示例: import requests from bs4 import BeautifulSoup #引入BeautifulSoup库 res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') html = res.text soup = BeautifulSoup(html,'html.parser') #把网页解析为BeautifulSoup对象 提取数据 提取数据仍然使用BeautifulSoup模块来进行
  • Beautiful Soup loop over div element in HTML
    I am attempting to use Beautiful Soup to extract some values out of a web page (not very much wisdom here..) which are hourly values from a weatherbug forecast. In Chrome developer mode I can see the values are nested within the div classes as shown in the snip below: In Python I can attempt to mimic a web browser and find these values: import requests import bs4 as BeautifulSoup import pandas as pd from bs4 import BeautifulSoup url = 'https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103' header = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML
  • [Python从零到壹] 六.网络爬虫之BeautifulSoup爬取豆瓣TOP250电影详解
    欢迎大家来到“Python从零到壹”,在这里我将分享约200篇Python系列文章,带大家一起去学习和玩耍,看看Python这个有趣的世界。所有文章都将结合案例、代码和作者的经验讲解,真心想把自己近十年的编程经验分享给大家,希望对您有所帮助,文章中不足之处也请海涵。Python系列整体框架包括基础语法10篇、网络爬虫30篇、可视化分析10篇、机器学习20篇、大数据分析20篇、图像识别30篇、人工智能40篇、Python安全20篇、其他技巧10篇。您的关注、点赞和转发就是对秀璋最大的支持,知识无价人有情,希望我们都能在人生路上开心快乐、共同成长。 前一篇文章讲述了BeautifulSoup技术,它是一个可以从HTML或XML文件中提取数据的Python库,一个分析HTML或XML文件的解析器,包括安装过程和基础语法。这篇文章将详细讲解 BeautifulSoup 爬取豆瓣TOP250电影,通过案例的方式让大家熟悉Python网络爬虫,同时豆瓣TOP250也是非常适合入门的案例,也能普及简单的预处理知识。 希望对您有所帮助,本文参考了作者CSDN的文章,链接如下: https://blog.csdn.net/Eastmounthttps://github.com/eastmountyxz/Python-zero2one 同时,作者新开的“娜璋AI安全之家”将专注于Python和安全技术
  • Python Memory Issue with BeautifulSoup
    I've resolved this issue, but I'm wondering why it was caused in the first place. I used BeautifulSoup to identify this span from a webpage: span = <span id="ctl00_ContentPlaceHolder1_RestInfoReskin_lblRestName">Ally's Sizzlers</span> I then assign this variable: restaurant.name = span.contents However on each loop this takes up a full 1 MB, and there's about 20,000 loops. Through trial and error I came upon this solution: restaurant.name = str(span.contents) Can you tell me why the former span.contents takes up so much memory?
  • BeautifulSoup 的 Python 内存问题(Python Memory Issue with BeautifulSoup)
    问题 我已经解决了这个问题,但我想知道它最初是为什么引起的。 我使用 BeautifulSoup 从网页中识别此跨度: span = <span id="ctl00_ContentPlaceHolder1_RestInfoReskin_lblRestName">Ally's Sizzlers</span> 然后我分配这个变量: restaurant.name = span.contents 但是,在每个循环中,这会占用整整 1 MB,并且大约有 20,000 个循环。 通过反复试验,我找到了这个解决方案: restaurant.name = str(span.contents) 你能告诉我为什么以前的span.contents占用这么多内存吗? 回答1 可能是因为str(span.contents)正在调用对象span.contents内的__str__函数并返回较小的表示。 您可以使用 pympler 来测量内存消耗 回答2 旧东西,但以防其他人想知道: span.contents返回对NavigableString实例的引用。 这个实例和DOM树之间有一个链接,所以只要这个实例在使用中,整个DOM树就不能被垃圾收集器从内存中释放出来。 因此,只要restaurant.name没有从内存中释放,整个 DOM 树就会保存在内存中。 使用str(span.contents