天道酬勤,学无止境

How can I scrape data from a website within a frame using R?

The following link contains the results of the marathon of Paris: http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon. I want to scrape these results, but the information lies within a frame. I know the basics of scraping with Rvest and Rselenium, but I am clueless on how to retrieve the data within such a frame. To get an idea, one of the things I tried was:

url = "http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon"
site = read_html(url)
ParisResults = site %>% html_node("iframe") %>% html_table()
ParisResults = as.data.frame(ParisResults)

Any help in solving this problem would be very welcome!

评论

The results are loaded by ajax from the following url :

url="http://www.aso.fr/massevents/resultats/ajax.php?v=1460995792&course=mar16&langue=us&version=3&action=search"
  table <- url %>%
    read_html(encoding="UTF-8") %>%
    html_nodes(xpath='//table[@class="footable"]') %>%
    html_table()

PS: I don't know what ajax is exactly, and I just know basics of rvest

EDIT: in order to answer the question in the comment: I don't have a lot of experience in web scraping. If you only use very basic technics with rvest or xml, you have to understand a little more the web site, and every site has its own structure. For this one, here is how I did:

  1. As you see, in the source code you don't see any results because they are in an iframe, and when inspecting the code, you can see after "RESULTS OF 2016 EDITION":

    class="iframe-xdm iframe-resultats" data-href="http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=3"

  2. Now you can use directly this url : http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=2

  3. But you still can get the results. You can then use Chrome developer tools > Network > XHR. When refreshing the page, you can see that the data is loaded from this url (when you choose the Men category) : http://www.aso.fr/massevents/resultats/ajax.php?course=mar16&langue=us&version=2&action=search&fields%5Bsex%5D=F&limiter=&order=

  4. Now you can get the results !

  5. And if you want the second page, etc. you can click on the number of the page, then use developer tool to see what happens !

受限制的 HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。

相关推荐
  • R - 使用 rvest 抓取受密码保护的网站,而无需在每次循环迭代时登录(R - Using rvest to scrape a password protected website without logging in at each loop iteration)
    问题 我正在尝试使用 rvest 包从 R 中受密码保护的网站中抓取数据。 我的代码目前在循环的每次迭代中登录到网站,该循环将运行大约 15,000 次。 这似乎非常低效,但我还没有找到解决方法,因为每次返回网站的登录页面时,都没有先登录就跳转到不同的 url。 我的代码的简化如下: library(rvest) url <- password protected website url within quotes session <-html_session(url) form <-html_form(session)[[1]] filled_form <- set_values(form, `username` = email within quotes, `password` = password within quotes) start_table <- submit_form(session, filled_form) %>% jump_to(url from which to scrape first table within quotes) %>% html_node("table.inlayTable") %>% html_table() data_table <- start_table for(i in 1:nrow(data_ids)) { current
  • R - Using rvest to scrape a password protected website without logging in at each loop iteration
    I'm trying to scrape data from a password protected website in R using the rvest package. My code currently logs in to the website at each iteration of a loop that will run about 15,000 times. This seems very inefficient but I have not figured out a way around it, because jumping to a different url without first logging in every time returns to the website's log in page. A simplification of my code is as follows: library(rvest) url <- password protected website url within quotes session <-html_session(url) form <-html_form(session)[[1]] filled_form <- set_values(form, `username` = email within
  • 从 R 中的 HTML 选择/选项标签中抓取值(Scrape values from HTML select/option tags in R)
    问题 我正在尝试(相当不成功)使用 R 从网站(www.majidata.co.ke)抓取一些数据。我已经设法抓取 HTML 并解析它,但现在有点不确定如何提取我实际的位需要! 使用XML库,我使用以下代码抓取数据: majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=") majidata_html <- htmlTreeParse(content(majidata_get, as="text")) 这给我留下了(大)XMLDocumentContent。 网页上有一个下拉列表,我想从中抓取值(与不同城镇的名称和 ID 号相关)。 我想提取的位是<option value ="XXX">和它后面的大写字母之间的数字。 <div class="regiondata"> <div id="town_data"> <select id="town" name="town" onchange="town_data(this.value);"> <option value="0" selected="selected">[SELECT TOWN]</option> <option value="611">AHERO</option> <option value="635">AKALA<
  • How can I use a loop to scrape website data for multiple webpages in R?
    I would like to apply a loop to scrape data from multiple webpages in R. I am able to scrape the data for one webpage, however when I attempt to use a loop for multiple pages, I get a frustrating error. I have spent hours tinkering, to no avail. Any help would be greatly appreciated!!! This works: ########################### # GET COUNTRY DATA ########################### library("rvest") site <- paste("http://www.countryreports.org/country/","Norway",".htm", sep="") site <- html(site) stats<- data.frame(names =site %>% html_nodes(xpath="//*/td[1]") %>% html_text() , facts =site %>% html_nodes
  • Web scraping and looping through pages with R
    I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks). In order to learn scraping, I am exercising with a few pages on Psychology Today. I have written a function that allows me to scrape information for one therapist and to create a data set with the information collected in this way: install.packages('rvest') #Loading the rvest package install.packages('xml2') #Loading the xml2 package library('rvest') #to scrape library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
  • Scraping a JavaScript object and converting to JSON within R/Rvest
    I am scraping the following website: https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio I am trying to get the table of currency exchange rates into an R data frame via the rvest package, but the table itself is configured in a JavaScript variable within the HTML code. I located the relevant css selector and now I have this: library(rvest) banorte <- "https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio/" %>% read_html() %>% html_nodes('#indicadores_financieros_wrapper > script:nth-child(2)') my output is now the following JavaScript script, as an XML
  • Scrape values from HTML select/option tags in R
    I'm trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I've managed to scrape the HTML and parse it but now a little unsure how to extract the bits I actually need! Using the XML library I scrape my data using this code: majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=") majidata_html <- htmlTreeParse(content(majidata_get, as="text")) This leaves me with (Large) XMLDocumentContent. There is a drop-down list on the webpage and I want to scrape the values from it (which relate to the names and ID no. of different towns). The
  • tryCatch function works on most non-existent URLs, but it does not work in (at least) one case
    Dear Stackoverflow users, I am using R to scrape profiles of a few psycotherapists from Psychology Today; this is done for exercising and learning more about web scraping. I am new to R and I I have to go through this intense training that will help me with a future projects. It implies that I might not know precisely what I am doing at the moment (e.g. I might not interpret well either the script or the error messages from R), but I have to get it done. Therefore, I beg your pardon for possible misunderstandings or inaccuracies. In short, the situation is the following. I have created a
  • 在 R/Rvest 中抓取 JavaScript 对象并转换为 JSON(Scraping a JavaScript object and converting to JSON within R/Rvest)
    问题 我正在抓取以下网站:https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio 我试图通过 rvest 包将货币汇率表放入 R 数据框中,但该表本身是在 HTML 代码中的 JavaScript 变量中配置的。 我找到了相关的 css 选择器,现在我有了: library(rvest) banorte <- "https://www.banorte.com/wps/portal/ixe/Home/indicadores/tipo-de-cambio/" %>% read_html() %>% html_nodes('#indicadores_financieros_wrapper > script:nth-child(2)') 我的输出现在是以下 JavaScript 脚本,作为 XML 节点集: <script> $(document).ready(function(){ var valor = '{"tablaDivisas":[{"nombreDivisas":"FRANCO SUIZO","compra":"18.60","venta":"19.45"}, {"nombreDivisas":"LIBRA ESTERLINA","compra":"24.20","venta":"25.15
  • 尝试过Python BeautifulSoup和Phantom JS:仍然无法抓取网站(Tried Python BeautifulSoup and Phantom JS: STILL can't scrape websites)
    问题 在过去的几周里,您可能已经看到我的绝望情绪。 我一直在抓取一些等待时间数据,但仍然无法从这两个站点获取数据 http://www.centura.org/erwait http://hcavirginia.com/home/ 最初,我尝试使用BS4 for Python。 以下HCA Virgina的示例代码 from BeautifulSoup import BeautifulSoup import requests url = 'http://hcavirginia.com/home/' r = requests.get(url) soup = BeautifulSoup(r.text) wait_times = [span.text for span in soup.findAll('span', attrs={'class': 'ehc-er-digits'})] fd = open('HCA_Virginia.csv', 'a') for w in wait_times: fd.write(w + '\n') fd.close() 所有这些操作就是将空白打印到控制台或CSV。 所以我用PhantomJS进行了尝试,因为有人告诉我它可能正在用JS加载。 然而,同样的结果! 将空白打印到控制台或CSV。 下面的示例代码。 var page = require(
  • Scraping from aspx website using R
    I am trying to accomplish a task using R to scrape data on a website. I would like to go through each link on the following page: http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House Bills Select only items with Current Status showing "transmitted to the governor". For example, http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013 And then scrapping the cells within STATUS TEXT for the following clause" Passed Final Reading". For example: Passed Final Reading as amended in SD 2 with Representative(s) Fale
  • R 中 CRAN 中所有包的列表和描述(List and description of all packages in CRAN from within R)
    问题 我可以使用以下功能获取所有可用软件包的列表: ap <- available.packages() 但是我怎样才能从 R 中获取这些包的描述,以便我可以有一个包含两列的data.frame :包和描述? 回答1 我实际上认为您希望“Package”和“Title”作为“Description”可以运行到几行。 所以这里是前者,如果你真的想要“描述”,只需将“描述”放在最后的子集中: R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted R> R> require("tools") R> R> getPackagesWithTitle <- function() { + contrib.url(getOption("repos")["CRAN"], "source") + description <- sprintf("%s/web/packages/packages.rds", + getOption("repos")["CRAN"]) + con <- if(substring(description, 1L, 7L) == "file://") { + file(description, "rb") + } else { + url(description, "rb") +
  • 抓取需要使用 BeautifulSoup 登录的网站(Scrape website that require login with BeautifulSoup)
    问题 我想抓取需要使用 Python 和 BeautifulSoup 登录并请求库的网站。 (无硒)这是我的代码: import requests from bs4 import BeautifulSoup auth = (username, password) headers = { 'authority': 'signon.springer.com', 'cache-control': 'max-age=0', 'upgrade-insecure-requests': '1', 'origin': 'https://signon.springer.com', 'content-type': 'application/x-www-form-urlencoded', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application
  • Use phantomJS in R to scrape page with dynamically loaded content
    Background I'm currently scraping product information from some websites in R using rvest. This works on all but one website where the content seems to be loaded dynamically via angularJS (?), so cannot be loaded iteratively e.g. via URL parameters (as I did for other websites). The specific url is as follows: http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html Please keep in mind I don't have admin rights on my machine and can only implement solutions that require either no or only single-time granting of admin rights Desired Output In the end a table in R
  • 我正在尝试使用 Excel VBA 抓取网站数据(I'm trying to scrape website data with Excel VBA)
    问题 所以我首先要说我对 VBA 很陌生。 我正在尝试从此页面上的表格中提取数据。 就代码而言,我还没有完成太多,所以请放轻松。 我正在寻找有关如何处理它以及是否可以完成的方向,我相信它可以。 如果有人能够帮助指导我朝着正确的方向前进,我将不胜感激。 Sub rgnbateamstats() Dim appIE As Object Set appIE = CreateObject("internetexplorer.application") With appIE .navigate "https://rotogrinders.com/team-stats/nba-earned?site=draftkings" .Visible = True End With Do While appIE.Busy DoEvents Loop Set allRowOfData = appIE.document.getElementById("proj-stats") 不确定从这里去哪里,或者我是否在正确的轨道上。 回答1 这将抓取该页面上的整个表格。 本项目使用早期绑定。 您需要将引用设置为: 微软互联网控制微软 HTML 对象库您可以在 VBE > Tools > References 中完成此操作。 我会说,这个网站使用了一种非常奇怪的方法来设置他们的桌子,找到一种体面的方法来实现这一点很有趣
  • How to scrape several websites with pyqt4, scope change?
    I would like to scrape two websites in java for links using PyQt4.QtWebKit to render the pages and then get the desired links. The code works fine with one page or url, but stops (but continues running until force quit) after printing the links of the first website. It seems the scope stays in the event loop of the render class. How can I get the program to change scope and continue with the for loop and rendering the second website? Using exit() in _loadFinished method just quits the program after the first iteration. Maybe the python app has to close and reopen to render the next page, which
  • 如何使用pyqt4抓取多个网站,范围更改?(How to scrape several websites with pyqt4, scope change?)
    问题 我想使用 PyQt4.QtWebKit 在 java 中抓取两个网站的链接来呈现页面,然后获取所需的链接。 该代码在一个页面或 url 上工作正常,但在打印第一个网站的链接后停止(但继续运行直到强制退出)。 范围似乎停留在渲染类的事件循环中。 如何让程序更改范围并继续 for 循环并呈现第二个网站? 在 _loadFinished 方法中使用 exit() 只是在第一次迭代后退出程序。 也许python应用程序必须关闭并重新打开才能呈现下一页,这是不可能的,因为应用程序是在程序之外打开/重新打开的? import sys from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * from PyQt4 import QtGui from lxml import html class Render(QWebPage): def __init__(self, url): self.frame = None QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) def _loadFinished(self, result)
  • 使用 Python 抓取网页 JavaScript 页面(Web-scraping JavaScript page with Python)
    问题 I'm trying to develop a simple web scraper. I want to extract text without the HTML code. In fact, I achieve this goal, but I have seen that in some pages where JavaScript is loaded I didn't obtain good results. For example, if some JavaScript code adds some text, I can't see it, because when I call response = urllib2.urlopen(request) I get the original text without the added one (because JavaScript is executed in the client). So, I'm looking for some ideas to solve this problem. 回答1 2017 年 12 月 30 日编辑:此答案出现在 Google 搜索的热门结果中,因此我决定对其进行更新。 旧答案还在最后。 dryscape 不再维护,dryscape 开发人员推荐的库仅适用于 Python 2
  • Coloring Single Column of Pandas Dataframe.to_html()
    Before this is marked as duplicate, I have tried the code from the following topics and none has worked for me thus far: [Colouring one column of pandas dataframe ] [Format the color of a cell in a panda dataframe according to multiple conditions ] [how to color selected columns in python dataframe? ] I have code that produces three pandas dataframe that looks like this: RowName Orders Market StartTime StopTime Status good A 9 gold 10:00:00 10:09:45 . . . bad B 60 silver 07:54:43 08:02:12 RowName Orders Market StartTime StopTime Status good E 19 plat. 10:00:00 10:09:45 . . bad F 54 mercury 07
  • Scrape Data from Tableau Public dashboard
    I am very new to the world of scraping data off of websites and am at a lost on how to grab data off of a website that is using Tableau Public website: https://showmestrong.mo.gov/data/public-health/ I've been reading up on several sources on how to inspect the elements and finding the table within it but I am at a loss. I've tried using in Python requests and BeautifulSoup but don't know how to work past that. import requests from bs4 import BeautifulSoup import json import re r = requests.get("https://showmestrong.mo.gov/data/public-health/") soup = BeautifulSoup(r.text, "html.parser") and