Using scrapy recursivelly for scrape a phpBB forum

问题 我正在尝试使用scrapy 来抓取基于phpbb 的论坛。 我的scrapy知识水平非常基础(但正在提高)。 提取论坛帖子首页的内容或多或少很容易。 我成功的刮板是这样的: import scrapy from ptmya1.items import Ptmya1Item class bastospider3(scrapy.Spider): name = "basto3" allowed_domains = ["portierramaryaire.com"] start_urls = [ "http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a" ] def parse(self, response): for sel in response.xpath('//div[2]/div'): item = Ptmya1Item() item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract() item['date'] = sel.xpath('div/div[1]/p/text()').extract() item['body'] = sel.xpath('div/div[1]/div/text()')

2022-05-12 05:29:07    分类:技术分享    python-2.7   xpath   web-scraping   scrapy   screen-scraping

Getting SIGNATURE (parameter names) MISMATCH with Java POST but not with curl.

问题 curl 可以很好地处理参数列表(curl -d),但是当我使用 OpenURL 运行帖子时,我得到了一个 SIGNATURE MISMATCH。 我发现很难理解缺少什么。 这有效: $ curl -d "term_in=201510&sel_subj=dummy&sel_day=dummy&sel_schd=dummy&sel_insm=dummy&sel_camp=dummy&sel_levl=dummy&sel_sess=dummy&sel_instr=dummy&sel_ptrm=dummy&sel_attr=dummy&sel_subj=%&sel_crse=&sel_title=&sel_from_cred=&sel_to_cred=&sel_camp=%25&sel_levl=%25&begin_hh=0&begin_mi=0&begin_ap=a&end_hh=0&end_mi=0&end_ap=a" https://somesite.ca/banprod/bwckschd.p_get_crse_unsec 这失败了: NameValuePair[] data = { new NameValuePair("term_code", termCode), new NameValuePair("sel_subj", "dummy"), new

2022-05-08 09:17:05    分类:技术分享    java   oracle   curl   plsql   screen-scraping

Logging into website with multiple pages using Python (urllib2 and cookielib)

问题 我正在编写一个脚本来从我的银行的家庭银行网站检索交易信息,以便在个人移动应用程序中使用。 该网站的布局如下: https://homebanking.purduefed.com/OnlineBanking/Login.aspx -> 输入用户名 -> 提交表单 -> https://homebanking.purduefed.com/OnlineBanking/AOP/Password.aspx -> 输入密码 -> 提交表格 -> https://homebanking.purduefed.com/OnlineBanking/AccountSummary.aspx 我遇到的问题是因为有 2 个单独的页面可以进行 POST,我首先认为这是会话信息丢失的问题。 但我使用 urllib2 的 HTTPCookieProcessor 来存储 cookie 并向网站发出 GET 和 POST 请求,发现这不是问题。 我目前的代码是: import urllib import urllib2 import cookielib loginUrl = 'https://homebanking.purduefed.com/OnlineBanking/Login.aspx' passwordUrl = 'https://homebanking.purduefed.com

2022-05-05 09:09:16    分类:技术分享    python   screen-scraping   urllib2   cookielib

How to scrape hidden class data using selenium and beautiful soup

I'm trying to scrape java script enabled web page content. I need to extract data in the table of that website. However each row of the table has button (arrow) by which we get additional information of that row. I need to extract that additional description of each row. By inspecting it is observed that the contents of those arrow of each row belong to same class. However the class is hidden in source code. It can be observed only while inspecting. The data I'm trying to sparse is from the webpage. I have used selenium and beautiful soup. I'm able to scrape data of table but not content of

2022-05-03 00:26:30    分类:问答    python   selenium   web-scraping   beautifulsoup   screen-scraping

How do I download the highest resolution image from a JavaScript rendered responsive page?

Suppose this is the website page: "https://www.dior.com/en_us/products/couture-943C105A4655_C679-technical-fabric-cargo-pants-covered-in-tulle", from which I want to download all the images of the product showcased (4 images in this case). I am using Selenium and extracting image links. The problem is if I click the images they are even 2000x3000 pixels big, but I am only able to get 480 around pixels resolution images of them. Where are these images stored? How do I extract them? ( basically I want to download the maximum possible size of those images )

2022-05-01 20:56:00    分类:问答    javascript   python   screen-scraping   responsive

requests.get(url) not returning for this specific url

I'm trying to use requests.get(url).text to get the HTML from this website. However, when requests.get(url) is called with this specific url, it never returns no matter how long I wait. This works with other urls, but this one specifically is giving me trouble. Code is below from bs4 import BeautifulSoup import requests source = requests.get('https://www.carmax.com/cars/all', allow_redirects=True).text soup = BeautifulSoup(source, 'lxml') print(soup.prettify().encode('utf-8')) Thanks for any help!

2022-04-30 12:38:31    分类:问答    python   web   python-requests   screen-scraping

How fill the form of payment amazon using SELENIUM PYTHON

This is a one part of my code, here I do click in 'Add a credit or debit card' and switch the frame, for the I will proceed fill the form but have this error: selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id="pp-QqmNYT-14"]"} This is my part of code: self.__driver.find_element(By.LINK_TEXT, 'Add a credit or debit card').click() self.__driver.switch_to.frame(self.__driver.find_element_by_tag_name('iframe')) self.__driver.find_element(By.XPATH, '//*[@id="pp-QqmNYT-14"]').send_keys("user admin") self._

2022-04-30 12:10:11    分类:问答    forms   selenium   iframe   amazon   screen-scraping

Argentina supermarket web scraping

问题 我正在尝试从网站上抓取数据: https://www.disco.com.ar/Comprar/Home.aspx#_atCategory=false&_atGrilla=true&_id=21063 通过Excel 2013中的宏,例如实时价格、产品名称和图像。 我尝试过 excel 网络查询,但它不起作用。 有没有办法做到这一点? 回答1 有一个示例显示了如何使用 XHR 和 JSON 解析从网站检索数据,它由几个步骤组成。 检索数据。 我使用 Chrome Developer Tools Network 选项卡对 XHR 进行了一些研究。 我发现的最相关数据是来自 https://www.disco.com.ar/Comprar/HomeService.aspx/ObtenerLimiteDeProductos 的 POST XHR 返回的 JSON 字符串 如果没有 cookie 标头,POST XHR 对我不起作用。 因此,我必须首先添加额外的 HEAD XHR 来检索ASP.NET_SessionId cookie,服务器版本 XMLHTTP 用于控制 cookie。 返回 cookie 的唯一响应标头是来自 https://www.disco.com.ar/Login/PreHome.aspx 的 GET XHR 检索到的 JSON 字符串应该被解析两次

2022-04-30 06:09:28    分类:技术分享    excel   vba   web   web-scraping   screen-scraping

Getting HTML from a page behind a login

This question is a follow up to my previous question about getting the HTML from an ASPX page. I decided to try using the webclient object, but the problem is that I get the login page's HTML because login is required. I tried "logging in" using the webclient object: WebClient ww = new WebClient(); ww.DownloadString("Login.aspx?UserName=&Password="); string html = ww.DownloadString("Internal.aspx"); But I still get the login page all the time. I know that the username info is not stored in a cookie. I must be doing something wrong or leaving out an important part. Does anyone know what it

2022-04-29 17:26:28    分类:问答    asp.net   html   screen-scraping

R - Scraping aspx web error

问题 library(rvest) url <- "http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=14-12-2016&venue=HV&raceno=1&lang=en" R1odds <- url %>% read_html() %>% html_nodes("table") %>% .[[2]] %>% html_table(fill=TRUE) R1odds 我收到此错误消息: Error: input conversion failed due to input error, bytes 0x3C 0x2F 0x6E 0x6F [6003] 如何解决这个问题? 回答1 对于在非赌博环境中可能遇到类似情况的其他人,这里是绕过空值的解决方案。 您必须自己处理赌博数据问题: library(rvest) library(curl) url <- "http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=14-12-2016&venue=HV&raceno=1&lang=en" pg <- curl_fetch_memory(url) pg$content %>% readBin(what=character()) %>% read_html() -> doc html

2022-04-29 05:22:19    分类:技术分享    asp.net   r   screen-scraping