Scraping

Question

Scraping

Camden Powell

Perhaps someone here is more familiar with this. I constantly frequent websites from 2 different banks to see what prices they quote. Now, instead of going back and forth the whole day I am writing my own Java program to display the current values for me which I plan to get through scraping the websites of both banks.

I have written a simple class (using JSoup) that extracts an element by its CSS selector (pic related). In Chrome, I go on their website and right-click > [Inspect] to find the CSS node. This approach works fine for Deutsche Bank (see below) but not for Commerzbank (see below). What am I missing here?

Would really appreciate some help. I have a feeling Commerzbank works with some "lightstreamer" application which is not easily scrapable as I cant extract anything from their site. It basically gives me nothing instead of any value or string.

July 7, 2017 - 18:36

Other urls found in this thread:

developers.google.com/web/updates/2017/04/headless-chrome
casperjs.org/
twitter.com/NSFWRedditGif

Elijah Davis

Works for Deutsche Bank (pic related).

July 7, 2017 - 18:37

Josiah Scott

Stop cheapening out and get a data provider. Scraping is for beggars

July 7, 2017 - 18:37

Dylan Baker

Does not work for Commerzbank (pic related).

July 7, 2017 - 18:38

William Moore

And let me tell you that sites like Yahoo Finance has started spoofing live prices and historical prices to combat scrapers, so your scraping days won't last

July 7, 2017 - 18:38

Tyler Ortiz

I work at a bank and I have access to Reuters Eikon and Bloomberg market data services there. This is however for a side project at home, which is not as far progressed yet. If it does, I would be willing to pay money. But this would be the interim solution for now.

July 7, 2017 - 18:39

Logan Murphy

set your user-agent

July 7, 2017 - 18:40

Cooper Long

Set a useragent. I am sure this will help.
Some sites do not output anything if you are a bot or they see that you have no useragent.

July 7, 2017 - 18:41

Caleb Gutierrez

Thanks for the tip. I will read up on that. You mean something like in the picture? Still does not seem to work.

July 7, 2017 - 18:49

Ryan Cooper

Dump the HTML and compare with the web site

July 7, 2017 - 19:25

Gavin Adams

I'd guess the second bank retrieves the data with some javascript executed on load rather than the data being returned inside the HTML

July 7, 2017 - 19:34

Connor Foster

I guess one has to send some form of request to the "lightstreamer" to receive any values. Just looking at the HTML does not yield much. (pic related).

July 7, 2017 - 19:37

Mason Brooks

Sup Forums also blocks URL requests with no UserAgent, so you can confirm if you've set it up correctly by trying to scrape this thread, before trying the bank website again.

July 7, 2017 - 19:38

Asher Nguyen

They're using websockets (fucking why) to get the data.
So you're shit out of luck. You could try Headless Chrome to actually load the page and then use JS to pull it out the data.

July 7, 2017 - 19:44

Anthony Cruz

Sup Forums seems to work with or without setting a user agent. I tried both

July 7, 2017 - 19:45

Jace Kelly

Just mentioned it, because when I tried scraping inside a python script it'd fail when using urllib2, so whatever java lib you're using is probably already using a known-good UserAgent. But some sites are pickier when it comes to the UserAgent string than others, and requires stuff like info about character encodings and charsets.

July 7, 2017 - 19:51

Nolan Morales

hmm, that sounds like a lot of work unfortunately and I wouldn't be surprised if it still doesn't work then. Seems like Commerzbank has guarded their website against automated access... I managed to get it done before with iMacros, a program loosely based on Visual Basic (pic related) but that required the Browser to stay open and the program would essentially go through all fields you would like to extract and copy them (simulating roughly would a user would do with mouse and CTRL+C I guess). It was also slow.

July 7, 2017 - 20:05

Dominic Perry

>developers.google.com/web/updates/2017/04/headless-chrome
I realized that it won't support Windows until the next version of Chrome. It's a full browser, it just doesn't render to screen.

Maybe that page will help, but I haven't used it personally.

July 7, 2017 - 20:10

Owen Thomas

I wrote quite a few scrapers and as a general suggestion I would recommend you use Groovy instead of Java for this kind of project. You will save literally days.

July 7, 2017 - 21:06

Carter James

Why not an actual websockets library?

July 7, 2017 - 21:38

Carter Campbell

Gonna further this and suggest using the Geb library within Groovy. Makes creating browser bots really quick.

July 8, 2017 - 01:26

Carson White

Browser proxy to find the loads the site makes?

July 8, 2017 - 01:56

Andrew Brown

Why are you using Java when Python is literally made for this shit? BeautifulSoup and Mechanize.

Or, selenium.

July 8, 2017 - 01:57

Caleb Gonzalez

Are you sure this would work? I tried Python and BeautifulSoup before and it did not work (though I did not use Selenium or Mechanize).

July 8, 2017 - 04:56

Liam Price

If the bank requires javascript enabled for whatever reason, try using a headless browser. Or a ws library

July 8, 2017 - 04:58

Josiah Brooks

check out selenium for java, used it a couple of times and it's surprisingly easy

July 8, 2017 - 05:11

John Wright

thanks will take a look into both ways.

July 8, 2017 - 05:21

Benjamin Rivera

Take a look at the page's source code (not via Inspect Element), check if the data is there. If it is not the site may be generated with Javascript. Then you need something like Selenium.

July 8, 2017 - 05:23

Alexander Richardson

This

July 8, 2017 - 05:24

Adam Long

Not sure if you are dead set on using Java but I use a JS approach with CasperJS, it loads all the scripts properly and can fire click events and whatnot

casperjs.org/

July 8, 2017 - 05:26

Charles Richardson

Use Groovy/Geb for Selenium drivers. Raw Selenium is an unnecessary headache.

July 8, 2017 - 06:01

Michael Lopez

Your Java is severely triggering me

July 8, 2017 - 08:31

1 2 ... 4 Next

Scraping

Last threads