Scraping

Perhaps someone here is more familiar with this. I constantly frequent websites from 2 different banks to see what prices they quote. Now, instead of going back and forth the whole day I am writing my own Java program to display the current values for me which I plan to get through scraping the websites of both banks.

I have written a simple class (using JSoup) that extracts an element by its CSS selector (pic related). In Chrome, I go on their website and right-click > [Inspect] to find the CSS node. This approach works fine for Deutsche Bank (see below) but not for Commerzbank (see below). What am I missing here?

Would really appreciate some help. I have a feeling Commerzbank works with some "lightstreamer" application which is not easily scrapable as I cant extract anything from their site. It basically gives me nothing instead of any value or string.

Other urls found in this thread:

developers.google.com/web/updates/2017/04/headless-chrome
casperjs.org/
twitter.com/NSFWRedditGif

Works for Deutsche Bank (pic related).

Stop cheapening out and get a data provider. Scraping is for beggars

Does not work for Commerzbank (pic related).

And let me tell you that sites like Yahoo Finance has started spoofing live prices and historical prices to combat scrapers, so your scraping days won't last

I work at a bank and I have access to Reuters Eikon and Bloomberg market data services there. This is however for a side project at home, which is not as far progressed yet. If it does, I would be willing to pay money. But this would be the interim solution for now.

set your user-agent

Set a useragent. I am sure this will help.
Some sites do not output anything if you are a bot or they see that you have no useragent.

Thanks for the tip. I will read up on that. You mean something like in the picture? Still does not seem to work.

Dump the HTML and compare with the web site

I'd guess the second bank retrieves the data with some javascript executed on load rather than the data being returned inside the HTML

I guess one has to send some form of request to the "lightstreamer" to receive any values. Just looking at the HTML does not yield much. (pic related).

Sup Forums also blocks URL requests with no UserAgent, so you can confirm if you've set it up correctly by trying to scrape this thread, before trying the bank website again.

They're using websockets (fucking why) to get the data.
So you're shit out of luck. You could try Headless Chrome to actually load the page and then use JS to pull it out the data.

Sup Forums seems to work with or without setting a user agent. I tried both

Just mentioned it, because when I tried scraping inside a python script it'd fail when using urllib2, so whatever java lib you're using is probably already using a known-good UserAgent. But some sites are pickier when it comes to the UserAgent string than others, and requires stuff like info about character encodings and charsets.

hmm, that sounds like a lot of work unfortunately and I wouldn't be surprised if it still doesn't work then. Seems like Commerzbank has guarded their website against automated access... I managed to get it done before with iMacros, a program loosely based on Visual Basic (pic related) but that required the Browser to stay open and the program would essentially go through all fields you would like to extract and copy them (simulating roughly would a user would do with mouse and CTRL+C I guess). It was also slow.

>developers.google.com/web/updates/2017/04/headless-chrome
I realized that it won't support Windows until the next version of Chrome. It's a full browser, it just doesn't render to screen.

Maybe that page will help, but I haven't used it personally.

I wrote quite a few scrapers and as a general suggestion I would recommend you use Groovy instead of Java for this kind of project. You will save literally days.

Why not an actual websockets library?

Gonna further this and suggest using the Geb library within Groovy. Makes creating browser bots really quick.

Browser proxy to find the loads the site makes?

Why are you using Java when Python is literally made for this shit? BeautifulSoup and Mechanize.

Or, selenium.

Are you sure this would work? I tried Python and BeautifulSoup before and it did not work (though I did not use Selenium or Mechanize).

If the bank requires javascript enabled for whatever reason, try using a headless browser. Or a ws library

check out selenium for java, used it a couple of times and it's surprisingly easy

thanks will take a look into both ways.

Take a look at the page's source code (not via Inspect Element), check if the data is there. If it is not the site may be generated with Javascript. Then you need something like Selenium.

This

Not sure if you are dead set on using Java but I use a JS approach with CasperJS, it loads all the scripts properly and can fire click events and whatnot

casperjs.org/

Use Groovy/Geb for Selenium drivers. Raw Selenium is an unnecessary headache.

Your Java is severely triggering me