Perhaps someone here is more familiar with this. I constantly frequent websites from 2 different banks to see what prices they quote. Now, instead of going back and forth the whole day I am writing my own Java program to display the current values for me which I plan to get through scraping the websites of both banks.
I have written a simple class (using JSoup) that extracts an element by its CSS selector (pic related). In Chrome, I go on their website and right-click > [Inspect] to find the CSS node. This approach works fine for Deutsche Bank (see below) but not for Commerzbank (see below). What am I missing here?
Would really appreciate some help. I have a feeling Commerzbank works with some "lightstreamer" application which is not easily scrapable as I cant extract anything from their site. It basically gives me nothing instead of any value or string.
Stop cheapening out and get a data provider. Scraping is for beggars
Dylan Baker
Does not work for Commerzbank (pic related).
William Moore
And let me tell you that sites like Yahoo Finance has started spoofing live prices and historical prices to combat scrapers, so your scraping days won't last
Tyler Ortiz
I work at a bank and I have access to Reuters Eikon and Bloomberg market data services there. This is however for a side project at home, which is not as far progressed yet. If it does, I would be willing to pay money. But this would be the interim solution for now.
Logan Murphy
set your user-agent
Cooper Long
Set a useragent. I am sure this will help. Some sites do not output anything if you are a bot or they see that you have no useragent.
Caleb Gutierrez
Thanks for the tip. I will read up on that. You mean something like in the picture? Still does not seem to work.
Ryan Cooper
Dump the HTML and compare with the web site
Gavin Adams
I'd guess the second bank retrieves the data with some javascript executed on load rather than the data being returned inside the HTML
Connor Foster
I guess one has to send some form of request to the "lightstreamer" to receive any values. Just looking at the HTML does not yield much. (pic related).
Mason Brooks
Sup Forums also blocks URL requests with no UserAgent, so you can confirm if you've set it up correctly by trying to scrape this thread, before trying the bank website again.
Asher Nguyen
They're using websockets (fucking why) to get the data. So you're shit out of luck. You could try Headless Chrome to actually load the page and then use JS to pull it out the data.
Anthony Cruz
Sup Forums seems to work with or without setting a user agent. I tried both
Jace Kelly
Just mentioned it, because when I tried scraping inside a python script it'd fail when using urllib2, so whatever java lib you're using is probably already using a known-good UserAgent. But some sites are pickier when it comes to the UserAgent string than others, and requires stuff like info about character encodings and charsets.
Nolan Morales
hmm, that sounds like a lot of work unfortunately and I wouldn't be surprised if it still doesn't work then. Seems like Commerzbank has guarded their website against automated access... I managed to get it done before with iMacros, a program loosely based on Visual Basic (pic related) but that required the Browser to stay open and the program would essentially go through all fields you would like to extract and copy them (simulating roughly would a user would do with mouse and CTRL+C I guess). It was also slow.
Maybe that page will help, but I haven't used it personally.
Owen Thomas
I wrote quite a few scrapers and as a general suggestion I would recommend you use Groovy instead of Java for this kind of project. You will save literally days.
Carter James
Why not an actual websockets library?
Carter Campbell
Gonna further this and suggest using the Geb library within Groovy. Makes creating browser bots really quick.
Carson White
Browser proxy to find the loads the site makes?
Andrew Brown
Why are you using Java when Python is literally made for this shit? BeautifulSoup and Mechanize.
Or, selenium.
Caleb Gonzalez
Are you sure this would work? I tried Python and BeautifulSoup before and it did not work (though I did not use Selenium or Mechanize).
Liam Price
If the bank requires javascript enabled for whatever reason, try using a headless browser. Or a ws library
Josiah Brooks
check out selenium for java, used it a couple of times and it's surprisingly easy
John Wright
thanks will take a look into both ways.
Benjamin Rivera
Take a look at the page's source code (not via Inspect Element), check if the data is there. If it is not the site may be generated with Javascript. Then you need something like Selenium.
Alexander Richardson
This
Adam Long
Not sure if you are dead set on using Java but I use a JS approach with CasperJS, it loads all the scripts properly and can fire click events and whatnot