Hey Sup Forums

Hey Sup Forums.

I'm working on a project to extract data from the internet.

I want to write a program (preferably C/C++) to open a site, get the contents, then look for a specific header and return the values. Has anyone done anything similar and is prepared to clarify the basic terms?

First of all is it possible to write it in C/C++ and what libraries should I use. I'm using Visual Studio. Is there an alternative to libcurl, because I couldn't manage to get it working. After I have the XML file, what library should I use for working with that format (mshtml, msxml)? I've done basic C/C++ programming up to pointers in UNI.

>I'm using Visual Studio

Use a safe language with built-in support for HTTP, like Python, Java, Go or Scala.

Q1: how much data will you be extracting?
Q2: Who will host said much data assuming your talking about the entire internet

Don't use C/C++ for this, it'll make it a million times harder

Learn C# and take a look at the AngleSharp library, it has support for CSS selectors and was designed for this kinda thing.

Far easier to develop for and will actually work immediately.

1:only text, max 1000 chars a day

also I want the executable file, to run it 2 or 3 times a day.

You can execute programs written in all 4 of those languages

meant for

you can arrange that in some process scheduler in Linux or the task scheduler in Windows.

Is it possible in Windows to schedule executing Python programs? Then I'd be switching to py, because it looks easier for this amount of data.

I'm in OP shoes except I want to extract data from a software written in C++.

So what am I going to research first, I've already played with memory reading / editing.
Do I have to read packets? But aren't them encrypted?

>extract data from the internet
fuggin ebic

1 more question. I would store this data in a .txt file. The data would obviously accumulate over the years (strings of text). How big can the file get to still be able to search for keywords in it efficiently?

>Is it possible in Windows to schedule executing Python programs?
Yes

are we getting rused in this thread?

There's like six reasons this is retarded.

>preferably C/C++

You haven't said why you have this preference. But seriously this is about two minutes work in Ruby or Python or.. much longer in C.

>Is there an alternative to libcurl, because I couldn't manage to get it working.

Fix your libcurl problem. It's the most popular and portable library and if you can't make it work you'll never anything else to work.

> I'm using Visual Studio.
Stop. Use a fucking database like a normal person.

It will never be efficient because of linear nature of searching in text file. You have to sort those strings or make index somehow

This is my first introductory project of this type, so thanks for all the input. I've switched to Python. So urlllib is the way to go?

What should I learn about databases?

OP, learn either Python with BeautifulSoup or Scrapy or C# and HTMLAgilityPack or AngleSharp

If you don't know HTML/CSS good luck with your frustration

You're thinking of a basic recursive scraper, there are tons of 'em around the internet. And, since a good software engineer uses an existing solution if there's any - be aware of the fact that you can do it even with the fucking wget that pipes shit to a commandline HTML parser. This is if your """"data"""" is not much.

If you actually want a real web scraper - use ScraPY. Or Apache Nutch, or Heritrix, or YaCy, or whatever pleases you the most.

Just, for the jesus' sake - don't write your own Yet Another Retarded Solution In The Language You Like The Most - unless it's just for exercising purpose to show your e-peen to the prof and then to throw it away in the oblivion 10 minutes afterwards.

Do I need to know a lot about HTML/CSS? All I need to do in the program is to look tor a certain tag and extract the info from there.

You don't have to be an expert but you need to know what element, attribute, selector, property, value is. Plus the hierarchy of the html document, what is the parent/child and siblings

You need to learn XPath.

Scrapy it is. Thanks Sup Forums.

>urlllib
Use the "requests" library instead.

>no mention of Perl
Pajeet my son.