Hey Sup Forums

Question

Hey Sup Forums

Brody Cook

Hey Sup Forums.

I'm working on a project to extract data from the internet.

I want to write a program (preferably C/C++) to open a site, get the contents, then look for a specific header and return the values. Has anyone done anything similar and is prepared to clarify the basic terms?

First of all is it possible to write it in C/C++ and what libraries should I use. I'm using Visual Studio. Is there an alternative to libcurl, because I couldn't manage to get it working. After I have the XML file, what library should I use for working with that format (mshtml, msxml)? I've done basic C/C++ programming up to pointers in UNI.

October 8, 2017 - 20:39

Jackson Lee

>I'm using Visual Studio

October 8, 2017 - 20:41

Ian Anderson

Use a safe language with built-in support for HTTP, like Python, Java, Go or Scala.

October 8, 2017 - 20:43

Cameron Perez

Q1: how much data will you be extracting?
Q2: Who will host said much data assuming your talking about the entire internet

October 8, 2017 - 20:45

Samuel Ross

Don't use C/C++ for this, it'll make it a million times harder

Learn C# and take a look at the AngleSharp library, it has support for CSS selectors and was designed for this kinda thing.

Far easier to develop for and will actually work immediately.

October 8, 2017 - 20:47

Austin Hernandez

1:only text, max 1000 chars a day

October 8, 2017 - 20:48

Samuel Murphy

also I want the executable file, to run it 2 or 3 times a day.

October 8, 2017 - 20:51

Leo Watson

You can execute programs written in all 4 of those languages

October 8, 2017 - 20:53

Nolan Cox

meant for

October 8, 2017 - 20:54

Ryan Stewart

you can arrange that in some process scheduler in Linux or the task scheduler in Windows.

October 8, 2017 - 20:55

Eli Reed

Is it possible in Windows to schedule executing Python programs? Then I'd be switching to py, because it looks easier for this amount of data.

October 8, 2017 - 21:08

Gavin Baker

I'm in OP shoes except I want to extract data from a software written in C++.

So what am I going to research first, I've already played with memory reading / editing.
Do I have to read packets? But aren't them encrypted?

October 8, 2017 - 21:09

Christopher Williams

>extract data from the internet
fuggin ebic

October 8, 2017 - 21:10

Lincoln Brown

1 more question. I would store this data in a .txt file. The data would obviously accumulate over the years (strings of text). How big can the file get to still be able to search for keywords in it efficiently?

October 8, 2017 - 21:11

Aaron Brown

>Is it possible in Windows to schedule executing Python programs?
Yes

October 8, 2017 - 21:13

Nolan Roberts

are we getting rused in this thread?

October 8, 2017 - 21:14

Parker Brooks

There's like six reasons this is retarded.

>preferably C/C++

You haven't said why you have this preference. But seriously this is about two minutes work in Ruby or Python or.. much longer in C.

>Is there an alternative to libcurl, because I couldn't manage to get it working.

Fix your libcurl problem. It's the most popular and portable library and if you can't make it work you'll never anything else to work.

> I'm using Visual Studio.
Stop. Use a fucking database like a normal person.

October 8, 2017 - 21:17

James Russell

It will never be efficient because of linear nature of searching in text file. You have to sort those strings or make index somehow

October 8, 2017 - 21:20

Benjamin Sullivan

This is my first introductory project of this type, so thanks for all the input. I've switched to Python. So urlllib is the way to go?

October 8, 2017 - 21:23

Samuel Myers

What should I learn about databases?

October 8, 2017 - 21:24

Daniel Gray

OP, learn either Python with BeautifulSoup or Scrapy or C# and HTMLAgilityPack or AngleSharp

If you don't know HTML/CSS good luck with your frustration

October 8, 2017 - 21:25

Lincoln Hughes

You're thinking of a basic recursive scraper, there are tons of 'em around the internet. And, since a good software engineer uses an existing solution if there's any - be aware of the fact that you can do it even with the fucking wget that pipes shit to a commandline HTML parser. This is if your """"data"""" is not much.

If you actually want a real web scraper - use ScraPY. Or Apache Nutch, or Heritrix, or YaCy, or whatever pleases you the most.

Just, for the jesus' sake - don't write your own Yet Another Retarded Solution In The Language You Like The Most - unless it's just for exercising purpose to show your e-peen to the prof and then to throw it away in the oblivion 10 minutes afterwards.

October 8, 2017 - 21:27

Nathan Moore

Do I need to know a lot about HTML/CSS? All I need to do in the program is to look tor a certain tag and extract the info from there.

October 8, 2017 - 21:28

Easton King

You don't have to be an expert but you need to know what element, attribute, selector, property, value is. Plus the hierarchy of the html document, what is the parent/child and siblings

October 8, 2017 - 21:33

Asher Stewart

You need to learn XPath.

October 8, 2017 - 21:33

Evan Bailey

Scrapy it is. Thanks Sup Forums.

October 8, 2017 - 21:46

Michael Ward

>urlllib
Use the "requests" library instead.

October 8, 2017 - 23:22

Nathaniel Perry

>no mention of Perl
Pajeet my son.

October 9, 2017 - 02:24

1 2 3 Next

Hey Sup Forums

Last threads