I need to download all postings of a large forum. It has around 800k members. All content and links are public...

I need to download all postings of a large forum. It has around 800k members. All content and links are public. The forum is limited to around one connection per second. To not download for half a year, I am looking for someone with a lot of IPs under his control to download the stuff for me in parallel. Like, a botnet herder. What would be a good place to look for this?

Other urls found in this thread:

digitalocean.com/pricing/
twitter.com/SFWRedditImages

Go ask the admin for a dump

Not here.

Hmm. Low chance of success, but it would remove some traffic from them. I'll try, I guess.

this

AWS, or dump.

> AWS, or dump.
With AWS, I'd need to rent a few dozen independent machines to have a few dozen different IPs, I suppose?

Why do you need to download everything?

> Why do you need to download everything?
I try to correlate written stuff to one of the forums members via linguistic methods. Basically to catch my hacker.

IPs = $$$

No one will do it free for you.

Just reset your router every time you idiot. New IP for free

How likely is it to get a dump this way? Anyone tried it before? It makes sense for admins to provide the dump rather than endure scraping?

I usually run wget or a Python script from a bunch of machines. After all, you rent them by hour so it isn't expensive. It probably looks like shit on the server, though.

Completely depends on the admin and how you ask. Got a rough estimate on cost to rent those machines?

Check out the hourly pricing on, Digital Ocean, for example:
>digitalocean.com/pricing/

You set up one machine (the cheapest one will do for scraping - $ 0.007/hr) and just clone it a bunch of times. I don't know if there's a limit on how many you can set up at once. I think Digital Ocean says in their ToS that you could be shut down for scraping, if someone complains, but I never went near causing a denial of service attack and choose a slow pace to scrape at.

Noone said I expect to get that for free.
I'll pay up to a hundred bucks in Bitcoin for that.

You have no clue whatsoever. The problem is rate-limiting per IP, not "banning" or whatever you experience when you have to reset your router. Protip: you can simply reconnect too.

I started with httrack, where I can set connections/sec and threads. I am effectivly limited to 1 html download per second, taking ten days only to download a list of the users.

>limited to around one connection per second
so dont close the connection? (use pipelining)
or are you trying to say one request per second?

also why dont you download the not so recent stuff from google cache and then only down load the diff from source? if that shit is Biblically available google must have indexed it

>Biblically
kek, meant to say publicly, fucking autocorrect

Not cached by google, but it's on archive.org. Will check if there's a useable mirror there, thanks!

In fact it is cached by google! I'll look into that, although at some point archive.org and google will rate-limit or block me as well.

i doubt google even cares about the rate you fetch cached read-only content from them, they probably have that shit stored across multiple instances and with each request you'll be just ping-ponging between them
they have _massive_ data uploading capabilities, your requests are like a rain drop in an ocean
gl

Very true! Still, why do I have to solve captchas to prove I am a human on google search, every other month? Ah well, every independent source helps! archive.org, google and the original forum. Then with three different IPs, that's a tenfold speedup already. Maybe I can do it on my own, after all! Thank you, helpful user.

>hundred bucks to scrape an entire site with bots
wew lad

>why do I have to solve captchas to prove I am a human on google search
cause the last gateway between google and you is used by more people? (you have a shared ip address)

and i bettrr leave it at that, i dont really wanna ask on what country you live or this thread will turn to shit
>poo in loo?
god damn it, i cant help it, sry

>what is DHCP lease time