Sup Forums thread downloader

Question

Sup Forums thread downloader

Joshua Howard

Isn't there a software that doesn't suck that can download all files from a thread with original filename?
All the ones I found are trash and 4chanx doesn't seem to have such option.
No botnet tier chrome extension pls.

January 18, 2018 - 02:11

Other urls found in this thread:

github.com/4pr0n/ripme
gitgud.io/sen/fsic
pastebin.com/tT62ZWcr
github.com/Sup
github.com/AmmarkoV/MyScripts/blob/master/Tools/4chanleech
a.4cdn.org/$Board/thread/$ThreadNumber.json|%
github.com/SuperGouge/ChanThreadWatch
twitter.com/SFWRedditImages

Charles Ross

Just write one yourself.

January 18, 2018 - 02:13

Hudson Carter

ive been using ripme.jar

github.com/4pr0n/ripme

January 18, 2018 - 02:13

Jonathan Watson

Clover can download all files with filenames and puts each thread in a separate folder

January 18, 2018 - 02:14

Samuel Miller

Implying I know how.
Pretty good but it doesn't save filename right? I'd like to use it for a webm database.
I know but it's on mobile I don't want to download a thread and have to export everything to my computer every time.

January 18, 2018 - 02:18

Eli Jenkins

Just fucking write one yourself. Look. Mine's done in fucking python 3.X. It's 121 lines long because I was verbose with the documentation and liked the idea of expanding it later so some of it is more complex than it needs to be (calling the API for example). Some anons have managed an image grabber using one line of bash script. This is not fucking hard.

January 18, 2018 - 02:53

Nathan Reed

I don't know how to code and don't have the time or the or the will to learn.

January 18, 2018 - 02:57

Austin Baker

Then kindly fuck off, little princess. You've been given multiple solutions and Sup Forums isn't your personal fucking helpdesk anyway.

January 18, 2018 - 03:00

Ethan Cruz

I have a job and a family, when I come back from work I just want to spend time with my family or lurk Sup Forums, not work even more.

January 18, 2018 - 03:04

Jaxon Cook

Fuck you, neither did I. I literally write one line every 5 minutes with a dozen or so "do [action] in python 3" searches in tabs. I only stopped because the scope of what I was trying to accomplish was getting out of hand. Also I decided to add the option to preserve original filenames in my script just to gloat over having built what you want but are too lazy and incompetent to make yourself.

January 18, 2018 - 03:09

Colton Price

Coding isn't for everyone. Here's a bash script

#!/bin/bash
#
# Get all images from Sup Forums thread

if [ $# = 0 ]; then
echo 'Usage: ./4chget.sh [thread url]'
exit
fi

LINK=$1
BOARD=${LINK#http*://*/*}
THREAD=${BOARD##*/*/}
BOARD=${BOARD%%/*}
DIR="$BOARD-$THREAD"

echo 'Board: ' $BOARD

mkdir -p $DIR ; cd $DIR

curl $LINK -o 'orig.html'
tr ' ' '\n' < orig.html > tmp.html
grep -oh 'i.4cdn.org/*/.*jpg' tmp.html > jpgs.txt
grep -oh 'i.4cdn.org/*/.*png' tmp.html > pngs.txt
if [[ -s jpgs.txt ]]; then
sort -u jpgs.txt | grep -v 's.*' | wget -i -
fi
if [[ -s pngs.txt ]]; then
sort -u pngs.txt | grep -v 's.*' | wget -i -
fi

#clean up
rm orig.html tmp.html jpgs.txt pngs.txt

January 18, 2018 - 03:13

Aiden Ortiz

...

January 18, 2018 - 03:16

Easton Murphy

Just learn python. It won't be that difficult.

January 18, 2018 - 03:17

Luis Smith

Basically this.

January 18, 2018 - 03:22

Samuel Hill

4dl() {
board="$(printf -- '%s' "${1:?}" | cut -d '/' -f4)"
thread="$(printf -- '%s' "${1:?}" | cut -d '/' -f6)"
wget -qO- "a.4cdn.org/${board}/thread/${thread}.json" | jq -r '
.posts
| map(select(.tim != null))
| map((.tim | tostring) + .ext)
| map("

January 18, 2018 - 03:32

Jordan Parker

Is there a way to "watch" when a thread is created, with certain key words?

January 18, 2018 - 04:10

Eli Barnes

>jpgs
>pngs
literally a whitelist, fkin useless
nice skeleton tho
look, just grab jq and writing a proper bash downloader becomes easy as pie

January 18, 2018 - 04:23

Hunter James

nice
should be able to extend this to symlink the original filenames too, methinks

January 18, 2018 - 04:24

Luke Evans

>doesn't know Firefox has a save function

January 18, 2018 - 04:33

Sebastian Hernandez

GNU wget

January 18, 2018 - 04:34

Adrian Stewart

Wrote this for myself in C feel free to use it

January 18, 2018 - 04:38

Christian Lee

gitgud.io/sen/fsic
fuck me

January 18, 2018 - 04:39

Colton Cruz

>[]
deprecated
>= 0
Should be -eq or (( $# == 0 ))
Always double quote your variables unless they're direct integers. This is some really niggerlicious code user.
>grep -oh 'i.4cdn.org/*/.*jpg' tmp.html > jpgs.txt
>grep -oh 'i.4cdn.org/*/.*png' tmp.html > pngs.txt
Don't write out to a file, what the fuck are you doing? Pipe to a while loop
And more importantly, why the fuck are you parsing html when there's an API made specifically for this kind of thing? See >Coding isn't for everyone.
I've never seen such strong irony in my life before.

January 18, 2018 - 04:39

Levi Adams

...

January 18, 2018 - 04:58

Mason Lopez

Holy shit learn to make webms better

January 18, 2018 - 04:59

Levi Lewis

I need to increase bitrate right?

January 18, 2018 - 05:01

Cameron Ramirez

Yes, and also lower the resolution, there's absolutely no reason to have it at a full 1080p

January 18, 2018 - 05:02

Jaxson Brooks

the script btw
pastebin.com/tT62ZWcr

January 18, 2018 - 05:05

Nathan Peterson

mines in C, but its about 400 lines(though it has additonal comforts and its own complete download function and shit). This script gives me desire to go back and optimize my code. I just love how short it is.

January 18, 2018 - 05:15

Michael Jenkins

Wrote this about 5 years ago to download images, still works.

January 18, 2018 - 05:17

Connor Gutierrez

fuggg, it has 3 download functions. I forgot to remove the other 2.
1/4 of my program is unused download functions

January 18, 2018 - 05:18

Michael White

nigger.
are you downloading entire boards?
Thats some next level shit buy waaaayyy to risky.
damn I thought I was taking risks blindly downloading random b threads back in the day

January 18, 2018 - 05:19

Luis Phillips

What? It downloads images from a thread into a directory, that's all.

January 18, 2018 - 05:21

Jordan Price

This is the reason Chad bullied you on school

January 18, 2018 - 05:25

Ethan Howard

weak b8 m8

January 18, 2018 - 05:25

Adrian Miller

>pastebin.com/tT62ZWcr
fuckin' whitelists again
cmon guys, surely you can write dat shite without resorting to hardcoding?

January 18, 2018 - 05:27

Adrian Parker

so the number is the thread number? odd approach.
I just pulled that information from the url instead of having to manually select it every time

January 18, 2018 - 05:29

Connor Moore

This is why bash is better than Python.
#!/bin/bash
xdotool search "Mozilla Firefox" windowactivate --sync key --clearmodifiers F6 ctrl+c
url="$(xsel -bo)"
board="$(printf -- '%s' "${url:?}" | cut -d '/' -f4)"
thread="$(printf -- '%s' "${url:?}" | cut -d '/' -f6)"
wget -qO- "a.4cdn.org/${board}/thread/${thread}.json" | jq -r '
.posts
| map(select(.tim != null))
| map((.tim | tostring) + .ext)
| map("

January 18, 2018 - 05:30

Ryan Hughes

How do you think images are identified?
>hurr
>If(file_ends_with(.*)) download();

January 18, 2018 - 05:30

Tyler Roberts

>xdotool firefox f6 ^c
ooh, me likey
thanks for teaching me how to do that

January 18, 2018 - 05:33

Benjamin Scott

>so the number is the thread number
Yes, I know it would be much easier to just copy paste the link, but the program doesn't only support Sup Forums, so I opted to just limit the options on the GUI instead of doing manual parsing of whatever string the user could enter and then go to all the trouble of deducing what website it needs to connect to. Nobody ever used the program but me and I don't even use it that much, so I don't mind.

January 18, 2018 - 05:36

Adam Gutierrez

tfw I'm just saving threads into my dedicated Sup Forums folder. I hope I will reach your level of expertise one day.

January 18, 2018 - 05:38

Parker Butler

>extensions=('.webm','.gif','.jpeg','.jpg','.png')
>thumbnails=('s.jpg','s.png','s.jpeg')
>if (files.endswith(extensions) & title.endswith(extensions)) & (not (files.endswith(thumbnails) & title.endswith(thumbnails)))
you're hardcoding

you can instead just select the appropriate nodes in the DOM, ffs
does the job, but is sloppy
taking a step back, the first question should be: why not use the API?

January 18, 2018 - 05:41

Easton Perez

4chins has an API?

January 18, 2018 - 05:42

Anthony Hernandez

Giving up on filenames seems preferable when you do meem image collections where you do not necessarily have one artist and title with one clear alphabetic name. People don't stick to any formatting norm either.

For later retreival, its much easier to just have tags, like in a booru, desktop search/tag engine, or hydrus.

January 18, 2018 - 05:43

Thomas Lee

>something something hydrus you pack of

January 18, 2018 - 05:44

Liam Cooper

Use the API and filter out the results.

January 18, 2018 - 05:44

Anthony Ramirez

10 points for appropriate pic

yep, jewgle it
also, see (jewgle hydrus network)

January 18, 2018 - 05:45

Landon Mitchell

oh yea I know that feel.
I was going to do the same thing but never got around to it.
I was just going to butcher the url to leave x.com or something, then look at the first part and direct it to the section of the program for that particular site or whatever

January 18, 2018 - 05:45

Christian Garcia

>there's an API

you learn something new every day

January 18, 2018 - 05:45

Sebastian Green

>dont hardcode here, youre an idiot
>you need to hardcode over here and use whatever meme apis I like
how is that not hardcoding?

you understand these arnt Sup Forums clients, they are image scrapers. The nature of scraping is doing shit yourself because most sites dont provide an api

January 18, 2018 - 05:47

Mason Brown

there were already half a dozen mentions of an API in this thread before you geniuses got the hint
how can you be this dense?

January 18, 2018 - 05:47

Ethan Hall

I'll admit, I didn't read the thread

January 18, 2018 - 05:48

Michael Cruz

also, always do a 'just in case' '

January 18, 2018 - 05:48

Asher Brooks

wget -nc -nd -nv -e robots=off -ER html,s.jpg -rHD i.4cdn.org

January 18, 2018 - 05:49

Justin Hughes

If you scrape everything, you're bound to get useless shit. Don't be a moron, use formatting to find useful data.

January 18, 2018 - 05:49

Jose Sullivan

inb4404.py odes what you want, I think. Has an option for original filenames.

January 18, 2018 - 05:49

Liam Cook

>why not use the API?
API has rules and you have to learn the calls constantly consulting docs. API is fine for projects like Clover or some other thread browsing application, but it's overkill for a simple thread scraper.

January 18, 2018 - 05:52

Ryan Martinez

I use jdownloader for threads here on Sup Forums.

January 18, 2018 - 05:52

Thomas Flores

>you understand these arnt Sup Forums clients, they are image scrapers
if we're talking about generic image scrapers, sure. But this thread:
>Topic: Sup Forums thread downloader
>Isn't there a software that doesn't suck that can download all files from a thread with original filename?
so no, if your scope is "fuckin fetch all images from a Sup Forums thread with their original filenames", the hardcoded-extensions sloppy-match approach is retarded. Also, not using the API (with e.g. jq) is retarded, with the exception of a few scenarios.

January 18, 2018 - 05:58

Julian Lopez

explain?

January 18, 2018 - 05:59

Nicholas Scott

Sup Forums's api is literally dumb as bricks.
github.com/Sup Forums/4chan-API

January 18, 2018 - 05:59

Evan Ward

>Do not make more than one request per second.
Fuck you.

January 18, 2018 - 06:00

Connor Gomez

what you call white lists are actuall just filters.
Im not the guy whos code you reference, but my scraper checks file extension just like the other guy and it eliminates all garbage. I have only a couple filters and it removes adds, thumbnails, and banners. Worked fine for many years now.

January 18, 2018 - 06:01

Kayden Kelly

>8-liner using API, bash, jq and wget
>3-20 times larger messes doing sloppy html fuckery
>"API is [...] overkill for a simple thread scraper"
uhh, you sure you're not from Bizzaro world?

January 18, 2018 - 06:01

Cooper Kelly

nigga that was a single request

January 18, 2018 - 06:02

Jackson Myers

Downloads all files from a thread except thumbnails into the current directory. Check the manpage for options.

January 18, 2018 - 06:02

Ian Reyes

>I have a job and a family
Reddit might more at your speed then.

January 18, 2018 - 06:02

Elijah Russell

That's only for the api requests. You can download the images as frequently as you want however.

January 18, 2018 - 06:04

Daniel Hall

>3-20 times larger messes doing sloppy html fuckery
Are you autistic about the size of the message being 20KB larger?

Still beats being a hostage of an API that can disappear any second rendering your application useless. Rather count on my shit, than on someone elses and Sup Forums's shit isn't exactly reliant.

January 18, 2018 - 06:04

Thomas Thompson

wget \
--recursive \
--no-directories \
--directory-prefix=$HOME/foo/bar/corn \
--accept '.jpg,.jpeg,.png,.gif,.webm' \
--reject '*s.jpg' \
--span-hosts \
--domains=i.4cdn.org \
--execute robots=off \
--wait=2 \
--random-wait \
--limit-rate=800k \
[Sup Forums-thread-url]

January 18, 2018 - 06:05

Adrian Torres

>Worked fine for many years now.
not arguing against that, just saying it's sloppy shit that is more complicated and fragile than needed
e.g.
>I have only a couple filters and it removes adds, thumbnails, and banners
you don't need fucking filters, or any logic to remove ads/thumbnails/banners
the image URLs - all of them, excluding the ads/banners/etc - can be selected with a single path specifier in the DOM
just do a fuckin inspect element in a thread and literally copy-paste (+tweak) dat shit as a selector in your html parsing library of choice

January 18, 2018 - 06:05

Joseph Myers

dudes talking about 4chanx and botnet chrome extensions.
I seriously doubt he knows the difference between a run of the mill scraper and a proper archiver.

Also saving images with the unedited original file name is just asking for trouble with false duplicates when you go to organize your porn hoard. I run into this issue frequently with thread names being reused.

the original file name can be retrieved by just locating the "file:" position in the html when you go to get the image urls.

January 18, 2018 - 06:06

Julian Reyes

yeah, I got the intention
manpages are garbage
I was asking for a tldr on the trick: how does wget build the appropriate URLs just from i.4cdn.org ?
is -rHD doing something like "filter: only URLs containing this pattern"?

January 18, 2018 - 06:07

Lincoln Bailey

honestly dont know what any of that is, I havnt fucked with web development since hdmtl 3 or 4 was hot shit.
I get the html, and retrieve images.

If you would like to explain what youre talking about I will read it and consider that approach, but at the moment it sounds like some proprietary shit that would only work on Sup Forums.

January 18, 2018 - 06:09

Liam Bell

love it, actually
readable!

January 18, 2018 - 06:09

Gabriel Parker

>Are you autistic about the size of the message being 20KB larger?
reading comprehension, motherfucker
I'm telling you your code is pajeet-tier cuz you can't into DOM selectors, so you resort to shitty fragile whitelist filtering heuristics

>Still beats being a hostage of an API that can disappear any second rendering your application useless. Rather count on my shit, than on someone elses and Sup Forums's shit isn't exactly reliant.
it's an 8-liner that works now, it's a trivial enough piece of code that losing it is no biggie
not like the HTML format can't change enough to cause wrong behavior in your hacked together pajeet shit

January 18, 2018 - 06:11

Henry Lewis

github.com/AmmarkoV/MyScripts/blob/master/Tools/4chanleech

January 18, 2018 - 06:13

Justin Kelly

>saving images with the unedited original file name is just asking for trouble
yup, agreed
would need fallback strats, not just "overwrite if it exists"
personally, I'm leaning towards the hydrus network / git / general CAS approach
- name is a hash of the content
- metadata (download filename / full filename, tags) kept separately and linking toward the content

January 18, 2018 - 06:14

Owen Long

>not like the HTML format can't change enough to cause wrong behavior in your hacked together pajeet shit
You'd be surprised, the essential parts have been the same for 5+ years now. The API has only been around for like what, 2-3 years? No thanks, I'll stick to my shit that works and will work after your hiroshima decides to pull the plug on the API. I'm also not the guy you were originally replying to.

January 18, 2018 - 06:17

Sebastian Morales

Ive often looked into hydrus. Currently I like to name stuff with what I like followed by its original reupload number like "cool_wallpaper - 143252351243.jpg" and threads in the format of "wg_wallpaper thread - 2152354235".

With hydrus you say the name is a hash? I was under the impression that hydrus could be used to identify files and possibly retrieve their proper name, for example the name of a model.
How does hydrus handle resized images? Naturally I want the best resolution and highest quality.

January 18, 2018 - 06:23

Christian Stewart

>You'd be surprised, the essential parts have been the same for 5+ years now
wouldn't be surprised
>The API has only been around for like what, 2-3 years?
doesn't mean it will/won't have equivalent stability for our babby-tier purposes
>No thanks, I'll stick to my shit that works
sure thing, not saying you shouldn't use it
just stating my opinion on the simplicity/elegance/robustness of the alternative approaches
> and will work after your hiroshima decides to pull the plug on the API.
will work till the HTML deviates enough
same deal with the API
given mook's history, I guess he'll only remove the API (i.e. do actual work) if it'd benefit him in some way
so I'm not too worried about the tiny risk of the tiny impact of losing an 8-liner's worth of effort
> I'm also not the guy you were originally replying to.
dont matter to me none
that's the beauty of user discussion - I'm replying to the content, not the person

January 18, 2018 - 06:26

Aiden Young

no clue, just learned about it today
but I saw the guide mention it doesn't keep the original filenames in its own copy, so I'd guess it'd go for either content hash or some other sort of uid

January 18, 2018 - 06:28

Easton Williams

Here we go, powershell image downloader based on . No extension based filtering or anything.
function Get-4chanImages ($Board,$ThreadNumber,$OutputPath) { wget a.4cdn.org/$Board/thread/$ThreadNumber.json|% content|ConvertFrom-Json|% posts|? filename $null -ne |%{ wget "i.4cdn.org/$Board/$($_.tim)$($_.ext)" -OutFile "$OutputPath\$($_.tim)$($_.ext)" } }

Usage:
Get-4chanImages g 64369961 D:\tmp

First goes the board letter, then goes the thread number. Lastly the output path, if omitted images will be saved to the current location.

January 18, 2018 - 06:28

Aiden Thomas

First fag to make a Fire/Waterfox addon with a big dumb "download thread" button on the toolbar wins the game.

January 18, 2018 - 06:29

Nathan Rivera

looks about right, take a look at the jq approaches above
PS's integrated object-oriented shite (including convert-from-json) certainly looks alluring

January 18, 2018 - 06:31

Levi Scott

>powershell is a real boy noaw *u*

January 18, 2018 - 06:34

Connor Miller

>CTRL+F ctw
>CTRL+F chanthreadwatch
>CTRL+F supergouge

Here is EXACTLY what OP was looking for:
github.com/SuperGouge/ChanThreadWatch

Did all of you start lurking in 2018?

January 18, 2018 - 07:22

Nathaniel Parker

>c#

January 18, 2018 - 07:24

Cameron Reed

Just download the binaries.
I highly doubt someone who'd make this thread would care about that.

January 18, 2018 - 07:26

Ryder Harris

It needs .NET which means it sucks.

January 18, 2018 - 07:28

Gavin Cruz

Again, I highly doubt someone who'd make this thread would care about that.

January 18, 2018 - 07:28

Leo Moore

>not coding in C or C++ and using the .net framework

January 18, 2018 - 07:30

Samuel Johnson

>i highly doubt someone would object to install a bloated heap of steaming hot shit for a simple thread scraper

January 18, 2018 - 07:31

Zachary Carter

>not coding in C or C++ and using the .net framework

January 18, 2018 - 07:31

Kevin Diaz

>bloated
Let me laugh at you for a minute

January 18, 2018 - 07:33

Luke Martin

.NET is fucking dogshit, just fucking use MSVC if you want to suck microshaft's cock, .NET is absolute cancer.

January 18, 2018 - 07:34

Anthony Bennett

>minute
I'm so sorry you can't laugh for much longer because of that microshaft dick clogging your throat.

January 18, 2018 - 07:35

John Sullivan

Well, maybe I wan't Microshoft's cock down my throat, ever considered that?

January 18, 2018 - 07:37

Parker Smith

You are gay and a faggot.

Seek help.

January 18, 2018 - 07:38

1 2 ... 10 Next

Sup Forums thread downloader

Last threads