Sup Forums thread downloader

Isn't there a software that doesn't suck that can download all files from a thread with original filename?
All the ones I found are trash and 4chanx doesn't seem to have such option.
No botnet tier chrome extension pls.

Other urls found in this thread:

github.com/4pr0n/ripme
gitgud.io/sen/fsic
pastebin.com/tT62ZWcr
github.com/Sup
github.com/AmmarkoV/MyScripts/blob/master/Tools/4chanleech
a.4cdn.org/$Board/thread/$ThreadNumber.json|%
github.com/SuperGouge/ChanThreadWatch
twitter.com/SFWRedditImages

Just write one yourself.

ive been using ripme.jar

github.com/4pr0n/ripme

Clover can download all files with filenames and puts each thread in a separate folder

Implying I know how.
Pretty good but it doesn't save filename right? I'd like to use it for a webm database.
I know but it's on mobile I don't want to download a thread and have to export everything to my computer every time.

Just fucking write one yourself. Look. Mine's done in fucking python 3.X. It's 121 lines long because I was verbose with the documentation and liked the idea of expanding it later so some of it is more complex than it needs to be (calling the API for example). Some anons have managed an image grabber using one line of bash script. This is not fucking hard.

I don't know how to code and don't have the time or the or the will to learn.

Then kindly fuck off, little princess. You've been given multiple solutions and Sup Forums isn't your personal fucking helpdesk anyway.

I have a job and a family, when I come back from work I just want to spend time with my family or lurk Sup Forums, not work even more.

Fuck you, neither did I. I literally write one line every 5 minutes with a dozen or so "do [action] in python 3" searches in tabs. I only stopped because the scope of what I was trying to accomplish was getting out of hand. Also I decided to add the option to preserve original filenames in my script just to gloat over having built what you want but are too lazy and incompetent to make yourself.

Coding isn't for everyone. Here's a bash script

#!/bin/bash
#
# Get all images from Sup Forums thread

if [ $# = 0 ]; then
echo 'Usage: ./4chget.sh [thread url]'
exit
fi

LINK=$1
BOARD=${LINK#http*://*/*}
THREAD=${BOARD##*/*/}
BOARD=${BOARD%%/*}
DIR="$BOARD-$THREAD"

echo 'Board: ' $BOARD

mkdir -p $DIR ; cd $DIR

curl $LINK -o 'orig.html'
tr ' ' '\n' < orig.html > tmp.html
grep -oh 'i.4cdn.org/*/.*jpg' tmp.html > jpgs.txt
grep -oh 'i.4cdn.org/*/.*png' tmp.html > pngs.txt
if [[ -s jpgs.txt ]]; then
sort -u jpgs.txt | grep -v 's.*' | wget -i -
fi
if [[ -s pngs.txt ]]; then
sort -u pngs.txt | grep -v 's.*' | wget -i -
fi

#clean up
rm orig.html tmp.html jpgs.txt pngs.txt

...

Just learn python. It won't be that difficult.

Basically this.

4dl() {
board="$(printf -- '%s' "${1:?}" | cut -d '/' -f4)"
thread="$(printf -- '%s' "${1:?}" | cut -d '/' -f6)"
wget -qO- "a.4cdn.org/${board}/thread/${thread}.json" | jq -r '
.posts
| map(select(.tim != null))
| map((.tim | tostring) + .ext)
| map("

Is there a way to "watch" when a thread is created, with certain key words?

>jpgs
>pngs
literally a whitelist, fkin useless
nice skeleton tho
look, just grab jq and writing a proper bash downloader becomes easy as pie

nice
should be able to extend this to symlink the original filenames too, methinks

>doesn't know Firefox has a save function

GNU wget

Wrote this for myself in C feel free to use it

gitgud.io/sen/fsic
fuck me

>[]
deprecated
>= 0
Should be -eq or (( $# == 0 ))
Always double quote your variables unless they're direct integers. This is some really niggerlicious code user.
>grep -oh 'i.4cdn.org/*/.*jpg' tmp.html > jpgs.txt
>grep -oh 'i.4cdn.org/*/.*png' tmp.html > pngs.txt
Don't write out to a file, what the fuck are you doing? Pipe to a while loop
And more importantly, why the fuck are you parsing html when there's an API made specifically for this kind of thing? See >Coding isn't for everyone.
I've never seen such strong irony in my life before.

...

Holy shit learn to make webms better

I need to increase bitrate right?

Yes, and also lower the resolution, there's absolutely no reason to have it at a full 1080p

the script btw
pastebin.com/tT62ZWcr

mines in C, but its about 400 lines(though it has additonal comforts and its own complete download function and shit). This script gives me desire to go back and optimize my code. I just love how short it is.

Wrote this about 5 years ago to download images, still works.

fuggg, it has 3 download functions. I forgot to remove the other 2.
1/4 of my program is unused download functions

nigger.
are you downloading entire boards?
Thats some next level shit buy waaaayyy to risky.
damn I thought I was taking risks blindly downloading random b threads back in the day

What? It downloads images from a thread into a directory, that's all.

This is the reason Chad bullied you on school

weak b8 m8

>pastebin.com/tT62ZWcr
fuckin' whitelists again
cmon guys, surely you can write dat shite without resorting to hardcoding?

so the number is the thread number? odd approach.
I just pulled that information from the url instead of having to manually select it every time

This is why bash is better than Python.
#!/bin/bash
xdotool search "Mozilla Firefox" windowactivate --sync key --clearmodifiers F6 ctrl+c
url="$(xsel -bo)"
board="$(printf -- '%s' "${url:?}" | cut -d '/' -f4)"
thread="$(printf -- '%s' "${url:?}" | cut -d '/' -f6)"
wget -qO- "a.4cdn.org/${board}/thread/${thread}.json" | jq -r '
.posts
| map(select(.tim != null))
| map((.tim | tostring) + .ext)
| map("

How do you think images are identified?
>hurr
>If(file_ends_with(.*)) download();

>xdotool firefox f6 ^c
ooh, me likey
thanks for teaching me how to do that

>so the number is the thread number
Yes, I know it would be much easier to just copy paste the link, but the program doesn't only support Sup Forums, so I opted to just limit the options on the GUI instead of doing manual parsing of whatever string the user could enter and then go to all the trouble of deducing what website it needs to connect to. Nobody ever used the program but me and I don't even use it that much, so I don't mind.

tfw I'm just saving threads into my dedicated Sup Forums folder. I hope I will reach your level of expertise one day.

>extensions=('.webm','.gif','.jpeg','.jpg','.png')
>thumbnails=('s.jpg','s.png','s.jpeg')
>if (files.endswith(extensions) & title.endswith(extensions)) & (not (files.endswith(thumbnails) & title.endswith(thumbnails)))
you're hardcoding

you can instead just select the appropriate nodes in the DOM, ffs
does the job, but is sloppy
taking a step back, the first question should be: why not use the API?

4chins has an API?

Giving up on filenames seems preferable when you do meem image collections where you do not necessarily have one artist and title with one clear alphabetic name. People don't stick to any formatting norm either.

For later retreival, its much easier to just have tags, like in a booru, desktop search/tag engine, or hydrus.

>something something hydrus you pack of

Use the API and filter out the results.

10 points for appropriate pic

yep, jewgle it
also, see (jewgle hydrus network)

oh yea I know that feel.
I was going to do the same thing but never got around to it.
I was just going to butcher the url to leave x.com or something, then look at the first part and direct it to the section of the program for that particular site or whatever

>there's an API

you learn something new every day

>dont hardcode here, youre an idiot
>you need to hardcode over here and use whatever meme apis I like
how is that not hardcoding?

you understand these arnt Sup Forums clients, they are image scrapers. The nature of scraping is doing shit yourself because most sites dont provide an api

there were already half a dozen mentions of an API in this thread before you geniuses got the hint
how can you be this dense?

I'll admit, I didn't read the thread

also, always do a 'just in case' '

wget -nc -nd -nv -e robots=off -ER html,s.jpg -rHD i.4cdn.org

If you scrape everything, you're bound to get useless shit. Don't be a moron, use formatting to find useful data.

inb4404.py odes what you want, I think. Has an option for original filenames.

>why not use the API?
API has rules and you have to learn the calls constantly consulting docs. API is fine for projects like Clover or some other thread browsing application, but it's overkill for a simple thread scraper.

I use jdownloader for threads here on Sup Forums.

>you understand these arnt Sup Forums clients, they are image scrapers
if we're talking about generic image scrapers, sure. But this thread:
>Topic: Sup Forums thread downloader
>Isn't there a software that doesn't suck that can download all files from a thread with original filename?
so no, if your scope is "fuckin fetch all images from a Sup Forums thread with their original filenames", the hardcoded-extensions sloppy-match approach is retarded. Also, not using the API (with e.g. jq) is retarded, with the exception of a few scenarios.

explain?

Sup Forums's api is literally dumb as bricks.
github.com/Sup Forums/4chan-API

>Do not make more than one request per second.
Fuck you.

what you call white lists are actuall just filters.
Im not the guy whos code you reference, but my scraper checks file extension just like the other guy and it eliminates all garbage. I have only a couple filters and it removes adds, thumbnails, and banners. Worked fine for many years now.

>8-liner using API, bash, jq and wget
>3-20 times larger messes doing sloppy html fuckery
>"API is [...] overkill for a simple thread scraper"
uhh, you sure you're not from Bizzaro world?

nigga that was a single request

Downloads all files from a thread except thumbnails into the current directory. Check the manpage for options.

>I have a job and a family
Reddit might more at your speed then.

That's only for the api requests. You can download the images as frequently as you want however.

>3-20 times larger messes doing sloppy html fuckery
Are you autistic about the size of the message being 20KB larger?

Still beats being a hostage of an API that can disappear any second rendering your application useless. Rather count on my shit, than on someone elses and Sup Forums's shit isn't exactly reliant.

wget \
--recursive \
--no-directories \
--directory-prefix=$HOME/foo/bar/corn \
--accept '.jpg,.jpeg,.png,.gif,.webm' \
--reject '*s.jpg' \
--span-hosts \
--domains=i.4cdn.org \
--execute robots=off \
--wait=2 \
--random-wait \
--limit-rate=800k \
[Sup Forums-thread-url]

>Worked fine for many years now.
not arguing against that, just saying it's sloppy shit that is more complicated and fragile than needed
e.g.
>I have only a couple filters and it removes adds, thumbnails, and banners
you don't need fucking filters, or any logic to remove ads/thumbnails/banners
the image URLs - all of them, excluding the ads/banners/etc - can be selected with a single path specifier in the DOM
just do a fuckin inspect element in a thread and literally copy-paste (+tweak) dat shit as a selector in your html parsing library of choice

dudes talking about 4chanx and botnet chrome extensions.
I seriously doubt he knows the difference between a run of the mill scraper and a proper archiver.

Also saving images with the unedited original file name is just asking for trouble with false duplicates when you go to organize your porn hoard. I run into this issue frequently with thread names being reused.

the original file name can be retrieved by just locating the "file:" position in the html when you go to get the image urls.

yeah, I got the intention
manpages are garbage
I was asking for a tldr on the trick: how does wget build the appropriate URLs just from i.4cdn.org ?
is -rHD doing something like "filter: only URLs containing this pattern"?

honestly dont know what any of that is, I havnt fucked with web development since hdmtl 3 or 4 was hot shit.
I get the html, and retrieve images.

If you would like to explain what youre talking about I will read it and consider that approach, but at the moment it sounds like some proprietary shit that would only work on Sup Forums.

love it, actually
readable!

>Are you autistic about the size of the message being 20KB larger?
reading comprehension, motherfucker
I'm telling you your code is pajeet-tier cuz you can't into DOM selectors, so you resort to shitty fragile whitelist filtering heuristics

>Still beats being a hostage of an API that can disappear any second rendering your application useless. Rather count on my shit, than on someone elses and Sup Forums's shit isn't exactly reliant.
it's an 8-liner that works now, it's a trivial enough piece of code that losing it is no biggie
not like the HTML format can't change enough to cause wrong behavior in your hacked together pajeet shit

github.com/AmmarkoV/MyScripts/blob/master/Tools/4chanleech

>saving images with the unedited original file name is just asking for trouble
yup, agreed
would need fallback strats, not just "overwrite if it exists"
personally, I'm leaning towards the hydrus network / git / general CAS approach
- name is a hash of the content
- metadata (download filename / full filename, tags) kept separately and linking toward the content

>not like the HTML format can't change enough to cause wrong behavior in your hacked together pajeet shit
You'd be surprised, the essential parts have been the same for 5+ years now. The API has only been around for like what, 2-3 years? No thanks, I'll stick to my shit that works and will work after your hiroshima decides to pull the plug on the API. I'm also not the guy you were originally replying to.

Ive often looked into hydrus. Currently I like to name stuff with what I like followed by its original reupload number like "cool_wallpaper - 143252351243.jpg" and threads in the format of "wg_wallpaper thread - 2152354235".

With hydrus you say the name is a hash? I was under the impression that hydrus could be used to identify files and possibly retrieve their proper name, for example the name of a model.
How does hydrus handle resized images? Naturally I want the best resolution and highest quality.

>You'd be surprised, the essential parts have been the same for 5+ years now
wouldn't be surprised
>The API has only been around for like what, 2-3 years?
doesn't mean it will/won't have equivalent stability for our babby-tier purposes
>No thanks, I'll stick to my shit that works
sure thing, not saying you shouldn't use it
just stating my opinion on the simplicity/elegance/robustness of the alternative approaches
> and will work after your hiroshima decides to pull the plug on the API.
will work till the HTML deviates enough
same deal with the API
given mook's history, I guess he'll only remove the API (i.e. do actual work) if it'd benefit him in some way
so I'm not too worried about the tiny risk of the tiny impact of losing an 8-liner's worth of effort
> I'm also not the guy you were originally replying to.
dont matter to me none
that's the beauty of user discussion - I'm replying to the content, not the person

no clue, just learned about it today
but I saw the guide mention it doesn't keep the original filenames in its own copy, so I'd guess it'd go for either content hash or some other sort of uid

Here we go, powershell image downloader based on . No extension based filtering or anything.
function Get-4chanImages ($Board,$ThreadNumber,$OutputPath) { wget a.4cdn.org/$Board/thread/$ThreadNumber.json|% content|ConvertFrom-Json|% posts|? filename $null -ne |%{ wget "i.4cdn.org/$Board/$($_.tim)$($_.ext)" -OutFile "$OutputPath\$($_.tim)$($_.ext)" } }

Usage:
Get-4chanImages g 64369961 D:\tmp

First goes the board letter, then goes the thread number. Lastly the output path, if omitted images will be saved to the current location.

First fag to make a Fire/Waterfox addon with a big dumb "download thread" button on the toolbar wins the game.

looks about right, take a look at the jq approaches above
PS's integrated object-oriented shite (including convert-from-json) certainly looks alluring

>powershell is a real boy noaw *u*

>CTRL+F ctw
>CTRL+F chanthreadwatch
>CTRL+F supergouge

Here is EXACTLY what OP was looking for:
github.com/SuperGouge/ChanThreadWatch

Did all of you start lurking in 2018?

>c#

Just download the binaries.
I highly doubt someone who'd make this thread would care about that.

It needs .NET which means it sucks.

Again, I highly doubt someone who'd make this thread would care about that.

>not coding in C or C++ and using the .net framework

>i highly doubt someone would object to install a bloated heap of steaming hot shit for a simple thread scraper

>not coding in C or C++ and using the .net framework

>bloated
Let me laugh at you for a minute

.NET is fucking dogshit, just fucking use MSVC if you want to suck microshaft's cock, .NET is absolute cancer.

>minute
I'm so sorry you can't laugh for much longer because of that microshaft dick clogging your throat.

Well, maybe I wan't Microshoft's cock down my throat, ever considered that?

You are gay and a faggot.

Seek help.