"Hacking"

I'm trying to get a textbook, only copy online I found was on a chinese site with a download paywall (but free viewing).

What I did until now:
I inspected the source and made script in pic to generate 500+ links to every page.

What I think I need to do:
automate chrome to do this for me (load page, wait 5-10 secs to load the page, then use the "Print to PDF" functionality and number each file 1.pdf, 2.pdf etc. Only, I don't know how to do this (I don't know JS, and if I need it, what would I need to learn?)

That's what I think I need to do, any other ideas maybe?

Both links:
Original site:
source of pages (replace the x in num=x to get desired page): docread.mbalib.com/read/768f505c1c5ed5f21c5bc0874b711328?num=1&code=35bd670302aea685074b3f8e29015d52&max=0

Original site: doc.mbalib.com/view/768f505c1c5ed5f21c5bc0874b711328.html

Other urls found in this thread:

golibgen.io/view.php?id=1495599
twitter.com/NSFWRedditGif

If you get can it in plain text or HTML, why not use something like Pandoc to convert it to PDF?

Can't check the original site, it doesn't load for me.

this is it's not plaintext it's something weird.... (flash)

also, the 2nd link should work..

tried runnin it, seems like I'll have to clean it up manually before running: Package inputenc Error: Unicode char 智 (U+667A)
(inputenc) not set up for use with LaTeX.

Perhaps Flash to HTML5 to PDF? There should be some conversion tools around.

okay I cleaned it up and it kinda works, but only thing is that when you save the source html. google chrome doesn't save everything

the content is saved in "unnamed" (with no extension) files. And chrome saves only like 7 or 8.

How do I download everything, after waiting it all to load?

example:

I wonder if gnash could render these SWFs correctly, that would save a lot of time...

Why do you want the textbook to be split in 500 different PDFs? Why not just one PDF with 500 pages?

I guess there's separate programs that can append them together later...

in bash

for i in {1..500}
do
wget "docread.mbalib.com/read/768f505c1c5ed5f21c5bc0874b711328?num="$i"&code=35bd670302aea685074b3f8e29015d52&max=0"
done

i think that should grab the whole page, it's behind that flash/embed, but i think it's just the pdf content. which you should be able to just merge with some pdf tool.

not on a linux box, so i can't test it though.

that way you'll end up with 500 swfs which is pretty disgusting. we just need a way to render them to pdf

ayyyyy, gnash renders the files fine. for some reason the site blocks the wget user-agent, so you'll need to use wget -U "Mozilla" ""

that was the plan yeah, but I want one PDF ultimately.

Here's your book: golibgen.io/view.php?id=1495599

Click on the GET! button.

ayy, nice find.

not OP, but I'll finish writing a ripper for this mbalib website because it could be useful in future.

thank you!!

If you don't mind, how did you do the ripping & merging to pdf? (assuming you ripped and uploaded it)

I didn't. There's a magical website called libgen (gen.lib.rus.ec) with all the books you'll ever need for college.

they're compressed swf files. "cwf" so you'll need to unzip them. fuck knows how to parse swf files though.


but yeah, op are you sure you can't get the book elsewhere?

easy, just use swfrender:

swfrender page.swf -X 1024 -o page.png

damn, I could swear I searched libgen and bookzz.org before doing this, apparently not enough

someone posted the link here, so it's done

That would be handy to have, if you don't mind could you send it (or github link or something) to [email protected]

OP here, I made a script (from info on here & on the internet) It's slow as fuck (any idea on how to do this in C++?)

#!/bin/bash

mkdir scripty
cd scripty

for i in {1..553}
do
echo "Downloading file "$i"..."
wget -U "Mozilla" \
"docread.mbalib.com/read/768f505c1c5ed5f21c5bc0874b711328?num=$i&code=35bd670302aea685074b3f8e29015d52&max=0" -O "$i.cwf"
swfrender "$i.cwf" -X 1024 -o "$i.png"
rm "$i.cwf"
done
convert $(find -maxdepth 1 -type f -name '*.png' | sort -n | paste -sd\ ) output.pdf

mv output.pdf .. ##moves output.pdf to parent directory

cd ..
rm -r scripty

echo "done"

The fact that it's bash isn't what's slowing you down -- it's because you're grabbing 553 swf files and parsing them.
I'll take a look in a minute, how slow are we talking? you could probably do this async or something.

20 mins slow

also the order of the pdf file was fucked up

also it gave errors (probably info of the swf/cwf getting lost) is there a way to directly make a pdf of cwf without converting them to png?

>20 mins slow
yeah, this is to be expected if you're scraping 500 swf files. I'll try to improve on it a bit...

>also it gave errors (probably info of the swf/cwf getting lost) is there a way to directly make a pdf of cwf without converting them to png?
not with swfrender, there might be alternative and better tools out there