{S/E/D/HELP}

{S/E/D/HELP}

Is there a SED wizard in the audience?

This sed searches through some text and successively finds instances of the keyword "conditions" and prints the closest instance of what ever follows after "subreddit_id":

sed -n '/conditions/{s/.*,"subreddit_id":"\([^"]*\)".*/\1/;p;}' /home/a/Desktop/subredditidsearch.txt

The contents of the data set "subredditidsearch.txt":

{"ups":1,"created_utc":"1204329608","subreddit":"reddit.com","link_id":"t3_69ta9","author_flair_text":null,"score":1,"subreddit_id":"t5_6","body":"What conditions would have to be met in order for the U.S. led invasion of Iraq to be considered a genocide?","name":"t1_c03bgax","distinguished":null,"edited":false,"parent_id":"t1_c03bg9s","archived":true,"author_flair_css_class":null,"gilded":0,"retrieved_on":1425832318,"id":"c03bgax","controversiality":0,"downs":0,"author":"bsiviglia9","score_hidden":false}
{"edited":false,"parent_id":"t1_c03bc4g","distinguished":null,"name":"t1_c03bgay","body":"You missed his their/they're mixup.","score":1,"subreddit_id":"t5_6","author_flair_text":null,"link_id":"t3_6aezn","ups":1,"subreddit":"reddit.com","created_utc":"1204329608","score_hidden":false,"author":"Cyrius","downs":0,"controversiality":0,"id":"c03bgay","retrieved_on":1425832318,"author_flair_css_class":null,"gilded":0,"archived":true}

Now, if I change "subreddit_id" to say "author", it will print the correct author, in this case "Cyrius".

Is there an easy way to modify this sed to print both "subreddit_id" and "author" when it finds the keyword? Any help would be greatly appreaciated!

Other urls found in this thread:

github.com/fabienrenaud/java-json-benchmark
github.com/miloyip/nativejson-benchmark
twitter.com/SFWRedditVideos

Why don't you go ba󠀼ck to Re󠀼dd󠀼it and ask them

/thread

I've had more luck with sed and regex questions in here than on reddit desu.

Besides, I'm trying to data mine reddit, not ask them for help.

>working with JSON
>not just using the right tool for the job like Node.js

Kill yourself

>JSON file named .txt

Literally kill yourself.

>>working with JSON
That's what the data set is structured as. Blame the dumper not me

>not just using the right tool for the job like Node.js

I have nearly two billion reddit comments to look through. With sed I manage to saturate the wirte speed of the HDDs I'm storing it on

>JSON file named .txt
It's a set of strings, call the police

read speed*

>Node.js

heartylaugh.flv

bump

underrated post

Bump.
My current solution is to run the entire command for every item I want to extract, but because it takes me 11 hours to run through the data set it would save me a lot of time if I could use a capture group or something to add say both subreddit_id and author and have it print both. I\ve been trying various things but I can\t seem to get it to work, the sex regex is very unfamiliar to me.

sed regex*

bumping for interest.

sed is terrible for this

sed works wonderfully, and the speed is incredible. It also handles a 1TB+ data set with ease.

It's actually perfect for the job.

>sed is perfect for parsing json
lmfao

>because it takes me 11 hours to run through the data set
Use a JSON parser and dump to database you utter retard.

>Use a JSON parser and dump to database you utter retard.

Name a JSON parser that can handle 6 billion lines at 200+ MB/s and I'll give it a try

I took a shot at it lol. maybe it helps? :)

sed -R "s/.*subreddit_id...([a-z]{1}\d_\d).*(conditions).*author...
(\w+).*/ID: [\1] by \3 contained topic with the word \2/" output.txt

I'll give it a try, hold on

So I tried the folowing:

sed -r "s/.*subreddit_id...([a-z]{1}\d_\d).*(conditions).*author...
> (\w+).*/ID: [\1] by \3 contained topic with the word \2/" /home/a/Desktop/subredditidsearch.txt

Which gives me:

sed: -e expression #1, char 59: unterminated `s' command

You're on Linux right? If so, try replacing " with ' and use capital R instead of sed -r.

Yeah, Ubuntu.

Now I'm getting this:

a@1:~$ sed R 's/.*subreddit_id...([a-z]{1}\d_\d).*(conditions).*author...
> (\w+).*/ID: [\1] by \3 contained topic with the word \2/' /home/a/Desktop/subredditidsearch.txt
sed: can't read s/.*subreddit_id...([a-z]{1}\d_\d).*(conditions).*author...
(\w+).*/ID: [\1] by \3 contained topic with the word \2/: No such file or directory
{"ups":1,"created_utc":"1204329608","subreddit":"reddit.com","link_id":"t3_69ta9","author_flair_text":null,"score":1,"subreddit_id":"t5_6","body":"What conditions would have to be met in order for the U.S. led invasion of Iraq to be considered a genocide?","name":"t1_c03bgax","distinguished":null,"edited":false,"parent_id":"t1_c03bg9s","archived":true,"author_flair_css_class":null,"gilded":0,"retrieved_on":1425832318,"id":"c03bgax","controversiality":0,"downs":0,"author":"bsiviglia9","score_hidden":false}
{"edited":false,"parent_id":"t1_c03bc4g","distinguished":null,"name":"t1_c03bgay","body":"You missed his their/they're mixup.","score":1,"subreddit_id":"t5_6","author_flair_text":null,"link_id":"t3_6aezn","ups":1,"subreddit":"reddit.com","created_utc":"1204329608","score_hidden":false,"author":"Cyrius","downs":0,"controversiality":0,"id":"c03bgay","retrieved_on":1425832318,"author_flair_css_class":null,"gilded":0,"archived":true}

-R. Also fix your filepath

>-R

sed: invalid option -- 'R'
Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...

No clue how fast your machine is. But here are fast full JSON parsers:

github.com/fabienrenaud/java-json-benchmark

github.com/miloyip/nativejson-benchmark

You will almost guaranteed to be bottlenecked on IO not CPU, unless you're running a toaster.

> I'm trying to data mine reddit, not ask them for help.
How about you first throw the data into something more suitable, such as Apache Spark?

It'll make the actual data mining far easier going forward.