I am building a search engine

Question

I am building a search engine

Luis Ross

It is located at wibr.me. I am sick of google and the boring results it returns so I'm building my own. I want to rebuild the web to be like it was in the early days, where mostly personal and hobbyist type pages existed, that were lightweight and easy to read. Please try it out and give me feedback. If you want to submit a page to it, I'd appreciate it. There aren't many pages indexed at the moment.

May 5, 2017 - 11:47

Other urls found in this thread:

goldeneye007.detstar.com
wibr.me/surprise/
motherfuckingwebsite.com/
angelfire.com/trek/caver
vxheaven.org/
bbspot.com/News/2000/5/clock_rift.html
mapsnet.org/pages/kcarr/student webpages/2012-2013/3rd_hr/vermeulen_webpage/index.html
yacy.net/
motherfuckingwebsite.com
omfgdogs.com/
fimfiction.net/
twitter.com/NSFWRedditImage

Jeremiah Peterson

What is your policy on copyright?
Will your system be built on top of any other platform or system?
Will your system use or depend on third-party libraries or systems?
What languages do you use for programming?
Do you use any form of logging in your software?
Can we see the source?

May 5, 2017 - 11:52

Ryder Nelson

>What is your policy on copyright?
I'm just a one man show so I honestly haven't really considered this. I dont have a policy except to say if its illegal where I live, I probably wouldnt be able to have it. :(

>Will your system be built on top of any other platform or system?
The LAMP stack. Thats it.

>What languages do you use for programming?
The pages are php. The web crawler is pure C (for the speed).

>Do you use any form of logging in your software?
I really, really don't want to log peoples IP's. That ads a huge layer of complexity and for what purpose? To feed to the government? Screw them. My fear is what would happen if they tried to compel me? This will probably never make any real popularity so I don't think it would be a worry.

May 5, 2017 - 11:58

Noah Sanchez

Neat. I like the slim 'n cozy feel.

May 5, 2017 - 12:07

Lucas Green

Thanks

May 5, 2017 - 12:20

Anthony Foster

bump for interest

May 5, 2017 - 12:27

Evan Rogers

are you just not indexing big sites or something?
I searched for wikipedia and got stallman.org

May 5, 2017 - 12:27

Benjamin Rivera

stallman > wikipedia

But in all seriousness I don't think I want to index wikipedia or other wikies all that much, Its not trying to be like google where it will deliver you the answer to the technical question you may search for at the expense of all the fun interesting sites being buried. I could never compete with them on that level anyway. Rather I want this to be more of a search engine for when you have no idea what you're looking for. Myave you just have a general top of interest. Its hard to put into words. kindof just flying by the seat of my pants here.

May 5, 2017 - 12:34

Jaxon Wright

Also, all the big sites are bloated and full of scripts. They aren't going to make it on this search engine.

May 5, 2017 - 12:36

Joseph Powell

This might just be absolutely genius.

May 5, 2017 - 12:37

Colton Sanchez

results are a little weak

May 5, 2017 - 12:38

Josiah Foster

what do you mean don't know what your're looking for? how does it work?

May 5, 2017 - 12:41

Charles Allen

Its shit

May 5, 2017 - 12:42

Leo Lewis

:D
yeah, it will get better as I can improve the code behind it, but also there just aren't many pages indexed.

May 5, 2017 - 12:42

Jeremiah Anderson

...

May 5, 2017 - 12:44

Alexander Torres

Say you want to find pages about.. cats. But not a page on say, "cat hair removal". "Cat hair removal" is a technical question that google is best suited for. Of course, as more pages get indexed, this sort of query might actually work.

May 5, 2017 - 12:45

Christian Garcia

Sorry :(

May 5, 2017 - 12:46

Colton Stewart

but can't you just google "cat"? what does it do differently?

the suprise me feature is pretty good though

May 5, 2017 - 12:50

Joseph Brooks

What I hope to do, is when you search for 'cat', you will get a bunch of results of people's personal webpages, about their own cats. If you google 'cat', you get SPCA type results, or mainstream news articles, or pages that are mainly profit/ad driven that yield little genuine content. Those are not the type of results I want it to generate.

May 5, 2017 - 12:53

Gavin Carter

>tfw almost every search I've done pointed me to stallman.org
Is this some kind of joke

Also, I appreciate your efforts OP, but nothing I've searched for gave me meaningful results.

May 5, 2017 - 12:53

Mason Hernandez

ah okay makes sense now

you should put a tagline somewhere on the main page that explains that

May 5, 2017 - 12:58

Jackson Foster

heheh.
Well come back in a year, I hope to have more pages indexed by then.

May 5, 2017 - 12:59

Levi Cook

Yeah I think I might have to do just that

May 5, 2017 - 13:00

Jordan Cook

Searched for 007 and got this, that's the kind of thing you would get on old Google, good job OP.
goldeneye007.detstar.com

May 5, 2017 - 13:00

Jackson Perez

>one man show trying to build a search engine
>LAMP
>pure C "for speed"

I smell a college sophomore who is trying to spin his data structures final into a half assed product because they couldn't get an internship for the summer.

May 5, 2017 - 13:05

Dylan Nguyen

Yeah those are the pages I need to find among all the crap on the interweb.

I come from the electronics/hardware side of things, long since graduated. SW is a skill I like to explore and keep up with. Maybe it looks half assed to you but I'm happy with the way its turning out.

May 5, 2017 - 13:12

Nathaniel Bailey

I like this feature to

wibr.me/surprise/

Not gonna use this for anything I'm really trying to find out but for finding things you don't know you want to know seems to work great.

May 5, 2017 - 13:18

Grayson Nelson

>tfw this is the first website I found
motherfuckingwebsite.com/
Seems like I'll be using this often

May 5, 2017 - 13:28

Brandon Baker

Glad you like it. Only way to keep that interesting is to keep the same kind of requirements for what gets indexed.

May 5, 2017 - 13:41

Lucas Martin

I smell a retard.

May 5, 2017 - 13:43

Joseph Jenkins

That took me to
>angelfire.com/trek/caver

May 5, 2017 - 14:00

Thomas Roberts

vxheaven.org/
Neat

May 5, 2017 - 14:23

Levi Perry

Interesting stuff

May 5, 2017 - 14:24

Julian Nelson

yeah i'm liking the surprise feature. feels like late 90s, early 2000s. but then i'm just using it as a stumbleupon which was/is already a thing to go to random pages.

May 5, 2017 - 14:39

Jaxson Howard

Give sites that use non-free javascript lower search priority

Blacklist sites which abuse search hits for clicks.
For instance, sites that create empty pages for certain keystrokes like jiskha.com or chegg.com which puts everything behind a paywall (advertising).

May 5, 2017 - 14:58

Jackson Robinson

Need some work

May 5, 2017 - 15:03

Nolan Peterson

also let us submit sites to blacklist

you can start with twitter, youtube, google, wikipedia, twitter, facebook, reddit and Sup Forums.

Although some of these are indeed useful, they do not need to appear in search results.

May 5, 2017 - 15:08

Robert Bailey

You should make a search engine that ignores the top 1000 sites.

May 5, 2017 - 15:12

Chase Carter

These sites will never be able to be indexed on here. They are all retroactively banned I tell you.

You can help me with that...

May 5, 2017 - 15:15

Colton Wilson

>These sites will never be able to be indexed on here. They are all retroactively banned I tell you.
Except the chans

May 5, 2017 - 15:16

Benjamin Peterson

What an awful place to work. Are these tools even on the spectrum? What is privacy? What is personal space? Fucking disgusting

May 5, 2017 - 15:49

Christian Richardson

rude

May 5, 2017 - 15:50

Zachary Kelly

Maybe it's a good thing.

May 5, 2017 - 15:57

Liam Nelson

ultimately if you blacklist enough crap and set up filters, you could create a fairly neat search engine without having to manually approve each address.

May 5, 2017 - 16:00

Bentley Bennett

>I smell a college sophomore who is trying to spin his data structures final into a half assed product because they couldn't get an internship for the summer.

fucking savage I love it
this is so true

any lone man trying to create a search engine is either delusional, or a sophomore in uni

May 5, 2017 - 16:01

Nicholas Johnson

>learn linux
>learn php
>learn sql
>learn how search engines work
>improve coding skills
>can leverage new skills to get better job

hurr durr totally delusional.

May 5, 2017 - 16:05

Grayson Davis

bbspot.com/News/2000/5/clock_rift.html

Now this is the type of content I want on Sup Forums

May 5, 2017 - 16:05

Hudson Perry

Your algorithm decides which sites to includes by how much CSS they use and the general size of images I suppose?
I think it's cool but how do you make money? Or rather: Why should I use your search engine instead of just keep using google?
You're literally trying to make a separate web happen.
Yes technically every search engine is a separate web that allows you to peek into different parts of the internet.
But which sites will get a higher pagerank than other sites? Those that use less CSS? So a site that could be seriously popular might only show up on page 3 because it is too 'modern' or too corporate?

May 5, 2017 - 16:08

Ian James

user, be careful. One day, you *will* index CP. This could be a life ruining event for you. You are not Google and do not have over 9000 jewish lawyers on your payroll.

May 5, 2017 - 16:09

Zachary Davis

>linking to illegal material is now illegal
>???

May 5, 2017 - 16:12

Owen Evans

>>Your algorithm decides which sites to includes by how much CSS they use and the general size of images I suppose?

This I want to know, How does this algorithm work that filters out 'big' websites? Or do you possibly just go through your web crawler's results yourself?

May 5, 2017 - 16:21

Adam Price

Go set up a CP link site. Lemme know how that works out of you.

Yes, the law is retarded.

May 5, 2017 - 16:23

Grayson Evans

I'm not trying to replace or supplant google. This is as I said earlier up the thread, a totally different kind of approach. Google wants to index the entire web, and deliver you the exact answer you are looking for related usually to a technical question. Whereas I want my search engine to be more of a stroll in a cozy village. You wanted to go for a walk, you weren't sure exactly what you will run into, and you end up seeing some interesting things along the way. The sites also will fit a certain aesthetic criteria. Those which do not fit the criteria are not indexed.

Yes its strange. Its not going to impress most people. Just a niche. But thats how it works. I'm fine with it.

The problem with google indexing literally everything, is that you end up with many awful, cancerous pages. They are primarily ad driven, click bait type pages that make my your modern computer sweat. Or they are boring as hell associations, corporate landing pages, mainstream new articles, etc. The gem type pages I am looking for get lost and crowded out by all the other filth.

May 5, 2017 - 16:27

Jack Parker

>use the surprise feature
>get this
>mapsnet.org/pages/kcarr/student webpages/2012-2013/3rd_hr/vermeulen_webpage/index.html

May 5, 2017 - 16:28

Gabriel Long

I'm not sure I want to explain how that part works :D I will say however, that only the page you submit will get indexed. Thought about crawling other sub-pages and decided against it. It relies entirely on user submissions (or me finding the pages myself and submitting them).

May 5, 2017 - 16:35

Nathaniel Long

It's great that you're building your own search-engine because most search-engines are heavily censored and monitored.

There is one alternative that may interest you, yacy.net/

YaCy is P2P based and it's written in Java. It really isn't very good. I've been running it for years anyway, though.

You should probably look into that and think long and hard about how you want to be sorting your index.

YaCy has a ton of pages in it's index, and since it's P2P the entire network "knows" about a large portion of the Internet. Where this project really breaks down is it's ability to sort search-results, it doesn't. At all. It's almost like that code which supposedly ranks search-results simply pulls /dev/random and outputs the pages in that order.

May 5, 2017 - 16:41

Carson Torres

Thanks for the feedback yall, time for bed.

May 5, 2017 - 16:41

Justin Nelson

I hate the name. Try something more like "frontier.web". It's more meaningful to what it seems like you're trying to do, and I'm pretty sure that people are sick of meaningless names, especially if you're trying to distance yourself from the existing Silicon Valley circle jerk.

May 5, 2017 - 16:42

Samuel Carter

alright will check that out and thanks for the feedback. night

May 5, 2017 - 16:43

David Lewis

> The sites also will fit a certain aesthetic criteria. Those which do not fit the criteria are not indexed.

yeah but sites change their design when they get more popular. So at some point you may be removing sites from your search engine which became popular due to your search engine?
that's a funny scenario

May 5, 2017 - 16:44

Jackson Stewart

oh well :)

May 5, 2017 - 16:46

Josiah Carter

>more popular
I think you mean, "too mainstream".

May 5, 2017 - 16:48

Carson Sullivan

This is really neat user. I bookmarked this.

May 5, 2017 - 16:49

Aiden Howard

Any plans to monetize it yet? If so, how?

May 5, 2017 - 16:50

Aaron Torres

look, if you want actual advice: Make a search engine that *only* finds high performance websites. You ping them, if they respond and load well, you include them in search results.
You can start by just requiring sites to be under lets say 250kb or 500kb and go from there to more features.

May 5, 2017 - 16:51

Sebastian Thompson

I'm not sure how much I like it either, but its short and easy to remember. Also, unfortunately most of the premium, english domains are taken, so you have to resort to some kind of weird new word.

May 5, 2017 - 16:52

Ryder Smith

A lightweight, minimal ad to the right of the search results will one day be put in, is the plan anyway.

So every page taken by a domain shark will get indexed then :). They all have responsive, light weight pages. You need an extra level of QC.

fuck i need to goto bed but cant stop

May 5, 2017 - 16:55

Brody Barnes

> So every page taken by a domain shark will get indexed then

mate then add your own level of QC. But you need some kind of actually objectively useful draw other than "this search engine only has sites I like"

May 5, 2017 - 16:58

Liam Miller

As I said, hobbyist and personal web pages that are lightweight. Thats the main criteria.

May 5, 2017 - 17:00

Samuel Sullivan

bump'd

I submitted some websites

May 5, 2017 - 17:13

Elijah Thomas

...

May 5, 2017 - 17:29

Oliver Adams

>motherfuckingwebsite.com
lmfao

May 5, 2017 - 17:33

Julian Harris

Good work OP! i will submit some websites

May 5, 2017 - 23:06

Jose Evans

I like the idea, OP, it's like a niche stumbleupon. I'm somehow getting addicted to it.

nice

May 6, 2017 - 00:40

Grayson Nelson

>I really, really don't want to log peoples IP's. That ads a huge layer of complexity
i dont think you know how to program

May 6, 2017 - 00:43

Samuel Carter

That fucking surprise hahhahah

May 6, 2017 - 01:02

Landon Mitchell

congratulations! unlike front end guys, you are really an engineer.

May 6, 2017 - 01:07

Kevin Williams

>surprise me
>home.mcom.com/home/welcome.html

FUCK

May 6, 2017 - 01:14

Easton Thompson

Does it even use keywords?
Looks like completely random redirect from yahoo search. If it does not find shit just make it say "sorry nothing relevant found onii-chan"

May 6, 2017 - 01:19

Connor Myers

Bullshit, i highly doubt every "young models/ family nudism/ lolita" site that links to literal cp is run by some pro. Some of them are obviously amateurs and only half are Russian.

OP, put a disclaimer on the site and have a report feature, that's all you really need.

May 6, 2017 - 01:27

Jeremiah Evans

I got omfgdogs.com/
before I read the other comments I thought, this just brought you to a random but useless site

May 6, 2017 - 01:40

Christopher Morales

>surprise me
>cia.gov

May 6, 2017 - 02:17

Charles Rodriguez

Awesome site, OP. Surprise feature a fucking best.

May 6, 2017 - 03:14

Jackson Mitchell

>surprise me
>heavensgate.com

May 6, 2017 - 03:29

Camden Murphy

>Surprise me
Ok, that's kinda funny m8. But the regular search makes no sense, the few things it bring back don't even have the same words in it.

May 6, 2017 - 03:29

Jace Fisher

I can improve upon it, but honestly for such a technical question like what you are asking, should probably stick with google.

May 6, 2017 - 04:13

Julian Evans

>OP, put a disclaimer on the site and have a report feature, that's all you really need.
Will look into that! Thanks.

May 6, 2017 - 04:21

Anthony Peterson

Yeah I think that will improve as more pages get indexed.

May 6, 2017 - 04:23

Benjamin Myers

>What is your policy on copyright?
I have thought about creating a search engine too but this was actually one of the biggest questions on my mind was the legality of running automated bots to spider peoples sites cause I dont want to end up in rape you in the ass prison because of some bullshit laws. remember what they did to that kid at MIT sentencing him to like 50 years in prison for automatically downloading papers that where available through the campus network?

May 6, 2017 - 04:24

Cooper Hall

>hqdefault.jpg

Question: did u make your image that?

May 6, 2017 - 04:25

Jaxson Baker

search: cats
fimfiction.net/
FUUUUUUCK
REEEEE

May 6, 2017 - 04:29

Christian Thompson

is it actually using the words you input in the "search" OP ?

May 6, 2017 - 04:51

Jack Campbell

yes, but there arent a lot of pages in this yet so it doesn't have much to work with right now

May 6, 2017 - 05:10

Nolan Diaz

What is the user agent of your crawler and how does it work?

Do you read robots.txts? IMO you shouldn't because fuck 'em

May 6, 2017 - 05:18

Dylan Mitchell

Thanks anons :)

User Agent? Sorry I'm not sure what you mean but I'll try to answer. Using libcurl to download the submitted page. Then it goes to an html parser that I wrote to separate the text from the html. Then it gets put into the db. There were some off-the-shelf solutions for a web crawler and even a full search engine, but they are often just as hard to figure out as it would take to just build your own.

>robots.txts
I dont read it but if the page has "noindex", it wont get indexed.

May 6, 2017 - 05:26

Alexander Reyes

cool just checking, glad to know your awake hope you slept well :3

May 6, 2017 - 05:31

Easton Sanchez

I searched for "dog" and got "puppy linux" as the first result. I like it.

May 6, 2017 - 05:38

Jeremiah Moore

Please continue development. This could be very valuable. Most might not get it but you're right... Searching for interesting things when you don't really know what you're looking for is actually a very useful thing.

May 6, 2017 - 05:42

Jacob Green

This is what it looks like to the site admin when you crawl a website

172.93.49.59 - - [06/May/2017:13:39:48 -0600] "GET / HTTP/1.1" 200 1088 "-" "-"
172.93.49.59 - - [06/May/2017:13:39:48 -0600] "GET /posts/post4.html HTTP/1.1" 200 2923 "-" "-"
172.93.49.59 - - [06/May/2017:13:39:48 -0600] "GET /posts/post1.html HTTP/1.1" 200 2146 "-" "-"

This is what another random bot looks like

100.43.81.141 - - [06/May/2017:08:16:33 -0600] "GET /contact.html HTTP/1.1" 200 741 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +yandex.com/bots)"

the difference is you have no user agent.

May 6, 2017 - 05:53

Dominic Johnson

Never do sleep well but that helps when you have projects to work on.

Will do!

May 6, 2017 - 05:54

1 2 ... 10 Next

I am building a search engine

Last threads