I am building a search engine

It is located at wibr.me. I am sick of google and the boring results it returns so I'm building my own. I want to rebuild the web to be like it was in the early days, where mostly personal and hobbyist type pages existed, that were lightweight and easy to read. Please try it out and give me feedback. If you want to submit a page to it, I'd appreciate it. There aren't many pages indexed at the moment.

Other urls found in this thread:

goldeneye007.detstar.com
wibr.me/surprise/
motherfuckingwebsite.com/
angelfire.com/trek/caver
vxheaven.org/
bbspot.com/News/2000/5/clock_rift.html
mapsnet.org/pages/kcarr/student webpages/2012-2013/3rd_hr/vermeulen_webpage/index.html
yacy.net/
motherfuckingwebsite.com
omfgdogs.com/
fimfiction.net/
twitter.com/NSFWRedditImage

What is your policy on copyright?
Will your system be built on top of any other platform or system?
Will your system use or depend on third-party libraries or systems?
What languages do you use for programming?
Do you use any form of logging in your software?
Can we see the source?

>What is your policy on copyright?
I'm just a one man show so I honestly haven't really considered this. I dont have a policy except to say if its illegal where I live, I probably wouldnt be able to have it. :(

>Will your system be built on top of any other platform or system?
The LAMP stack. Thats it.

>What languages do you use for programming?
The pages are php. The web crawler is pure C (for the speed).

>Do you use any form of logging in your software?
I really, really don't want to log peoples IP's. That ads a huge layer of complexity and for what purpose? To feed to the government? Screw them. My fear is what would happen if they tried to compel me? This will probably never make any real popularity so I don't think it would be a worry.

Neat. I like the slim 'n cozy feel.

Thanks

bump for interest

are you just not indexing big sites or something?
I searched for wikipedia and got stallman.org

stallman > wikipedia

But in all seriousness I don't think I want to index wikipedia or other wikies all that much, Its not trying to be like google where it will deliver you the answer to the technical question you may search for at the expense of all the fun interesting sites being buried. I could never compete with them on that level anyway. Rather I want this to be more of a search engine for when you have no idea what you're looking for. Myave you just have a general top of interest. Its hard to put into words. kindof just flying by the seat of my pants here.

Also, all the big sites are bloated and full of scripts. They aren't going to make it on this search engine.

This might just be absolutely genius.

results are a little weak

what do you mean don't know what your're looking for? how does it work?

Its shit

:D
yeah, it will get better as I can improve the code behind it, but also there just aren't many pages indexed.

...

Say you want to find pages about.. cats. But not a page on say, "cat hair removal". "Cat hair removal" is a technical question that google is best suited for. Of course, as more pages get indexed, this sort of query might actually work.

Sorry :(

but can't you just google "cat"? what does it do differently?

the suprise me feature is pretty good though

What I hope to do, is when you search for 'cat', you will get a bunch of results of people's personal webpages, about their own cats. If you google 'cat', you get SPCA type results, or mainstream news articles, or pages that are mainly profit/ad driven that yield little genuine content. Those are not the type of results I want it to generate.

>tfw almost every search I've done pointed me to stallman.org
Is this some kind of joke

Also, I appreciate your efforts OP, but nothing I've searched for gave me meaningful results.

ah okay makes sense now

you should put a tagline somewhere on the main page that explains that

heheh.
Well come back in a year, I hope to have more pages indexed by then.

Yeah I think I might have to do just that

Searched for 007 and got this, that's the kind of thing you would get on old Google, good job OP.
goldeneye007.detstar.com

>one man show trying to build a search engine
>LAMP
>pure C "for speed"

I smell a college sophomore who is trying to spin his data structures final into a half assed product because they couldn't get an internship for the summer.

Yeah those are the pages I need to find among all the crap on the interweb.

I come from the electronics/hardware side of things, long since graduated. SW is a skill I like to explore and keep up with. Maybe it looks half assed to you but I'm happy with the way its turning out.

I like this feature to

wibr.me/surprise/

Not gonna use this for anything I'm really trying to find out but for finding things you don't know you want to know seems to work great.

>tfw this is the first website I found
motherfuckingwebsite.com/
Seems like I'll be using this often

Glad you like it. Only way to keep that interesting is to keep the same kind of requirements for what gets indexed.

I smell a retard.

That took me to
>angelfire.com/trek/caver

vxheaven.org/
Neat

Interesting stuff

yeah i'm liking the surprise feature. feels like late 90s, early 2000s. but then i'm just using it as a stumbleupon which was/is already a thing to go to random pages.

Give sites that use non-free javascript lower search priority

Blacklist sites which abuse search hits for clicks.
For instance, sites that create empty pages for certain keystrokes like jiskha.com or chegg.com which puts everything behind a paywall (advertising).

Need some work

also let us submit sites to blacklist

you can start with twitter, youtube, google, wikipedia, twitter, facebook, reddit and Sup Forums.

Although some of these are indeed useful, they do not need to appear in search results.

You should make a search engine that ignores the top 1000 sites.

These sites will never be able to be indexed on here. They are all retroactively banned I tell you.

You can help me with that...

>These sites will never be able to be indexed on here. They are all retroactively banned I tell you.
Except the chans

What an awful place to work. Are these tools even on the spectrum? What is privacy? What is personal space? Fucking disgusting

rude

Maybe it's a good thing.

ultimately if you blacklist enough crap and set up filters, you could create a fairly neat search engine without having to manually approve each address.

>I smell a college sophomore who is trying to spin his data structures final into a half assed product because they couldn't get an internship for the summer.

fucking savage I love it
this is so true

any lone man trying to create a search engine is either delusional, or a sophomore in uni

>learn linux
>learn php
>learn sql
>learn how search engines work
>improve coding skills
>can leverage new skills to get better job

hurr durr totally delusional.

bbspot.com/News/2000/5/clock_rift.html

Now this is the type of content I want on Sup Forums

Your algorithm decides which sites to includes by how much CSS they use and the general size of images I suppose?
I think it's cool but how do you make money? Or rather: Why should I use your search engine instead of just keep using google?
You're literally trying to make a separate web happen.
Yes technically every search engine is a separate web that allows you to peek into different parts of the internet.
But which sites will get a higher pagerank than other sites? Those that use less CSS? So a site that could be seriously popular might only show up on page 3 because it is too 'modern' or too corporate?

user, be careful. One day, you *will* index CP. This could be a life ruining event for you. You are not Google and do not have over 9000 jewish lawyers on your payroll.

>linking to illegal material is now illegal
>???

>>Your algorithm decides which sites to includes by how much CSS they use and the general size of images I suppose?

This I want to know, How does this algorithm work that filters out 'big' websites? Or do you possibly just go through your web crawler's results yourself?

Go set up a CP link site. Lemme know how that works out of you.

Yes, the law is retarded.

I'm not trying to replace or supplant google. This is as I said earlier up the thread, a totally different kind of approach. Google wants to index the entire web, and deliver you the exact answer you are looking for related usually to a technical question. Whereas I want my search engine to be more of a stroll in a cozy village. You wanted to go for a walk, you weren't sure exactly what you will run into, and you end up seeing some interesting things along the way. The sites also will fit a certain aesthetic criteria. Those which do not fit the criteria are not indexed.

Yes its strange. Its not going to impress most people. Just a niche. But thats how it works. I'm fine with it.

The problem with google indexing literally everything, is that you end up with many awful, cancerous pages. They are primarily ad driven, click bait type pages that make my your modern computer sweat. Or they are boring as hell associations, corporate landing pages, mainstream new articles, etc. The gem type pages I am looking for get lost and crowded out by all the other filth.

>use the surprise feature
>get this
>mapsnet.org/pages/kcarr/student webpages/2012-2013/3rd_hr/vermeulen_webpage/index.html

I'm not sure I want to explain how that part works :D I will say however, that only the page you submit will get indexed. Thought about crawling other sub-pages and decided against it. It relies entirely on user submissions (or me finding the pages myself and submitting them).

It's great that you're building your own search-engine because most search-engines are heavily censored and monitored.

There is one alternative that may interest you, yacy.net/

YaCy is P2P based and it's written in Java. It really isn't very good. I've been running it for years anyway, though.

You should probably look into that and think long and hard about how you want to be sorting your index.

YaCy has a ton of pages in it's index, and since it's P2P the entire network "knows" about a large portion of the Internet. Where this project really breaks down is it's ability to sort search-results, it doesn't. At all. It's almost like that code which supposedly ranks search-results simply pulls /dev/random and outputs the pages in that order.

Thanks for the feedback yall, time for bed.

I hate the name. Try something more like "frontier.web". It's more meaningful to what it seems like you're trying to do, and I'm pretty sure that people are sick of meaningless names, especially if you're trying to distance yourself from the existing Silicon Valley circle jerk.

alright will check that out and thanks for the feedback. night

> The sites also will fit a certain aesthetic criteria. Those which do not fit the criteria are not indexed.

yeah but sites change their design when they get more popular. So at some point you may be removing sites from your search engine which became popular due to your search engine?
that's a funny scenario

oh well :)

>more popular
I think you mean, "too mainstream".

This is really neat user. I bookmarked this.

Any plans to monetize it yet? If so, how?

look, if you want actual advice: Make a search engine that *only* finds high performance websites. You ping them, if they respond and load well, you include them in search results.
You can start by just requiring sites to be under lets say 250kb or 500kb and go from there to more features.

I'm not sure how much I like it either, but its short and easy to remember. Also, unfortunately most of the premium, english domains are taken, so you have to resort to some kind of weird new word.

A lightweight, minimal ad to the right of the search results will one day be put in, is the plan anyway.

So every page taken by a domain shark will get indexed then :). They all have responsive, light weight pages. You need an extra level of QC.

fuck i need to goto bed but cant stop

> So every page taken by a domain shark will get indexed then

mate then add your own level of QC. But you need some kind of actually objectively useful draw other than "this search engine only has sites I like"

As I said, hobbyist and personal web pages that are lightweight. Thats the main criteria.

bump'd

I submitted some websites

...

>motherfuckingwebsite.com
lmfao

Good work OP! i will submit some websites

I like the idea, OP, it's like a niche stumbleupon. I'm somehow getting addicted to it.

nice

>I really, really don't want to log peoples IP's. That ads a huge layer of complexity
i dont think you know how to program

That fucking surprise hahhahah

congratulations! unlike front end guys, you are really an engineer.

>surprise me
>home.mcom.com/home/welcome.html

FUCK

Does it even use keywords?
Looks like completely random redirect from yahoo search. If it does not find shit just make it say "sorry nothing relevant found onii-chan"

Bullshit, i highly doubt every "young models/ family nudism/ lolita" site that links to literal cp is run by some pro. Some of them are obviously amateurs and only half are Russian.

OP, put a disclaimer on the site and have a report feature, that's all you really need.

I got omfgdogs.com/
before I read the other comments I thought, this just brought you to a random but useless site

>surprise me
>cia.gov

Awesome site, OP. Surprise feature a fucking best.

>surprise me
>heavensgate.com

>Surprise me
Ok, that's kinda funny m8. But the regular search makes no sense, the few things it bring back don't even have the same words in it.

I can improve upon it, but honestly for such a technical question like what you are asking, should probably stick with google.

>OP, put a disclaimer on the site and have a report feature, that's all you really need.
Will look into that! Thanks.

Yeah I think that will improve as more pages get indexed.

>What is your policy on copyright?
I have thought about creating a search engine too but this was actually one of the biggest questions on my mind was the legality of running automated bots to spider peoples sites cause I dont want to end up in rape you in the ass prison because of some bullshit laws. remember what they did to that kid at MIT sentencing him to like 50 years in prison for automatically downloading papers that where available through the campus network?

>hqdefault.jpg

Question: did u make your image that?

search: cats
fimfiction.net/
FUUUUUUCK
REEEEE

is it actually using the words you input in the "search" OP ?

yes, but there arent a lot of pages in this yet so it doesn't have much to work with right now

What is the user agent of your crawler and how does it work?

Do you read robots.txts? IMO you shouldn't because fuck 'em

Thanks anons :)

User Agent? Sorry I'm not sure what you mean but I'll try to answer. Using libcurl to download the submitted page. Then it goes to an html parser that I wrote to separate the text from the html. Then it gets put into the db. There were some off-the-shelf solutions for a web crawler and even a full search engine, but they are often just as hard to figure out as it would take to just build your own.

>robots.txts
I dont read it but if the page has "noindex", it wont get indexed.

cool just checking, glad to know your awake hope you slept well :3

I searched for "dog" and got "puppy linux" as the first result. I like it.

Please continue development. This could be very valuable. Most might not get it but you're right... Searching for interesting things when you don't really know what you're looking for is actually a very useful thing.

This is what it looks like to the site admin when you crawl a website

172.93.49.59 - - [06/May/2017:13:39:48 -0600] "GET / HTTP/1.1" 200 1088 "-" "-"
172.93.49.59 - - [06/May/2017:13:39:48 -0600] "GET /posts/post4.html HTTP/1.1" 200 2923 "-" "-"
172.93.49.59 - - [06/May/2017:13:39:48 -0600] "GET /posts/post1.html HTTP/1.1" 200 2146 "-" "-"


This is what another random bot looks like

100.43.81.141 - - [06/May/2017:08:16:33 -0600] "GET /contact.html HTTP/1.1" 200 741 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +yandex.com/bots)"


the difference is you have no user agent.

Never do sleep well but that helps when you have projects to work on.

Will do!