I am building a search engine

Question

I am building a search engine

Angel Torres

Located at wiby.me. Had made a thread a couple months ago, this will be my final update. I want to rebuild the web to be like the 90's / early 00's where pages weren't so bloated and were based more on a subject of interest rather than making money. I am trying to gather as many of these kinds of pages and if you know any, please submit them.
It used to be called wibr but have changed it to wiby because its easier to sound out. The search engine will now crawl indexed pages on a weekly basis to keep updated.
Some server info: This is a lamp server hosted on a VPS with an SSD. The crawler and update scheduler I wrote in C. That's about 60kb compiled and does the job well. The site is usually pretty snappy but there isn't may users either.
Anyhoo I will keep trying to find old-school style pages to keep it growing even if there aren't many submissions, although I've had a good number of contributions.

July 28, 2017 - 02:58

Other urls found in this thread:

prog.ide.sk/pas.php
catb.org/esr/
ifs.nog.cc/
wwwtxt.org/tagged/txt
txti.es/
cpuville.com/
members.iinet.net.au/~daveb/simplex/simplex.html
homebrewcpu.com/magic-16.htm
textfiles.com/
twitter.com/SFWRedditGifs

James Torres

bump
Can't provide anything since I'm a babby but I'm rooting for you user

July 28, 2017 - 02:59

Connor Miller

Same. good luck mane

July 28, 2017 - 03:04

Ian Taylor

prog.ide.sk/pas.php is a nice page for learning the basics of the pascal language. It's in hungarian though

July 28, 2017 - 03:09

Charles Lee

ubunto server, eww

July 28, 2017 - 03:30

Chase Nelson

Keep posting, I gladly provide cool pages to index.
catb.org/esr/
ifs.nog.cc/
wwwtxt.org/tagged/txt

bonus:
txti.es/

July 28, 2017 - 04:39

Evan Richardson

does it werk without js

July 28, 2017 - 04:42

Luke Watson

cpuville.com/

This is a nice page about a DIY cpu a guy built out of transistors.

July 28, 2017 - 06:14

Carter Lee

And heres another DIY cpu, this guy made his in the 1970s out of salvaged parts.
members.iinet.net.au/~daveb/simplex/simplex.html

Theres a whole plethora of pages about homebrew CPUs, in this thing called the Homebrew CPUs Webring. Whats even better is that they are all comfy web 1.0 pages.

Ima go submit a few to your engine.

July 28, 2017 - 06:19

Austin Howard

Thanks g/ents
Indexed
200mb ram usage
Yes
Appreciate you providing all those quality pages thanks!

July 28, 2017 - 06:20

David Lopez

Thanks!!!!

July 28, 2017 - 06:21

Daniel Powell

This is probably the funniest thing I've read on any of the Homebrew CPUs Webring pages:

An encore for Magic-1?

Shortly after I declared Magic-1 "hardware complete", I casually mentioned to my wife that I was starting to think about Magic-2. Her response was swift, and final:

"No, there will be no Magic-2!"

I can't blame her. She was an extraordinary good sport during Magic-1's design and construction - especially during the wire-wrapping phase. For most of a year, she put up with electronic junk littering the kitchen table, wire-wrap insulation fragments on the floor and a husband often lost in concentration while the kids were hollering for attention.

She's the love of my life, the woman I plan on growing old with, mother of my children, my partner and best friend. I have to respect her wishes on this.

So, there will be no Magic-2.

Instead, we'll call the follow-on project "Magic-16".

homebrewcpu.com/magic-16.htm

July 28, 2017 - 06:25

John Collins

> I have to respect her wishes on this.
>So, there will be no Magic-2.
>Instead, we'll call the follow-on project "Magic-16".

July 28, 2017 - 06:41

Cooper Sullivan

I happen to do search for a living, so I'm curious about your tech.

* What analysis/normalization do you apply to text?
* How do you rank? Classic TF/IDF, PageRank-style, ...?
* How do you deal with different languages?
* How do you handle semantics and synonyms?
* What user input error correction mechanisms do you employ?

I'm happy for any info, and would be glad to share experience.

July 28, 2017 - 06:48

Hudson Gomez

why not
textfiles.com/

it has
textfiles.com/sex/fuckdead.txt

July 28, 2017 - 06:53

Zachary Ward

suckless
cat-v
9p.io

July 28, 2017 - 06:54

Dominic Ramirez

>* What analysis/normalization do you apply to text?
Whatever is employed within the realm of the SQL full text search
>* How do you rank? Classic TF/IDF, PageRank-style, ...?
No page ranking. There are no cancerous pages on wiby worth suppressing.
>* How do you deal with different languages?
This will only be an issue for the 'surprise me' feature. Eventually I'll have to automatically determine the language of the page submitted, and then include/remove those for the surprise feature depending on what region of the world you live in based on IP. English pages will always be included however.
>* How do you handle semantics and synonyms?
Is a one man show here, so I'm not handling this. Hopefully you will find a page with a word match somewhere.
>* What user input error correction mechanisms do you employ?
None. Only checking for security h4x. Users will have to know how to spell I guess.

July 28, 2017 - 07:09

James Martin

>>* What analysis/normalization do you apply to text?
>Whatever is employed within the realm of the SQL full text search
Depending on which RDBMS you use, the system itself may already cover quite a bit. What are you using? I'm assuming that you're exploiting those capabilities instead of just matching substrings.

>>* How do you rank? Classic TF/IDF, PageRank-style, ...?
>No page ranking. There are no cancerous pages on wiby worth suppressing.
Ranking is not about "suppressing" things, but about putting more relevant content first. TF/IDF is quite useful, because it's a great tradeoff between precision and recall. If you're not doing that, what does determine the order of results?

>>* How do you deal with different languages?
>This will only be an issue for the 'surprise me' feature.
I don't think so. Multi-language capabilities of RDBMS'es are currently rather limited as far as I know. So systems I've worked with before employ different ways of processing different languages because they have different stopwords, and **very** different ways of stemming.

>Eventually I'll have to automatically determine the language of the page submitted, and then include/remove those for the surprise feature depending on what region of the world you live in based on IP. English pages will always be included however.
I don't like the idea of limiting content based on the physical location. Think about people traveling, or VPN users. Maybe the browser's locale would be a better approach.

July 28, 2017 - 07:33

Thomas Stewart

cont.

>>* How do you handle semantics and synonyms?
>Is a one man show here, so I'm not handling this. Hopefully you will find a page with a word match somewhere.
That's probably worth looking at. It's been a very important topic for me as an implementer, and a huge point of frustration as a user, often having trouble to find the right word to look up things. Not finding something because of wrong words and giving up then is a very annoying experience.

>>* What user input error correction mechanisms do you employ?
>None. Only checking for security h4x. Users will have to know how to spell I guess.
Also worth looking at. Simple fuzzy search based on permutations up to a certain Levenshtein distance can already get you most of the way.

July 28, 2017 - 07:34

Benjamin Rodriguez

>Depending on which RDBMS you use, the system itself may already cover quite a bit. What are you using? I'm assuming that you're exploiting those capabilities instead of just matching substrings.

You'll have to excuse me because I'm just flying by the seat of my pants with all of this. I'm not sure what you're asking.

>what does determine the order of results?

Its prioritizing matches with titles first, followed by matches within body. Using the full text search algorithm.

>because they have different stopwords, and **very** different ways of stemming.

Hwelp, I don't know what to say. I'm hoping the black box that is SQL can handle it. Would have to do a lot more research on these subjects.

>Maybe the browser's locale would be a better approach.

I shall remember this when I do eventually have to tackle it.

>It's been a very important topic for me as an implementer, and a huge point of frustration as a user, often having trouble to find the right word to look up things. Not finding something because of wrong words and giving up then is a very annoying experience.

Yes, I will try to improve the search algorithm over time. Though I prefer to stick to simple, brutish solutions since its more of a hobby project for me at the moment.

July 28, 2017 - 07:55

Thomas Williams

>>Depending on which RDBMS you use, the system itself may already cover quite a bit. What are you using? I'm assuming that you're exploiting those capabilities instead of just matching substrings.
>You'll have to excuse me because I'm just flying by the seat of my pants with all of this. I'm not sure what you're asking.
The question was basically what database software you're using.

In any case, is there any possibility to get a dump of your index in a generic format (CSV, JSON, whatever) to play around with? I feel like trying some things to improve search quality by running it through some specialized setups, and I'm too lazy to write a crawler and curate some sites.

July 28, 2017 - 08:08

1 2 3 Next

I am building a search engine

Last threads