I am building a search engine

Located at wiby.me. Had made a thread a couple months ago, this will be my final update. I want to rebuild the web to be like the 90's / early 00's where pages weren't so bloated and were based more on a subject of interest rather than making money. I am trying to gather as many of these kinds of pages and if you know any, please submit them.
It used to be called wibr but have changed it to wiby because its easier to sound out. The search engine will now crawl indexed pages on a weekly basis to keep updated.
Some server info: This is a lamp server hosted on a VPS with an SSD. The crawler and update scheduler I wrote in C. That's about 60kb compiled and does the job well. The site is usually pretty snappy but there isn't may users either.
Anyhoo I will keep trying to find old-school style pages to keep it growing even if there aren't many submissions, although I've had a good number of contributions.

Other urls found in this thread:

prog.ide.sk/pas.php
catb.org/esr/
ifs.nog.cc/
wwwtxt.org/tagged/txt
txti.es/
cpuville.com/
members.iinet.net.au/~daveb/simplex/simplex.html
homebrewcpu.com/magic-16.htm
textfiles.com/
twitter.com/SFWRedditGifs

bump
Can't provide anything since I'm a babby but I'm rooting for you user

Same. good luck mane

prog.ide.sk/pas.php is a nice page for learning the basics of the pascal language. It's in hungarian though

ubunto server, eww

Keep posting, I gladly provide cool pages to index.
catb.org/esr/
ifs.nog.cc/
wwwtxt.org/tagged/txt

bonus:
txti.es/

does it werk without js

cpuville.com/

This is a nice page about a DIY cpu a guy built out of transistors.

And heres another DIY cpu, this guy made his in the 1970s out of salvaged parts.
members.iinet.net.au/~daveb/simplex/simplex.html

Theres a whole plethora of pages about homebrew CPUs, in this thing called the Homebrew CPUs Webring. Whats even better is that they are all comfy web 1.0 pages.

Ima go submit a few to your engine.

Thanks g/ents
Indexed
200mb ram usage
Yes
Appreciate you providing all those quality pages thanks!

Thanks!!!!

This is probably the funniest thing I've read on any of the Homebrew CPUs Webring pages:

An encore for Magic-1?

Shortly after I declared Magic-1 "hardware complete", I casually mentioned to my wife that I was starting to think about Magic-2. Her response was swift, and final:

"No, there will be no Magic-2!"

I can't blame her. She was an extraordinary good sport during Magic-1's design and construction - especially during the wire-wrapping phase. For most of a year, she put up with electronic junk littering the kitchen table, wire-wrap insulation fragments on the floor and a husband often lost in concentration while the kids were hollering for attention.

She's the love of my life, the woman I plan on growing old with, mother of my children, my partner and best friend. I have to respect her wishes on this.

So, there will be no Magic-2.

Instead, we'll call the follow-on project "Magic-16".

homebrewcpu.com/magic-16.htm

> I have to respect her wishes on this.
>So, there will be no Magic-2.
>Instead, we'll call the follow-on project "Magic-16".

I happen to do search for a living, so I'm curious about your tech.

* What analysis/normalization do you apply to text?
* How do you rank? Classic TF/IDF, PageRank-style, ...?
* How do you deal with different languages?
* How do you handle semantics and synonyms?
* What user input error correction mechanisms do you employ?

I'm happy for any info, and would be glad to share experience.

why not
textfiles.com/

it has
textfiles.com/sex/fuckdead.txt

suckless
cat-v
9p.io

>* What analysis/normalization do you apply to text?
Whatever is employed within the realm of the SQL full text search
>* How do you rank? Classic TF/IDF, PageRank-style, ...?
No page ranking. There are no cancerous pages on wiby worth suppressing.
>* How do you deal with different languages?
This will only be an issue for the 'surprise me' feature. Eventually I'll have to automatically determine the language of the page submitted, and then include/remove those for the surprise feature depending on what region of the world you live in based on IP. English pages will always be included however.
>* How do you handle semantics and synonyms?
Is a one man show here, so I'm not handling this. Hopefully you will find a page with a word match somewhere.
>* What user input error correction mechanisms do you employ?
None. Only checking for security h4x. Users will have to know how to spell I guess.

>>* What analysis/normalization do you apply to text?
>Whatever is employed within the realm of the SQL full text search
Depending on which RDBMS you use, the system itself may already cover quite a bit. What are you using? I'm assuming that you're exploiting those capabilities instead of just matching substrings.

>>* How do you rank? Classic TF/IDF, PageRank-style, ...?
>No page ranking. There are no cancerous pages on wiby worth suppressing.
Ranking is not about "suppressing" things, but about putting more relevant content first. TF/IDF is quite useful, because it's a great tradeoff between precision and recall. If you're not doing that, what does determine the order of results?

>>* How do you deal with different languages?
>This will only be an issue for the 'surprise me' feature.
I don't think so. Multi-language capabilities of RDBMS'es are currently rather limited as far as I know. So systems I've worked with before employ different ways of processing different languages because they have different stopwords, and **very** different ways of stemming.

>Eventually I'll have to automatically determine the language of the page submitted, and then include/remove those for the surprise feature depending on what region of the world you live in based on IP. English pages will always be included however.
I don't like the idea of limiting content based on the physical location. Think about people traveling, or VPN users. Maybe the browser's locale would be a better approach.

cont.

>>* How do you handle semantics and synonyms?
>Is a one man show here, so I'm not handling this. Hopefully you will find a page with a word match somewhere.
That's probably worth looking at. It's been a very important topic for me as an implementer, and a huge point of frustration as a user, often having trouble to find the right word to look up things. Not finding something because of wrong words and giving up then is a very annoying experience.

>>* What user input error correction mechanisms do you employ?
>None. Only checking for security h4x. Users will have to know how to spell I guess.
Also worth looking at. Simple fuzzy search based on permutations up to a certain Levenshtein distance can already get you most of the way.

>Depending on which RDBMS you use, the system itself may already cover quite a bit. What are you using? I'm assuming that you're exploiting those capabilities instead of just matching substrings.

You'll have to excuse me because I'm just flying by the seat of my pants with all of this. I'm not sure what you're asking.

>what does determine the order of results?

Its prioritizing matches with titles first, followed by matches within body. Using the full text search algorithm.

>because they have different stopwords, and **very** different ways of stemming.

Hwelp, I don't know what to say. I'm hoping the black box that is SQL can handle it. Would have to do a lot more research on these subjects.

>Maybe the browser's locale would be a better approach.

I shall remember this when I do eventually have to tackle it.

>It's been a very important topic for me as an implementer, and a huge point of frustration as a user, often having trouble to find the right word to look up things. Not finding something because of wrong words and giving up then is a very annoying experience.

Yes, I will try to improve the search algorithm over time. Though I prefer to stick to simple, brutish solutions since its more of a hobby project for me at the moment.

>>Depending on which RDBMS you use, the system itself may already cover quite a bit. What are you using? I'm assuming that you're exploiting those capabilities instead of just matching substrings.
>You'll have to excuse me because I'm just flying by the seat of my pants with all of this. I'm not sure what you're asking.
The question was basically what database software you're using.

In any case, is there any possibility to get a dump of your index in a generic format (CSV, JSON, whatever) to play around with? I feel like trying some things to improve search quality by running it through some specialized setups, and I'm too lazy to write a crawler and curate some sites.