Managing large images collections

Hello,

As an imageboard user and data hoarder, I save all the content I find interesting, amusing, or arousing. You know this very well since you are probably the same way.

I have several thousands of images sitting on my hard drive, most of which have been saved by hand. This is content I dedicate time for searching, so it is pretty special. I have several thousands more images but those are scrapped, so I have no troubles organizing them. I have postponed the sorting of these images for a while and hand-sorting them is just painful and time consuming. At the moment, I have about 5000 images left to organise in around 200 thematic folders. I thought about deploying simple machine-learning based solutions using python and tensorflow, but they are unlikely to fit the granularity level I go for, since vastly different images can fit together in the same folder (think of memes, as an example). Also, organizing the dataset by hand for learning would be pretty much what I am trying to accomplish right now.

Do you have a similar collection ? Do you have a set-up to manage it ?

Other urls found in this thread:

github.com/hydrusnetwork/hydrus
hydrusnetwork.github.io/hydrus/help/getting_started_tags.html
hydrusnetwork.github.io/hydrus/help/faq.html
hydrusnetwork.github.io/hydrus/help/local_booru.html)
twitter.com/NSFWRedditVideo

Some user told me about this: github.com/hydrusnetwork/hydrus

>github.com/hydrusnetwork/hydrus

This is incredible and very much valuable, thank you user.

Interesting. Thanks.

Seconding Hydrus. Just use that.

Note that it will maintain it's own file and folder structure (all hash based).

It doesn't preserve YOUR folders and filenames, though you can grab them into the DB at import time.

PS: To get other people's tags -which is one of the main uses this tool has, adding the PTR -the public tag repository by/for hydrus users- is the most important step:
hydrusnetwork.github.io/hydrus/help/getting_started_tags.html

You can also add downloaded tag database dump files (like from a *booru) in almost the same place, but the PTR is the big one.

hydrus network is so freaking gay

I keep a complex file structure of anime images sorted by character, pairings, show ect ect
and then elo ranked using tournaments and manual comparisons and some fancy algorymths
and then rip tags from boorus using a program i wrote and embed the tags as metedata

coming soon: ranking individual tags using microsoft TRUESKILL and a gui to put it all together and btfo hydrus once and for all

btw hydrus copies all of your images into its own folder, freaking LAME

You have autism

last time I used hydrus, it lagged horribly with large image collections. was this ever fixed?

>btw hydrus copies all of your images into its own folder, freaking LAME
Wait, it does? That's not gonna work out when I have around 40gb of images saved.

anyone who considers using hydrus network is reddit and also im not autistic

Hydrus is shit, yes, but you have autism.

Yes, the creator somehow thought it a bad idea to just symlink and create a db of hashes based on that.
Find a download for Google Picasa instead.

It doesn't really lag for me, but of course querying the DB and loading thumbnails gets slower if you have a bunch of million files and search for a bunch of tags.

It can move them into its own folder. Doesn't require any more storage space then.

But it will manage files and folders hash-based, yes. It doesn't do another mapping between hash and file storage location.

Side question ... is there a self-hosted booru that clones the tags from the public ones?

> Yes, the creator somehow thought it a bad idea to just symlink and create a db of hashes based on that.
It is a problem, symlinks / NTFS junctions like other things aren't going to work on all filesystems and security policies that are still in use. Symlinks are generally used sparingly and only by entities that fully do sysadmin stuff.

Also see:
> why not use filenames and folders?
> can the client manage files from their original locations?
hydrusnetwork.github.io/hydrus/help/faq.html

It's a choice for more performance and less file management problems.

Yes, that's exactly what Hydrus is doing.

Apart from the public tag repository you can use tag database dumps scraped from all the typical boorus and such.

And it CAN host a local web-browser based booru (hydrusnetwork.github.io/hydrus/help/local_booru.html) apart from the python GUI client.

if the dev wasn't a freaking retard he would put the metadata inside the images and kept a db of hashes

It's freaking retarded to alter all the images, it will change their hashes.

And actually not all formats supported will just take some metadata field.

And yes, this IS using a db of hashes, it's just not also doing a mapping from hashes to file paths that needs an extra lookup per file.

Neat. Gonna have a booru with only porn that I like.

Can it auto-download artist tags too?

save the new hash as metadeta in the image genius
bwahaha

>Use tool once
>Changes hash of all your images
>Now locked in to that vendor

>new hash
>PTR becomes pointless since the hashes are different

If you let Hydrus download files from a *booru, it'll basically grab all tags if you tell it to. [There is sometimes slightly more fine-grained control, but generally you'll grab all tags.]

The PTR and DB dumps from possibly other boorus can then also augment with their tags.

a large metadeta section that just iterates looking for hash collisions for the original image hash

Yea, and everyone else will hate your users when they upload their files from their filesystem directly since now easy duplicate elimination no longer works.

Basically, even if you want to hack the feature into your fork of hydrus, please add another DB to have that slower hash->file paths indirection. Don't fuck with the images if you don't have to.

ohhh noo the ceo of whatever non-encrypted / autistic indie image host will be mad because I used .0000001% more resources

Not entirely sure of the file structure for all image types, but isnt it possible to generate a hash using only data from the non-metadata section of the file? So that you can rename the file, change metadata, tags, etc but the hash remains the same so long as the image remains the same?

the hash will be calculated using the entire file by whatever other generic service that sees the file; I think it's technically possible to generate a deterministic hash without the metadeta though; because normal jpeg metedata is all at the end of the file by itself

Right, so really... that's how it should be done. I'm sure such a system would have to detect the way the drive is formatted, and the file type, to properly hash using only "non-metadata".

If they can't strip the metadata easily, they'll just disable the file uploads, and either way, you'd be a faggot like the sites who watermark everything.

Even then, you can't add your metadata to all file types that Hydrus supports. But feel free to fork if you really need to try your approach, it's an open sauce project after all.

Enjoy doing it for all 16+ container formats apart from JPEG.

And you're generally just asking to wreck performance anyhow.

where do these fantasies about 'disabling uploading of unique files' come from lol

im not forking garbage pythongui garbage shit; i already have my own perfect system built

lol calculate the hash before modifying the file and save that hash as a metadeta nerd
all these problems are already solved
there are already cpp libraries that can metadeta just about any image format that supports metadeta; otherwise just convert the image

every booru api supports searching for hash directly anyway, you dont even have to do anything funny

the best way to sort this kind of content is chronologically like how your memories are organised. separate them into folders order by month. this way there should be a manageable amount of folders each with a manageable amount of images and if you have a decent memory you should be able to recall roughly what's in each folder

worst post itt

t. someone with a shit memory.

> i already have my own perfect system built
Uh ... good job I guess?

I obviously don't even see why I should believe that it's anywhere near perfect. You generally seem to make everything slower by requiring a lot of filesystem accesses, and be interested in JPEG only.

> every booru api supports searching for hash directly anyway, you dont even have to do anything funny
No shit, because *boorus and various CDN and big data things tend store and retrieve files by hash "/68/c4/sample_68c416bf307b595173121aad55d829fd.jpg" on gelbooru. Exactly because it's the superior solution.

> otherwise just convert the image
Converting one lossy image format into another is usually such a great idea.

Never mind all the fun you can have converting .swf, .pdf and more.

You don't even need the folders, your file manager can sort files chronologically for you.

But really, this doesn't work very well unless you're not looking at that many files, or have a really fucking good memory.

lol why are my anime images going to .pdfs lmao

So does hydrous in effect know how to parse through Sup Forums images simply based off the file number and no additional tags? I say this because I have over a hundred twenty thousand images save in a single four terabyte SATA drive that are completely disorganized outside of being listed by the sequential order as saved from

> So does hydrous in effect know how to parse through Sup Forums images simply based off the file number and no additional tags?
It will generate checksums from the files when you import the files.

Then it will match them with tag databases you enabled (from *boorus, the PTR, whatever) and you should have quite a lot of images that now have tags.

It'll also generate a second set of "checksums" (perceptual hashes) to enable finding duplicate files for various supported file types, but they'll not be used in the file names.

> over a hundred twenty thousand images
Seems considerably less than I'd typically expect on a 4TB drive, are they all high resolution or something?

That's just really my memes and interesting images folder from Sup Forums the SATA drive just so happens to be 4 terabytes in size.

So the images have to be p*** is that really the caveat in order to make it work? What about all of the the images I have that are strictly not p***and obviously wouldn't show up on you know like most archives with extensive tags. Also I'm using a transcriber have no f****** idea why the phone keeps censoring my language when I curse but I don't think it well sensor specific terms like gook.

lol

> That's just really my memes and interesting images folder from Sup Forums the SATA drive just so happens to be 4 terabytes in size.
Odd. Most of these should be like 200kb to 1MB or something, so I'd certainly more expect like 120GB than 4TB.

I'm not actually sure what that p*** censored word is. Pasta? Porn? Penises? But anyhow, it's pretty funny what exactly your transcriber censors.

I also can't really tell you if there is particularly *good* tag coverage for your images. But of course you can add your own tags to files. [Whether it's on a booru or on the PTR, most tags were ultimately added manually by someone.]

It is limited by your hard drive of course, but you can try running db maintenance or (preferably) moving hydrus to your ssd (you do have an ssd, right user?)

I tell you what I do OP. every few months I perform what I call a "clean slate" where I move all pictures from my phone, tablet, and laptop onto dedicated flash drives. the drives then go into a box along with graveyard of drives from previous clean slates.

You would think that you would need to explore these often, but often I find they are forgotten about quite quickly as your fresh devices begin the cycle again with new media just as rapidly

I use Save Image/Link in Folder and save images into set categories. The images go into a unsorted folder within their category so that I can place it into its specific folder later.