Managing large images collections

Question

Managing large images collections

Noah Parker

Hello,

As an imageboard user and data hoarder, I save all the content I find interesting, amusing, or arousing. You know this very well since you are probably the same way.

I have several thousands of images sitting on my hard drive, most of which have been saved by hand. This is content I dedicate time for searching, so it is pretty special. I have several thousands more images but those are scrapped, so I have no troubles organizing them. I have postponed the sorting of these images for a while and hand-sorting them is just painful and time consuming. At the moment, I have about 5000 images left to organise in around 200 thematic folders. I thought about deploying simple machine-learning based solutions using python and tensorflow, but they are unlikely to fit the granularity level I go for, since vastly different images can fit together in the same folder (think of memes, as an example). Also, organizing the dataset by hand for learning would be pretty much what I am trying to accomplish right now.

Do you have a similar collection ? Do you have a set-up to manage it ?

June 11, 2017 - 07:36

Other urls found in this thread:

github.com/hydrusnetwork/hydrus
hydrusnetwork.github.io/hydrus/help/getting_started_tags.html
hydrusnetwork.github.io/hydrus/help/faq.html
hydrusnetwork.github.io/hydrus/help/local_booru.html)
twitter.com/NSFWRedditVideo

Luke Bennett

Some user told me about this: github.com/hydrusnetwork/hydrus

June 11, 2017 - 07:39

Nolan Howard

>github.com/hydrusnetwork/hydrus

This is incredible and very much valuable, thank you user.

June 11, 2017 - 07:40

Angel Smith

Interesting. Thanks.

June 11, 2017 - 07:42

Jose Jackson

Seconding Hydrus. Just use that.

Note that it will maintain it's own file and folder structure (all hash based).

It doesn't preserve YOUR folders and filenames, though you can grab them into the DB at import time.

June 11, 2017 - 07:45

Kevin Garcia

PS: To get other people's tags -which is one of the main uses this tool has, adding the PTR -the public tag repository by/for hydrus users- is the most important step:
hydrusnetwork.github.io/hydrus/help/getting_started_tags.html

You can also add downloaded tag database dump files (like from a *booru) in almost the same place, but the PTR is the big one.

June 11, 2017 - 07:56

Carson Davis

hydrus network is so freaking gay

I keep a complex file structure of anime images sorted by character, pairings, show ect ect
and then elo ranked using tournaments and manual comparisons and some fancy algorymths
and then rip tags from boorus using a program i wrote and embed the tags as metedata

coming soon: ranking individual tags using microsoft TRUESKILL and a gui to put it all together and btfo hydrus once and for all

btw hydrus copies all of your images into its own folder, freaking LAME

June 11, 2017 - 09:33

Jacob Parker

You have autism

June 11, 2017 - 09:34

Blake Ward

last time I used hydrus, it lagged horribly with large image collections. was this ever fixed?

June 11, 2017 - 09:35

Bentley Carter

>btw hydrus copies all of your images into its own folder, freaking LAME
Wait, it does? That's not gonna work out when I have around 40gb of images saved.

June 11, 2017 - 09:36

Kevin Morales

anyone who considers using hydrus network is reddit and also im not autistic

June 11, 2017 - 09:38

Alexander Rogers

Hydrus is shit, yes, but you have autism.

Yes, the creator somehow thought it a bad idea to just symlink and create a db of hashes based on that.
Find a download for Google Picasa instead.

June 11, 2017 - 09:40

Kevin Hall

It doesn't really lag for me, but of course querying the DB and loading thumbnails gets slower if you have a bunch of million files and search for a bunch of tags.

June 11, 2017 - 12:55

Aiden Lopez

It can move them into its own folder. Doesn't require any more storage space then.

But it will manage files and folders hash-based, yes. It doesn't do another mapping between hash and file storage location.

June 11, 2017 - 12:57

Blake Jackson

Side question ... is there a self-hosted booru that clones the tags from the public ones?

June 11, 2017 - 13:04

Christopher Campbell

> Yes, the creator somehow thought it a bad idea to just symlink and create a db of hashes based on that.
It is a problem, symlinks / NTFS junctions like other things aren't going to work on all filesystems and security policies that are still in use. Symlinks are generally used sparingly and only by entities that fully do sysadmin stuff.

Also see:
> why not use filenames and folders?
> can the client manage files from their original locations?
hydrusnetwork.github.io/hydrus/help/faq.html

It's a choice for more performance and less file management problems.

June 11, 2017 - 13:04

Charles Ortiz

Yes, that's exactly what Hydrus is doing.

Apart from the public tag repository you can use tag database dumps scraped from all the typical boorus and such.

And it CAN host a local web-browser based booru (hydrusnetwork.github.io/hydrus/help/local_booru.html) apart from the python GUI client.

June 11, 2017 - 13:06

Jeremiah Baker

if the dev wasn't a freaking retard he would put the metadata inside the images and kept a db of hashes

June 11, 2017 - 13:06

Charles Nguyen

It's freaking retarded to alter all the images, it will change their hashes.

And actually not all formats supported will just take some metadata field.

And yes, this IS using a db of hashes, it's just not also doing a mapping from hashes to file paths that needs an extra lookup per file.

June 11, 2017 - 13:08

Aaron Wright

Neat. Gonna have a booru with only porn that I like.

Can it auto-download artist tags too?

June 11, 2017 - 13:09

Nolan Flores

save the new hash as metadeta in the image genius
bwahaha

June 11, 2017 - 13:09

Cameron Ramirez

>Use tool once
>Changes hash of all your images
>Now locked in to that vendor

June 11, 2017 - 13:10

Liam Hill

>new hash
>PTR becomes pointless since the hashes are different

June 11, 2017 - 13:11

Jose Peterson

If you let Hydrus download files from a *booru, it'll basically grab all tags if you tell it to. [There is sometimes slightly more fine-grained control, but generally you'll grab all tags.]

The PTR and DB dumps from possibly other boorus can then also augment with their tags.

June 11, 2017 - 13:12

Elijah Ramirez

a large metadeta section that just iterates looking for hash collisions for the original image hash

June 11, 2017 - 13:14

Isaiah Miller

Yea, and everyone else will hate your users when they upload their files from their filesystem directly since now easy duplicate elimination no longer works.

Basically, even if you want to hack the feature into your fork of hydrus, please add another DB to have that slower hash->file paths indirection. Don't fuck with the images if you don't have to.

June 11, 2017 - 13:14

David Myers

ohhh noo the ceo of whatever non-encrypted / autistic indie image host will be mad because I used .0000001% more resources

June 11, 2017 - 13:17

Adam James

Not entirely sure of the file structure for all image types, but isnt it possible to generate a hash using only data from the non-metadata section of the file? So that you can rename the file, change metadata, tags, etc but the hash remains the same so long as the image remains the same?

June 11, 2017 - 13:18

Matthew Reyes

the hash will be calculated using the entire file by whatever other generic service that sees the file; I think it's technically possible to generate a deterministic hash without the metadeta though; because normal jpeg metedata is all at the end of the file by itself

June 11, 2017 - 13:21

Evan White

Right, so really... that's how it should be done. I'm sure such a system would have to detect the way the drive is formatted, and the file type, to properly hash using only "non-metadata".

June 11, 2017 - 13:23

Hunter Phillips

If they can't strip the metadata easily, they'll just disable the file uploads, and either way, you'd be a faggot like the sites who watermark everything.

Even then, you can't add your metadata to all file types that Hydrus supports. But feel free to fork if you really need to try your approach, it's an open sauce project after all.

June 11, 2017 - 13:25

Jordan Brown

Enjoy doing it for all 16+ container formats apart from JPEG.

And you're generally just asking to wreck performance anyhow.

June 11, 2017 - 13:28

Julian Martin

where do these fantasies about 'disabling uploading of unique files' come from lol

im not forking garbage pythongui garbage shit; i already have my own perfect system built

lol calculate the hash before modifying the file and save that hash as a metadeta nerd
all these problems are already solved
there are already cpp libraries that can metadeta just about any image format that supports metadeta; otherwise just convert the image

every booru api supports searching for hash directly anyway, you dont even have to do anything funny

June 11, 2017 - 13:31

Andrew Price

the best way to sort this kind of content is chronologically like how your memories are organised. separate them into folders order by month. this way there should be a manageable amount of folders each with a manageable amount of images and if you have a decent memory you should be able to recall roughly what's in each folder

June 11, 2017 - 13:40

Robert Sullivan

worst post itt

June 11, 2017 - 13:46

Evan Foster

t. someone with a shit memory.

June 11, 2017 - 13:47

Kayden Green

> i already have my own perfect system built
Uh ... good job I guess?

I obviously don't even see why I should believe that it's anywhere near perfect. You generally seem to make everything slower by requiring a lot of filesystem accesses, and be interested in JPEG only.

> every booru api supports searching for hash directly anyway, you dont even have to do anything funny
No shit, because *boorus and various CDN and big data things tend store and retrieve files by hash "/68/c4/sample_68c416bf307b595173121aad55d829fd.jpg" on gelbooru. Exactly because it's the superior solution.

> otherwise just convert the image
Converting one lossy image format into another is usually such a great idea.

Never mind all the fun you can have converting .swf, .pdf and more.

June 11, 2017 - 13:57

Isaac Barnes

You don't even need the folders, your file manager can sort files chronologically for you.

But really, this doesn't work very well unless you're not looking at that many files, or have a really fucking good memory.

June 11, 2017 - 13:59

Andrew Thompson

lol why are my anime images going to .pdfs lmao

June 11, 2017 - 14:00

Thomas Ortiz

So does hydrous in effect know how to parse through Sup Forums images simply based off the file number and no additional tags? I say this because I have over a hundred twenty thousand images save in a single four terabyte SATA drive that are completely disorganized outside of being listed by the sequential order as saved from

June 11, 2017 - 14:05

Ryder Reyes

> So does hydrous in effect know how to parse through Sup Forums images simply based off the file number and no additional tags?
It will generate checksums from the files when you import the files.

Then it will match them with tag databases you enabled (from *boorus, the PTR, whatever) and you should have quite a lot of images that now have tags.

It'll also generate a second set of "checksums" (perceptual hashes) to enable finding duplicate files for various supported file types, but they'll not be used in the file names.

> over a hundred twenty thousand images
Seems considerably less than I'd typically expect on a 4TB drive, are they all high resolution or something?

June 11, 2017 - 14:11

Connor Torres

That's just really my memes and interesting images folder from Sup Forums the SATA drive just so happens to be 4 terabytes in size.

So the images have to be p*** is that really the caveat in order to make it work? What about all of the the images I have that are strictly not p***and obviously wouldn't show up on you know like most archives with extensive tags. Also I'm using a transcriber have no f****** idea why the phone keeps censoring my language when I curse but I don't think it well sensor specific terms like gook.

June 11, 2017 - 14:17

Jacob Cook

lol

June 11, 2017 - 14:18

Anthony Jones

> That's just really my memes and interesting images folder from Sup Forums the SATA drive just so happens to be 4 terabytes in size.
Odd. Most of these should be like 200kb to 1MB or something, so I'd certainly more expect like 120GB than 4TB.

I'm not actually sure what that p*** censored word is. Pasta? Porn? Penises? But anyhow, it's pretty funny what exactly your transcriber censors.

I also can't really tell you if there is particularly *good* tag coverage for your images. But of course you can add your own tags to files. [Whether it's on a booru or on the PTR, most tags were ultimately added manually by someone.]

June 11, 2017 - 14:27

Ian Gonzalez

It is limited by your hard drive of course, but you can try running db maintenance or (preferably) moving hydrus to your ssd (you do have an ssd, right user?)

June 11, 2017 - 19:02

Ian Perry

I tell you what I do OP. every few months I perform what I call a "clean slate" where I move all pictures from my phone, tablet, and laptop onto dedicated flash drives. the drives then go into a box along with graveyard of drives from previous clean slates.

You would think that you would need to explore these often, but often I find they are forgotten about quite quickly as your fresh devices begin the cycle again with new media just as rapidly

June 11, 2017 - 19:34

Grayson Hernandez

I use Save Image/Link in Folder and save images into set categories. The images go into a unsorted folder within their category so that I can place it into its specific folder later.

June 11, 2017 - 21:42

1 2 ... 5 Next

Managing large images collections

Last threads