Deduplication capabilities in proget

c4buildmasters_2588

Like stated in the title, are there deduplication capabilities within ProGet? As an organization, we have an extremely large deployment that we are looking to fully migrate to ProGet. Currently, our package managers we are running stores around 15 TB of packages across a variety of different package feeds including Debian, Conan, maven, npm and docker. Using deduplication, we are around 5 TiB of storage usage. (the majority of it is in Debian and maven, but there are quite a few in the other feed types as well)

A question along the same lines. Where would we start seeing a degradation of proget by the number of packages? We currently maintain around 5 million packages in our current solution. Would there be any performance degradation from that number of packages within ProGet?

Thanks in advance!

dean-houston

Hi @c4buildmasters_2588

Short answer yes, and you'd probably see a bit better than 15 -> 5 TB reductions with those artifacts. We usually see 90-95% storage space reduction. Pair it with ProGet's retention rules and I wouldn't be surprised to see that drop to 500GB.

Long answer, file deduplication is something you want handled by the operating system (e.g. Windows Data Deduplication, RHEL VDO, etc), not the application. It's way too complex -- you have to routinely index a fileset, centralize chunks in a compressed store, and then rebuild those files with reparse points.

Maybe this wasn't the case a couple decades ago. But these days, rolling your own file deduplication would be like implementing your own hacky encryption or compression. Pointless and a bad idea.

That being said, you may be using a tool by our friends at JFrog. They advertise a feature called "data deduplication", which IMHO is something between deceptive and a clever marketing flex.

Because they store files by their hash instead of file names, the files are automatically "deduplicated"... so long as it's the exact same contents. Which, in most case, it will not be.

Here’s an article that digs into how things are stored in Artifactory, and also should give you an idea of their “file-based” approach: https://blog.inedo.com/proget-migration/how-files-and-packages-work-in-proget-for-artifactory-users/

As for the package count, 5M is obviously a lot of packages. Obviously it's not going to be as fast as 5 packages - but probably not that much noticeably slower. There's lots of database indexes, etc.

Hope that helps.

-- Dean